Marvin Humphrey
2006-05-19 05:00:48 UTC
Greets,
I have a tokenizing algorithm which uses regexes, and it would
presumably faster if it were implemented in XS. The algorithm
itself, I've appended below, as it's less important than the more
general concept of how to get at regexes from XS.
It looks like the relevant functions are pregcomp() and pregexec().
There isn't anything about these in perlapi, so accessing them might
be a little naughty. However, I have found some prior art: Tk uses
them, in the file tkGlue.c.
What follows is what I've been able to deduce so far. If someone can
help fill in the blanks, I'll be much obliged.
This function header is from regcomp.c:
regexp *
Perl_pregcomp(pTHX_ char *exp, char *xend, PMOP *pm)
I gather that the first two arguments to pregcomp are the start and
the limit (a la SvEND) of the pattern. The returned regexp*, it
looks like I would immediately supply to pregexec(). I'm not too
sure how to supply a PMOP*, but I saw in a Nick Ing-Simmons post to
p5p that you have to "fake an op" in order to make this work. Looks
like that's what this function from Tk does:
/* An "XS" routine to call with G_EVAL set */
static void
do_comp(pTHX_ CV *cv)
{
dMARK;
dAX;
struct WrappedRegExp *p = (struct WrappedRegExp *) CvXSUBANY
(cv).any_ptr;
int len = 0;
char *string = Tcl_GetStringFromObj(p->source,&len);
p->op.op_pmdynflags |= PMdf_DYN_UTF8;
p->pat = pregcomp(string,string+len,&p->op);
#if 0
LangDebug("/%.*s/ => %p\n",len,string,p->pat);
#endif
XSRETURN(0);
}
It seems the PMOP stores some flags which affect how pregcomp()
behaves. In this case, it appears that pregcomp() needs to know that
UTF-8 is in effect. Comments elsewhere in tkGlue.c indicate that any
string coming from Tk will be UTF-8.
This function header is from regexec.c:
/*
- pregexec - match a regexp against a string
*/
I32
Perl_pregexec(pTHX_ register regexp *prog, char *stringarg, register
char *strend,
char *strbeg, I32 minend, SV *screamer, U32 nosave)
/* strend: pointer to null at end of string */
/* strbeg: real beginning of string */
/* minend: end of match must be >=minend after stringarg. */
/* nosave: For optimizations. */
{
I think I understand most of that. stringarg may differ from strbeg
if, for example, we're in the middle of an m//g sequence. I'm not
sure under what circumstances it would be useful to set minend to
something other than 0, but maybe for the tokenizer it should be 1.
One of these days I'll figure out what a "screaming" SV is, but it's
clear from the Tk example that it can simply be the SV that to which
strarg belongs. nosave looks like it affects whether matches will be
saved, though I'm not clear whether that means $1 $2 etc, or $` etc,
or both.
Most of the code in the Tk function which invokes pregexec() is
concerned with wrapping an SV around a C string. The actual matching
only takes one line...
int
Tcl_RegExpExec(interp, re, cstring, cstart)
Tcl_Interp *interp;
Tcl_RegExp re;
CONST char *cstring;
CONST char *cstart;
{
dTHX;
SV *tmp = sv_newmortal();
int code;
sv_upgrade(tmp,SVt_PV);
SvCUR_set(tmp,strlen(cstring));
SvPVX(tmp) = (char *) cstring;
SvLEN(tmp) = 0;
SvREADONLY_on(tmp);
SvPOK_on(tmp);
/* From Tk all strings are UTF-8 */
SvUTF8_on(tmp);
#ifdef ROPT_MATCH_UTF8
RX_MATCH_UTF8_on(re->pat);
#else
/* eeek what do we do now ... */
#endif
code = pregexec(re->pat,SvPVX(tmp),SvEND(tmp),(char *) cstart,0,
tmp,REXEC_COPY_STR);
#if 0
LangDebug("%d '%.*s'\n",code,SvCUR(tmp),SvPVX(tmp));
sv_dump(tmp);
regdump(re->pat);
#endif
return code;
}
If I were to rewrite the tokenizer algo below in C, I surmise that I
would keep calling pregexex(), advancing stringarg each time, until
the match fails.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
sub tokenize {
for ( $_[0] ) {
pos = 0;
my ( @starts, @ends );
1 while (
m/$separator_re/g # could be qr/\W*/
and push @starts, pos
and m/$token_re/g # could be qr/\w+/
and push @ends, pos
);
add_many_tokens( $_, \@starts, \@ends );
}
}
I have a tokenizing algorithm which uses regexes, and it would
presumably faster if it were implemented in XS. The algorithm
itself, I've appended below, as it's less important than the more
general concept of how to get at regexes from XS.
It looks like the relevant functions are pregcomp() and pregexec().
There isn't anything about these in perlapi, so accessing them might
be a little naughty. However, I have found some prior art: Tk uses
them, in the file tkGlue.c.
What follows is what I've been able to deduce so far. If someone can
help fill in the blanks, I'll be much obliged.
This function header is from regcomp.c:
regexp *
Perl_pregcomp(pTHX_ char *exp, char *xend, PMOP *pm)
I gather that the first two arguments to pregcomp are the start and
the limit (a la SvEND) of the pattern. The returned regexp*, it
looks like I would immediately supply to pregexec(). I'm not too
sure how to supply a PMOP*, but I saw in a Nick Ing-Simmons post to
p5p that you have to "fake an op" in order to make this work. Looks
like that's what this function from Tk does:
/* An "XS" routine to call with G_EVAL set */
static void
do_comp(pTHX_ CV *cv)
{
dMARK;
dAX;
struct WrappedRegExp *p = (struct WrappedRegExp *) CvXSUBANY
(cv).any_ptr;
int len = 0;
char *string = Tcl_GetStringFromObj(p->source,&len);
p->op.op_pmdynflags |= PMdf_DYN_UTF8;
p->pat = pregcomp(string,string+len,&p->op);
#if 0
LangDebug("/%.*s/ => %p\n",len,string,p->pat);
#endif
XSRETURN(0);
}
It seems the PMOP stores some flags which affect how pregcomp()
behaves. In this case, it appears that pregcomp() needs to know that
UTF-8 is in effect. Comments elsewhere in tkGlue.c indicate that any
string coming from Tk will be UTF-8.
This function header is from regexec.c:
/*
- pregexec - match a regexp against a string
*/
I32
Perl_pregexec(pTHX_ register regexp *prog, char *stringarg, register
char *strend,
char *strbeg, I32 minend, SV *screamer, U32 nosave)
/* strend: pointer to null at end of string */
/* strbeg: real beginning of string */
/* minend: end of match must be >=minend after stringarg. */
/* nosave: For optimizations. */
{
I think I understand most of that. stringarg may differ from strbeg
if, for example, we're in the middle of an m//g sequence. I'm not
sure under what circumstances it would be useful to set minend to
something other than 0, but maybe for the tokenizer it should be 1.
One of these days I'll figure out what a "screaming" SV is, but it's
clear from the Tk example that it can simply be the SV that to which
strarg belongs. nosave looks like it affects whether matches will be
saved, though I'm not clear whether that means $1 $2 etc, or $` etc,
or both.
Most of the code in the Tk function which invokes pregexec() is
concerned with wrapping an SV around a C string. The actual matching
only takes one line...
int
Tcl_RegExpExec(interp, re, cstring, cstart)
Tcl_Interp *interp;
Tcl_RegExp re;
CONST char *cstring;
CONST char *cstart;
{
dTHX;
SV *tmp = sv_newmortal();
int code;
sv_upgrade(tmp,SVt_PV);
SvCUR_set(tmp,strlen(cstring));
SvPVX(tmp) = (char *) cstring;
SvLEN(tmp) = 0;
SvREADONLY_on(tmp);
SvPOK_on(tmp);
/* From Tk all strings are UTF-8 */
SvUTF8_on(tmp);
#ifdef ROPT_MATCH_UTF8
RX_MATCH_UTF8_on(re->pat);
#else
/* eeek what do we do now ... */
#endif
code = pregexec(re->pat,SvPVX(tmp),SvEND(tmp),(char *) cstart,0,
tmp,REXEC_COPY_STR);
#if 0
LangDebug("%d '%.*s'\n",code,SvCUR(tmp),SvPVX(tmp));
sv_dump(tmp);
regdump(re->pat);
#endif
return code;
}
If I were to rewrite the tokenizer algo below in C, I surmise that I
would keep calling pregexex(), advancing stringarg each time, until
the match fails.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
sub tokenize {
for ( $_[0] ) {
pos = 0;
my ( @starts, @ends );
1 while (
m/$separator_re/g # could be qr/\W*/
and push @starts, pos
and m/$token_re/g # could be qr/\w+/
and push @ends, pos
);
add_many_tokens( $_, \@starts, \@ends );
}
}