Regexes from XS

Discussion:

Regexes from XS

(too old to reply)

Marvin Humphrey

2006-05-19 05:00:48 UTC

Greets,

I have a tokenizing algorithm which uses regexes, and it would
presumably faster if it were implemented in XS. The algorithm
itself, I've appended below, as it's less important than the more
general concept of how to get at regexes from XS.

It looks like the relevant functions are pregcomp() and pregexec().
There isn't anything about these in perlapi, so accessing them might
be a little naughty. However, I have found some prior art: Tk uses
them, in the file tkGlue.c.

What follows is what I've been able to deduce so far. If someone can
help fill in the blanks, I'll be much obliged.

This function header is from regcomp.c:

regexp *
Perl_pregcomp(pTHX_ char *exp, char *xend, PMOP *pm)

I gather that the first two arguments to pregcomp are the start and
the limit (a la SvEND) of the pattern. The returned regexp*, it
looks like I would immediately supply to pregexec(). I'm not too
sure how to supply a PMOP*, but I saw in a Nick Ing-Simmons post to
p5p that you have to "fake an op" in order to make this work. Looks
like that's what this function from Tk does:

/* An "XS" routine to call with G_EVAL set */
static void
do_comp(pTHX_ CV *cv)
{
dMARK;
dAX;
struct WrappedRegExp *p = (struct WrappedRegExp *) CvXSUBANY
(cv).any_ptr;
int len = 0;
char *string = Tcl_GetStringFromObj(p->source,&len);
p->op.op_pmdynflags |= PMdf_DYN_UTF8;
p->pat = pregcomp(string,string+len,&p->op);
#if 0
LangDebug("/%.*s/ => %p\n",len,string,p->pat);
#endif
XSRETURN(0);
}

It seems the PMOP stores some flags which affect how pregcomp()
behaves. In this case, it appears that pregcomp() needs to know that
UTF-8 is in effect. Comments elsewhere in tkGlue.c indicate that any
string coming from Tk will be UTF-8.

This function header is from regexec.c:

/*
- pregexec - match a regexp against a string
*/
I32
Perl_pregexec(pTHX_ register regexp *prog, char *stringarg, register
char *strend,
char *strbeg, I32 minend, SV *screamer, U32 nosave)
/* strend: pointer to null at end of string */
/* strbeg: real beginning of string */
/* minend: end of match must be >=minend after stringarg. */
/* nosave: For optimizations. */
{

I think I understand most of that. stringarg may differ from strbeg
if, for example, we're in the middle of an m//g sequence. I'm not
sure under what circumstances it would be useful to set minend to
something other than 0, but maybe for the tokenizer it should be 1.
One of these days I'll figure out what a "screaming" SV is, but it's
clear from the Tk example that it can simply be the SV that to which
strarg belongs. nosave looks like it affects whether matches will be
saved, though I'm not clear whether that means $1 $2 etc, or $` etc,
or both.

Most of the code in the Tk function which invokes pregexec() is
concerned with wrapping an SV around a C string. The actual matching
only takes one line...

int
Tcl_RegExpExec(interp, re, cstring, cstart)
Tcl_Interp *interp;
Tcl_RegExp re;
CONST char *cstring;
CONST char *cstart;
{
dTHX;
SV *tmp = sv_newmortal();
int code;
sv_upgrade(tmp,SVt_PV);
SvCUR_set(tmp,strlen(cstring));
SvPVX(tmp) = (char *) cstring;
SvLEN(tmp) = 0;
SvREADONLY_on(tmp);
SvPOK_on(tmp);
/* From Tk all strings are UTF-8 */
SvUTF8_on(tmp);
#ifdef ROPT_MATCH_UTF8
RX_MATCH_UTF8_on(re->pat);
#else
/* eeek what do we do now ... */
#endif
code = pregexec(re->pat,SvPVX(tmp),SvEND(tmp),(char *) cstart,0,
tmp,REXEC_COPY_STR);
#if 0
LangDebug("%d '%.*s'\n",code,SvCUR(tmp),SvPVX(tmp));
sv_dump(tmp);
regdump(re->pat);
#endif
return code;
}

If I were to rewrite the tokenizer algo below in C, I surmise that I
would keep calling pregexex(), advancing stringarg each time, until
the match fails.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

sub tokenize {
for ( $_[0] ) {
pos = 0;
my ( @starts, @ends );
1 while (
m/$separator_re/g # could be qr/\W*/
and push @starts, pos
and m/$token_re/g # could be qr/\w+/
and push @ends, pos
);
add_many_tokens( $_, \@starts, \@ends );
}
}

Vaclav Barta

2006-05-19 04:24:25 UTC

Permalink

Hi,

Post by Marvin Humphrey
I have a tokenizing algorithm which uses regexes, and it would
presumably faster if it were implemented in XS. The algorithm

...

Post by Marvin Humphrey
It looks like the relevant functions are pregcomp() and pregexec().
There isn't anything about these in perlapi, so accessing them might
be a little naughty. However, I have found some prior art: Tk uses
them, in the file tkGlue.c.

For an XS perspective on Perl regexes, I found
http://perl.plover.com/Rx/paper/ useful - it does very fancy regex matching.

Post by Marvin Humphrey
regexp *
Perl_pregcomp(pTHX_ char *exp, char *xend, PMOP *pm)
I gather that the first two arguments to pregcomp are the start and
the limit (a la SvEND) of the pattern. The returned regexp*, it
looks like I would immediately supply to pregexec(). I'm not too
sure how to supply a PMOP*, but I saw in a Nick Ing-Simmons post to
p5p that you have to "fake an op" in order to make this work. Looks

I hacked
PMOP *pm;
char *end;
regexp *rx;

Newz(1, pm, 1, PMOP);
end = strchr(rx_string, '\0');
rx = pregcomp(rx_string, end, pm);
out of Rx, and it seems to be working (I'm not setting any flags, but Rx does
implement that). One unsettling thing is that Rx also does some magic around
the call to pregcomp, which changes the regex slightly (i.e. my module gets a
different regex signature from the one documented in the Rx paper), but I
didn't notice any negative effects of my simplification so far... Also, Rx
doesn't bother with memory management - AFAIK pm should be freed with
safefree and rx with pregfree.

Post by Marvin Humphrey
I32
Perl_pregexec(pTHX_ register regexp *prog, char *stringarg, register
char *strend,
char *strbeg, I32 minend, SV *screamer, U32 nosave)

Didn't get to use that - I'll pass.

Bye
Vasek

Marvin Humphrey

2006-05-20 05:27:35 UTC

Permalink

:It seems the PMOP stores some flags which affect how pregcomp()
:behaves. In this case, it appears that pregcomp() needs to know that
:UTF-8 is in effect. Comments elsewhere in tkGlue.c indicate that any
:string coming from Tk will be UTF-8.
Yes, you probably want to avoid that if you don't need utf8.

I assume that this means the *pattern* will be interpreted as if it
was in UTF-8, which would most likely happen within the scope of a
'use utf8;' pragma (hence your spelling of 'utf8'). The behavior of
the regex against the string to be matched has to be determined too
late for pregcomp(), since it is based on the value of each scalar's
SvUTF8 flag. IOW, regardless of the conditions under which the
REGEXP* struct was prepared, it has to be ready to deal with either
UTF-8 scalars or byte-oriented scalars.

It makes my head hurt to consider what would happen if a UTF8 scalar
had to be interpolated into a pattern outside the scope of a utf8
pragma.

$foo =~ qr/stuff$a_utf8_string/;

I imagine you'd have to perform scalar concatenation, then set the
UTF-8 flag on the PMOP based on the value of the UTF8 flag of the
concatenated string.

I'm starting to understand why this stuff isn't in the official
public API. :)

In my case, what I'll really need to do is retrieve a precompiled
regular expression from within a passed-in qr// construct. That's
another headache. Maybe I think it means I don't have to worry about
pregcomp() at all, though.

I'm not
:sure under what circumstances it would be useful to set minend to
:something other than 0, but maybe for the tokenizer it should be 1.
I think this is there to handle the avoidance of infinite zero-length
matches - after a zero-length match at position p, the next match
must end *after* position p. This allows correct behaviour for things
"abc" =~ /( | . )/xg;
returns: ("", "a", "", "b", "", "c", "").

Yes. Snooping the code of pp_match in pp_hot.c, I see that the
variable minmatch starts out as 0, but if global matching is in
effect, it can be reset to something else on subsequent loops.

When doing //g matches, supply 0 for the first call; for subsequent
calls
supply (I think) $+[0] + (matchlen == 0 ? 1 : 0).

I've decided that I can simplify my tokenizing algo:

sub tokenize {
my ( $token_re, $source_text ) = @_;
# accumulate token start_offsets and end_offsets
my ( @starts, @ends );
1 while (
m/$token_re/g
and push @starts, $-[0]
and push @ends, $+[0]
);

# add the new tokens to the batch
add_many_tokens( $_, \@starts, \@ends );
}
}

Unfortunately, experimenting with this uncovered a bug in my algo. @
+, @-, and pos() all give answers in terms of characters if the
scalar which matched was marked with SvUTF8. But I always need
@starts and @ends measured in *bytes*.

At the C level, I can get at that information using the startp and
endp members of the regexp struct. Unfortunately, that's a deeper
violation of the private API than I intended. There are two levels
of hackery here: there's naughty, and then there's evil. Using
pregcomp() and pregexec() is naughty. Using rx->endp[0] is evil.

Oh well. I'm in too deep to quit now. At least I'm learning a lot.
See below for a demo app that manages to successfully match once. I
haven't figured out how to turn on global matching though, and I
definitely need that for the tokenizer.

:One of these days I'll figure out what a "screaming" SV is, but it's
:clear from the Tk example that it can simply be the SV that to which
:strarg belongs.
This is the Boyer-Moore optimisation. Off the top of my head, it is
applied only when you 'study()' the target string - this upgrades the
string to a different type (SVt_PVBM) which adds a structure giving
a frequency table and linked lists of occurrences of each character in
the target string. This is useful only when you have one string to
which
you plan to apply many patterns.

Interesting. I wonder why Tk bothers with it, then, since it looks
like the matching from Tk is all one-shot and the SV* gets discarded.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

#!/usr/bin/perl
use strict;
use warnings;
use Inline C => <<'END_C';

SV*
regex_once(SV *regex_ref, SV *text) {
SV *regex_sv;
regexp *rx;
char *stringarg, *stringbeg, *stringend;
MAGIC *mg = NULL;

if (!SvROK(regex_ref)) croak("not a ref");
regex_sv = SvRV(regex_ref);
if (!SvMAGICAL(regex_sv)) croak("No magic");
mg = mg_find(regex_sv, PERL_MAGIC_qr);
rx = (REGEXP*)mg->mg_obj;

stringbeg = SvPV_nolen(text);
stringarg = stringbeg;
stringend = SvEND(text);

pregexec(rx, stringarg, stringend, stringbeg, 1, text, 1);

return newSVpv(stringbeg, rx->endp[0]);
}

void
regex_many(SV *either, SV *text) {
PMOP *pm;
SV *regex_sv;
regexp *rx;
char *stringarg, *stringbeg, *stringend;
MAGIC *mg = NULL;
int safety = 0;

New(1, pm, 1, PMOP);
pm->op_pmflags |= PMf_GLOBAL;

if (SvROK(either)) {
regex_sv = SvRV(either);
if (!SvMAGICAL(regex_sv)) croak("No magic");
mg = mg_find(regex_sv, PERL_MAGIC_qr);
rx = (REGEXP*)mg->mg_obj;
}
else {
if (!SvPOK(either)) croak("need a pattern");
rx = pregcomp(SvPVX(either), SvEND(either), pm);
}

stringbeg = SvPV_nolen(text);
stringarg = stringbeg;
stringend = SvEND(text);

while (pregexec(rx, stringarg, stringend, stringbeg, 1, text, 1)) {
stringarg = stringbeg + rx->endp[0];
fprintf(stderr, "%d\n", safety);
if (safety++ > 10)
break;
}

return newSVpv(stringbeg, rx->endp[0]);
}

END_C

my $regex = qr/../;
my $string = join '', 'a' .. 'z';
my $matched = regex_once($regex, $string);
print "Matched once: $matched\n";

$matched = regex_many('..', $string);
print "match_many with pattern: $matched\n";

$matched = regex_many($regex, $string);
print "match_many with qr// construct: $matched\n";

rvtol+ (Dr.Ruud)

2006-05-20 10:31:12 UTC

Permalink

@+, @-, and pos() all give answers in terms of characters if the
scalar which matched was marked with SvUTF8. But I always need
@starts and @ends measured in *bytes*.

Better call them "octets", see `perldoc Encode`.

<quote>
When you run "$octets = encode("utf8", $string)", then
$octets may not be equal to $string. Though they both contain the
same data, the utf8 flag for $octets is always off.
</quote>

--
Affijn, Ruud

"Gewoon is een tijger."