|
Perl Differences
The differences described here are with respect to Perl 5.005.
-
By default, a whitespace character is any character that
the C library function isspace() recognizes, though it is
possible to compile PCRE with alternative character type
tables. Normally isspace() matches space, formfeed, newline,
carriage return, horizontal tab, and vertical tab. Perl 5 no
longer includes vertical tab in its set of whitespace characters.
The \v escape that was in the Perl documentation for
a long time was never in fact recognized. However, the character
itself was treated as whitespace at least up to 5.002.
In 5.004 and 5.005 it does not match \s.
-
PCRE does not allow repeat quantifiers on lookahead
assertions. Perl permits them, but they do not mean what you
might think. For example, (?!a){3} does not assert that the
next three characters are not "a". It just asserts that the
next character is not "a" three times.
-
Capturing subpatterns that occur inside negative
lookahead assertions are counted, but their entries in the
offsets vector are never set. Perl sets its numerical
variables from any such patterns that are matched before the
assertion fails to match something (thereby succeeding), but
only if the negative lookahead assertion contains just one
branch.
-
Though binary zero characters are supported in the subject string,
they are not allowed in a pattern string because it is passed as a
normal C string, terminated by zero. The escape sequence "\x00" can
be used in the pattern to represent a binary zero.
-
The following Perl escape sequences are not supported:
\l, \u, \L, \U. In fact these are implemented by
Perl's general string-handling and are not part of its
pattern matching engine.
-
The Perl \G assertion is not supported as it is not
relevant to single pattern matches.
-
Fairly obviously, PCRE does not support the (?{code}) and (??{code})
construction. However, there is support for recursive patterns.
-
There are at the time of writing some oddities in Perl
5.005_02 concerned with the settings of captured strings
when part of a pattern is repeated. For example, matching
"aba" against the pattern /^(a(b)?)+$/ sets $2 to the value
"b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2
unset. However, if the pattern is changed to
/^(aa(b(b))?)+$/ then $2 (and $3) get set.
In Perl 5.004 $2 is set in both cases, and that is also
true
of PCRE. If in the future Perl changes to a consistent state
that is different, PCRE may change to follow.
-
Another as yet unresolved discrepancy is that in Perl
5.005_02 the pattern /^(a)?(?(1)a|b)+$/ matches the string
"a", whereas in PCRE it does not. However, in both Perl and
PCRE /^(a)?a/ matched against "a" leaves $1 unset.
-
PCRE provides some extensions to the Perl regular
expression facilities:
-
Although lookbehind assertions must match fixed length
strings, each alternative branch of a lookbehind assertion
can match a different length of string. Perl 5.005 requires
them all to have the same length.
-
If PCRE_DOLLAR_ENDONLY
is set and PCRE_MULTILINE is
not set, the $ meta-character matches only at the very end of the
string.
-
If PCRE_EXTRA is
set, a backslash followed by a letter with no special meaning is
faulted.
-
If PCRE_UNGREEDY is
set, the greediness of the repetition quantifiers is inverted,
that is, by default they are not greedy, but if followed by a
question mark they are.
|