This document was extracted from the pcre2.3.html documentation, Copyright (c) 1997-2025 University of Cambridge, and minimally adapted for use in TinyFugue. The latest reference can be found at: https://www.pcre.org/current/doc/html/index.html
The syntax and semantics of the regular expressions that are supported by PCRE2 are described in detail below. PCRE2 tries to match Perl syntax and semantics as closely as it can. PCRE2 also supports some alternative regular expression syntax that does not conflict with the Perl syntax in order to provide some compatibility with regular expressions in Python, .NET, and Oniguruma. There are in addition some options that enable alternative syntax and semantics that are not the same as in Perl.
Perl's regular expressions are described in its own documentation, and regular expressions in general are covered in a number of books, some of which have copious examples. Jeffrey Friedl's "Mastering Regular Expressions", published by O'Reilly, covers regular expressions in great detail. This description of PCRE2's regular expressions is intended as reference material.
A regular expression is a pattern that is matched against a subject string from left to right. Most characters stand for themselves in a pattern, and match the corresponding characters in the subject. As a trivial example, the pattern
The quick brown fox
matches a portion of a subject string that is identical to itself. The power of regular expressions comes from the ability to include alternatives and repetitions in the pattern. These are encoded in the pattern by the use of meta-characters, which do not stand for themselves but instead are interpreted in some special way.
There are two different sets of meta-characters: those that are recognized anywhere in the pattern except within square brackets, and those that are recognized in square brackets. Outside square brackets, the meta-characters are as follows:
\ general escape character with several uses
^ assert start of subject (or line, in multiline mode)
$ assert end of subject (or line, in multiline mode)
. match any character except newline (by default)
[ start character class definition
| start of alternative branch
( start subpattern
) end subpattern
? extends the meaning of (
also 0 or 1 quantifier
also quantifier minimizer
* 0 or more quantifier
+ 1 or more quantifier
{ start min/max quantifier
Brace characters { and } are also used to enclose data for constructions such
as \g{2} or \k{name}. In almost all uses of braces, space and/or horizontal
tab characters that follow { or precede } are allowed and are ignored.
Part of a pattern that is in square brackets is called a "character class". In a character class the only meta-characters are:
\ general escape character ^ negate the class, but only if the first character - indicates character range ] terminates the character class
The following sections describe the use of each of the meta-characters.
The backslash character has several uses. Firstly, if it is followed by a non-alphameric character, it takes away any special meaning that character may have. This use of backslash as an escape character applies both inside and outside character classes.
For example, if you want to match a "*" character, you write "\*" in the pattern. This applies whether or not the following character would otherwise be interpreted as a meta-character, so it is always safe to precede a non-alphameric with "\" to specify that it stands for itself. In particular, if you want to match a backslash, you write "\\".
If you want to treat all characters in a sequence as literals, you can do so by putting them between \Q and \E. Note that this includes white space even when the PCRE2_EXTENDED option is set so that most other white space is ignored. The behaviour is different from Perl in that $ and @ are handled as literals in \Q...\E sequences in PCRE2, whereas in Perl, $ and @ cause variable interpolation. Also, Perl does "double-quotish backslash interpolation" on any backslashes between \Q and \E which, its documentation says, "may lead to confusing results". PCRE2 treats a backslash between \Q and \E just like any other character. Note the following examples:
Pattern PCRE2 matches Perl matches \Qabc$xyz\E abc$xyz abc followed by the contents of $xyz \Qabc\$xyz\E abc\$xyz abc\$xyz \Qabc\E\$\Qxyz\E abc$xyz abc$xyz \QA\B\E A\B A\B \Q\\E \ \\EThe \Q...\E sequence is recognized both inside and outside character classes. An isolated \E that is not preceded by \Q is ignored. If \Q is not followed by \E later in the pattern, the literal interpretation continues to the end of the pattern (that is, \E is assumed at the end). If the isolated \Q is inside a character class, this causes an error, because the character class is then not terminated by a closing square bracket.
Another difference from Perl is that any appearance of \Q or \E inside what might otherwise be a quantifier causes PCRE2 not to recognize the sequence as a quantifier. Perl recognizes a quantifier if (redundantly) either of the numbers is inside \Q...\E, but not if the separating comma is. When not recognized as a quantifier a sequence such as {\Q1\E,2} is treated as the literal string "{1,2}".
A second use of backslash provides a way of encoding non-printing characters in patterns in a visible manner. There is no restriction on the appearance of non-printing characters, apart from the binary zero that terminates a pattern, but when a pattern is being prepared by text editing, it is usually easier to use one of the following escape sequences than the binary character it represents:
\a alarm, that is, the BEL character (hex 07)
\cx "control-x", where x is a non-control ASCII character
\e escape (hex 1B)
\f form feed (hex 0C)
\n linefeed (hex 0A)
\r carriage return (hex 0D) (but see below)
\t tab (hex 09)
\0dd character with octal code 0dd
\ddd character with octal code ddd, or back reference
\o{ddd..} character with octal code ddd..
\xhh character with hex code hh
\x{hhh..} character with hex code hhh..
The precise effect of "\cx" is as follows: if "x" is a lower case letter, it is converted to upper case. Then bit 6 of the character (hex 40) is inverted. Thus "\cz" becomes hex 1A, but "\c{" becomes hex 3B, while "\c;" becomes hex 7B.
After "\x", up to two hexadecimal digits are read (letters can be in upper or lower case).
After "\0" up to two further octal digits are read. In both cases, if there are fewer than two digits, just those that are present are used. Thus the sequence "\0\x\07" specifies two binary zeros followed by a BEL character. Make sure you supply two digits after the initial zero if the character that follows is itself an octal digit.
The handling of a backslash followed by a digit other than 0 is complicated. Outside a character class, PCRE2 reads it and any following digits as a decimal number. If the number is less than 10, or if there have been at least that many previous capturing left parentheses in the expression, the entire sequence is taken as a back reference. A description of how this works is given later, following the discussion of parenthesized subpatterns.
Inside a character class, or if the decimal number is greater than 9 and there have not been that many capturing subpatterns, PCRE re-reads up to three octal digits following the backslash, and generates a single byte from the least significant 8 bits of the value. Any subsequent digits stand for themselves. For example:
\040 is another way of writing a space
\40 is the same, provided there are fewer than 40
previous capturing subpatterns
\7 is always a back reference
\11 might be a back reference, or another way of
writing a tab
\011 is always a tab
\0113 is a tab followed by the character "3"
\113 is the character with octal code 113 (since there
can be no more than 99 back references)
\377 is a byte consisting entirely of 1 bits
\81 is either a back reference, or a binary zero
followed by the two characters "8" and "1"
Note that octal values of 100 or greater must not be introduced by a leading zero, because no more than three octal digits are ever read.
All the sequences that define a single byte value can be used both inside and outside character classes. In addition, inside a character class, the sequence "\b" is interpreted as the backspace character (hex 08). Outside a character class it has a different meaning (see below).
The third use of backslash is for specifying generic character types:
\d any decimal digit \D any character that is not a decimal digit \h any horizontal white space character \H any character that is not a horizontal white space character \N any character that is not a newline \s any white space character \S any character that is not a white space character \v any vertical white space character \V any character that is not a vertical white space character \w any "word" character \W any "non-word" character
The \N escape sequence has the same meaning as the "." metacharacter when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change the meaning of \N. Note that when \N is followed by an opening brace it has a different meaning. Perl uses \N{name} to specify characters by Unicode name; PCRE2 does not support this.
Each pair of escape sequences partitions the complete set of characters into two disjoint sets. Any given character matches one, and only one, of each pair.
A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE2's character tables, and may vary if locale- specific matching is taking place (see "Locale support" above). For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.
These character type sequences can appear both inside and outside character classes. They each match one character of the appropriate type. If the current matching point is at the end of the subject string, all of them fail, since there is no character to match.
The fourth use of backslash is for certain simple assertions. An assertion specifies a condition that has to be met at a particular point in a match, without consuming any characters from the subject string. The use of subpatterns for more complicated assertions is described below. The backslashed assertions are
\b word boundary \B not a word boundary \A start of subject (same as "^" in tf) \Z end of subject (same as "$" in tf) \z end of subject (same as "$" in tf)
These assertions may not appear in character classes (but note that "\b" has a different meaning, namely the backspace character, inside a character class).
A word boundary is a position in the subject string where the current character and the previous character do not both match \w or \W (i.e. one matches \w and the other matches \W), or the start or end of the string if the first or last character matches \w, respectively.
Outside a character class, in the default matching mode, the circumflex character is an assertion which is true only if the current matching point is at the start of the subject string. Inside a character class, circumflex has an entirely different meaning (see below).
Circumflex need not be the first character of the pattern if a number of alternatives are involved, but it should be the first thing in each alternative in which it appears if the pattern is ever to match that branch. If all possible alternatives start with a circumflex, that is, if the pattern is constrained to match only at the start of the subject, it is said to be an "anchored" pattern. (There are also other constructs that can cause a pattern to be anchored.)
A dollar character is an assertion which is true only if the current matching point is at the end of the subject string. Dollar need not be the last character of the pattern if a number of alternatives are involved, but it should be the last item in any branch in which it appears. Dollar has no special meaning in a character class.
An opening square bracket introduces a character class, terminated by a closing square bracket. A closing square bracket on its own is not special. If a closing square bracket is required as a member of the class, it should be the first data character in the class (after an initial circumflex, if present) or escaped with a backslash.
A character class matches a single character in the subject; the character must be in the set of characters defined by the class, unless the first character in the class is a circumflex, in which case the subject character must not be in the set defined by the class. If a circumflex is actually required as a member of the class, ensure it is not the first character, or escape it with a backslash.
For example, the character class [aeiou] matches any lower case vowel, while [^aeiou] matches any character that is not a lower case vowel. Note that a circumflex is just a convenient notation for specifying the characters which are in the class by enumerating those that are not. It is not an assertion: it still consumes a character from the subject string, and fails if the current pointer is at the end of the string.
When caseless matching is set, any letters in a class represent both their upper case and lower case versions, so for example, a caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a caseful version would.
The minus (hyphen) character can be used to specify a range of characters in a character class. For example, [d-m] matches any letter between d and m, inclusive. If a minus character is required in a class, it must be escaped with a backslash or appear in a position where it cannot be interpreted as indicating a range, typically as the first or last character in the class.
It is not possible to have the literal character "]" as the end character of a range. A pattern such as [W-]46] is interpreted as a class of two characters ("W" and "-") followed by a literal string "46]", so it would match "W46]" or "-46]". However, if the "]" is escaped with a backslash it is interpreted as the end of range, so [W-\]46] is interpreted as a single class containing a range followed by two separate characters. The octal or hexadecimal representation of "]" can also be used to end a range.
Ranges operate in ASCII collating sequence. They can also be used for characters specified numerically, for example [\000-\037]. If a range that includes letters is used when caseless matching is set, it matches the letters in either case. For example, [W-c] is equivalent to [][\^_`wxyzabc], matched caselessly, and if character tables for the "fr" locale are in use, [\xc8-\xcb] matches accented E characters in both cases.
The character types \d, \D, \s, \S, \w, and \W may also appear in a character class, and add the characters that they match to the class. For example, [\dABCDEF] matches any hexadecimal digit. A circumflex can conveniently be used with the upper case character types to specify a more restricted set of characters than the matching lower case type. For example, the class [^\W_] matches any letter or digit, but not underscore.
All non-alphameric characters other than \, -, ^ (at the start) and the terminating ] are non-special in character classes, but it does no harm if they are escaped.
Vertical bar characters are used to separate alternative patterns. For example, the pattern
gilbert|sullivan
matches either "gilbert" or "sullivan". Any number of alternatives may appear, and an empty alternative is permitted (matching the empty string). The matching process tries each alternative in turn, from left to right, and the first one that succeeds is used. If the alternatives are within a subpattern (defined below), "succeeds" means matching the rest of the main pattern as well as the alternative in the subpattern.
The settings of several options can be changed within a pattern by a sequence of letters enclosed between "(?" and ")". The following are Perl-compatible, and are described in detail in the pcre2api documentation. The option letters are:
i for PCRE2_CASELESS m for PCRE2_MULTILINE n for PCRE2_NO_AUTO_CAPTURE s for PCRE2_DOTALL x for PCRE2_EXTENDED xx for PCRE2_EXTENDED_MORE
For example, (?im) sets caseless, multiline matching. It is also possible to unset these options by preceding the relevant letters with a hyphen, for example (?-im). The two "extended" options are not independent; unsetting either one cancels the effects of both of them.
A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also permitted. Only one hyphen may appear in the options string. If a letter appears both before and after the hyphen, the option is unset. An empty options setting "(?)" is allowed. Needless to say, it has no effect.
If the first character following (? is a circumflex, it causes all of the above options to be unset. Letters may follow the circumflex to cause some options to be re-instated, but a hyphen may not appear.
Some PCRE2-specific options can be changed by the same mechanism using these pairs or individual letters:
aD for PCRE2_EXTRA_ASCII_BSD aS for PCRE2_EXTRA_ASCII_BSS aW for PCRE2_EXTRA_ASCII_BSW aP for PCRE2_EXTRA_ASCII_POSIX and PCRE2_EXTRA_ASCII_DIGIT aT for PCRE2_EXTRA_ASCII_DIGIT r for PCRE2_EXTRA_CASELESS_RESTRICT J for PCRE2_DUPNAMES U for PCRE2_UNGREEDY
However, except for 'r', these are not unset by (?^), which is equivalent to (?-imnrsx). If 'a' is not followed by any of the upper case letters shown above, it sets (or unsets) all the ASCII options.
Such "top level" settings apply to the whole pattern (unless there are other changes inside subpatterns). If there is more than one setting of the same option at top level, the rightmost setting is used.
If an option change occurs inside a subpattern, the effect is different. An option change inside a subpattern affects only that part of the subpattern that follows it, so
(a(?-i)b)c
matches abc, Abc, abC and AbC, and no other strings (remember, in tf, regexps are caseless by default if they do not contain any capital letters). By this means, options can be made to have different settings in different parts of the pattern. Any changes made in one alternative do carry on into subsequent branches within the same subpattern. For example,
X(a(?i)b|c)
matches "Xab", "XaB", "Xc", and "XC", even though when matching "C" the first branch is abandoned before the option setting. This is because the effects of option settings happen at compile time. There would be some very weird behaviour otherwise.
Groups are delimited by parentheses (round brackets), which can be nested.
Turning part of a pattern into a group does two things:
1. It localizes a set of alternatives. For example, the pattern
cat(aract|erpillar|)matches "cataract", "caterpillar", or "cat". Without the parentheses, it would match "cataract", "erpillar" or an empty string.
Opening parentheses are counted from left to right (starting from 1) to obtain numbers for capture groups. For example, if the string "the red king" is matched against the pattern
the ((red|white) (king|queen))the captured substrings are "red king", "red", and "king", and are numbered 1, 2, and 3, respectively.
The fact that plain parentheses fulfil two functions is not always helpful. There are often times when grouping is required without capturing. If an opening parenthesis is followed by a question mark and a colon, the group does not do any capturing, and is not counted when computing the number of any subsequent capture groups. For example, if the string "the white queen" is matched against the pattern
the ((?:red|white) (king|queen))the captured substrings are "white queen" and "queen", and are numbered 1 and 2. The maximum number of capture groups is 65535.
As a convenient shorthand, if any option settings are required at the start of a non-capturing group, the option letters may appear between the "?" and the ":". Thus the two patterns
(?i:saturday|sunday) (?:(?i)saturday|sunday)match exactly the same set of strings. Because alternative branches are tried from left to right, and options are not reset until the end of the group is reached, an option setting in one branch does affect subsequent branches, so the above patterns match "SUNDAY" as well as "Saturday".
Perl 5.10 introduced a feature whereby each alternative in a group uses the same numbers for its capturing parentheses. Such a group starts with (?| and is itself a non-capturing group. For example, consider this pattern:
(?|(Sat)ur|(Sun))dayBecause the two alternatives are inside a (?| group, both sets of capturing parentheses are numbered one. Thus, when the pattern matches, you can look at captured substring number one, whichever alternative matched. This construct is useful when you want to capture part, but not all, of one of a number of alternatives. Inside a (?| group, parentheses are numbered as usual, but the number is reset at the start of each branch. The numbers of any capturing parentheses that follow the whole group start after the highest number used in any branch. The following example is taken from the Perl documentation. The numbers underneath show in which buffer the captured content will be stored.
# before ---------------branch-reset----------- after / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x # 1 2 2 3 2 3 4A backreference to a capture group uses the most recent value that is set for the group. The following pattern matches "abcabc" or "defdef":
/(?|(abc)|(def))\1/In contrast, a subroutine call to a capture group always refers to the first one in the pattern with the given number. The following pattern matches "abcabc" or "defabc":
/(?|(abc)|(def))(?1)/A relative reference such as (?-1) is no different: it is just a convenient way of computing an absolute group number.
If a condition test for a group's having matched refers to a non-unique number, the test is true if any group with that number has matched.
Repetition is specified by quantifiers, which may follow any one of these items:
a literal data character the dot metacharacter the \C escape sequence the \R escape sequence the \X escape sequence any escape sequence that matches a single character a character class a backreference a parenthesized group (including lookaround assertions) a subroutine call (recursive or otherwise)If a quantifier does not follow a repeatable item, an error occurs. The general repetition quantifier specifies a minimum and maximum number of permitted matches by giving two numbers in curly brackets (braces), separated by a comma. The numbers must be less than 65536, and the first must be less than or equal to the second. For example,
z{2,4}
matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special
character. If the second number is omitted, but the comma is present, there is
no upper limit; if the second number and the comma are both omitted, the
quantifier specifies an exact number of required matches. Thus
[aeiou]{3,}
matches at least 3 successive vowels, but may match many more, whereas
\d{8}
matches exactly 8 digits. If the first number is omitted, the lower limit is
taken as zero; in this case the upper limit must be present.
X{,4} is interpreted as X{0,4}
This is a change in behaviour that happened in Perl 5.34.0 and PCRE2 10.43. In
earlier versions such a sequence was not interpreted as a quantifier. Other
regular expression engines may behave either way.
If the characters that follow an opening brace do not match the syntax of a quantifier, the brace is taken as a literal character. In particular, this means that {,} is a literal string of three characters.
Note that not every opening brace is potentially the start of a quantifier because braces are used in other items such as \N{U+345} or \k{name}.
The quantifier {0} is permitted, causing the expression to behave as if the previous item and the quantifier were not present. This may be useful for capture groups that are referenced as subroutines from elsewhere in the pattern (but see also the section entitled "Defining capture groups for use by reference only" below). Except for parenthesized groups, items that have a {0} quantifier are omitted from the compiled pattern.
For convenience, the three most common quantifiers have single-character abbreviations:
* is equivalent to {0,}
+ is equivalent to {1,}
? is equivalent to {0,1}
It is possible to construct infinite loops by following a group that can match
no characters with a quantifier that has no upper limit, for example:
(a?)*Earlier versions of Perl and PCRE1 used to give an error at compile time for such patterns. However, because there are cases where this can be useful, such patterns are now accepted, but whenever an iteration of such a group matches no characters, matching moves on to the next item in the pattern instead of repeatedly matching an empty string. This does not prevent backtracking into any of the iterations if a subsequent item fails to match.
By default, quantifiers are "greedy", that is, they match as much as possible (up to the maximum number of permitted repetitions), without causing the rest of the pattern to fail. The classic example of where this gives problems is in trying to match comments in C programs. These appear between /* and */ and within the comment, individual * and / characters may appear. An attempt to match C comments by applying the pattern
/\*.*\*/to the string
/* first comment */ not comment /* second comment */fails, because it matches the entire string owing to the greediness of the .* item. However, if a quantifier is followed by a question mark, it ceases to be greedy, and instead matches the minimum number of times possible, so the pattern
/\*.*?\*/does the right thing with C comments. The meaning of the various quantifiers is not otherwise changed, just the preferred number of matches. Do not confuse this use of question mark with its use as a quantifier in its own right. Because it has two uses, it can sometimes appear doubled, as in
\d??\dwhich matches one digit by preference, but can match two if that is the only way the rest of the pattern matches.
If the PCRE2_UNGREEDY option is set (an option that is not available in Perl), the quantifiers are not greedy by default, but individual ones can be made greedy by following them with a question mark. In other words, it inverts the default behaviour.
When a parenthesized group is quantified with a minimum repeat count that is greater than 1 or with a limited maximum, more memory is required for the compiled pattern, in proportion to the size of the minimum or maximum.
If a pattern starts with .* or .{0,} and the PCRE2_DOTALL option (equivalent to Perl's /s) is set, thus allowing the dot to match newlines, the pattern is implicitly anchored, because whatever follows will be tried against every character position in the subject string, so there is no point in retrying the overall match at any position after the first. PCRE2 normally treats such a pattern as though it were preceded by \A.
In cases where it is known that the subject string contains no newlines, it is worth setting PCRE2_DOTALL in order to obtain this optimization, or alternatively, using ^ to indicate anchoring explicitly.
However, there are some cases where the optimization cannot be used. When .* is inside capturing parentheses that are the subject of a backreference elsewhere in the pattern, a match at the start may fail where a later one succeeds. Consider, for example:
(.*)abc\1If the subject is "xyz123abc123" the match point is the fourth character. For this reason, such a pattern is not implicitly anchored.
Another case where implicit anchoring is not applied is when the leading .* is inside an atomic group. Once again, a match at the start may fail where a later one succeeds. Consider this pattern:
(?>.*?a)bIt matches "ab" in the subject "aab". The use of the backtracking control verbs (*PRUNE) and (*SKIP) also disable this optimization. To do so explicitly, either pass the compile option PCRE2_NO_DOTSTAR_ANCHOR, or call pcre2_set_optimize() with a PCRE2_DOTSTAR_ANCHOR_OFF directive.
When a capture group is repeated, the value captured is the substring that matched the final iteration. For example, after
(tweedle[dume]{3}\s*)+
has matched "tweedledum tweedledee" the value of the captured substring is
"tweedledee". However, if there are nested capture groups, the corresponding
captured values may have been set in previous iterations. For example, after
(a|(b))+matches "aba" the value of the second captured substring is "b".
An assertion is a test that does not consume any characters. The test must succeed for the match to continue. The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described above.
More complicated assertions are coded as parenthesized groups. If matching such a group succeeds, matching continues after it, but with the matching position in the subject string reset to what it was before the assertion was processed.
A special kind of assertion, called a "scan substring" assertion, matches a subpattern against a previously captured substring. This is described in the section entitled "Scan substring assertions" below. It is a PCRE2 extension, not compatible with Perl.
The other goup-based assertions are of two kinds: those that look ahead of the current position in the subject string, and those that look behind it, and in each case an assertion may be positive (must match for the assertion to be true) or negative (must not match for the assertion to be true).
The Perl-compatible lookaround assertions are atomic. If an assertion is true, but there is a subsequent matching failure, there is no backtracking into the assertion. However, there are some cases where non-atomic assertions can be useful. PCRE2 has some support for these, described in the section entitled "Non-atomic assertions" below, but they are not Perl-compatible.
A lookaround assertion may appear as the condition in a conditional group (see below). In this case, the result of matching the assertion determines which branch of the condition is followed.
Assertion groups are not capture groups. If an assertion contains capture groups within it, these are counted for the purposes of numbering the capture groups in the whole pattern. Within each branch of an assertion, locally captured substrings may be referenced in the usual way. For example, a sequence such as (.)\g{-1} can be used to check that two adjacent characters are the same.
When a branch within an assertion fails to match, any substrings that were captured are discarded (as happens with any pattern branch that fails to match). A negative assertion is true only when all its branches fail to match; this means that no captured substrings are ever retained after a successful negative assertion. When an assertion contains a matching branch, what happens depends on the type of assertion.
For a positive assertion, internally captured substrings in the successful branch are retained, and matching continues with the next pattern item after the assertion. For a negative assertion, a matching branch means that the assertion is not true. If such an assertion is being used as a condition in a conditional group (see below), captured substrings are retained, because matching continues with the "no" branch of the condition. For other failing negative assertions, control passes to the previous backtracking point, thus discarding any captured strings within the assertion.
Most assertion groups may be repeated; though it makes no sense to assert the same thing several times, the side effect of capturing in positive assertions may occasionally be useful. However, an assertion that forms the condition for a conditional group may not be quantified. PCRE2 used to restrict the repetition of assertions, but from release 10.35 the only restriction is that an unlimited maximum repetition is changed to be one more than the minimum. For example, {3,} is treated as {3,4}.
Traditionally, symbolic sequences such as (?= and (?<= have been used to specify lookaround assertions. Perl 5.28 introduced some experimental alphabetic alternatives which might be easier to remember. They all start with (* instead of (? and must be written using lower case letters. PCRE2 supports the following synonyms:
(*positive_lookahead: or (*pla: is the same as (?= (*negative_lookahead: or (*nla: is the same as (?! (*positive_lookbehind: or (*plb: is the same as (?<= (*negative_lookbehind: or (*nlb: is the same as (?<!For example, (*pla:foo) is the same assertion as (?=foo). In the following sections, the various assertions are described using the original symbolic forms.
Lookahead assertions start with (?= for positive assertions and (?! for negative assertions. For example,
\w+(?=;)matches a word followed by a semicolon, but does not include the semicolon in the match, and
foo(?!bar)matches any occurrence of "foo" that is not followed by "bar". Note that the apparently similar pattern
(?!foo)bardoes not find an occurrence of "bar" that is preceded by something other than "foo"; it finds any occurrence of "bar" whatsoever, because the assertion (?!foo) is always true when the next three characters are "bar". A lookbehind assertion is needed to achieve the other effect.
If you want to force a matching failure at some point in a pattern, the most convenient way to do it is with (?!) because an empty string always matches, so an assertion that requires there not to be an empty string must always fail. The backtracking control verb (*FAIL) or (*F) is a synonym for (?!).
Lookbehind assertions start with (?<= for positive assertions and (?<! for negative assertions. For example,
(?<!foo)bardoes find an occurrence of "bar" that is not preceded by "foo". The contents of a lookbehind assertion are restricted such that there must be a known maximum to the lengths of all the strings it matches. There are two cases:
If every top-level alternative matches a fixed length, for example
(?<=colour|color)there is a limit of 65535 characters to the lengths, which do not have to be the same, as this example demonstrates. This is the only kind of lookbehind supported by PCRE2 versions earlier than 10.43 and by the alternative matching function pcre2_dfa_match().
In PCRE2 10.43 and later, pcre2_match() supports lookbehind assertions in which one or more top-level alternatives can match more than one string length, for example
(?<=colou?r)The maximum matching length for any branch of the lookbehind is limited to a value set by the calling program (default 255 characters). Unlimited repetition (for example \d*) is not supported. In some cases, the escape sequence \K (see above) can be used instead of a lookbehind assertion at the start of a pattern to get round the length limit restriction.
In UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which matches a single code unit even in a UTF mode) to appear in lookbehind assertions, because it makes it impossible to calculate the length of the lookbehind. The \X and \R escapes, which can match different numbers of code units, are never permitted in lookbehinds.
"Subroutine" calls such as (?2) or (?&X) are permitted in lookbehinds, as long as the called capture group matches a limited-length string. However, recursion, that is, a "subroutine" call into a group that is already active, is not supported.
PCRE2 supports backreferences in lookbehinds, but only if certain conditions are met. The PCRE2_MATCH_UNSET_BACKREF option must not be set, there must be no use of (?| in the pattern (it creates duplicate group numbers), and if the backreference is by name, the name must be unique. Of course, the referenced group must itself match a limited length substring. The following pattern matches words containing at least two characters that begin and end with the same character:
\b(\w)\w++(?<=\1)
Possessive quantifiers can be used in conjunction with lookbehind assertions to specify efficient matching at the end of subject strings. Consider a simple pattern such as
abcd$when applied to a long string that does not match. Because matching proceeds from left to right, PCRE2 will look for each "a" in the subject and then see if what follows matches the rest of the pattern. If the pattern is specified as
^.*abcd$the initial .* matches the entire string at first, but when this fails (because there is no following "a"), it backtracks to match all but the last character, then all but the last two characters, and so on. Once again the search for "a" covers the entire string, from right to left, so we are no better off. However, if the pattern is written as
^.*+(?<=abcd)there can be no backtracking for the .*+ item because of the possessive quantifier; it can match only the entire string. The subsequent lookbehind assertion does a single test on the last four characters. If it fails, the match fails immediately. For long strings, this approach makes a significant difference to the processing time.
Several assertions (of any sort) may occur in succession. For example,
(?<=\d{3})(?<!999)foo
matches "foo" preceded by three digits that are not "999". Notice that each of
the assertions is applied independently at the same point in the subject
string. First there is a check that the previous three characters are all
digits, and then there is a check that the same three characters are not "999".
This pattern does not match "foo" preceded by six characters, the first
of which are digits and the last three of which are not "999". For example, it
doesn't match "123abcfoo". A pattern to do that is
(?<=\d{3}...)(?<!999)foo
This time the first assertion looks at the preceding six characters, checking
that the first three are digits, and then the second assertion checks that the
preceding three characters are not "999".
Assertions can be nested in any combination. For example,
(?<=(?<!foo)bar)bazmatches an occurrence of "baz" that is preceded by "bar" which in turn is not preceded by "foo", while
(?<=\d{3}(?!999)...)foo
is another pattern that matches "foo" preceded by three digits and any three
characters that are not "999".
It is possible to cause the matching process to obey a pattern fragment conditionally or to choose between two alternative fragments, depending on the result of an assertion, or whether a specific capture group has already been matched. The two possible forms of conditional group are:
(?(condition)yes-pattern) (?(condition)yes-pattern|no-pattern)If the condition is satisfied, the yes-pattern is used; otherwise the no-pattern (if present) is used. An absent no-pattern is equivalent to an empty string (it always matches). If there are more than two alternatives in the group, a compile-time error occurs. Each of the two alternatives may itself contain nested groups of any form, including conditional groups; the restriction to two alternatives applies only at the level of the condition itself. This pattern fragment is an example where the alternatives are complex:
(?(1) (A|B|C) | (D | (?(2)E|F) | E) )
There are five kinds of condition: references to capture groups, references to recursion, two pseudo-conditions called DEFINE and VERSION, and assertions.
If the text between the parentheses consists of a sequence of digits, the condition is true if a capture group of that number has previously matched. If there is more than one capture group with the same number (see the earlier section about duplicate group numbers), the condition is true if any of them have matched. An alternative notation, which is a PCRE2 extension, not supported by Perl, is to precede the digits with a plus or minus sign. In this case, the group number is relative rather than absolute. The most recently opened capture group (which could be enclosing this condition) can be referenced by (?(-1), the next most recent by (?(-2), and so on. Inside loops it can also make sense to refer to subsequent groups. The next capture group to be opened can be referenced as (?(+1), and so on. The value zero in any of these forms is not used; it provokes a compile-time error.
Consider the following pattern, which contains non-significant white space to make it more readable (assume the PCRE2_EXTENDED option) and to divide it into three parts for ease of discussion:
( \( )? [^()]+ (?(1) \) )The first part matches an optional opening parenthesis, and if that character is present, sets it as the first captured substring. The second part matches one or more characters that are not parentheses. The third part is a conditional group that tests whether or not the first capture group matched. If it did, that is, if subject started with an opening parenthesis, the condition is true, and so the yes-pattern is executed and a closing parenthesis is required. Otherwise, since no-pattern is not present, the conditional group matches nothing. In other words, this pattern matches a sequence of non-parentheses, optionally enclosed in parentheses.
If you were embedding this pattern in a larger one, you could use a relative reference:
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ...This makes the fragment independent of the parentheses in the larger pattern.
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used capture group by name. For compatibility with earlier versions of PCRE1, which had this facility before Perl, the syntax (?(name)...) is also recognized. Note, however, that undelimited names consisting of the letter R followed by digits are ambiguous (see the following section). Rewriting the above example to use a named group gives this:
(?<OPEN> \( )? [^()]+ (?(<OPEN>) \) )If the name used in a condition of this kind is a duplicate, the test is applied to all groups of the same name, and is true if any one of them has matched.
"Recursion" in this sense refers to any subroutine-like call from one part of the pattern to another, whether or not it is actually recursive. See the sections entitled "Recursive patterns" and "Groups as subroutines" below for details of recursion and subroutine calls.
If a condition is the string (R), and there is no capture group with the name R, the condition is true if matching is currently in a recursion or subroutine call to the whole pattern or any capture group. If digits follow the letter R, and there is no group with that name, the condition is true if the most recent call is into a group with the given number, which must exist somewhere in the overall pattern. This is a contrived example that is equivalent to a+b:
((?(R1)a+|(?1)b))However, in both cases, if there is a capture group with a matching name, the condition tests for its being set, as described in the section above, instead of testing for recursion. For example, creating a group with the name R1 by adding (?<R1>) to the above pattern completely changes its meaning.
If a name preceded by ampersand follows the letter R, for example:
(?(R&name)...)the condition is true if the most recent recursion is into a group of that name (which must exist within the pattern).
This condition does not check the entire recursion stack. It tests only the current level. If the name used in a condition of this kind is a duplicate, the test is applied to all groups of the same name, and is true if any one of them is the most recent recursion.
At "top level", all these recursion test conditions are false.
If the condition is the string (DEFINE), the condition is always false, even if there is a group with the name DEFINE. In this case, there may be only one alternative in the rest of the conditional group. It is always skipped if control reaches this point in the pattern; the idea of DEFINE is that it can be used to define subroutines that can be referenced from elsewhere. (The use of subroutines is described below.) For example, a pattern to match an IPv4 address such as "192.168.23.245" could be written like this (ignore white space and line breaks):
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
\b (?&byte) (\.(?&byte)){3} \b
The first part of the pattern is a DEFINE group inside which another group
named "byte" is defined. This matches an individual component of an IPv4
address (a number less than 256). When matching takes place, this part of the
pattern is skipped because DEFINE acts like a false condition. The rest of the
pattern uses references to the named group to match the four dot-separated
components of an IPv4 address, insisting on a word boundary at each end.
Programs that link with a PCRE2 library can check the version by calling pcre2_config() with appropriate arguments. Users of applications that do not have access to the underlying code cannot do this. A special "condition" called VERSION exists to allow such users to discover which version of PCRE2 they are dealing with by using this condition to match a string such as "yesno". VERSION must be followed either by "=" or ">=" and a version number. For example:
(?(VERSION>=10.4)yes|no)This pattern matches "yes" if the PCRE2 version is greater or equal to 10.4, or "no" otherwise. The fractional part of the version number may not contain more than two digits.
If the condition is not in any of the above formats, it must be a parenthesized assertion. This may be a positive or negative lookahead or lookbehind assertion. However, it must be a traditional atomic assertion, not one of the non-atomic assertions.
Consider this pattern, again containing non-significant white space, and with the two alternatives on the second line:
(?(?=[^a-z]*[a-z])
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
The condition is a positive lookahead assertion that matches an optional
sequence of non-letters followed by a letter. In other words, it tests for the
presence of at least one letter in the subject. If a letter is found, the
subject is matched against the first alternative; otherwise it is matched
against the second. This pattern matches strings in one of the two forms
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
When an assertion that is a condition contains capture groups, any capturing that occurs in a matching branch is retained afterwards, for both positive and negative assertions, because matching always continues after the assertion, whether it succeeds or fails. (Compare non-conditional assertions, for which captures are retained only for positive assertions that succeed.)
Consider the problem of matching a string in parentheses, allowing for unlimited nested parentheses. Without the use of recursion, the best that can be done is to use a pattern that matches up to some fixed depth of nesting. It is not possible to handle an arbitrary nesting depth.
For some time, Perl has provided a facility that allows regular expressions to recurse (amongst other things). It does this by interpolating Perl code in the expression at run time, and the code can refer to the expression itself. A Perl pattern using code interpolation to solve the parentheses problem can be created like this:
$re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
The (?p{...}) item interpolates Perl code at run time, and in this case refers
recursively to the pattern in which it appears.
Obviously, PCRE2 cannot support the interpolation of Perl code. Instead, it supports special syntax for recursion of the entire pattern, and also for individual capture group recursion. After its introduction in PCRE1 and Python, this kind of recursion was subsequently introduced into Perl at release 5.10.
A special item that consists of (? followed by a number greater than zero and a closing parenthesis is a recursive subroutine call of the capture group of the given number, provided that it occurs inside that group. (If not, it is a non-recursive subroutine call, which is described in the next section.) The special item (?R) or (?0) is a recursive call of the entire regular expression.
This PCRE2 pattern solves the nested parentheses problem (assume the PCRE2_EXTENDED option is set so that white space is ignored):
\( ( [^()]++ | (?R) )* \)First it matches an opening parenthesis. Then it matches any number of substrings which can either be a sequence of non-parentheses, or a recursive match of the pattern itself (that is, a correctly parenthesized substring). Finally there is a closing parenthesis. Note the use of a possessive quantifier to avoid backtracking into sequences of non-parentheses.
If this were part of a larger pattern, you would not want to recurse the entire pattern, so instead you could use this:
( \( ( [^()]++ | (?1) )* \) )We have put the pattern into parentheses, and caused the recursion to refer to them instead of the whole pattern.
In a larger pattern, keeping track of parenthesis numbers can be tricky. This is made easier by the use of relative references. Instead of (?1) in the pattern above you can write (?-2) to refer to the second most recently opened parentheses preceding the recursion. In other words, a negative number counts capturing parentheses leftwards from the point at which it is encountered.
Be aware however, that if duplicate capture group numbers are in use, relative references refer to the earliest group with the appropriate number. Consider, for example:
(?|(a)|(b)) (c) (?-2)The first two capture groups (a) and (b) are both numbered 1, and group (c) is number 2. When the reference (?-2) is encountered, the second most recently opened parentheses has the number 1, but it is the first such group (the (a) group) to which the recursion refers. This would be the same if an absolute reference (?1) was used. In other words, relative references are just a shorthand for computing a group number.
It is also possible to refer to subsequent capture groups, by writing references such as (?+2). However, these cannot be recursive because the reference is not inside the parentheses that are referenced. They are always non-recursive subroutine calls, as described in the next section.
An alternative approach is to use named parentheses. The Perl syntax for this is (?&name); PCRE1's earlier syntax (?P>name) is also supported. We could rewrite the above example as follows:
(?<pn> \( ( [^()]++ | (?&pn) )* \) )If there is more than one group with the same name, the earliest one is used.
The example pattern that we have been looking at contains nested unlimited repeats, and so the use of a possessive quantifier for matching strings of non-parentheses is important when applying the pattern to strings that do not match. For example, when this pattern is applied to
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()it yields "no match" quickly. However, if a possessive quantifier is not used, the match runs for a very long time indeed because there are so many different ways the + and * repeats can carve up the subject, and all have to be tested before failure can be reported.
At the end of a match, the values of capturing parentheses are those from the outermost level. If you want to obtain intermediate values, a callout function can be used (see below and the pcre2callout documentation). If the pattern above is matched against
(ab(cd)ef)the value for the inner capturing parentheses (numbered 2) is "ef", which is the last value taken on at the top level. If a capture group is not matched at the top level, its final captured value is unset, even if it was (temporarily) set at a deeper level during the matching process.
Do not confuse the (?R) item with the condition (R), which tests for recursion. Consider this pattern, which matches text in angle brackets, allowing for arbitrary nesting. Only digits are allowed in nested brackets (that is, when recursing), whereas any characters are permitted at the outer level.
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * >In this pattern, (?(R) is the start of a conditional group, with two different alternatives for the recursive and non-recursive cases. The (?R) item is the actual recursive call.
Some former differences between PCRE2 and Perl no longer exist.
Before release 10.30, recursion processing in PCRE2 differed from Perl in that a recursive subroutine call was always treated as an atomic group. That is, once it had matched some of the subject string, it was never re-entered, even if it contained untried alternatives and there was a subsequent matching failure. (Historical note: PCRE implemented recursion before Perl did.)
Starting with release 10.30, recursive subroutine calls are no longer treated as atomic. That is, they can be re-entered to try unused alternatives if there is a matching failure later in the pattern. This is now compatible with the way Perl works. If you want a subroutine call to be atomic, you must explicitly enclose it in an atomic group.
Supporting backtracking into recursions simplifies certain types of recursive pattern. For example, this pattern matches palindromic strings:
^((.)(?1)\2|.?)$The second branch in the group matches a single central character in the palindrome when there are an odd number of characters, or nothing when there are an even number of characters, but in order to work it has to be able to try the second case when the rest of the pattern match fails. If you want to match typical palindromic phrases, the pattern has to ignore all non-word characters, which can be done like this:
^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$If run with the PCRE2_CASELESS option, this pattern matches phrases such as "A man, a plan, a canal: Panama!". Note the use of the possessive quantifier *+ to avoid backtracking into sequences of non-word characters. Without this, PCRE2 takes a great deal longer (ten times or more) to match typical phrases, and Perl takes so long that you think it has gone into a loop.
Another way in which PCRE2 and Perl used to differ in their recursion processing is in the handling of captured values. Formerly in Perl, when a group was called recursively or as a subroutine (see the next section), it had no access to any values that were captured outside the recursion, whereas in PCRE2 these values can be referenced. Consider this pattern:
^(.)(\1|a(?2))This pattern matches "bab". The first capturing parentheses match "b", then in the second group, when the backreference \1 fails to match "b", the second alternative matches "a" and then recurses. In the recursion, \1 does now match "b" and so the whole match succeeds. This match used to fail in Perl, but in later versions (I tried 5.024) it now works.
If the syntax for a recursive group call (either by number or by name) is used outside the parentheses to which it refers, it operates a bit like a subroutine in a programming language. More accurately, PCRE2 treats the referenced group as an independent subpattern which it tries to match at the current matching position. The called group may be defined before or after the reference. A numbered reference can be absolute or relative, as in these examples:
(...(absolute)...)...(?2)... (...(relative)...)...(?-1)... (...(?+1)...(relative)...An earlier example pointed out that the pattern
(sens|respons)e and \1ibilitymatches "sense and sensibility" and "response and responsibility", but not "sense and responsibility". If instead the pattern
(sens|respons)e and (?1)ibilityis used, it does match "sense and responsibility" as well as the other two strings. Another example is given in the discussion of DEFINE above.
Like recursions, subroutine calls used to be treated as atomic, but this changed at PCRE2 release 10.30, so backtracking into subroutine calls can now occur. However, any capturing parentheses that are set during the subroutine call revert to their previous values afterwards.
Processing options such as case-independence are fixed when a group is defined, so if it is used as a subroutine, such options cannot be changed for different calls. For example, consider this pattern:
(abc)(?i:(?-1))It matches "abcabc". It does not match "abcABC" because the change of processing option does not affect the called group.
The behaviour of backtracking control verbs in groups when called as subroutines is described in the section entitled "Backtracking verbs in subroutines" in the PCRE2 documentation.
Philip Hazel
Retired from University Computing Service
Cambridge, England.
Last updated: 27 November 2024
Copyright © 1997-2024 University of Cambridge.