Professional Documents
Culture Documents
Fundamentals
Black Belt Program
Regex in Action
Humor & More
Ask Rex
In the regex tutorials and books I have read, these various points of syntax are introduced in stages. But
(?: ) looks a lot like (?= ), so that at some point they are bound to clash in the mind of the regex
apprentice. To facilitate study, I have pulled all the (? ) usages I know about into one place. I'll start
by pointing out three confusing couples; details of usage will follow.
Jumping Points
For easy navigation, here are some jumping points to various sections of the page:
Confusing Couples
Lookahead and Lookbehind: (?= ), (?! ), (?<= ), (?<! )
Non-Capturing Groups: (?: ) and (?is: )
Atomic Groups: (?> )
Named Capture: (?<foo> ) and (?P<foo> )
Inline Modifiers: (?isx-m)
Subroutines: (?1)
Recursion: (?R)
Conditionals: (?(A)B) and (?(A)B|C)
Pre-Defined Subroutines: (?(DEFINE)(<foo> )(<bar> )) and (?&foo)
Branch Reset: (?| )
Inline Comments: (?# )
(direct link)
http://www.rexegg.com/regex-disambiguation.html
1/19
10/11/2014
Confusing Couples
Confusing Couple #1: (?: ) and (?= )
These false twins have very different jobs. (?: ) contains a non-capturing group, while (?= ) is a
lookahead.
Confusing Couple #2: (?<= ) and (?> )
(?<= ) is a lookbehind, so (?> ) must be a lookahead, right? Not so. (?> ) contains an atomic
group. The actual lookahead marker is (?= ). More about all these guys below.
Confusing Couple #3: (?(1) ) and (?1)
This pair is delightfully confusing. The first is a conditional expression that tests whether Group 1 has
been captured. The second is a subroutine call that matches the sub-pattern contained within the
capturing parentheses of Group 1.
Now that these three "big ones" are out of the way, let's drill into the syntax.
(direct link)
2/19
10/11/2014
Note that this pattern achieves the same result as \d+(?= dollars) from above, but it is less efficient
because \d+ is matched twice. A better use of looking ahead before matching characters is to validate
multiple conditions in a password.
(direct link)
Negative Lookahead After the Match: \d+(?! dollars)
Sample Match: 100 in 100 pesos
Explanation: \d+ matches 100, then the negative lookahead (?! dollars) asserts that at that position in the
string, what immediately follows is not the characters "dollars"
Negative Lookahead Before the Match: (?!\d+ dollars)\d+
Sample Match: 100 in 100 pesos
Explanation: The negative lookahead (?!\d+ dollars) asserts that at the current position in the string,
what follows is not digits then the characters "dollars". If the assertion succeeds, the engine matches the
digits with \d+.
Note that this pattern achieves the same result as \d+(?! dollars) from above, but it is less efficient
because \d+ is matched twice. A better use of looking ahead before matching characters is to validate
multiple conditions in a password.
(direct link)
Lookbehind Before the match: (?<=USD)\d{3}
Sample Match: 100 in USD100
Explanation: The lookbehind (?<=USD) asserts that at the current position in the string, what precedes
is the characters "USD". If the assertion succeeds, the engine matches three digits with \d{3}.
Lookbehind After the match: \d{3}(?<=USD\d{3})
Sample Match: 100 in USD100
Explanation: \d{3} matches 100, then the lookbehind (?<=USD\d{3}) asserts that at that position in the
string, what immediately precedes is the characters "USD" then three digits.
Note that this pattern achieves the same result as (?<=USD)\d{3} from above, but it is less efficient
because \d{3} is matched twice.
(direct link)
Negative Lookbehind Before the Match: (?<!USD)\d{3}
Sample Match: 100 in JPY100
Explanation: The negative lookbehind (?<!USD) asserts that at the current position in the string, what
precedes is not the characters "USD". If the assertion succeeds, the engine matches three digits with
\d{3}.
Negative Lookbehind After the Match: \d{3}(?<!USD\d{3})
Explanation: \d{3} matches 100, then the negative lookbehind (?<!USD\d{3}) asserts that at that
position in the string, what immediately precedes is not the characters "USD" then three digits.
Note that this pattern achieves the same result as (?<!USD)\d{3} from above, but it is less efficient
because \d{3} is matched twice.
(direct link)
http://www.rexegg.com/regex-disambiguation.html
3/19
10/11/2014
4/19
10/11/2014
Likewise, you can capture the content of a non-capturing group by surrounding it with parentheses. For
instance, ((?:Bob|Chloe)\d\d) would capture "Chloe44".
(direct link)
Mode Modifiers within Non-Capture Groups
On all engines that support inline modifiers such as (?i), except Python, you can blend the the noncapture group syntax with mode modifiers. Here are some examples:
(?i:Bob|Chloe) This non-capturing group is case-insensitive.
(?ism:^BEGIN.*?END) This non-capturing group matches everything between "begin" and "end"
(case-insensitive), allowing such content to span multiple lines (the s modifier), starting at the beginning
of any line (the m modifier allows the ^ anchor to match the beginning of any line).
(?i-sm:^BEGIN.*?END) As above, but turns off the "s" and "m" modifiers
See below for more on inline modifiers.
(direct link)
This will fail against ABC, whereas (?:A|.B)C would have succeeded. After matching the A in the atomic
group, the engine tries to match the C but fails. Because it is atomic, it is unable to try the .B part of the
alternation, which would also succeed, and allow the final token C to match.
Example 2: With Quantifier
(?>A+)[A-Z]C
This will fail against AAC, whereas (?:A+)[A-Z]C would have succeeded. After matching the AA in the
atomic group, the engine tries to match the [A-Z], succeeds by matching the C, then tries to match the
token C but fails as the end of the string has been reached. Because the group is atomic, it is unable to
give up the second A, which would allow the rest of the pattern to match.
If, before the atomic group, there were other options to which the engine can backtrack (such as
quantifiers or alternations), then the whole atomic group can be given up in one go.
When are Atomic Groups Important?
When a series of characters only makes sense as a block, using an atomic group can prevent needless
backtracking. This is explored on the section on possessive quantifiers. In such situations atomic
quantifiers can be useful, but not necessarily mission-critical.
On the other hand, there are situations where atomic quantifiers can save your pattern from disaster. They
are particularly useful:
http://www.rexegg.com/regex-disambiguation.html
5/19
10/11/2014
In order to avoid the Lazy Trap with patterns that contain lazy quantifiers whose token can eat the
delimiter
To avoid certain forms of the Explosive Quantifier Trap
Supported Engines, and Workaround
Atomic groups are supported in most of the major engines: .NET, Perl, PCRE and Ruby. For engines that
don't support atomic grouping syntax, such as Python and JavaScript, see the well-known pseudo-atomic
group workaround.
(direct link)
Alternate Syntax: Possessive Quantifier
When an atomic group only contains a token with a quantifier, an alternate syntax (in engines that
support it) is a possessive quantifier, where a + is added to the quantifier. For instance,
(?>A+) is equivalent to A++
(?>A*) is equivalent to A*+
(?>A?) is equivalent to A?+
(?>A{,}) is equivalent to A{,}+
This works in Perl, PCRE, Java and Ruby 2+.
For more, see the possessive quantifiers section of the quantifiers page.
Non-Capturing
Atomic groups are non-capturing, though as with other non-capturing groups, you can place the group
inside another set of parentheses to capture the group's entire match; and you can place parentheses
inside the atomic group to capture a section of the match.
Watch out, as the atomic group syntax is confusingly similar to the lookbehind syntax (?<= ).
(direct link)
http://www.rexegg.com/regex-disambiguation.html
6/19
10/11/2014
To create a back-reference to the intpart group in the pattern, depending on the engine, you'll use
\k<intpart> or (?P=intpart). To insert the named group in a replacement string, depending on the engine,
you'll either use ${intpart}, \g<intpart>, $+{intpart}or the group number \1. For the gory details, see
Naming Groupsand referring back to them.
To name, or not to name?
I'll admit that I don't use named groups a whole lot, but some people love them.
Sure, named captures are bulkier than a quick (capture) and reference to \1but they can save hassles in
expressions that contain many groups.
Do they make your patterns easier to read? That's subjective. For my part, if the regex is short, I always
prefer numbered groups. And if it is long, I would rather read a regex with numbered groups and good
comments in free-spacing mode than a one-liner with named groups.
(direct link)
7/19
10/11/2014
In Java, (?d) turns on "Unix lines mode" mode, which means that the dot and the anchors ^ and $ only
care about line break characters when they are line feeds \n.
Combining Non-Capture Group with Inline Modifiers
As we saw in the section on non-capture groups, you can blend mode modifiers into the non-capture
group syntax in all engines that support inline modifiersexcept Python. For instance, (?i:bob) is a noncapturing group with the case insensitive flag turned on. It matches strings such as "bob" and "boB"
But don't get carried away: you cannot blend inline modifiers with any random bit of regex syntax. For
instance, the following are all illegal: (?i=bob), (?iP<name>bob) and (?i>bob)
Using Inline Modifiers in the Middle of a Pattern
Usually, you'll use your inline modifiers at the start of the regex string to set the mode for the entire
pattern. However, changing modes in the middle of a pattern can be useful, so I'll give you two examples.
This ensures that an upper-case word is repeated somewhere in the string, in
any letter-case. First we capture an upper-case word to Group 1 (for instance DOG), then we set caseinsensitive mode, then .*? matches any characters up to the back-reference \1, which could be dog or
dOg. As a neat variation, (\b[A-Z]+\b).*?\b(?=[a-z]+\b)(?i)\1\b ensures that the back-reference is in
lower-case.
(\b[A-Z]+\b)(?i).*?\b\1\b
This ensures that the first word of the string is repeated on a different
line. First we capture a word to Group 1, then we get to the end of the line with .*, match a line break,
then set DOTALL modeallowing the .*? to match across lines, which brings us to our back-reference
\1.
^(\w+)\b.*\r?\n(?s).*?\b\1\b
(direct link)
will match Hey Ho. The parentheses in (\w+) not only capture Hey to Group 1they also define
Subroutine 1, whose pattern is \w+. Later, (?1) is a call to subroutine 1. The entire regex is therefore
equivalent to (\w+) \w+
Subroutines can make long expressions much easier to look at and far less prone to copy-paste errors.
(direct link)
Relative Subroutines
Instead of referring to a subroutine by its number, you can refer to the relative position of its defining
group, counting left or right from the current position in the pattern. For instance, (?-1) refers to the last
http://www.rexegg.com/regex-disambiguation.html
8/19
10/11/2014
defined subroutine, and (?+1) refers to the next defined subroutine. Therefore,
and (?+1) (\w+)
are both equivalent to our first example with numbered group 1. In Ruby 2+, for relative subroutine calls,
you would use \g<-1> and \g<+1>.
(\w+) (?-1)
(direct link)
Named Subroutines
Instead of using numbered groups, you can use named groups. In that case, in Perl and PHP the syntax
for the subroutine call will be (?&group_name). In Ruby 2+ the syntax is \g<some_word>. For instance,
(?<some_word>\w+) (?&some_word) is equivalent to our first example with numbered group 1.
Pre-Defined Subroutines
So far, when we defined our subroutines, we also matched something. For instance, (\w+) defines
subroutine 1 but also immediately matches some word characters. It so happens that Perl and PCRE have
terrific syntax that allows you to pre-define a subroutine without initially matching anything. This
syntax is extremely useful to build large, modular expressions. We will look at it in the corresponding
section: Defined Subroutines: (?(DEFINE)(<foo> ))(<bar> ))
Subroutines and Recursion
If you place a subroutine such as (?1) within the very capture group to which it refersGroup 1 in this
casethen you have a recursive expression. For instance, the regex ^(A(?1)?Z)$ contains a recursive
sub-pattern, because the call (?1) to subroutine 1 is embedded in the parentheses that define Group 1.
If you try to trace the matching path of this regex in your mind, you will see that it matches strings like
AAAZZZ, strings which start with any number of letters A and end with letters Z that perfectly balance the
As. After you open the parenthesis, the A matches an A then the optional (?1)? opens another
parenthesis and tries to match an A and so on.
We'll look at recursion syntax in the next section. There is also a page dedicated to recursion.
Warning
Note that the (?1) syntax looks confusingly similar to the ?(1) found in conditionals.
(direct link)
9/19
10/11/2014
perfectly balanced by a number of letters Z at the end. The initial token A matches an A Then the
optional (?R)? tries to repeat the whole pattern right there, and therefore attempts the token A to match an
A and so on.
Recursion of a Subroutine: (?1) and (?-1)
You also have recursion when a subroutine calls itself. For instance, in
^(A(?1)?Z)$ subroutine 1 (defined by the outer parentheses) contains a call to itself. This regex matches
entire strings such as AAAZZZ, where a number of letters A at the start are perfectly balanced by a
number of letters Z at the end.
As we saw in the section on subroutines, you can also call a subroutine by the relative position of its
defining group at the current position in the pattern. Therefore,
^(A(?-1)?Z)$ performs exactly like the above regex.
There is much more to be said about recursion. See the page dedicated to recursive regex patterns.
(direct link)
Likewise, (?(foo)) checks if the capture group named foo has been set.
This pattern matches a string of digits that may or may not be embedded in curly braces. The optional
capture Group 1 ({)? captures an opening brace. Later, the conditional checks if capture 1 was set, and if
so it matches the closing brace.
Let's expand this example to use the "else" part of the syntax:
^(?:({)|")\d+(?(1)}|")$
This pattern matches strings of digits that are either embedded in double quotes or in curly braces. The
non-capture group (?:({)|") matches the opening delimiter, capturing it to Group 1 if it is a curly brace.
After matching the digits, (?(1)}|") checks whether Group 1 was set. If so, we match a closing curly
brace. If not, we match a double quote.
Lookaround in Conditions
In (?(A)B), the condition you'll most frequently see is a check as to whether a capture group has been set.
In .NET, PCRE and Perl (but not Python and Ruby), you can also use lookarounds:
\b(?(?<=5D:)\d{5}|\d{10})\b
If the prefix 5D: can be found, the pattern will match five digits. Otherwise, it will match ten digits.
http://www.rexegg.com/regex-disambiguation.html
10/19
10/11/2014
Needless to say, that is not the only way to perform this task.
(direct link)
Checking if a relative capture group was set
(?(1)A) checks whether Group 1 was set. In PCRE, instead of hard-coding the group number, we can also
check whether a group at a relative position to the current position in the pattern has been set: for
instance, (?(-1)A) checks whether the previous group has been set. Likewise, (?(+1)A) checks whether
the next capture group has been set. (This last scenario would be found within a larger repeating group,
so that on the second pass through the pattern, the next capture group may indeed have been set on the
previous pass.)
(direct link)
Checking if a recursion level was reached
This is not the place to be talking in depth about recursion, which has a section below and a dedicated
page, but for completion I should mention two other uses of conditionals, available in Perl and PCRE:
(?(R)A) tests whether the regex engine is currently working within a recursion depth (reached from a
recursive call to the whole pattern or a subroutine).
(?(R1)A) tests whether the current recursion level has been reached by a recursive call to subroutine 1.
See examples here.
Availability of Regex Conditionals
Conditionals are available in PCRE, Perl, .NET, Python, and Ruby 2+. In other engines, the work of a
conditional can usually be handled by the careful use of lookarounds.
Similar Syntax
Note that the (?(1)B) syntax can look confusingly similar to (?1) which stands for a regex subroutine,
where the regex pattern defined by Group 1 must be matched.
(direct link)
11/19
10/11/2014
The subroutine noun_phrase is called twice: there is no need to paste a large repeated regex sub-pattern,
and if we decide to change the definition of noun_phrase, that immediately trickles to the two places
where it is used.
Note also that noun_phrase itself is built by assembling smaller blocks: its code (?&quant)\ (?&adj)\ (?
&object) uses the quant, adj and object subroutines.
With this kind of modularity, you can build regex cathedrals. There is a beautiful example on the page
with the regex to match numbers in plain English.
A Note on Group Numbering
Please be mindful that each named subroutine consumes one capture group number, so if you use capture
groups later in the regex, remember to count from left to right. The gory details are on the page about
Capture Group Numbering & Naming.
(direct link)
12/19
10/11/2014
If you've read the page about Capture Group Numbering & Naming, you'll remember that capture groups
get numbered from left to right. Therefore, if you have two sets of capturing parentheses, they have two
group numbers. Sometimes, you might wish that these two sets of parentheses might capture to the same
numbered group.
Perl and PCRE (and therefore C, PHP, R) have a feature that let you reuse a group number when
capturing parentheses are present on different sides of an alternation.
This is rather abstract, so let's take an example. Let's say you want to match a number, but only in three
situations:
If it follows an A, as in A00
If it precedes a B, as in 11B
If it is sandwiched between C and D, as in C22D
This poses no problem using lookahead and lookbehind, but the branch reset syntax (?| ) gives you
anotherpotentially more readableoption:
(?|A(\d+)|(\d+)B|C(\d+)D)
After the initial (?|, which introduces a branch reset, the group has a three-piece alternation (two |). Each
of those contains a capture group (\d+). The number of all of those capture groups is the same: Group 1.
You are not limited to one group. For instance, if you are also interested in capturing a potential suffix
after the number (which can happen in the situations 11B and C55D), place another set of parentheses
wherever you find a suffix:
(?|A(\d+)|(\d+)(B)|C(\d+)(D))
Using this regex to match the string A00 11B C22D, you obtain these groups:
Match
----A00
11B
C22D
Group 1: Number
--------------00
11
22
Group 2: Suffix
--------------(not set)
B
D
13/19
10/11/2014
Group 1
------song
fruit
color
motto
Group 2
------Sweet Home Alabama
apple
blue
Don't Worry
Group 1 (\S+) is a straight capture group that captures the key. In the branch reset, the two sets of
capturing parentheses allow you to capture different kinds of values in different formats to the same
group, i.e. Group 2. You can check the group captures in the right pane of this online regex demo.
To me, this alternative with a conditional and a lookbehind
(\S+):"?((?(?<!")[^"\s]+|[^"]+)) feels a little less satisfying. But hey, it works too.
(direct link)
\d{4} matches four digits, while (?# the year) tells you what we are trying to match.
How useful is this? Not very. I almost never use this feature: when I want comments, I just turn on freespacing mode for the whole regex.
14/19
10/11/2014
Ask Rex
Leave a Comment
1-7 of 7 Threads
Duncan UK
March 12, 2014 - 02:40
Subject: Removing Confusion Around (? Regex Syntax
This topic is very well written and much appreciated. Distills large works like Friedl's book into an easily
digestible quarter of an hour. I look forward to reading the rest!
xtello France
February 19, 2014 - 08:03
Subject: RE: Your banner regex
Thanks Rex, you really made me laugh!! I see you always have the same excellent sense of humor as in
your (brilliant) articles & tutorials! Thank you for this great site and for the joke :) (and for the new
regex)
Greetings from (the south of) France! Xavier Tello
Reply to xtello
Rex
February 21, 2014 - 10:45
Subject: RE: Your banner regex
Hi Xavier, Thank you for your very kind encouragements! If only everyone could be like you. When the
technology becomes available, would you mind if I get back in touch in order to clone you? Wishing you
a fun weekend, Rex
xtello France
February 17, 2014 - 10:07
Subject: Your banner regex
I looked at the regex displayed in your banner Applying this regex to the string [spoiler] will produce
[spoiler] (if I'm not wrong!). What's this easter egg? ;-)
Reply to xtello
Rex
February 17, 2014 - 16:37
Subject: RE: Your banner regex
Hi Xavier, Thank you for writing, it was a treat to hear from you. Wow, you are the first person to notice!
In fact, you made me change the banner to satisfy your sense of completion (and make it harder for the
next guy). > What's this easter egg? This Easter Egg (pun intended, I presume) is that you are the grand
winner of a secret contest. From the time I launched the site, I had planned that the first person to
http://www.rexegg.com/regex-disambiguation.html
15/19
10/11/2014
discover this would win a free trip to the South of France. You won!!! :) :) :) Wishing you a beautiful
day, Rex
Nicolas Brussels
August 05, 2013 - 10:09
Subject: Little question about capture
Hi Andy. Thank you for all these articles, they are amazing! I learn a lot with this website. So glad to
found it! Like they said : Best ressource on internet :)
I tried some of your example, and I'm stuck with one of them: (? :(\()|-)\d{6}(? (1)\)). When I'm trying "
(111111)" with "preg_match_all", it captures"(". Do you think it's possible to bypass this capture? When
I use "-222222", it catches an empty string And I dont unserstand why. Could you please explain this?
Thank you Andy! And again: Nice work!
Reply to Nicolas
Rex
August 05, 2013 - 18:56
Subject: RE: Little question about capture
Hi Nicolas,
Run this:
$regex='~(?:(\()|-)\d{6}(?(1)\))~';
$string='(such as "(444444)"), or it is preceded by a minus sign (such as "-333333").';
preg_match_all($regex,$string,$m);
var_dump( $m );
You will see that the MATCHES are (444444) and -333333
The CAPTURES are "(" and "". The captured left par is what makes the ?(1) work later in the regex.
Let me know if this is still unclear.
Aravind P S
May 03, 2013 - 17:39
Subject: Great Work man.
I enjoyed reading this article and learnt a lot. Thanks for your wonderful work. :)
Vin Switzerland
November 28, 2012 - 21:05
Subject: Brilliant
Best resource I've found yet on regular expressions. Much appreciate the work you put into this. Why not
create an eBook that could be downloadedI for one would willingly cough up a few dollars. Regards
Vin
Reply to Vin
Andy
December 02, 2012 - 09:03
Subject: Re: Brilliant
Hi Vin, Thank you very much for your encouragements, and also for your suggestion. I've been itching to
make a print-on-demand book with the lowest price possible, to make it easy to read offline. Will
probably do that as soon as they extend the length of a day to 49 hours. Wishing you a fun weekend,
Andy
http://www.rexegg.com/regex-disambiguation.html
16/19
10/11/2014
Skrell
November 22, 2012 - 08:21
Subject: amazing
These articles you post on regular expressions are among the best, I've found on the entire internet! No
joke! Much appreciated!!!
Reply to Skrell
Andy
November 22, 2012 - 21:13
Subject: Re: amazing
Hi Skrell, thank you very much for your supportive comment. I'm glad to know that someone likes these
pages! They took weeks to write and I've been surprised by how little time visitors have spent on them.
To enjoy a certain presentation of technical information I guess we must be of like minds at least in some
small way. :) Wishing you a fun end of the week, -A
Leave a Comment
* Your name
* Email (it will not be shown)
Your location
Subject:
All comments are moderated.
Link spammers, this won't work for you.
To prevent automatic spam, we require that you type the two words below before you submit your
comment.
Submit
Fundamentals
Regex Tutorial
Regex vs. Regex
Quick Reference
100 Uses for Regex
Regex Style Guide
Black Belt Program
http://www.rexegg.com/regex-disambiguation.html
17/19
10/11/2014
All (? ) Syntax
Boundaries++
Anchors
Capture & Back
Flags & Modifiers
Lookarounds
Quantifiers
Explosive Quantifiers
Conditionals
Recursion
Class Operations
Regex Gotchas
Syntax Tricks
Quantifier capture
Regex in Action
For awesome tricks:
scroll down!
Cookbook
Cool Regex Classes
Regex Optimizations
PCRE: Grep and Test
Perl One-Liners
Tools & More
Regex Tools
Regex Humor
Regex Books & More
RegexBuddy Trial
Tricks
The Best Regex Trick
Line Numbers
Numbers in English
Languages
PCRE Doc & Log
Regex with C#
Regex with PHP
Regex with Python
Regex with Java
Regex with JavaScript
Regex with Ruby
http://www.rexegg.com/regex-disambiguation.html
18/19
10/11/2014
A must-read
RegexBuddy 4
is Out! Big Wow!
Get the Free Trial
Ask Rex
search the site
Copyright RexEgg.com
http://www.rexegg.com/regex-disambiguation.html
19/19