Regex lookahead and lookbehind

A common search scenario involves finding all occurrences of a string x, but that are not followed by string y. Here’s a contrived example. Let’s say you were fond of using the variables foo, bar, and foobar. They appear everywhere in the code. Now you want to search for all occurences of the variable “foo”. Unfortunately, doing a simple search will result in foobar being returned in the search results as well. So you could attempt to do a search using grep using this as your regex: “foo[^b][^a][^r]”

Now, let’s say the test.txt file consists of the following 3 lines:

foo
foobar
hello foo world

Running “grep foo[^b][^a][^r] test.txt” only returns the third line, and not the first. The reason is that the regex will match all lines containing “foo” not followed by the three characters “b”, “a”, and “r”. What we actually want is to match on “foo” not followed by “bar”. There is a subtle semantic difference here. The former is expecting the existence of three characters to follow “foo”. If those three characters do not exist, then the match fails. We can express the latter using a negative lookahead: grep -P “foo(?!bar)”. Note that the -P tells it to use PERL style regeular expressions, which actually support lookaheads and lookbehinds. This time, grep returns both line one and three.

A lookahead does exactly what it sounds like. The regex engine will look ahead of its current position for the specified pattern. A negative lookahead causes the match to fail immediately if the pattern is found, while a positive lookahead does not. Now, the key point here is that these constructs are what is known as zero width assertions.

A zero width assertion does not actually consume any characters when doing a match. What does this mean? Lets take the following regex as an example: “(\d+)(\w+)”. When matched against the input string “12345xyz”, the “(\d+)” part of the regex “consumes” the characters “12345”. These will be returned as part of the match group, and will not be available to be matched against anything else in the regex. A zero width assertion on the other hand, leaves the string intact, leaving those characters available for matching against the rest of the regex. Start of line “^” and end of line “$” are two examples of zero width assertions that most people are familiar with.

A lookahead is also a zero width assertion. For example, let’s take a look at the regex foo(?=bar)(\w+) applied on the string “foobarhelloworld”. The positive lookahead portion “foo(?=bar)” will succeed, because in the input string, “foo” is followed by “bar”. However, “bar” will not be consumed by this match, and will be consumed by “(\w+)”. So match group one will consist of “barhelloworld”.

Lookbehinds work in the same way. A lookbehind causes the regex engine to look behind in its current position for the specified pattern. A negative lookbehind will fail the match if the pattern is found, and a positive one will not. Let’s take a look at the regex (\w+)(?<=foo)bar(\d+), on the input string "Testfoobar12345". Again, since the lookbehind is a zero width assertion, the first "(\w+)" matches "Testfoo". The lookbehind, "(?<=foo)bar", succeeds, and the rest of the string, "12345" is matched against the "(\d+)". Here is a quick cliff notes summary. Positive lookahead: "foo(?=bar)" matches on foo followed by bar Negative lookahead: "foo(?!bar)" matches on foo not followed by bar Positive lookbehind: "(?<=foo)bar" matches on bar preceded by foo Negative lookbehind: "(?<!foo)bar" matches on bar not preceded by foo Needless to say, lookahead and lookbehind provide a concise way of specifying that a given pattern cannot follow or precede a given position in the string. Its a powerful feature, but unfortunately not all tools and languages with regex engines support it, especially older versions (hence calling grep with "-P" in one of the examples given above). However, anytime you can use them, they will invariably make pattern matching that much easier.

Leave a Reply

Your email address will not be published. Required fields are marked *