Advanced Regular Expressions: part 1 – look ahead/look behind

We are going to dig into one of the most powerful so-called “language” to process text, the Regular Expressions (RegExp).
 
In this series of posts I will only discuss about advanced features of the RegExp. If you are not familiar with it already, I will advise dropping an eye here.

Today I start with the look ahead/look behind construct.

Most of us are quite familiar with the very common quantifier characters coming from the RegExp: star (*) and dot (.). With these two modifiers combined with some clever pattern, we can cover almost 90% of common use cases.
But sometimes we need to go further into the advanced features because the task requirements can not be done easily with basic regexps.

As typographic convention

  • all regexp patterns will be put in italic and large font:  [a-zA-Z]+
  • all  strings to be matches underlined:  This is a text to be matched
  • when necessary I will indicate the current matching position in the string by a ↑ symbol

 This is↑ a text to be matched

I  Look ahead/look behind

One day I had a requirement to do some input field validation for list of email addresses.

We had an input text field in the form, and we expected the user to provide a list of emails separated by a comma (,).

For the sake of simplicity, let’s say that the pattern for a valid email is: [-_a-zA-Z0-9]+@[-_a-zA-Z0-9]+\.[a-z]{3}

  • [-_a-zA-Z0-9]+ : one or more alphanumeric characters, including underscore ( _ ) and dash (-)
  • \ . : dot (.) character. The slash here is to escape the special meaning of the dot (.) as “any character
  • [a-z]{3} : exactly 3 alphabetic characters. Numbers are not allowed here.

This pattern is indeed very simplistic, the real regexp defined by the 3W consortium to validate all possible email addresses is very complex.

The pattern above only deals with email address. We’ve not yet looked at the comma as separator.

One naïve approach could be:  (?:[-_a-zA-Z0-9]+@[-_a-zA-Z0-9]+\.[a-z]{3},)+

  • first is the pattern for the email address
  • followed by the comma ,
  • both patterns are inside a non-capturing group  (?:  )
  • followed by the + quantifier meaning that this group can occur more than once

This pattern will work most of the time and match correctly everything, including john-smith@gmail.com,  which we do not want. How to get rid of the last comma ?

Ideally we want to match an email address followed by a comma, one time or more, but the last group should contain only email address, without the comma.

In other words, the text string should NOT finish by a comma, e.g. the comma is NOT allowed if it is the last character of the string.

For this we can use the negative look-ahead construct: pattern1(?!look-ahead)

This construct means: match pattern1 only if pattern1 is NOT followed by pattern2

So we will have (?:[-_a-zA-Z0-9]+@[-_a-zA-Z0-9]+\\.[a-z]{3},(?!$))+

But still this doesn’t work. Why ?

By appending ,(?!$) we slightly changed the meaning of the pattern. Now it means one or more group of  (a valid email address, followed by a non-terminal comma). In plain English this is quite obvious that it should work.

Let’s take the following string:  john-smith@gmail.com,adam-smith@yahoo.net

The first part john-smith@gmail.com, surely matches the pattern [-_a-zA-Z0-9]+@[-_a-zA-Z0-9]+\\.[a-z]{3},(?!$). However, the last part adam-smith@yahoo.net won’t match anything because the only allowed matching pattern is “mail address,”

The mistakes here is the ,(?!$). Indeed it means a non-terminal comma, and the non-terminal comma IS NOT OPTIONAL. The string must include the non-terminal comma otherwise the whole match will fail.

To make the non-terminal comma optional, we simply wrap it inside a non-capturing group with the ? (zero or one) quantifier: (?:,(?!$))?

So the final regexp is: (?:[-_a-zA-Z0-9]+@[-_a-zA-Z0-9]+\\.[a-z]{3}(?:,(?!$))?)+

In the previous example

  •  john-smith@gmail.com, will match [-_a-zA-Z0-9]+@[-_a-zA-Z0-9]+\\.[a-z]{3},(?!$)
  • adam-smith@yahoo.net will match [-_a-zA-Z0-9]+@[-_a-zA-Z0-9]+\\.[a-z]{3} without the non-terminal comma.

String text = "john-smith@hotmail.com,brian_adams@yahoo.net";

Pattern pattern = Pattern.compile("(?:[-_a-zA-Z0-9]+@[-_a-zA-Z0-9]+\\.[a-z]{3}(?:,(?!$))?)+");
Matcher matcher = pattern.matcher(text);
System.out.println(" Matches ? " + matcher.matches());

The output:

 Matches ? true

The other constructs, namely positive look-ahead, positive look-behind and negative look-behind work similarly.

  • Positive look-ahead:  pattern(?=look-ahead)
  • Negative look-ahead: pattern(?!look-ahead)
  • Positive look-behind: (?<=look-behind)pattern
  • Negative look-behind: (?<!look-behind)pattern

Recommended readings:

Advertisements

About DuyHai DOAN
Cassandra Technical Evangelist. LinkedIn profile : http://fr.linkedin.com/pub/duyhai-doan/2/224/848. Follow me on Twitter: @doanduyhai for latest updates on Cassandra

2 Responses to Advanced Regular Expressions: part 1 – look ahead/look behind

  1. I love your blog. Thank you for this very nice explanation of look-aheads. Please keep up the posts.

  2. Ivan Stankov says:

    The final pattern does not work coz if there is no comma in the middle it matches too.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: