Advanced Regular Expressions: part 2 – greedy vs reluctant

In this post we will look at the greedy nature of common quantifiers (* and +).

 

II  Greedy vs Reluctant

We’re so used with the * quantifier that we almost forget its fundamental greedy nature.  A quantifier is considered “greedy” when it tries to match the longest text chain possible before bactracking if the pattern does not match. Its counterpart is a reluctant quantifier.

Example is better that words, let’s see how we can extract the root directory in an Unix system given the complete path of a file:

Input: /home/data/ebooks/JavaForDummies.pdf

Pattern:  ^(.+)/.+$


String text = "/home/data/books/JavaForDummies.pdf";

         Pattern pattern = Pattern.compile("^/(.+)(?:/.+)*$");
         Matcher matcher = pattern.matcher(text);
         if(matcher.matches())
         {
             System.out.println(" Root folder : " + matcher.group(1));
         }

Not surprisingly the console would display:

Matches : true

Group 1 : home/data/books/JavaForDummies.pdf

What the heck ? But we did clearly isolate the first part /(.+) of the pattern from the remaining trail, how comes ?

This is a very common mistake when you just happen to forget the greedy nature of the quantifier (greedy is their default behavior). First the regexp engine tries to match the pattern/(.+) with the longest possible text. The regexp engine starts by consuming all the text:

/(.+) :     /home/data/books/JavaForDummies.pdf

The first part of the pattern does match, and the engine will stop here, reporting the entire text as the match. The last part of the pattern (?:/.+)* did indeed match with .. nothing because the * quantifier means nothing or more and it this case it matched nothing.

To circumvent this issue, one can thing to turn the * into + to force the regexp engine to match. The result is no even close to what we expect:

Matches : true

Group 1 : home/data/books

We did get rid of the last matching block (/JavaForDummies.pdf) but it’s still not satisfactory.

But ? Wasn’t the + in (?:/.+)+ supposed to be greedy ? Why didn’t it match /data/books/JavaForDummies.pdf but only the last part ?

Well, first come, first served. The regexp engine starts processing the /(.+) part first and since it is greedy it will consume all the string. But by doing so, there is nothing left to be matched with (?:/.+)+ so the regexp engine is forced to give up as many characters as needed so that the WHOLE pattern matches.

Step 1.  /home/data/books/JavaForDummies.pdf↑

  •  /(.+) matches /home/data/books/JavaForDummies.pdf
  • (?:/.+)+ matches nothing -> KO

Step 2. /home/data/books/JavaForDummies.pd↑f

  • /(.+) matches /home/data/books/JavaForDummies.pd
  • (?:/.+)+ matches  f -> KO

Step 3. /home/data/books/JavaForDummies.p↑df

  • /(.+) matches /home/data/books/JavaForDummies.p
  • (?:/.+)+ matches  df -> KO

Step 20. /home/data/books↑

  • /(.+) matches /home/data/books
  • (?:/.+)+ matches  /JavaForDummies.pdf -> OK

As long as (?:/.+)+ can match anything, be it the shortest length, the regexp engine will stop processing and return the successful result. Indeed the greediness of the first part /(.+) has priority over the greediness of the second one (?:/.+)+.

Now, back to our issue, we still want to extract the root folder from the file path. Each folder level in the path is delimited by the slash /. The idea here would be to take the first group of text between slashes:

Patter: ^/([^/]+)/.+$

This time the output is:

 Matches ? true
Group 1 : home

That’s what we wanted. The trick here is the use of a negation in a character class: [^/]. This means “anything except the a/“. Combined with the beginning / and the second /, we can isolate the root folder from the remaining path: /([^/]+)/

Alternatively we can use the reluctant version of our + quantifier:  +?

Pattern:  ^/(.+?)(?:/.+)+$

The output:

 Matches ? true
Group 1 : home

This time, instead of starting by consuming all the string, the reluctant +? will consume character by character until the whole pattern matches, thus giving us the smallest text that can match the string match.

 

Recommended readings:

Advertisements

About DuyHai DOAN
Cassandra Technical Evangelist. LinkedIn profile : http://fr.linkedin.com/pub/duyhai-doan/2/224/848. Follow me on Twitter: @doanduyhai for latest updates on Cassandra

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: