Tech Tools

Ok, Regex, You Win!

perl_problems

xkcd

Regular Expression or Regex is used in programming to indicate patterns in strings (of characters). My most frequent encounters with Regex are when I’m parsing a string in Java and I call the split() function.

When Regex has been useful:

Let’s say I wanted to count how many times all the words in the Bible occur. I have the book in the form of a large text file. If I were writing a Java program I would need a way to separate all the words from the punctuation and numbers. I’d also need to consider that “The” and “the” are the same word, despite being different to the computer. I tried this exercise here: http://pastebin.com/PaajTueh. I used Regex once in the program:

String[] words = s.toLowerCase().split(“[\\p{Punct}\\s\\d]+”);

This code would take any string of text, convert it to lowercase, and then split it by the patterns indicated within the quotation marks. Now, have I ever really learned what “[\\p{Punct}\\s\\d]+” stands for? It’s not the most intuitive syntax, so I’ve overly relied on looking it up every time I’ve needed it.

The above is an example of parsing text, a common way we pass around information that is both human readable and computationally useful.

Another example would be if I had a string of characters, say, “1-(800)-Reg-Expr,” and I want to remove the hyphens and parens so that a phone could dial the number for me. I may also want to map the letters to numbers and convert the uppercase to lower.

Sometimes you get output that needs to be tidied up or reformatted so another program can use the information. I had to do this in my research a couple times and ended up with Regex that looked like: ‘s/\(.*\)\([0-9]\)\(.*\)/\1 \2 \3;/’

Here’s an interesting scenario that I found involving Regex as a technical interview question:

Last year my team had to remove all the phone numbers from 50,000 Amazon web page templates, since many of the numbers were no longer in service, and we also wanted to route all customer contacts through a single page.

Let’s say you’re on my team, and we have to identify the pages having probable U.S. phone numbers in them. To simplify the problem slightly, assume we have 50,000 HTML files in a Unix directory tree, under a directory called “/website”. We have 2 days to get a list of file paths to the editorial staff. You need to give me a list of the .html files in this directory tree that appear to contain phone numbers in the following two formats: (xxx) xxx-xxxx and xxx-xxx-xxxx.

How would you solve this problem? Keep in mind our team is on a short (2-day) timeline.

(https://sites.google.com/site/steveyegge2/five-essential-phone-screen-questions)

The given Regex answer is as follows:

grep -l -R --perl-regexp "\b(\(\d{3}\)\s*|\d{3}-)\d{3}-\d{4}\b" * > output.txt

Somehow “\b(\(\d{3}\)\s*|\d{3}-)\d{3}-\d{4}\b” translates into phone numbers that either look like (xxx) xxx-xxxx or xxx-xxx-xxxx.

The bottom line:

Despite looking unintelligible at first, I’ve realized that Regex is a valuable part of my programming endeavors. There’s just not a better way around handling strings, and if I could go back in time, I would tell my younger self to embrace the power of Regular Expression.

Anyhow, I thought I’d pull together a few useful resources for Regex:

regular_expressions

xkcd

Advertisements

One thought on “Ok, Regex, You Win!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s