Fear No Longer Regular Expressions

*** This post has been originally published on ACRL TechConnect Blog on July 31, 2013. ***

Regex, it’s your friend

You may have heard the term, “regular expressions” before. If you have, you will know that it usually comes in a notation that is quite hard to make out like this:

(?=^[0-5\- ]+$)(?!.*0123)\d{3}-\d{3,4}-\d{4}

Despite its appearance, regular expressions (regex) is an extremely useful tool to clean up and/or manipulate textual data. I will show you an example that is easy to understand. Don’t worry if you can’t make sense of the regex above. We can talk about the crazy metacharacters and other confusing regex notations later. But hopefully, this example will help you appreciate the power of regex and give you some ideas of how to make use of regex to make your everyday library work easier.

What regex can do for you – an example

I looked for the zipcodes for all cities in Miami-Dade County and found a very nice PDF online (http://www.miamifocusgroup.com/usertpl/1vg116-three-column/miami-dade-zip-codes-and-map.pdf). But when I copy and paste the text from the PDF file to my text editor (Sublime), the format immediately goes haywire. The county name, ‘Miami-Dade,’ which should be in one line becomes three lines, each of which lists Miami, -, Dade.

Ugh, right? I do not like this one bit. So let’s fix this using regular expressions.

(**Click the images to bring up the larger version.)

Like many text editors, Sublime offers the find/replace with regex feature. This is a much powerful tool than the usual find/replace in MS Word, for example, because it allows you to match many complex cases that fit a pattern. Regular expressions lets you specify that pattern.

In this example, I capture the three lines each saying Miami,-,Dade with this regex:

Miami\n-\nDade.

When I enter this into the ‘Find What’ field, Sublime starts highlighting all the matches. I am sure you already guessed that \n means a new line. Now let me enter Miami-Dade in the ‘Replace With’ field and hit ‘Replace All.’

As you can see below, things are much better now. But I want each set of three lines – Miami-Dade, zipcode, and city – to be one line and each element to be separated by comma and a space such as ‘Miami-Dade, 33010, Hialeah’. So let’s do some more magic with regex.

How do I describe the pattern of three lines – Miami-Dade, zipcode, and city? When I look at the PDF, I notice that the zipcode is a 5 digit number and the city name consists of alphabet characters and space. I don’t see any hypen or comma in the city name in this particular case. And the first line is always ‘Miami-Dade.” So the following regular expression captures this pattern.

Miami-Dade\n\d{5}\n[A-Za-z ]+

Can you guess what this means? You already know that \n means a new line. \d{5} means a 5 digit number. So it will match 33013, 33149, 98765, or any number that consists of five digits. [A-Za-z ] means any alphabet character either in upper or lower case or space (N.B. the space at the end right before ‘]’).

Anything that goes inside [ ] is one character. Just like \d is one digit. So I need to specify how many of the characters are to be matched. if I put {5}, as I did in \d{5}, it will only match a city name that has five characters like ‘Miami,’ The pattern should match any length of city name as long as it is not zero. The + sign does that. [A-Za-z ]+ means that any alphabet character either in upper or lower case or space should appear at least or more than once. (N.B. * and ? are also quantifiers like +. See the metacharacter table below to find out what they do.)

Now I hit the “Find” button, and we can see the pattern worked as I intended. Hurrah!

Now, let’s make these three lines one line each. One great thing about regex is that you can refer back to matched items. This is really useful for text manipulation. But in order to use the backreference feature in regex, we have to group the items with parentheses. So let’s change our regex to something like this:

(Miami-Dade)\n\(d{5})\n([A-Za-z ]+)

This regex shows three groups separated by a new line (\n). You will see that Sublime still matches the same three line sets in the text file. But now that we have grouped the units we want – county name, zipcode, and city name – we can refer back to them in the ‘Replace With’ field. There were three units, and each unit can be referred by backslash and the order of appearance. So the county name is \1, zipcode is \2, and the city name is \3. Since we want them to be all in one line and separated by a comma and a space, the following expression will work for our purpose. (N.B. Usually you can have up to nine backreferences in total from \1 to\9. So if you want to backreference the later group, you can opt not to create a backreference from a group by using (?: ) instead of (). )

\1, \2, \3

Do a few Replaces and then if satisfied, hit ‘Replace All’.

Ta-da! It’s magic.

Regex Metacharacters

Regex notations look a bit funky. But it’s worth learning them since they enable you to specify a general pattern that can match many different cases that you cannot catch without the regular expression.

We have already learned the four regex metacharacters: \n, \d, { }, (). Not surprisingly, there are many more beyond these. Below is a pretty extensive list of regex metacharacters, which I borrowed from the regex tutorial here: http://www.hscripts.com/tutorials/regular-expression/metacharacter-list.php . I also highly recommend this one-page Regex cheat sheet from MIT (http://web.mit.edu/hackl/www/lab/turkshop/slides/regex-cheatsheet.pdf).

Note that \w will match not only a alphabetical character but also an underscore and a number. For example, \w+ matches Little999Prince_1892. Also remember that a small number of regular expression notations can vary depending on what programming language you use such as Perl, JavaScript, PHP, Ruby, or Python.

Metacharacter	Description
\	Specifies the next character as either a special character, a literal, a back reference, or an octal escape.
^	Matches the position at the beginning of the input string.
$	Matches the position at the end of the input string.
*	Matches the preceding subexpression zero or more times.
+	Matches the preceding subexpression one or more times.
?	Matches the preceding subexpression zero or one time.
{n}	Matches exactly n times, where n is a non-negative integer.
{n,}	Matches at least n times, n is a non-negative integer.
{n,m}	Matches at least n and at most m times, where m and n are non-negative integers and n <= m.
.	Matches any single character except “\n”.
[xyz]	A character set. Matches any one of the enclosed characters.
x\|y	Matches either x or y.
[^xyz]	A negative character set. Matches any character not enclosed.
[a-z]	A range of characters. Matches any character in the specified range.
[^a-z]	A negative range characters. Matches any character not in the specified range.
\b	Matches a word boundary, that is, the position between a word and a space.
\B	Matches a nonword boundary. ‘er\B’ matches the ‘er’ in “verb” but not the ‘er’ in “never”.
\d	Matches a digit character.
\D	Matches a non-digit character.
\f	Matches a form-feed character.
\n	Matches a newline character.
\r	Matches a carriage return character.
\s	Matches any whitespace character including space, tab, form-feed, etc.
\S	Matches any non-whitespace character.
\t	Matches a tab character.
\v	Matches a vertical tab character.
\w	Matches any word character including underscore.
\W	Matches any non-word character.
\un	Matches n, where n is a Unicode character expressed as four hexadecimal digits. For example, \u00A9 matches the copyright symbol

Matching modes

You also need to know about the Regex matching modes. In order to use these modes, you write your regex as shown above, and then at the end you add one or more of these modes. Note that in text editors, these options often appear as checkboxes and may apply without you doing anything by default.

/i makes the regex match case insensitive.
/g matches all instances, instead of the first one, matching the regular expression.
/s enables “single-line mode”. In this mode, the dot matches newlines.
/m enables “multi-line mode”. In this mode, the caret and dollar match before and after newlines in the subject string.

For example, [d]\w+[g] will match only the three lower case words in ding DONG dang DING dong DANG. On the other hand, [d]\w+[g]/i will match all six words whether they are in the upper or the lower case.

Look-ahead and Look-behind

There are also the ‘look-ahead’ and the ‘look-behind’ pattern in regular expressions. These often cause confusion and are considered to be a tricky part of regex. So, let me show you a simple example of how it can be used.

Below are several lines of a person’s last name, first name, middle name, separated by his or her department name. You can see that this is a snippet from a csv file. The problem is that a value in one field – the department name- also includes a comma, which is supposed to appear only between different fields not inside a field. So the comma becomes an unreliable separator. One way to solve this issue is to convert this csv file into a tab limited file, that is, using a tab instead of a comma as a field separater. That means that I need to replace all commas with tabs ‘except those commas that appear inside a department field.’

How do I achieve that? Luckily, the commas inside the department field value are all followed by a space character whereas the separator commas in between different fields are not so. Using the negative look-ahead regex, I can successfully specify the pattern of a comma that is not followed by (?!) a space \s.

,(?!\s)

Below, you can see that this regex matches all commas except those that are followed by a space.

For another example, the positive look-ahead regex, Ham(?=burg) , on the other hand, will match ‘Ham‘ in Hamburg when it is applied to the text: Hamilton, Hamburg, Hamlet, Hammock.

Below are the complete look-ahead and look-behind notations both positive and negative.

(?=pattern)is a positive look-ahead assertion
(?!pattern)is a negative look-ahead assertion
(?<=pattern)is a positive look-behind assertion
(?<!pattern)is a negative look-behind assertion

Can you think of any example where you can successfully apply a look-behind regular expression? (No? Then check out this page for more examples: http://www.rexegg.com/regex-lookarounds.html)

Now that we have covered even the look-ahead and the look-behind, you should be ready to tackle the very first crazy-looking regex that I introduced in the beginning of this post.

(?=^[0-5\- ]+$)(?!.*0123)\d{3}-\d{3,4}-\d{4}

Tell me what this will match! Post in the comment below and be proud of yourself.

More tools and resources for practicing regular expressions

There are many tools and resources out there that can help you practice regular expressions. Text editors such as EditPad Pro (Windows), Sublime, TextWrangler (Mac OS), Vi, EMacs all provide regex support. Wikipedia (https://en.wikipedia.org/wiki/Comparison_of_text_editors#Basic_features) offers a useful comparison chart of many text editors you can refer to. RegexPal.com is a convenient online Javascript Regex tester. FireFox also has Regular Expressions add-on (https://addons.mozilla.org/en-US/firefox/addon/rext/).

For more tools and resources, check out “Regular Expressions: 30 Useful Tools and Resources” http://www.hongkiat.com/blog/regular-expression-tools-resources/.

Library problems you can solve with regex

The best way to learn regex is to start using it right away every time you run into a problem that can be solved faster with regex. What library problem can you solve with regular expressions? What problem did you solve with regular expressions? I use regex often to clean up or manipulate large data. Suppose you have 500 links and you need to add either EZproxy suffix or prefix to each. With regex, you can get this done in a matter of a minute.

To give you an idea, I will wrap up this post with some regex use cases several librarians generously shared with me. (Big thanks to the librarians who shared their regex use cases through Twitter! )

Some ebook vendors don’t alert you to new (or removed!) books in their collections but do have on their website a big A-Z list of all of their titles. For each such vendor, each month, I run a script that downloads that page’s HTML, and uses a regex to identify the lines that have ebook links in them. It uses another regex to extract the useful data from those lines, like URL and Title. I compare the resulting spreadsheet against last month’s (using a tool like diff or vimdiff) to discover what has changed, and modify the catalog accordingly. (@zemkat)

Sometimes when I am cross-walking data into a MARC record, I find fields that includes irregular spacing that may have been used for alignment in the old setting but just looks weird in the new one. I use a regex to convert instances of more than two spaces into just two spaces. (@zemkat)

Recently while loading e-resource records for government documents, we noted that many of them were items that we had duplicated in fiche: the ones with a call number of the pattern “H, S, or J, followed directly by four digits”. We are identifying those duplicates using a regex for that pattern, and making holdings for those at the same time. (@zemkat)

I used regex as part of crosswalking metadata schemas in XML. I changed scientific OME-XML into MODS + METS to include research images into the library catalog. (@KristinBriney)

Parsing MARC just has to be one. Simple things like getting the GMD out of a 245 field require an expression like this: |h\[(.*)\] MARCedit supports regex search and replace, which catalogers can use. (@phette23)

I used regex to adjust the printed label in Millennium depending on several factors, including the material type in the MARC record. (@brianslone )

I strip out non-Unicode characters from the transformed finding aids daily with regex and replace them with Unicode equivalents. (@bryjbrown)