Wingtip Labs Blog

Regular Expression Repentance

I love Regular Expressions, so much so that I built an interactive regular expressions tutorial.

Today, I’m going to share with you some of the worst regular expressions I’ve used in real projects, and the lessons I wish I could share with the n00b me that wrote this crap.

What makes my regex tutorial great is that it uses real world examples. So I decided to build some new levels using material pulled from a large (~45,000 lines-of-code) multi-year PHP project I recently worked on.

This is the second article looking at that project. The first article, Using Regular Expressions to Nose Around a Large PHP Project, used find and grep to get a sense of the project’s size and how it used regex (matching, replacing, splitting, etc).

This article explores some of the worst regular expressions I contributed to that project, with the wisdom only hindsight can provide.

Case Insensitivity

1
2
3
if(preg_match("/soft$/", strtolower($search_term))){
  //This is a soft phone
}

This takes a user-provided string $search_term, and tests whether it ends with soft.

The Good — This code tries to accept any reasonable input, since I have no idea whether users will provide Soft or SOFT or soft or SoFt.

The Bad — I don’t need strtolower($search_term) to manage capitalization. I should be using the case insensitive regex flag, i.

This code is equivalent and simpler:

1
2
3
if(preg_match("/soft$/i", $search_term)){
  //This is a soft phone
}

This one’s the same sin, only more forehead-slappy:

1
$input = preg_replace('/<[bB][rR]\ ?\/?>/', "\r\n", $input);

That code turns <br> tags in the input into plaintext friendly carriage-return line-feed. The project uses this when we’ve accepted input from rich text editors (that will usually be output in web pages) that now needs to be displayed at a CLI or mailed out in plaintext.

The Good — Like the previous example, it’s very tolerant.

  • [bB][rR] accepts the tag in any case (<br> and <BR> and even <bR>)
  • \/? accepts normal or self-closing form (<br> and <br/>)
  • ? (space ?) allows a space before the self-closing form (<br/> and <br />))

The Bad — Thank goodness it’s only a two-letter tag, because I’d hate to find I’d ever written [bB][lL][iI][nN][kK] (for more than one reason).

Here’s a less-goofy, functionally-equivalent rewrite:

1
$input = preg_replace('/<br ?\/?>/i', "\r\n", $input);

Not using libraries

This isn’t a terrible practice, but let’s call it a regex smell: never roll your own parser for any file format you can’t explain in one breath.

1
$columns = preg_split("/\s*,\s*/", $input);

This turns the $input string, which is one row of a comma-separated-value file (CSV), into an array, $columns.

The Good — By the time I wrote this, I’d already learned (probably the hard way) that sometimes CSVs have whitespace you don’t expect. The pattern \s* will accept zero-or-more spaces on either side of the comma.

The Bad — Writing your own CSV parser is fun and easy while the CSV is small and written by a programmer. But lord help you if it comes out of Excel or is written by anyone but you. Here’s a better idea, use the CSV parser the language provides:

1
2
$columns = str_getcsv($input);
$columns = array_map("trim", $columns);

This has a few advantages:

  • It also strips leading whitespace off the first column, and trailing space off the last column.
  • It’s properly anticipates columns with line breaks and commas in them, the way they’ll appear from Excel (wrapped in “)
  • It separates “break this line into columns” in line 1, from “apply sensible filtering to the data” in line 2. In this example all the columns are the same, and we just want to trim whitespace. But in the real app, some columns are user IDs that need strtolower, some are numbers that need intval or floatval and some are keywords that can only be one of a few values.

Here’s that same bad idea, except now it’s a security risk.

1
$safe = preg_replace("/<[^>]+>/"," ", $raw);

This code takes input from a user and throws out any HTML tags. It assumes a tags start with a <, have some content that isn’t > matched by [^>]+, and end with >. The most obvious flaw is that this is a valid HTML tag:

1
<img title='monkeys > ninjas' src='monkeypunchesninja.png'>

that this regex mangles into

1
 ninjas' src='monkeypunchesninja.png'>

But the broader problem is that parsing HTML is difficult, and the stakes here are very high. This code needs to make sure users can’t inject potentially dangerous HTML into site content, so trusting myself to a hokey regular expression was a terrible idea.

I should have been using PHP’s built-in function that doesn’t have these problems:

1
$safe = strip_tags($raw);

When in doubt, look for a language feature or a well-written library to parse any complicated data format, especially when that data is coming from untrusted (or just non-technical) sources.

Parsing key-value pairs

This example was probably written the day I found PHP’s list construct.

1
2
3
4
5
6
7
8
foreach($lines as $line){
  if(preg_match("/^[\w]+=[\w]+$/", $line)){
      list($key, $value) = preg_split("/\s*=\s*/", $line);
      $return_array[$key] = $value;
  }else{ //Consider this like setting a flag
      $return_array[$line] = "true";
  }
}

The Bad — This code splits up detecting the type of row you’re parsing from actually parsing it. It’s like the poster child for don’t-repeat-yourself, since the two regexes are deceptively similar but subtly, brokenly, different.

The regex at line 2:

  • ^ matches from the beginning of the row
  • [\w]+ matches one-or-more alphanumeric characters
  • = matches a literal equals sign
  • [\w]+ matches one-or-more alphanumeric characters
  • $ matches to the end of the row

The square brackets are totally unnecessary, this is equivalent and cleaner:

1
if(preg_match("/^\w+=\w+$/", $line)){

Example Matches:

  • cow=moo
  • ducks=5
  • Johnny5=Still_Alive (\w includes underscore)

Example Misses:

  • ducks (no =)
  • author=Orson Scott Card (\w doesn’t match spaces)

The regex at line 3 then splits the string around an equal sign with zero or more spaces \s* on the left or right. The problem is, the regex on line 2 would fail if there were any spaces anywhere on the row.

This uses just one smarter regex:

1
2
3
4
5
6
7
foreach($lines as $line){
  if(preg_match("/^(\w+)\s*=\s*(\w+)$/", $line, $matches)){
      $return_array[$matches[1]] = $matches[2];
  }else{ //Consider this like setting a flag
      $return_array[$line] = "true";
  }
}

The new regular expression:

  • ^ matches from the beginning of the row
  • (\w+) captures one-or-more alphanumeric characters into $matches[1]
  • \s*=\s* matches (and doesn’t capture) a = with zero or more spaces on either side
  • (\w+) captures one-or-more alphanumeric characters into $matches[2]
  • $ matches to the end of the row

Hope you’ve enjoyed this cringe-inducing trip down memory lane. If you’d like to learn more about writing good regular expressions, you should try my regular expressions tutorial.