Using Regular Expressions to Nose Around a Large PHP Project

I love PHP, and I love Regular Expressions. I’m putting together a few blog posts to show how I used regular expressions and PHP together in a relatively large project that I was in charge of for the past few years. (Unfortunately, this project is owned by my previous employer, so I’ll be sharing metadata and short code samples rather than the whole shebang.)

Why? I built a regular expressions tutorial, and part of what makes it amazing is that it uses real world examples. This was a chance to find more examples to incorporate.

In this post, I’ll be using regular expressions to dredge up some information about the project:

How big is it?
How many files are actually PHP? (vs CSS, PNG, etc)
How often did I use regular expressions?
Which of PHP’s regular expressions functions do I lean on?

Today you’ll see some practical uses for regular expressions with find and egrep.

How large is this project?

How many files are in the repository?

Let’s ask find to tell us all the file names, then have wc count them.

$ find . -type f | wc -l
4433

How many lines of code are there in the project?

$ find . -type f | xargs wc -l | grep total
1478127 total
   98070 total

Here, we’re still asking find for a list of files, but instead of counting the file names, we’re using xargs to pass the filenames to wc, and asking wc to count the number of lines in each file.

The call to grep addresses a caveat for really big projects: xargs can’t hand an infinitely large set of arguments to wc. For large output from find, wc will output several interim subtotals.

All told, the project contains, 1,576,197 lines of code. Wait, that’s not right – not everything in the project is code. Let’s look at file types.

What filetypes does this project contain?

What are the most popular file suffixes in the project?

$ find . -type f | egrep -o "\.[a-z]+$" | sort | uniq -c | sort -nr
.conf
.php
.png
.inc
.ulaw
.js
.gif
.jpg
.txt
.bmp
.bin
.gsm
   ... some contents deleted ...

Let’s break that command down a little.

find . -type f is getting the names of all the files throughout the repository.
egrep -o "\.[a-z]+$" returns only the part of the file name that matches this regular expression for the file suffix. More on this in a moment.
sort sorts all these file suffixes alphabetically
uniq -c counts all the unique suffixes (but needs them to be sorted)
sort -nr sorts again, but from most occurrences to least (numerically, reversed)

The regular expression \.[a-z]+$ that we passed to egrep breaks down as:

\. a literal dot. (Without the backslash, . matches any character)
[a-z] any lower case letter
+ repeat the match to my left (“any lower case letter”) one or more times
$ the end of the file name. (There can’t be any more text after the last letter.)

The output tells us that there are 610 .php files. It also jogs my memory that we used .inc files for PHP library code that couldn’t be called by Apache directly. So we need to count another 375 .inc files.

How many PHP lines of code does the project contain?

Now let’s get the line count for all files ending .php or .inc.

$ find -E . -regex '.+\.(php|inc)' -type f | xargs wc -l | grep total
  143509 total

Here we’re using the -regex functionality of find. The -E flag turns on “extended” regular expressions.

Let’s break down the regular expression .+\.(php|inc) that we passed to find:

.+ Any number of any character. The -regex flag takes a pattern that matches the whole file name, so we use this pattern to match “whatever” the file name starts with. The . means any character, the + means “one or more of that thing to my left.”
\. Literally a dot. Without the backslash, this would mean “one more of anything.”
(php|inc) One of php or inc.

How can I avoid counting branch lines-of-code when estimating project size?

143,509 lines is at least the right order of magnitude, but when I ran it without the grep statement, I could see that it was double-counting some files that exist, unmodified, in different ongoing branches. So lets tighten up the count by only counting lines of text in PHP files in project trunks.

$ find -E . -regex '.*/trunk/.*\.(php|inc)' -type f | xargs wc -l | grep total
   45682 total

This regular expression we’re using with find matches:

.* Zero or more of any character.
/trunk/ The literal text /trunk/. In other words, the path contains a folder trunk somewhere.
.* Zero or more of any character. This means /trunk/ can be anywhere in the path.
\. Literally a dot.
(php|inc) One of php or inc.

For paths that contain a folder trunk, and filenames that end with .php or .inc, I have 45,682 lines of code.

Where am I using regular expressions in my PHP code?

Personally, I always use PHP’s Perl-compatible regular expression functions, which all begin “preg_”.

$ find -E . -regex '.*/trunk/.*\.(php|inc)$' -type f | xargs grep preg_
./project/trunk/announce-number-change.php:       if(preg_match($pattern,$dialed_number)){
./project/trunk/admintools/about.php: list($fs, $blocks, $used, $avail, $percent, $mount) = preg_split("/[\s]+/", array_shift($result));
   ... 352 other examples deleted ...

In this command, we use xargs to pass all the file names as arguments to grep. This grep command outputs all the lines in all those files that contain “preg_”.

What preg_ functions am I using, and how often?

$ find -E . -regex '.*/trunk/.*\.(php|inc)$' -type f | xargs egrep -o "preg_[a-z_]+" -h | sort | uniq -c | sort -nr
preg_match
preg_replace
preg_split
preg_replace_callback
preg_output
preg_match_all

Let’s break down that command:

find -E . -regex '.*/trunk/.*\.(php|inc)$' -type f Get all the files ending .php or .inc
xargs egrep -o "preg_[a-z_]+" -h Find the full preg_ function name. (More on the regex below) The -o option causes egrep to return only the part of the line that matches the expression. The -h option suppresses the file name you found it in.
sort | uniq -c | sort -nr Count how many times each function appears, and sort descending.

And the regular expression we hand to egrep:

preg_ - The literal text preg_
[a-z_] - Any character .
+ - Repeat “a to z, or an underscore” one or more times. This will stop matching when it gets to the ( around the arguments.

Next time we’ll talk about those functions in detail and how I used them.

If you’d like to learn more about regular expressions, you should try Wingtip Labs’ regular expressions tutorial.