I love PHP, and I love Regular Expressions. I’m putting together a few blog posts to show how I used regular expressions and PHP together in a relatively large project that I was in charge of for the past few years. (Unfortunately, this project is owned by my previous employer, so I’ll be sharing metadata and short code samples rather than the whole shebang.)
Why? I built a regular expressions tutorial, and part of what makes it amazing is that it uses real world examples. This was a chance to find more examples to incorporate.
In this post, I’ll be using regular expressions to dredge up some information about the project:
- How big is it?
- How many files are actually PHP? (vs CSS, PNG, etc)
- How often did I use regular expressions?
- Which of PHP’s regular expressions functions do I lean on?
Today you’ll see some practical uses for regular expressions with
How large is this project?
How many files are in the repository?
find to tell us all the file names, then have
wc count them.
How many lines of code are there in the project?
1 2 3
Here, we’re still asking find for a list of files, but instead of counting the file names, we’re using
xargs to pass the filenames to
wc, and asking
wc to count the number of lines in each file.
The call to
grep addresses a caveat for really big projects: xargs can’t hand an infinitely large set of arguments to
wc. For large output from
wc will output several interim subtotals.
All told, the project contains, 1,576,197 lines of code. Wait, that’s not right – not everything in the project is code. Let’s look at file types.
What filetypes does this project contain?
What are the most popular file suffixes in the project?
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Let’s break that command down a little.
find . -type fis getting the names of all the files throughout the repository.
egrep -o "\.[a-z]+$"returns only the part of the file name that matches this regular expression for the file suffix. More on this in a moment.
sortsorts all these file suffixes alphabetically
uniq -ccounts all the unique suffixes (but needs them to be sorted)
sort -nrsorts again, but from most occurrences to least (numerically, reversed)
The regular expression
\.[a-z]+$ that we passed to
egrep breaks down as:
\.a literal dot. (Without the backslash,
.matches any character)
[a-z]any lower case letter
+repeat the match to my left (“any lower case letter”) one or more times
$the end of the file name. (There can’t be any more text after the last letter.)
The output tells us that there are 610 .php files. It also jogs my memory that we used .inc files for PHP library code that couldn’t be called by Apache directly. So we need to count another 375 .inc files.
How many PHP lines of code does the project contain?
Now let’s get the line count for all files ending .php or .inc.
Here we’re using the
-regex functionality of
-E flag turns on “extended” regular expressions.
Let’s break down the regular expression
.+\.(php|inc) that we passed to
.+Any number of any character. The
-regexflag takes a pattern that matches the whole file name, so we use this pattern to match “whatever” the file name starts with. The
.means any character, the
+means “one or more of that thing to my left.”
\.Literally a dot. Without the backslash, this would mean “one more of anything.”
How can I avoid counting branch lines-of-code when estimating project size?
143,509 lines is at least the right order of magnitude, but when I ran it without the
grep statement, I could see that it was double-counting some files that exist, unmodified, in different ongoing branches. So lets tighten up the count by only counting lines of text in PHP files in project trunks.
This regular expression we’re using with
.*Zero or more of any character.
/trunk/The literal text
/trunk/. In other words, the path contains a folder
.*Zero or more of any character. This means
/trunk/can be anywhere in the path.
\.Literally a dot.
For paths that contain a folder
trunk, and filenames that end with
.inc, I have 45,682 lines of code.
Where am I using regular expressions in my PHP code?
Personally, I always use PHP’s Perl-compatible regular expression functions, which all begin “preg_”.
1 2 3 4
In this command, we use
xargs to pass all the file names as arguments to
grep command outputs all the lines in all those files that contain “preg_”.
What preg_ functions am I using, and how often?
1 2 3 4 5 6 7
Let’s break down that command:
find -E . -regex '.*/trunk/.*\.(php|inc)$' -type fGet all the files ending
xargs egrep -o "preg_[a-z_]+" -hFind the full preg_ function name. (More on the regex below) The
egrepto return only the part of the line that matches the expression. The
-hoption suppresses the file name you found it in.
sort | uniq -c | sort -nrCount how many times each function appears, and sort descending.
And the regular expression we hand to egrep:
preg_- The literal text
[a-z_]- Any character .
+- Repeat “a to z, or an underscore” one or more times. This will stop matching when it gets to the
(around the arguments.
Next time we’ll talk about those functions in detail and how I used them.
If you’d like to learn more about regular expressions, you should try Wingtip Labs’ regular expressions tutorial.