I love PHP, and I love Regular Expressions. I’m putting together a few blog posts to show how I used regular expressions and PHP together in a relatively large project that I was in charge of for the past few years. (Unfortunately, this project is owned by my previous employer, so I’ll be sharing metadata and short code samples rather than the whole shebang.)
Why? I built a regular expressions tutorial, and part of what makes it amazing is that it uses real world examples. This was a chance to find more examples to incorporate.
In this post, I’ll be using regular expressions to dredge up some information about the project:
- How big is it?
- How many files are actually PHP? (vs CSS, PNG, etc)
- How often did I use regular expressions?
- Which of PHP’s regular expressions functions do I lean on?
Today you’ll see some practical uses for regular expressions with find
and egrep
.
How large is this project?
How many files are in the repository?
Let’s ask find
to tell us all the file names, then have wc
count them.
1 2 |
|
How many lines of code are there in the project?
1 2 3 |
|
Here, we’re still asking find for a list of files, but instead of counting the file names, we’re using xargs
to pass the filenames to wc
, and asking wc
to count the number of lines in each file.
The call to grep
addresses a caveat for really big projects: xargs can’t hand an infinitely large set of arguments to wc
. For large output from find
, wc
will output several interim subtotals.
All told, the project contains, 1,576,197 lines of code. Wait, that’s not right – not everything in the project is code. Let’s look at file types.
What filetypes does this project contain?
What are the most popular file suffixes in the project?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Let’s break that command down a little.
find . -type f
is getting the names of all the files throughout the repository.egrep -o "\.[a-z]+$"
returns only the part of the file name that matches this regular expression for the file suffix. More on this in a moment.sort
sorts all these file suffixes alphabeticallyuniq -c
counts all the unique suffixes (but needs them to be sorted)sort -nr
sorts again, but from most occurrences to least (numerically, reversed)
The regular expression \.[a-z]+$
that we passed to egrep
breaks down as:
\.
a literal dot. (Without the backslash,.
matches any character)[a-z]
any lower case letter+
repeat the match to my left (“any lower case letter”) one or more times$
the end of the file name. (There can’t be any more text after the last letter.)
The output tells us that there are 610 .php files. It also jogs my memory that we used .inc files for PHP library code that couldn’t be called by Apache directly. So we need to count another 375 .inc files.
How many PHP lines of code does the project contain?
Now let’s get the line count for all files ending .php or .inc.
1 2 |
|
Here we’re using the -regex
functionality of find
. The -E
flag turns on “extended” regular expressions.
Let’s break down the regular expression .+\.(php|inc)
that we passed to find
:
.+
Any number of any character. The-regex
flag takes a pattern that matches the whole file name, so we use this pattern to match “whatever” the file name starts with. The.
means any character, the+
means “one or more of that thing to my left.”\.
Literally a dot. Without the backslash, this would mean “one more of anything.”(php|inc)
One ofphp
orinc
.
How can I avoid counting branch lines-of-code when estimating project size?
143,509 lines is at least the right order of magnitude, but when I ran it without the grep
statement, I could see that it was double-counting some files that exist, unmodified, in different ongoing branches. So lets tighten up the count by only counting lines of text in PHP files in project trunks.
1 2 |
|
This regular expression we’re using with find
matches:
.*
Zero or more of any character./trunk/
The literal text/trunk/
. In other words, the path contains a foldertrunk
somewhere..*
Zero or more of any character. This means/trunk/
can be anywhere in the path.\.
Literally a dot.(php|inc)
One ofphp
orinc
.
For paths that contain a folder trunk
, and filenames that end with .php
or .inc
, I have 45,682 lines of code.
Where am I using regular expressions in my PHP code?
Personally, I always use PHP’s Perl-compatible regular expression functions, which all begin “preg_”.
1 2 3 4 |
|
In this command, we use xargs
to pass all the file names as arguments to grep
. This grep
command outputs all the lines in all those files that contain “preg_”.
What preg_ functions am I using, and how often?
1 2 3 4 5 6 7 |
|
Let’s break down that command:
find -E . -regex '.*/trunk/.*\.(php|inc)$' -type f
Get all the files ending.php
or.inc
xargs egrep -o "preg_[a-z_]+" -h
Find the full preg_ function name. (More on the regex below) The-o
option causesegrep
to return only the part of the line that matches the expression. The-h
option suppresses the file name you found it in.sort | uniq -c | sort -nr
Count how many times each function appears, and sort descending.
And the regular expression we hand to egrep:
preg_
- The literal textpreg_
[a-z_]
- Any character .+
- Repeat “a to z, or an underscore” one or more times. This will stop matching when it gets to the(
around the arguments.
Next time we’ll talk about those functions in detail and how I used them.
If you’d like to learn more about regular expressions, you should try Wingtip Labs’ regular expressions tutorial.