mysqlslap
utility helps you visualize how your database performance will improve with more hardware, new tuning, or different indexes.
The trouble is, it’s designed to use fake data. You can tune the fake data to look increasingly like your real data, and that will help you get a feel for changes like bigger hardware. But fake data teaches you almost nothing for deep changes like altering your schema, adding new indexes, or tweaking memory and cache parameters.
So we’re going to get completely authentic data from your database so you can plan and test performance changes with higher confidence.
This procedure is really ideal if you have a second server about the same size as the one in production. You absolutely do not want to run this on a production server; mysqlslap
will try to contain the changes to a throwaway database (named mysqlslap) but it isn’t very bright and bad things are incredibly likely to happen. Run this process on a lab server, or a cloud server, or be prepared to do a full data restore when you’re finished.
1
|
|
Log into MySQL and turn on the innocently named “general log.” Don’t let the name fool you, this records every connect, disconnect, and query (write and read) occurring on the server. This is a bad idea if your server is anywhere near capacity, because it adds a disk hit to every single action on the server.
1 2 3 4 5 |
|
Now run some traffic through your system that you want to simulate. You could take a test account through a tour of your product’s important features, collect real traffic for a day, etc. When you’re done, turn the general log off:
1 2 |
|
Copy the general log file into your home directory and change its permissions. (It was created with strong permissions since it could contain sensitive data.)
1 2 |
|
Now get these files ready for mysqlslap
. The general log file contains more than just queries, here’s a short excerpt from mine:
1 2 3 4 |
|
These four records show a user (really a PHP app) connecting, running two SELECT queries, and disconnecting. mysqlslap
only needs the queries, so we’ll create a new slap.sql with only Query rows, and without the log columns before the SQL statements.
1
|
|
We’re also going to re-execute these queries repeatedly, both in parallel (to load the server) and sequentially (to get multiple runs of the data for statistical significance, and to see “warm cache” performance). Repeating INSERTs in tables with unique primary keys is a problem, so we’ll replace all INSERT INTO with REPLACE INTO, which prevents key collisions (but can cause trouble with constraints or cascading deletes).
1
|
|
You might have noticed that the SQL statements in slap.sql don’t have the customary ; at the end. That’s actually OK, by default mysqlslap
considers each line to be a SQL statement, so slap.sql is fine. But you will need to make each of the CREATE TABLE statements in create.sql into one line. Open it in your favorite text editor and change all the statements like:
1 2 3 4 5 |
|
Into one-liners like
1
|
|
If you use vim, it’s easy: just search for tables with /CREATE TABLE
then use J
to join lines until you have the entire statement in one row.
mysqlslap
will create a new database (named mysqlslap) and set it up using the create.sql (from our backup). Then it will run the queries in slap.sql (collected from the general log) as many times as we dictate.
Run the whole shebang once:
1
|
|
Run create.sql once, then slap.sql 10 times (single-threaded, one after another). This gives us more test data and helps eliminate outliers.
1
|
|
Run create.sql once, then have 10 concurrent clients each run slap.sql 10 times (100 times total):
1
|
|
These numbers don’t get interesting until you have a change that lets you collect before and after slaps.
I’m considering turning on sync-binlog
, which can prevent a master server crash from causing some transactions to never replicate out when the master comes back online–but it’s notorious for the performance penalty. So I’ll run a slap, then make the proposed change, the rerun the slap:
1 2 3 4 5 6 7 |
|
This tells me that turning on sync-binlog is likely to reduce my performance by about 70% (average run jumped from 3 seconds to more than 5 seconds). I can combine that with other knowledge (from the humble top to sophisticated tools like vmstat and iostat) to see if my existing hardware can take that kind of performance hit, or if I need faster disk or some other preventative planning.
That outcome might be completely different for an application with a higher read/write ratio, or a database that fits entirely in memory, or with a bigger write cache on my RAID card. But, because I fed mysqlslap
my own data, I have high confidence that I understand the consequences of the change I’m planning.
Some DBAs go their whole career using just one or two storage engines. In this article, we’ll take a peek at four remarkable storage engines you might have overlooked, all of which ship with MySQL 5.5.
Different applications create different data with different needs. Some applications require consistency and crash safety, others require speed, others require vast storage and can accept slow queries. MySQL’s pluggable storage engines let DBAs pick a storage model that fits the data while the application continues to use the same MySQL client libraries and SQL statements.
Most MySQL developers choose between MyISAM and InnoDB. MyISAM is fast, and has more than a decade of developer and DBA goodwill behind it. InnoDB brings transaction support for full ACID compliance, and includes row-level locks and caching that can make it even faster than MyISAM under concurrent load.
Ultimately both these storage engines make wise compromises for the vast majority of everyday workloads. But when you have an extraordinary problem, you might turn to one of these:
The Memory storage engine is a good fit for super fast access to information that is either ephemeral (like session management) or exists elsewhere (like a temporary table). Memory tables have no disk-based permanence; if MySQL crashes or the server reboots, that data is gone. Note that you can get better performance than Memory tables, along with higher scale and permanence, if you make the jump to MySQL Cluster—but that’s a little more involved than switching storage engines.
The Archive engine is ideal for tables that you append lots of data to, but hardly ever read, like logs. It only supports INSERT and SELECT; it does not support UPDATE, REPLACE, or DELETE, or even ORDER BY. INSERTed rows are compressed before being appended to the table on disk. The engine provides no row cache and no indexes; SELECT statements always perform a complete table scan, uncompressing the table as they go. As a result, the data takes up very little space, but will be very slow to retrieve.
The CSV engine stores data in a comma-separated value file on disk. A CSV file provides an easy compatibility point to share this data with other systems (like Excel). The benefit of using the CSV engine (instead of import/export options like load data infile and select into outfile) is that the underlying CSV file is kept continuously up to date, while being manipulated with standard SQL statements.
Blackhole is a storage engine that doesn’t actually store anything. SELECT, UPDATE, and DELETE always return 0 rows. Any INSERT succeeds, but the data is thrown away. It’s commonly used in building a Replication Relay: the relay copies events from the master’s binlog to its own binlog, then downstream slaves can read from the relay’s binlog. But no one cares if the relay applied the master’s changes to itself, so the relay can use the Blackhole engine to dramatically reduce I/O. You can get hands-on experience building a relay with the blackhole engine in our online MySQL Replication course.
Thanks for reading! If you’d like to get a little better at MySQL every week, you should sign up for our MySQL Tip of the Week mailing list. Not only will it make you smarter, we periodically send subscribers discounts for our courses!
]]>Raw HTML is a low-level language, and it’s starting to bum me out. I’m working on a project that has me writing a large number of relatively simply marked up pages. (We’ll see the structure below.) In this post, I’m going to implement a Domain-Specific Markup Language, using PHP. It’ll use nouns relevant to my subject as I write it, and output good old HTML when it’s time to render to a user.
(It looks like I’m not the first person to use the name DSML, but I promise to keep this article a little less academic by adding code samples and not using the word “indeed.”)
When I fire up my editor, my programming language knows nothing about my problem. One way to construct a program is to build up your language, to make it easier to express your problem in your problem’s language. When Ivan Jovanovic says programming languages are simply not powerful enough, he means it’s your job to make them more powerful, by shaping the primitives to fit your problem domain.
Likewise, HTML is full of wonderful primitives to mark up just about any human written knowledge. But I have a concrete problem that is a tiny specialized subset of human written knowledge, and so do you. Building a DSML is a way to build up HTML into the domain I’m working in. And, because HTML is the lingua franca of display, we’ll convert to it when it’s time to display.
I’m writing a course on MySQL administration. The largest volume of actual lines-typed-into-editors in the project is the content of the lessons. Every lesson is an HTML page, with one ordered list of steps, and every step has one or more tests that we perform to make sure the student did the work correctly.
Here’s how we used to write the lessons:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
This presents a few problems:
</li></ol>
? What did I just close? Where am I?<pre class='cli'>
superior to <cli>
? What do I get in return for 10 more characters?Here’s what I want to write:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
This is starting to look like XML, without accidentally becoming XHTML. The joy of my DSML is that I’m writing a language that knows what I mean, not that I want a new layer of quoting rules.
Let’s start with the easy work, let’s convert the <steps>
list and the <step>
elements back into <ol>
and <li>
s. I’m using the QueryPath library. It’s a very similar API to jQuery, and because I do the transform server-side, I can provide well-formed pages to clients without JavaScript (like search engine spiders).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
Of course, nobody’s perfect, so let’s add some rules to catch operator error:
1 2 3 4 5 6 7 8 9 |
|
While I’m writing lessons, my warn()
function adds bold red error messages to the top of the parsed document. In production, warn()
will quietly log them.
Those were simple replacements, you can do that with a regular expression and some duck tape. Let’s make this <cli>
tag fix my problems with HTML’s <pre>
:
</pre>
tag on its own line, without showing the student an empty line at the bottom of the code block.In other words, I want it to work like this:
1 2 3 4 |
|
But let me edit it like this:
1 2 3 4 5 |
|
Here’s the code that does it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
In lessons, when we introduce new terms, the student can hover over them to get a Bootstrap Popover that loads the definition from our glossary, dynamically. Here’s how that used to look in our code:
1 2 3 |
|
Here’s how I want it to look:
1 2 3 4 |
|
And here’s how we do it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
Now we get warnings about terms we haven’t written glossary entries for (and we don’t call attention to them, to avoid embarrassment in front of students). Tagging glossary entries is easier (so we’ll do it more). And we’re free to make dramatic changes to the way we present glossary terms without touching a zillion lesson files. For example:
<dfn>
tag instead of <abbr>
.HTML is pretty great, but I wouldn’t want to write in it.
Got comments? Head over to Reddit
]]>I was setting up replication on a new MySQL server, which starts with turning on binary logging by editing /etc/my.cnf
. Of course, I was logged in as a low-privilege user, and /etc/my.cnf
is owned by root, and I don’t have write privilege to it.
1 2 |
|
Typically, I’d run sudo vi /etc/my.conf
That works, but it wasn’t a good long term fit here. I’m writing a hands-on MySQL course and I want to give students all the access they need to administer the MySQL database, but not access to, say, turn the lab server into a BitTorrent seed at my expense.
As an administrator, I need to control which files my users can edit with elevated privileges.
In the old sudo vi /etc/my.cnf
world, I would need an entry in /etc/sudoers
like:
1
|
|
There are a series of problems for administrators here. The most serious is that you can use vi to launch other commands (with ! in command mode):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
There’s a fix for that, I can change my /etc/sudoers
entry to:
1
|
|
Now I have a new problem: some people don’t love vi. I don’t want to be in the business of telling you which editor you can run, I want to be in the business of telling you which files you can modify.
And heaven forbid I end up with a (# of editors) x (# of files) matrix I have to keep current in sudoers. Blerg.
Instead, I can authorize students to edit specific files using whatever editor they want (more on that below) with this entry in /etc/sudoers
:
1
|
|
Most importantly to me as a user, I get to use whatever editor I want. There’s a system-wide default, but I can override it for myself with
1
|
|
or
1
|
|
or even
1
|
|
I can run that every time I log in, but I’d rather append it to my ~/.bashrc
The other bonus is that my editor is running as me. That means that all the effort I put into my kickin’ ~/.vimrc
, my favorite syntax highlighters, my favorite plugins, all follow me even when I escalate privilege. You don’t get that with sudo vi
, you get root’s crappy preferences.
sudoedit
actually doesn’t let you edit the file directly. Instead, it creates a copy, in /tmp
, that only you have access to.
You can see more about the special copy with :! ls -l %
in vi (the % expands to the file currently being edited.)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
You can see (at the bottom) that there’s a new file in /tmp whose name is based on my.cnf
but with some extra characters in the middle to prevent collisions. It’s owned by the low-privilege user, and only that user can read/write it.
When you exit, sudoedit
overwrites the original. (Protip: sudoedit does not update the real file every time you write changes to the temp file. It waits until you exit your editor.)
sudo $FAVORITE_EDITOR
?sudoedit
lets the admin tighten sudoers with a “least privilege” model, while still letting the user choose which editor to use.sudoedit
preserves all your editor customizations, sudo $EDITOR
doesn’t.If you’re not using a least privilege model for your users, or if you don’t customize your editor, sudoedit
is probably not right for you. But if you’re like me, this is gonna make your day.
Today, I’m going to share with you some of the worst regular expressions I’ve used in real projects, and the lessons I wish I could share with the n00b me that wrote this crap.
What makes my regex tutorial great is that it uses real world examples. So I decided to build some new levels using material pulled from a large (~45,000 lines-of-code) multi-year PHP project I recently worked on.
This is the second article looking at that project. The first article, Using Regular Expressions to Nose Around a Large PHP Project, used find
and grep
to get a sense of the project’s size and how it used regex (matching, replacing, splitting, etc).
This article explores some of the worst regular expressions I contributed to that project, with the wisdom only hindsight can provide.
1 2 3 |
|
This takes a user-provided string $search_term
, and tests whether it ends with soft
.
The Good — This code tries to accept any reasonable input, since I have no idea whether users will provide Soft or SOFT or soft or SoFt.
The Bad — I don’t need strtolower($search_term)
to manage capitalization. I should be using the case insensitive regex flag, i
.
This code is equivalent and simpler:
1 2 3 |
|
This one’s the same sin, only more forehead-slappy:
1
|
|
That code turns <br>
tags in the input into plaintext friendly carriage-return line-feed. The project uses this when we’ve accepted input from rich text editors (that will usually be output in web pages) that now needs to be displayed at a CLI or mailed out in plaintext.
The Good — Like the previous example, it’s very tolerant.
[bB][rR]
accepts the tag in any case (<br>
and <BR>
and even <bR>
)\/?
accepts normal or self-closing form (<br>
and <br/>
)?
(space ?) allows a space before the self-closing form (<br/>
and <br />
))The Bad — Thank goodness it’s only a two-letter tag, because I’d hate to find I’d ever written [bB][lL][iI][nN][kK]
(for more than one reason).
Here’s a less-goofy, functionally-equivalent rewrite:
1
|
|
This isn’t a terrible practice, but let’s call it a regex smell: never roll your own parser for any file format you can’t explain in one breath.
1
|
|
This turns the $input
string, which is one row of a comma-separated-value file (CSV), into an array, $columns
.
The Good — By the time I wrote this, I’d already learned (probably the hard way) that sometimes CSVs have whitespace you don’t expect. The pattern \s*
will accept zero-or-more spaces on either side of the comma.
The Bad — Writing your own CSV parser is fun and easy while the CSV is small and written by a programmer. But lord help you if it comes out of Excel or is written by anyone but you. Here’s a better idea, use the CSV parser the language provides:
1 2 |
|
This has a few advantages:
trim
whitespace. But in the real app, some columns are user IDs that need strtolower
, some are numbers that need intval
or floatval
and some are keywords that can only be one of a few values.Here’s that same bad idea, except now it’s a security risk.
1
|
|
This code takes input from a user and throws out any HTML tags. It assumes a tags start with a <, have some content that isn’t > matched by [^>]+
, and end with >. The most obvious flaw is that this is a valid HTML tag:
1
|
|
that this regex mangles into
1
|
|
But the broader problem is that parsing HTML is difficult, and the stakes here are very high. This code needs to make sure users can’t inject potentially dangerous HTML into site content, so trusting myself to a hokey regular expression was a terrible idea.
I should have been using PHP’s built-in function that doesn’t have these problems:
1
|
|
When in doubt, look for a language feature or a well-written library to parse any complicated data format, especially when that data is coming from untrusted (or just non-technical) sources.
This example was probably written the day I found PHP’s list
construct.
1 2 3 4 5 6 7 8 |
|
The Bad — This code splits up detecting the type of row you’re parsing from actually parsing it. It’s like the poster child for don’t-repeat-yourself, since the two regexes are deceptively similar but subtly, brokenly, different.
The regex at line 2:
^
matches from the beginning of the row[\w]+
matches one-or-more alphanumeric characters=
matches a literal equals sign[\w]+
matches one-or-more alphanumeric characters$
matches to the end of the rowThe square brackets are totally unnecessary, this is equivalent and cleaner:
1
|
|
Example Matches:
\w
includes underscore)Example Misses:
\w
doesn’t match spaces)The regex at line 3 then splits the string around an equal sign with zero or more spaces \s*
on the left or right. The problem is, the regex on line 2 would fail if there were any spaces anywhere on the row.
This uses just one smarter regex:
1 2 3 4 5 6 7 |
|
The new regular expression:
^
matches from the beginning of the row(\w+)
captures one-or-more alphanumeric characters into $matches[1]
\s*=\s*
matches (and doesn’t capture) a = with zero or more spaces on either side(\w+)
captures one-or-more alphanumeric characters into $matches[2]
$
matches to the end of the rowHope you’ve enjoyed this cringe-inducing trip down memory lane. If you’d like to learn more about writing good regular expressions, you should try my regular expressions tutorial.
]]>After working at Yahoo! for six years as an IT Architect, I spent a year in the Technology & Operations TechLearn team. We created courses to teach Yahoo! employees how to use customized or proprietary technologies.
Part of the gig was talking to the best and brightest engineers, and seeing what parts of the infrastructure gave them trouble, and how we could help with training or documentation.
So I’m having this conversation with a very bright service engineer, and he says “Look, I’m sure these video-on-demand courses you guys build are great, but I’m telling you that I’ve never sat through one, and if I had, I probably wouldn’t remember anything from it. If I’m going to learn a technology, I need to do something with it. My hands have a better memory than I do.”
This wasn’t a novel sentiment, but the way he expressed it got in my head. So when I founded Wingtip Labs, it became a succinct way to express our mission:
The first course in the “My Hands Remember” series will be My Hands Remember MySQL. If you want to hear about it when we launch, please sign up for our Product Newsletter using the form on the right.
]]>SimpleDB (SDB) has been great to us, but the PHP API that Amazon provides fetches data in batches. Makes sense, except they expose that annoying implementation detail in a way that makes my calling code ~10x longer than it has to be. This is a story about how I fixed that, with a PHP 5 feature called Iterators.
SimpleDB is a non-relational database accessed through an HTTP API for relatively small pieces of information (no value can be bigger than 1Kb). I wouldn’t call it a NoSQL exactly, it has a SELECT statement with SQL-like keywords—minus JOINs or sub-queries of any kind.
SimpleDB fetches results over HTTP, and breaks large result sets into chunks: typically 100 records or 1Mb, whichever comes first. Then it provides a NextToken
identifier that you can use to ask for the next chunk. Your code ends up looking like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
That’s 21 lines of boilerplate and one line of Do a thing with $user
. There’s an outer do loop to make sure we’re going through all the chunks (as long as we get a NextToken
) and an inner for loop to handle the rows in the current chunk. What a nuisance!
Worse yet, when you’re initially writing the app and test data fits in one chunk, you’re likely to forget the outer loop at least once. That’s just human nature.
What can we do about it? Iterate!
A PHP Iterator is a data structure that the calling code interacts with just one element at a time. Inside the Iterator you can cache or parse or even generate values as they’re requested, as long as you expose a way to start or re-start (->rewind()
) test that there’s an element to fetch (->valid()
) fetch an element (->current()
) and move to the next element (->next()
). Unlike an array, you don’t have to hold all the elements in memory at once, because the exposed surface has no way to back up or jump arbitrarily ahead. And best of all, the calling code can still use the familiar foreach
control to walk the elements (it uses those methods so you don’t have to):
1 2 3 4 5 |
|
If that solves a problem for you, the simple one-file class is available on GitHub.
Most importantly, I wanted to get rid of the double-loop (outer loop for chunks, inner loop for rows) pattern in my code.
1 2 3 4 5 |
|
Secondly, I really hate dealing with the SimpleXML response direct from the AWSSDKforPHP. So SDBSelectIterator has a built-in parser that turns these SimpleXML responses into a more PHP idiomatic associative array. The parser is doing some work to figure out whether each attribute should be a number, parsed as a JSON object (into more associative arrays) or just returned as a string. This is helpful, since SimpleDB treats everything as a string.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
Thirdly, you can supply your own parser (and I encourage you to do so) for the data you’re receiving. Personally, I do things like populate default values for missing attributes, rename the primary key, anticipate which columns should always be arrays (even if they have zero or one entries for this row), and parse data types into more useful representations (e.g., the app carries any time values as epoch and leaves output formatting to the View, but we store as ‘YYYY-MM-DD HH:MM:SS’ which string-sorts nicely). You could even pass in a real Object factory.
1 2 3 4 5 |
|
You can check out the code in full on GitHub, but we’ll look at some snippets here.
We’re going to keep using the example of selecting user information. Here’s the calling code that creates value a user cares about.
1 2 3 4 5 |
|
Obviously, first that runs the constructor:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
You’ll notice that we’re completely encapsulating Amazon’s SimpleDB API. (In a future version it would be wise to use dependency injection here.) The main work the constructor does is to initialize a bunch of pointers to help keep our place (notably position
tells us where we are in this batch, and total_position
keeps count across batches) and initializing the SimpleDB API. We also execute the query so that the calling code can find out immediately if there’s an error or zero results, instead of waiting for the first valid
call to fail.
I’m not going to reproduce the query
method here, but suffice it to say it takes the hit of the 23 lines of code at the top of this post. That method uses private variables to keep track of NextToken
, and downloads and parses one entire batch into the private variable result_batch
.
Now when we actually start consuming the Iterator in the foreach
loop, it executes the methods valid
1 2 3 4 5 6 7 |
|
then current
1 2 3 4 |
|
then next
1 2 3 4 5 6 7 8 |
|
The foreach
loop doesn’t have to know about batches, the next
method will fetch the next batch when it runs out of elements in the current batch. From an application perspective, every 100th call to next
takes a little longer, but is otherwise indistinguishable.
To catch problems with SimpleDB service or our query, we can check that the returned Iterator is valid, and extract error messages:
1 2 3 4 5 6 7 |
|
I added the next_valid
method to be able to “peek” at whether the current element I was processing was going to be the last element. It’s not generally necessary, but it helped me when I was copying data between SimpleDB domains with the provided batch_put_attributes
, which can only take 25 items at a time. (The batch_friendly_sdb_parse
parser is also on GitHub.)
1 2 3 4 5 6 7 8 9 |
|
You can also directly call the valid
method to detect queries with zero results.
1 2 3 4 5 6 7 8 |
|
To check out the code, or hit me with the “loving mallet of correction”, stop by GitHub. To see it in action, check out our Regular Expressions tutorial which uses it extensively.
]]>Wingtip Labs makes a Regular Expressions Tutorial that has given hundreds of programmers the chance to learn thousands of regular expressions.
Most people beat the first few levels with no trouble: match literal text, use |
for alternatives, use []
for character ranges. But pretty soon you’re matching log lines by date, or validating IP addresses, and–because this is a learning tool–people start to make mistakes.
Teaching people to correct those mistakes is hands-down the best part of my job.
It turns out the mistakes people make while learning regular expressions follow a Pareto distribution: 80% of people make the same 20% of mistakes. So if we can anticipate a relatively small number of mistakes, we can help a large number of students.
I built the first generation of mistake-following-clues (we call them Oops Advice) based on watching friends and family play the game. And, because mistakes follow the 80:20 rule, that body of advice was pretty widely useful. For example, ~half of students will put spaces around a pipe the first time they use it, just like my Dad did.
But this week, I sat down and doubled the number of Oops Advice patterns in the app, by harvesting our error logs. Here’s how.
First, I needed to get the mass of data into a form where my pattern-matching unconscious could be helpful; to see student progress, and to see a lot of it all at once.
Each row is a student who has created an account to get 10 free levels. In the progress-at-a-glance diagram, the 10 big boxes for each user are levels (green if they’ve ever beaten it), the smaller boxes inside are one attempt (the levels are generated dynamically so most students will play some levels a few times), and the pixels inside are solutions they tried. A yellow is an abandon (they saw the problem but never tried a solution), red is a failing regex, and green is a winning regex.
So from the glance data, I can see which levels are giving people trouble, and even how much trouble it’s giving them (reds and yellows). Then I can dig in either by student or by level to see if I can diagnose what advice to give.
Let’s look at one specific error. We’re seeing one user spend ~4 minutes solving level 32:
What can we tell from this attempt?
Jun
and the second numeric pattern!In isolation, that’s an unfortunate typo. In aggregate, 1 in 7 students who try that level make that exact mistake. 14%!
“Questions are places in your mind where answers fit.
If you haven’t asked the question, the answer has nowhere to go.”
— Clayton Christensen paraphrased by Jason Fried
A tutorial has a leg up on other applications. I know what my students’ intent is: they want to write a regex that matches all the green rows, doesn’t match any of the red rows. I can interpret every regex they submit as reaching toward that one goal, and provide the right advice just after a student makes that mistake.
We use a pretty simple JavaScript data structure to encode Oops Clues. Here’s the one for that mistake on that level:
1 2 3 4 |
|
The pattern
attribute is a regex that acts on the regex the student submitted. Here I’m looking for three letters not followed by a space. And the advice
is extremely contextual, to the problem the student is trying to solve in this level, and to their specific error.
/var/log/httpd/error_log
, that would have been agony.If you liked this post, you should definitely see the regular expressions tutorial in action.
]]>Why? I built a regular expressions tutorial, and part of what makes it amazing is that it uses real world examples. This was a chance to find more examples to incorporate.
In this post, I’ll be using regular expressions to dredge up some information about the project:
Today you’ll see some practical uses for regular expressions with find
and egrep
.
Let’s ask find
to tell us all the file names, then have wc
count them.
1 2 |
|
1 2 3 |
|
Here, we’re still asking find for a list of files, but instead of counting the file names, we’re using xargs
to pass the filenames to wc
, and asking wc
to count the number of lines in each file.
The call to grep
addresses a caveat for really big projects: xargs can’t hand an infinitely large set of arguments to wc
. For large output from find
, wc
will output several interim subtotals.
All told, the project contains, 1,576,197 lines of code. Wait, that’s not right – not everything in the project is code. Let’s look at file types.
What are the most popular file suffixes in the project?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Let’s break that command down a little.
find . -type f
is getting the names of all the files throughout the repository.egrep -o "\.[a-z]+$"
returns only the part of the file name that matches this regular expression for the file suffix. More on this in a moment.sort
sorts all these file suffixes alphabeticallyuniq -c
counts all the unique suffixes (but needs them to be sorted)sort -nr
sorts again, but from most occurrences to least (numerically, reversed)The regular expression \.[a-z]+$
that we passed to egrep
breaks down as:
\.
a literal dot. (Without the backslash, .
matches any character)[a-z]
any lower case letter+
repeat the match to my left (“any lower case letter”) one or more times$
the end of the file name. (There can’t be any more text after the last letter.)The output tells us that there are 610 .php files. It also jogs my memory that we used .inc files for PHP library code that couldn’t be called by Apache directly. So we need to count another 375 .inc files.
Now let’s get the line count for all files ending .php or .inc.
1 2 |
|
Here we’re using the -regex
functionality of find
. The -E
flag turns on “extended” regular expressions.
Let’s break down the regular expression .+\.(php|inc)
that we passed to find
:
.+
Any number of any character. The -regex
flag takes a pattern that matches the whole file name, so we use this pattern to match “whatever” the file name starts with. The .
means any character, the +
means “one or more of that thing to my left.”\.
Literally a dot. Without the backslash, this would mean “one more of anything.”(php|inc)
One of php
or inc
.143,509 lines is at least the right order of magnitude, but when I ran it without the grep
statement, I could see that it was double-counting some files that exist, unmodified, in different ongoing branches. So lets tighten up the count by only counting lines of text in PHP files in project trunks.
1 2 |
|
This regular expression we’re using with find
matches:
.*
Zero or more of any character./trunk/
The literal text /trunk/
. In other words, the path contains a folder trunk
somewhere..*
Zero or more of any character. This means /trunk/
can be anywhere in the path.\.
Literally a dot.(php|inc)
One of php
or inc
.For paths that contain a folder trunk
, and filenames that end with .php
or .inc
, I have 45,682 lines of code.
Personally, I always use PHP’s Perl-compatible regular expression functions, which all begin “preg_”.
1 2 3 4 |
|
In this command, we use xargs
to pass all the file names as arguments to grep
. This grep
command outputs all the lines in all those files that contain “preg_”.
1 2 3 4 5 6 7 |
|
Let’s break down that command:
find -E . -regex '.*/trunk/.*\.(php|inc)$' -type f
Get all the files ending .php
or .inc
xargs egrep -o "preg_[a-z_]+" -h
Find the full preg_ function name. (More on the regex below) The -o
option causes egrep
to return only the part of the line that matches the expression. The -h
option suppresses the file name you found it in.sort | uniq -c | sort -nr
Count how many times each function appears, and sort descending.And the regular expression we hand to egrep:
preg_
- The literal text preg_
[a-z_]
- Any character .+
- Repeat “a to z, or an underscore” one or more times. This will stop matching when it gets to the (
around the arguments.Next time we’ll talk about those functions in detail and how I used them.
If you’d like to learn more about regular expressions, you should try Wingtip Labs’ regular expressions tutorial.
]]>