Wingtip Labs Blog

Use Real Data in MySQL Performance Tests

2013-09-10T14:19:00-05:00

MySQL’s mysqlslap utility helps you visualize how your database performance will improve with more hardware, new tuning, or different indexes.

The trouble is, it’s designed to use fake data. You can tune the fake data to look increasingly like your real data, and that will help you get a feel for changes like bigger hardware. But fake data teaches you almost nothing for deep changes like altering your schema, adding new indexes, or tweaking memory and cache parameters.

So we’re going to get completely authentic data from your database so you can plan and test performance changes with higher confidence.

This procedure is really ideal if you have a second server about the same size as the one in production. You absolutely do not want to run this on a production server; mysqlslap will try to contain the changes to a throwaway database (named mysqlslap) but it isn’t very bright and bad things are incredibly likely to happen. Run this process on a lab server, or a cloud server, or be prepared to do a full data restore when you’re finished.

Backup your database.

mysqldump -u root my_db_name --compact > ~/create.sql

Log into MySQL and turn on the innocently named “general log.” Don’t let the name fool you, this records every connect, disconnect, and query (write and read) occurring on the server. This is a bad idea if your server is anywhere near capacity, because it adds a disk hit to every single action on the server.

mysql> SET GLOBAL general_log_file = 'all-queries.log';
Query OK, 0 rows affected (0.01 sec)

mysql> SET GLOBAL general_log = 1;
Query OK, 0 rows affected (0.00 sec)

Make Things Happen

Now run some traffic through your system that you want to simulate. You could take a test account through a tour of your product’s important features, collect real traffic for a day, etc. When you’re done, turn the general log off:

mysql> SET GLOBAL general_log = 0;
Query OK, 0 rows affected (0.00 sec)

Change the Log File to a Slap File

Copy the general log file into your home directory and change its permissions. (It was created with strong permissions since it could contain sensitive data.)

$ sudo cp /var/lib/mysql/all-queries.log ~
$ sudo chmod a+rw ~/all-queries.log

Now get these files ready for mysqlslap. The general log file contains more than just queries, here’s a short excerpt from mine:

       3 Connect    user@localhost on application
       3 Query    SELECT * FROM users WHERE username='load'
       3 Query    SELECT * FROM charges WHERE username = 'load'
       3 Quit

These four records show a user (really a PHP app) connecting, running two SELECT queries, and disconnecting. mysqlslap only needs the queries, so we’ll create a new slap.sql with only Query rows, and without the log columns before the SQL statements.

grep " Query" all-queries.log | egrep -io '(SELECT|INSERT|UPDATE|DELETE|REPLACE).*' > slap.sql

We’re also going to re-execute these queries repeatedly, both in parallel (to load the server) and sequentially (to get multiple runs of the data for statistical significance, and to see “warm cache” performance). Repeating INSERTs in tables with unique primary keys is a problem, so we’ll replace all INSERT INTO with REPLACE INTO, which prevents key collisions (but can cause trouble with constraints or cascading deletes).

sed -i -E 's/INSERT INTO/REPLACE INTO/ig' slap.sql

You might have noticed that the SQL statements in slap.sql don’t have the customary ; at the end. That’s actually OK, by default mysqlslap considers each line to be a SQL statement, so slap.sql is fine. But you will need to make each of the CREATE TABLE statements in create.sql into one line. Open it in your favorite text editor and change all the statements like:

CREATE TABLE `stuff` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `details` varchar(255) DEFAULT 'boring',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=590 DEFAULT CHARSET=latin1;

Into one-liners like

CREATE TABLE `stuff` ( `id` int(11) NOT NULL AUTO_INCREMENT, `details` varchar(255) DEFAULT 'boring', PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=590 DEFAULT CHARSET=latin1;

If you use vim, it’s easy: just search for tables with /CREATE TABLE then use J to join lines until you have the entire statement in one row.

Happy Slapsgiving

mysqlslap will create a new database (named mysqlslap) and set it up using the create.sql (from our backup). Then it will run the queries in slap.sql (collected from the general log) as many times as we dictate.

Run the whole shebang once:

mysqlslap --create create.sql --query slap.sql

Run create.sql once, then slap.sql 10 times (single-threaded, one after another). This gives us more test data and helps eliminate outliers.

mysqlslap --create create.sql --query slap.sql --iterations=10

Run create.sql once, then have 10 concurrent clients each run slap.sql 10 times (100 times total):

mysqlslap --create create.sql --query slap.sql --iterations=10 --concurrency=10

These numbers don’t get interesting until you have a change that lets you collect before and after slaps.

I’m considering turning on sync-binlog, which can prevent a master server crash from causing some transactions to never replicate out when the master comes back online–but it’s notorious for the performance penalty. So I’ll run a slap, then make the proposed change, the rerun the slap:

$ mysqlslap --create create.sql --query slap.sql --iterations=10 --concurrency=10
   Average number of seconds to run all queries: 3.043 seconds

$ mysql -u root -e"SET GLOBAL sync_binlog = 1;"

$ mysqlslap --create create.sql --query slap.sql --iterations=10 --concurrency=10
   Average number of seconds to run all queries: 5.156 seconds

This tells me that turning on sync-binlog is likely to reduce my performance by about 70% (average run jumped from 3 seconds to more than 5 seconds). I can combine that with other knowledge (from the humble top to sophisticated tools like vmstat and iostat) to see if my existing hardware can take that kind of performance hit, or if I need faster disk or some other preventative planning.

That outcome might be completely different for an application with a higher read/write ratio, or a database that fits entirely in memory, or with a bigger write cache on my RAID card. But, because I fed mysqlslap my own data, I have high confidence that I understand the consequences of the change I’m planning.

Four Remarkable MySQL Storage Engines

2013-07-08T14:49:00-05:00

Some DBAs go their whole career using just one or two storage engines. In this article, we’ll take a peek at four remarkable storage engines you might have overlooked, all of which ship with MySQL 5.5.

What’s a Storage Engine?

Different applications create different data with different needs. Some applications require consistency and crash safety, others require speed, others require vast storage and can accept slow queries. MySQL’s pluggable storage engines let DBAs pick a storage model that fits the data while the application continues to use the same MySQL client libraries and SQL statements.

Ol’ Reliable MyISAM and InnoDB

Most MySQL developers choose between MyISAM and InnoDB. MyISAM is fast, and has more than a decade of developer and DBA goodwill behind it. InnoDB brings transaction support for full ACID compliance, and includes row-level locks and caching that can make it even faster than MyISAM under concurrent load.

Ultimately both these storage engines make wise compromises for the vast majority of everyday workloads. But when you have an extraordinary problem, you might turn to one of these:

Memory

The Memory storage engine is a good fit for super fast access to information that is either ephemeral (like session management) or exists elsewhere (like a temporary table). Memory tables have no disk-based permanence; if MySQL crashes or the server reboots, that data is gone. Note that you can get better performance than Memory tables, along with higher scale and permanence, if you make the jump to MySQL Cluster—but that’s a little more involved than switching storage engines.

CSV

The CSV engine stores data in a comma-separated value file on disk. A CSV file provides an easy compatibility point to share this data with other systems (like Excel). The benefit of using the CSV engine (instead of import/export options like load data infile and select into outfile) is that the underlying CSV file is kept continuously up to date, while being manipulated with standard SQL statements.

Blackhole

Blackhole is a storage engine that doesn’t actually store anything. SELECT, UPDATE, and DELETE always return 0 rows. Any INSERT succeeds, but the data is thrown away. It’s commonly used in building a Replication Relay: the relay copies events from the master’s binlog to its own binlog, then downstream slaves can read from the relay’s binlog. But no one cares if the relay applied the master’s changes to itself, so the relay can use the Blackhole engine to dramatically reduce I/O. You can get hands-on experience building a relay with the blackhole engine in our online MySQL Replication course.

Thanks for reading! If you’d like to get a little better at MySQL every week, you should sign up for our MySQL Tip of the Week mailing list. Not only will it make you smarter, we periodically send subscribers discounts for our courses!

Domain Specific Markup Language

2013-03-18T10:09:00-05:00

Raw HTML is a low-level language, and it’s starting to bum me out. I’m working on a project that has me writing a large number of relatively simply marked up pages. (We’ll see the structure below.) In this post, I’m going to implement a Domain-Specific Markup Language, using PHP. It’ll use nouns relevant to my subject as I write it, and output good old HTML when it’s time to render to a user.

(It looks like I’m not the first person to use the name DSML, but I promise to keep this article a little less academic by adding code samples and not using the word “indeed.”)

What does low-level mean?

When I fire up my editor, my programming language knows nothing about my problem. One way to construct a program is to build up your language, to make it easier to express your problem in your problem’s language. When Ivan Jovanovic says programming languages are simply not powerful enough, he means it’s your job to make them more powerful, by shaping the primitives to fit your problem domain.

Likewise, HTML is full of wonderful primitives to mark up just about any human written knowledge. But I have a concrete problem that is a tiny specialized subset of human written knowledge, and so do you. Building a DSML is a way to build up HTML into the domain I’m working in. And, because HTML is the lingua franca of display, we’ll convert to it when it’s time to display.

Getting Concrete

I’m writing a course on MySQL administration. The largest volume of actual lines-typed-into-editors in the project is the content of the lessons. Every lesson is an HTML page, with one ordered list of steps, and every step has one or more tests that we perform to make sure the student did the work correctly.

Here’s how we used to write the lessons:

 id='steps'>
  
      Put your left foot in
       class='step-outcomes'>
          
Left foot is in front of you

          Balanced on right foot

      Take your left foot out
       class='cli'>
student@server $ out /dev/left-foot

       class='step-outcomes'>
          Left foot is in front of you

          Balanced on right foot

This presents a few problems:

Local neighborhoods can be difficult to navigate. What does this string tell you:

In what way is

 superior to ? What do I get in return for 10 more characters?

PRE blocks make attractive indentation futile.

How does a DSML help?

Here’s what I want to write:

  
      Put your left foot in
      
          Left foot is in front of you
          Balanced on right foot
      
      Take your left foot out
      
          student@server $ out /dev/left-foot
      
          Left foot is behind you
          Balanced on right foot

This is starting to look like XML, without accidentally becoming XHTML. The joy of my DSML is that I’m writing a language that knows what I mean, not that I want a new layer of quoting rules.

Rewriting Custom Tags into HTML

Let’s start with the easy work, let’s convert the list and the elements back into

s. I’m using the QueryPath library. It’s a very similar API to jQuery, and because I do the transform server-side, I can provide well-formed pages to clients without JavaScript (like search engine spiders).

require 'QueryPath/QueryPath.php';

$qp = htmlqp($dsml_file); //The DSML in the previous code block

foreach($qp->find(":root steps > step") as $step){
  $content = $step->innerHTML();
  $step->replaceWith("" . $content . "
");
}

foreach($qp->find(":root steps:first") as $elm){
  $content = $elm->innerHTML();
  $elm->replaceWith("" . $content . "");
}

$qp->writeHTML();
?>

Adding Application Logic and Error Checking

Of course, nobody’s perfect, so let’s add some rules to catch operator error:

if($qp->find(":root steps")){ //We already translated steps:first into an ol
  warn("You have more than one  collection.");
}

if($qp->find(":root step")){ //We already translated any step that is a direct child of steps
  warn("You have  elements outside of the  container.");
}
?>

While I’m writing lessons, my warn() function adds bold red error messages to the top of the parsed document. In production, warn() will quietly log them.

Tags that are Smarter than HTML

Those were simple replacements, you can do that with a regular expression and some duck tape. Let’s make this tag fix my problems with HTML’s

I want to be able to indent the content for easier editing.
I want the

tag on its own line, without showing the student an empty line at the bottom of the code block.

In other words, I want it to work like this:

   class='cli'>
student@server $ out /dev/left-foot

But let me edit it like this:

  
      student@server $ out /dev/left-foot
  

Here’s the code that does it:

foreach($qp->find(":root cli") as $elm){
  $content = $elm->innerHTML();

  //Accept Windows or Unixy EOL
  $content_array = preg_split('/(\r\n|\r|\n)/', $content);

  //Get rid of whitespace on left and right.
  $content_array = array_map("trim", $content_array);

  //Get rid of trailing empty lines
  while(end($content_array) == ""){ array_pop($content_array); }

  //Reassemble with uniform EOL
  $content = implode("\r\n", $content_array);
  $elm->replaceWith("" . $content . "");
}
?>

DSML my Users Care About

In lessons, when we introduce new terms, the student can hover over them to get a Bootstrap Popover that loads the definition from our glossary, dynamically. Here’s how that used to look in our code:

Now edit my.cnf:

$  href='/glossary/sudoedit'>sudoedit /etc/my.cnf

Here’s how I want it to look:

Now edit my.cnf:

  $ sudoedit /etc/my.cnf

And here’s how we do it:

function term_to_url($title){
  $title = strtolower($title);
  $title = preg_replace('/[ \t\r\n]+/', '-', $title);
  $title = preg_replace('/[^a-z0-9\-_]/', '', $title);
  return $title;
}

foreach($qp->find(":root explain") as $elm){
  $content = $elm->innerHTML();

  $url = '/glossary/' . term_to_url($elm->text());
  if(file_exists('..' . $url)){
      $elm->replaceWith("" . $content . "");
  }else{
      warn("No glossary entry for $url");
      $elm->replaceWith($content);
  }
}
?>

Now we get warnings about terms we haven’t written glossary entries for (and we don’t call attention to them, to avoid embarrassment in front of students). Tagging glossary entries is easier (so we’ll do it more). And we’re free to make dramatic changes to the way we present glossary terms without touching a zillion lesson files. For example:

We could slipstream in all the definitions into data attributes instead of fetching via AJAX.
We could paste all the definitions as numbered footnotes on the page.
We could switch the HTML we emit to the browser to use the tag instead of .

Now go forth, and HTML no more.

HTML is pretty great, but I wouldn’t want to write in it.

A DSML can get you closer to your problem domain, not just in code, but in presentation.
A DSML can free you to write content without bogging you down in implementation details.
A DSML can even make it easier to develop and update features spread across content.

Got comments? Head over to Reddit

What's so great about sudoedit?

2013-03-13T10:09:00-05:00

Yesterday I learned about a tool that’s going to change my daily behavior working on servers.

I was setting up replication on a new MySQL server, which starts with turning on binary logging by editing /etc/my.cnf. Of course, I was logged in as a low-privilege user, and /etc/my.cnf is owned by root, and I don’t have write privilege to it.

lurkdata ~ $ ls -l /etc/my.cnf
-rw-r--r-- 1 root root 480 Jan  3 19:19 /etc/my.cnf

Typically, I’d run sudo vi /etc/my.conf That works, but it wasn’t a good long term fit here. I’m writing a hands-on MySQL course and I want to give students all the access they need to administer the MySQL database, but not access to, say, turn the lab server into a BitTorrent seed at my expense.

Why is sudoedit good for administrators?

As an administrator, I need to control which files my users can edit with elevated privileges.

In the old sudo vi /etc/my.cnf world, I would need an entry in /etc/sudoers like:

student ALL = vi /etc/my.cnf

There are a series of problems for administrators here. The most serious is that you can use vi to launch other commands (with ! in command mode):

[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0
# Settings user and group are ignored when systemd is used.
# If you need to run mysqld under a different user or group,
# customize your systemd unit file for mysqld according to the
# instructions in http://fedoraproject.org/wiki/Systemd

[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
~
~
:! whoami

root

Press ENTER or type command to continue

There’s a fix for that, I can change my /etc/sudoers entry to:

student ALL = NOEXEC:vi /etc/my.cnf

Now I have a new problem: some people don’t love vi. I don’t want to be in the business of telling you which editor you can run, I want to be in the business of telling you which files you can modify.

And heaven forbid I end up with a (# of editors) x (# of files) matrix I have to keep current in sudoers. Blerg.

Instead, I can authorize students to edit specific files using whatever editor they want (more on that below) with this entry in /etc/sudoers:

student ALL = sudoedit /etc/my.cnf

Why is sudoedit good for users?

Most importantly to me as a user, I get to use whatever editor I want. There’s a system-wide default, but I can override it for myself with

export EDITOR=/usr/bin/vim

export EDITOR=/usr/bin/emacs

or even

export EDITOR=/bin/nano

I can run that every time I log in, but I’d rather append it to my ~/.bashrc

The other bonus is that my editor is running as me. That means that all the effort I put into my kickin’ ~/.vimrc, my favorite syntax highlighters, my favorite plugins, all follow me even when I escalate privilege. You don’t get that with sudo vi, you get root’s crappy preferences.

How does sudoedit work?

sudoedit actually doesn’t let you edit the file directly. Instead, it creates a copy, in /tmp, that only you have access to.

You can see more about the special copy with :! ls -l % in vi (the % expands to the file currently being edited.)

[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0
# Settings user and group are ignored when systemd is used.
# If you need to run mysqld under a different user or group,
# customize your systemd unit file for mysqld according to the
# instructions in http://fedoraproject.org/wiki/Systemd

[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
~
~
:! ls -l %

-rw------- 1 student student 480 Jan  3 19:19 /var/tmp/myXXhUm0Rw.cnf

Press ENTER or type command to continue

You can see (at the bottom) that there’s a new file in /tmp whose name is based on my.cnf but with some extra characters in the middle to prevent collisions. It’s owned by the low-privilege user, and only that user can read/write it.

When you exit, sudoedit overwrites the original. (Protip: sudoedit does not update the real file every time you write changes to the temp file. It waits until you exit your editor.)

Why wouldn’t I just use `sudo $FAVORITE_EDITOR` ?

sudoedit lets the admin tighten sudoers with a “least privilege” model, while still letting the user choose which editor to use.
sudoedit preserves all your editor customizations, sudo $EDITOR doesn’t.

If you’re not using a least privilege model for your users, or if you don’t customize your editor, sudoedit is probably not right for you. But if you’re like me, this is gonna make your day.

Regular Expression Repentance

2013-03-05T09:25:00-06:00

I love Regular Expressions, so much so that I built an interactive regular expressions tutorial.

Today, I’m going to share with you some of the worst regular expressions I’ve used in real projects, and the lessons I wish I could share with the n00b me that wrote this crap.

What makes my regex tutorial great is that it uses real world examples. So I decided to build some new levels using material pulled from a large (~45,000 lines-of-code) multi-year PHP project I recently worked on.

This is the second article looking at that project. The first article, Using Regular Expressions to Nose Around a Large PHP Project, used find and grep to get a sense of the project’s size and how it used regex (matching, replacing, splitting, etc).

This article explores some of the worst regular expressions I contributed to that project, with the wisdom only hindsight can provide.

Case Insensitivity

if(preg_match("/soft$/", strtolower($search_term))){
  //This is a soft phone
}

This takes a user-provided string $search_term, and tests whether it ends with soft.

The Good — This code tries to accept any reasonable input, since I have no idea whether users will provide Soft or SOFT or soft or SoFt.

The Bad — I don’t need strtolower($search_term) to manage capitalization. I should be using the case insensitive regex flag, i.

This code is equivalent and simpler:

if(preg_match("/soft$/i", $search_term)){
  //This is a soft phone
}

This one’s the same sin, only more forehead-slappy:

$input = preg_replace('/<[bB][rR]\ ?\/?>/', "\r\n", $input);

That code turns tags in the input into plaintext friendly carriage-return line-feed. The project uses this when we’ve accepted input from rich text editors (that will usually be output in web pages) that now needs to be displayed at a CLI or mailed out in plaintext.

The Good — Like the previous example, it’s very tolerant.

[bB][rR] accepts the tag in any case ( and and even )
\/? accepts normal or self-closing form ( and )
? (space ?) allows a space before the self-closing form ( and ))

The Bad — Thank goodness it’s only a two-letter tag, because I’d hate to find I’d ever written [bB][lL][iI][nN][kK] (for more than one reason).

Here’s a less-goofy, functionally-equivalent rewrite:

$input = preg_replace('/
/i', "\r\n", $input);

Not using libraries

This isn’t a terrible practice, but let’s call it a regex smell: never roll your own parser for any file format you can’t explain in one breath.

$columns = preg_split("/\s*,\s*/", $input);

This turns the $input string, which is one row of a comma-separated-value file (CSV), into an array, $columns.

The Good — By the time I wrote this, I’d already learned (probably the hard way) that sometimes CSVs have whitespace you don’t expect. The pattern \s* will accept zero-or-more spaces on either side of the comma.

The Bad — Writing your own CSV parser is fun and easy while the CSV is small and written by a programmer. But lord help you if it comes out of Excel or is written by anyone but you. Here’s a better idea, use the CSV parser the language provides:

$columns = str_getcsv($input);
$columns = array_map("trim", $columns);

This has a few advantages:

It also strips leading whitespace off the first column, and trailing space off the last column.
It’s properly anticipates columns with line breaks and commas in them, the way they’ll appear from Excel (wrapped in “)
It separates “break this line into columns” in line 1, from “apply sensible filtering to the data” in line 2. In this example all the columns are the same, and we just want to trim whitespace. But in the real app, some columns are user IDs that need strtolower, some are numbers that need intval or floatval and some are keywords that can only be one of a few values.

Here’s that same bad idea, except now it’s a security risk.

$safe = preg_replace("/<[^>]+>/"," ", $raw);

This code takes input from a user and throws out any HTML tags. It assumes a tags start with a <, have some content that isn’t > matched by [^>]+, and end with >. The most obvious flaw is that this is a valid HTML tag:

 title='monkeys > ninjas' src='monkeypunchesninja.png'>

that this regex mangles into

 ninjas' src='monkeypunchesninja.png'>

But the broader problem is that parsing HTML is difficult, and the stakes here are very high. This code needs to make sure users can’t inject potentially dangerous HTML into site content, so trusting myself to a hokey regular expression was a terrible idea.

I should have been using PHP’s built-in function that doesn’t have these problems:

$safe = strip_tags($raw);

When in doubt, look for a language feature or a well-written library to parse any complicated data format, especially when that data is coming from untrusted (or just non-technical) sources.

Parsing key-value pairs

This example was probably written the day I found PHP’s list construct.

foreach($lines as $line){
  if(preg_match("/^[\w]+=[\w]+$/", $line)){
      list($key, $value) = preg_split("/\s*=\s*/", $line);
      $return_array[$key] = $value;
  }else{ //Consider this like setting a flag
      $return_array[$line] = "true";
  }
}

The Bad — This code splits up detecting the type of row you’re parsing from actually parsing it. It’s like the poster child for don’t-repeat-yourself, since the two regexes are deceptively similar but subtly, brokenly, different.

The regex at line 2:

^ matches from the beginning of the row
[\w]+ matches one-or-more alphanumeric characters
= matches a literal equals sign
[\w]+ matches one-or-more alphanumeric characters
$ matches to the end of the row

The square brackets are totally unnecessary, this is equivalent and cleaner:

if(preg_match("/^\w+=\w+$/", $line)){

Example Matches:

cow=moo
ducks=5
Johnny5=Still_Alive (\w includes underscore)

Example Misses:

ducks (no =)
author=Orson Scott Card (\w doesn’t match spaces)

The regex at line 3 then splits the string around an equal sign with zero or more spaces \s* on the left or right. The problem is, the regex on line 2 would fail if there were any spaces anywhere on the row.

This uses just one smarter regex:

foreach($lines as $line){
  if(preg_match("/^(\w+)\s*=\s*(\w+)$/", $line, $matches)){
      $return_array[$matches[1]] = $matches[2];
  }else{ //Consider this like setting a flag
      $return_array[$line] = "true";
  }
}

The new regular expression:

^ matches from the beginning of the row
(\w+) captures one-or-more alphanumeric characters into $matches[1]
\s*=\s* matches (and doesn’t capture) a = with zero or more spaces on either side
(\w+) captures one-or-more alphanumeric characters into $matches[2]
$ matches to the end of the row

Hope you’ve enjoyed this cringe-inducing trip down memory lane. If you’d like to learn more about writing good regular expressions, you should try my regular expressions tutorial.

My Hands Remember

2013-02-12T10:36:00-06:00

Wingtip Labs is hard at work on a course called My Hands Remember MySQL. This is a story about where that name comes from, and how it shapes a course that kicks dry theory to the curb in favor of hands-on experience.

After working at Yahoo! for six years as an IT Architect, I spent a year in the Technology & Operations TechLearn team. We created courses to teach Yahoo! employees how to use customized or proprietary technologies.

Part of the gig was talking to the best and brightest engineers, and seeing what parts of the infrastructure gave them trouble, and how we could help with training or documentation.

So I’m having this conversation with a very bright service engineer, and he says “Look, I’m sure these video-on-demand courses you guys build are great, but I’m telling you that I’ve never sat through one, and if I had, I probably wouldn’t remember anything from it. If I’m going to learn a technology, I need to do something with it. My hands have a better memory than I do.”

This wasn’t a novel sentiment, but the way he expressed it got in my head. So when I founded Wingtip Labs, it became a succinct way to express our mission:

Give people a way to build genuine, hands-on experience
Anticipate common mistakes and present focused in-the-moment advice
Reward self-taught students, and help them fill in the gaps
Pass on expertise from the field

The first course in the “My Hands Remember” series will be My Hands Remember MySQL. If you want to hear about it when we launch, please sign up for our Product Newsletter using the form on the right.

A PHP Iterator for Amazon SimpleDB

2013-02-04T10:05:00-06:00

I’m using Amazon SimpleDB to store all student progress information in my Regular Expressions tutorial. All the data I discuss in the article Why I Love My Error Logs (and You Should, Too) is in Simple DB: student records, all the solutions students have tried, and a few summary tables for data I’d just GROUP BY to get in a relational store (e.g., total time on site, fastest time to complete a level).

SimpleDB (SDB) has been great to us, but the PHP API that Amazon provides fetches data in batches. Makes sense, except they expose that annoying implementation detail in a way that makes my calling code ~10x longer than it has to be. This is a story about how I fixed that, with a PHP 5 feature called Iterators.

SimpleDB is a non-relational database accessed through an HTTP API for relatively small pieces of information (no value can be bigger than 1Kb). I wouldn’t call it a NoSQL exactly, it has a SELECT statement with SQL-like keywords—minus JOINs or sub-queries of any kind.

SimpleDB fetches results over HTTP, and breaks large result sets into chunks: typically 100 records or 1Mb, whichever comes first. Then it provides a NextToken identifier that you can use to ask for the next chunk. Your code ends up looking like:

// Old way
$sdb = new AmazonSDB();
$select_expression = 'SELECT * FROM `expressions_users`';
$next_token = null;
do {
  if ($next_token) {
      $response = $sdb->select($select_expression, array(
          'NextToken' => $next_token,
      ));
  } else {
      $response = $sdb->select($select_expression);
  }

  $rows = $response->body->Item();
  foreach($rows as $row){
      $user = parse_user($row);
      //Do a thing with $user
  }

  $next_token = isset($response->body->SelectResult->NextToken)
      ? (string) $response->body->SelectResult->NextToken
      : null;
} while ($next_token);

That’s 21 lines of boilerplate and one line of Do a thing with $user. There’s an outer do loop to make sure we’re going through all the chunks (as long as we get a NextToken) and an inner for loop to handle the rows in the current chunk. What a nuisance!

Worse yet, when you’re initially writing the app and test data fits in one chunk, you’re likely to forget the outer loop at least once. That’s just human nature.

What can we do about it? Iterate!

Iterators in PHP

A PHP Iterator is a data structure that the calling code interacts with just one element at a time. Inside the Iterator you can cache or parse or even generate values as they’re requested, as long as you expose a way to start or re-start (->rewind()) test that there’s an element to fetch (->valid()) fetch an element (->current()) and move to the next element (->next()). Unlike an array, you don’t have to hold all the elements in memory at once, because the exposed surface has no way to back up or jump arbitrarily ahead. And best of all, the calling code can still use the familiar foreach control to walk the elements (it uses those methods so you don’t have to):

// New way
$user_iterator = new SDBSelectIterator("SELECT * FROM `expressions_users`", "parse_users");
foreach ($user_iterator as $user){
  //Do a thing with $user
}

If that solves a problem for you, the simple one-file class is available on GitHub.

Key Features of SDBSelectIterator

Most importantly, I wanted to get rid of the double-loop (outer loop for chunks, inner loop for rows) pattern in my code.

// New way
$user_iterator = new SDBSelectIterator("SELECT * FROM `expressions_users`", "parse_users");
foreach ($user_iterator as $user){
  //Do a thing with $user
}

Secondly, I really hate dealing with the SimpleXML response direct from the AWSSDKforPHP. So SDBSelectIterator has a built-in parser that turns these SimpleXML responses into a more PHP idiomatic associative array. The parser is doing some work to figure out whether each attribute should be a number, parsed as a JSON object (into more associative arrays) or just returned as a string. This is helpful, since SimpleDB treats everything as a string.

// Old values for a user
CFSimpleXML::__set_state(array(
   'Name' => 'jwadhams',
   'Attribute' => 
  array (
    0 => 
    CFSimpleXML::__set_state(array(
       'Name' => 'first_seen',
       'Value' => '2011-01-05 19:35:43',
    )),
    1 => 
    CFSimpleXML::__set_state(array(
       'Name' => 'level_attempts',
       'Value' => '526',
    )), 
...

//New values for a user
array (
  'primary_key' => 'jwadhams',
  'first_seen' => '2011-01-05 19:35:43',
  'level_attempts' => 526,
 ...

Thirdly, you can supply your own parser (and I encourage you to do so) for the data you’re receiving. Personally, I do things like populate default values for missing attributes, rename the primary key, anticipate which columns should always be arrays (even if they have zero or one entries for this row), and parse data types into more useful representations (e.g., the app carries any time values as epoch and leaves output formatting to the View, but we store as ‘YYYY-MM-DD HH:MM:SS’ which string-sorts nicely). You could even pass in a real Object factory.

array (
  'user_id' => 'jwadhams',
  'first_seen' => 1294256143,
  'level_attempts' => 526,
...

How it Works

You can check out the code in full on GitHub, but we’ll look at some snippets here.

We’re going to keep using the example of selecting user information. Here’s the calling code that creates value a user cares about.

$get_users = "SELECT * FROM `" . EXP_DOMAIN_USERS . "` WHERE first_seen is not null ORDER BY first_seen DESC "; //Newest users first
$users = new SDBSelectIterator($get_users, "parse_exp_user_cache");
foreach($users as $user){
  //Display a user row in a table
}

Obviously, first that runs the constructor:

public function __construct($query, $parser = false) {
    if(!class_exists('AmazonSDB')){
      require('AWSSDKforPHP/sdk.class.php');
  }
  $this->position = 0;
  $this->total_position = 0;
  $this->query = $query;
  $this->next_token = null;
  $this->parser = $parser;
  $this->sdb = new AmazonSDB();
  $this->query();
}

You’ll notice that we’re completely encapsulating Amazon’s SimpleDB API. (In a future version it would be wise to use dependency injection here.) The main work the constructor does is to initialize a bunch of pointers to help keep our place (notably position tells us where we are in this batch, and total_position keeps count across batches) and initializing the SimpleDB API. We also execute the query so that the calling code can find out immediately if there’s an error or zero results, instead of waiting for the first valid call to fail.

I’m not going to reproduce the query method here, but suffice it to say it takes the hit of the 23 lines of code at the top of this post. That method uses private variables to keep track of NextToken, and downloads and parses one entire batch into the private variable result_batch.

Now when we actually start consuming the Iterator in the foreach loop, it executes the methods valid

function valid() {
  //var_dump(__METHOD__);
  if(!$this->result_batch or !is_array($this->result_batch)){
      return false;
  }
  return isset($this->result_batch[$this->position]);
}

then current

function current() {
    //var_dump(__METHOD__);
    return $this->result_batch[$this->position];
}

then next

function next() {
    //var_dump(__METHOD__);
    $this->position += 1;
    $this->total_position += 1;
  if(!isset($this->result_batch[$this->position]) && $this->next_token){
      $this->query();
  }
}

The foreach loop doesn’t have to know about batches, the next method will fetch the next batch when it runs out of elements in the current batch. From an application perspective, every 100th call to next takes a little longer, but is otherwise indistinguishable.

Helper Methods

To catch problems with SimpleDB service or our query, we can check that the returned Iterator is valid, and extract error messages:

$users = new SDBSelectIterator($get_users, "parse_exp_user_cache");

if(!$users->isOK()){
  echo "" . $users->error_message() . "
";
}

foreach ...

I added the next_valid method to be able to “peek” at whether the current element I was processing was going to be the last element. It’s not generally necessary, but it helped me when I was copying data between SimpleDB domains with the provided batch_put_attributes, which can only take 25 items at a time. (The batch_friendly_sdb_parse parser is also on GitHub.)

$old_domain_data = new SDBSelectIterator("SELECT * FROM `old`", "batch_friendly_sdb_parse");
$rows = array();
foreach($old_domain_data as $row){
  $rows = array_merge($rows, $row);
  if(count($rows) >= 25 or !$old_domain_data->next_valid()){
      $overwrite_stats = $sdb->batch_put_attributes($new_domain, $rows);
      $rows = array();
  }
}

You can also directly call the valid method to detect queries with zero results.

$user_iterator = new SDBSelectIterator("SELECT * FROM `expressions_users`", "parse_users");
if($user_iterator->valid()){
  foreach ($user_iterator as $user){
      //Do a thing with each $user
  }
} else {
  echo "Sorry, no users.";
}

To check out the code, or hit me with the “loving mallet of correction”, stop by GitHub. To see it in action, check out our Regular Expressions tutorial which uses it extensively.

Why I love my error logs (and you should, too)

2013-01-17T12:06:00-06:00

Wingtip Labs makes a Regular Expressions Tutorial that has given hundreds of programmers the chance to learn thousands of regular expressions.

Most people beat the first few levels with no trouble: match literal text, use | for alternatives, use [] for character ranges. But pretty soon you’re matching log lines by date, or validating IP addresses, and–because this is a learning tool–people start to make mistakes.

Teaching people to correct those mistakes is hands-down the best part of my job.

It turns out the mistakes people make while learning regular expressions follow a Pareto distribution: 80% of people make the same 20% of mistakes. So if we can anticipate a relatively small number of mistakes, we can help a large number of students.

I built the first generation of mistake-following-clues (we call them Oops Advice) based on watching friends and family play the game. And, because mistakes follow the 80:20 rule, that body of advice was pretty widely useful. For example, ~half of students will put spaces around a pipe the first time they use it, just like my Dad did.

But this week, I sat down and doubled the number of Oops Advice patterns in the app, by harvesting our error logs. Here’s how.

Find the Hot Spots

First, I needed to get the mass of data into a form where my pattern-matching unconscious could be helpful; to see student progress, and to see a lot of it all at once.

Each row is a student who has created an account to get 10 free levels. In the progress-at-a-glance diagram, the 10 big boxes for each user are levels (green if they’ve ever beaten it), the smaller boxes inside are one attempt (the levels are generated dynamically so most students will play some levels a few times), and the pixels inside are solutions they tried. A yellow is an abandon (they saw the problem but never tried a solution), red is a failing regex, and green is a winning regex.

So from the glance data, I can see which levels are giving people trouble, and even how much trouble it’s giving them (reds and yellows). Then I can dig in either by student or by level to see if I can diagnose what advice to give.

Find the Patterns

Let’s look at one specific error. We’re seeing one user spend ~4 minutes solving level 32:

What can we tell from this attempt?

He’s not building the regular expression interactively (type a little, see what matches); he worked for two minutes before submitting an answer.
The second answer catches one syntactical problem, an extra close paren.
The “ah ha” moment came right after he submitted the second try: he forgot the space between Jun and the second numeric pattern!

In isolation, that’s an unfortunate typo. In aggregate, 1 in 7 students who try that level make that exact mistake. 14%!

Getting help when you need it

“Questions are places in your mind where answers fit.

If you haven’t asked the question, the answer has nowhere to go.”

— Clayton Christensen paraphrased by Jason Fried

A tutorial has a leg up on other applications. I know what my students’ intent is: they want to write a regex that matches all the green rows, doesn’t match any of the red rows. I can interpret every regex they submit as reaching toward that one goal, and provide the right advice just after a student makes that mistake.

We use a pretty simple JavaScript data structure to encode Oops Clues. Here’s the one for that mistake on that level:

{
  "pattern" : /[A-Z][a-z][a-z][^ ]/,
  "advice" : "You're missing the space between the month name and the date number."
}

The pattern attribute is a regex that acts on the regex the student submitted. Here I’m looking for three letters not followed by a space. And the advice is extremely contextual, to the problem the student is trying to solve in this level, and to their specific error.

Takeaways

In-person usability testing is fantastic, but it’s an expensive way to get deep access to a few people’s thoughts. It’s especially helpful for figuring out the intent behind really perplexing mistakes.
Scanning your logs is an inexpensive way to cull hundreds of mistakes, and look for wider patterns. But you’ll be forced to imagine how they got there.
Intent is key. I’ve got it easy: people are playing a puzzle game. In other apps you might spend more time to figure out what they hoped would happen.
Display your errors in a way your intuitive mind can suck out insights. I didn’t do this by paging through /var/log/httpd/error_log, that would have been agony.

If you liked this post, you should definitely see the regular expressions tutorial in action.

Using Regular Expressions to Nose Around a Large PHP Project

2013-01-08T13:22:00-06:00

I love PHP, and I love Regular Expressions. I’m putting together a few blog posts to show how I used regular expressions and PHP together in a relatively large project that I was in charge of for the past few years. (Unfortunately, this project is owned by my previous employer, so I’ll be sharing metadata and short code samples rather than the whole shebang.)

Why? I built a regular expressions tutorial, and part of what makes it amazing is that it uses real world examples. This was a chance to find more examples to incorporate.

In this post, I’ll be using regular expressions to dredge up some information about the project:

How big is it?
How many files are actually PHP? (vs CSS, PNG, etc)
How often did I use regular expressions?
Which of PHP’s regular expressions functions do I lean on?

Today you’ll see some practical uses for regular expressions with find and egrep.

How large is this project?

How many files are in the repository?

Let’s ask find to tell us all the file names, then have wc count them.

$ find . -type f | wc -l
4433

How many lines of code are there in the project?

$ find . -type f | xargs wc -l | grep total
1478127 total
   98070 total

Here, we’re still asking find for a list of files, but instead of counting the file names, we’re using xargs to pass the filenames to wc, and asking wc to count the number of lines in each file.

The call to grep addresses a caveat for really big projects: xargs can’t hand an infinitely large set of arguments to wc. For large output from find, wc will output several interim subtotals.

All told, the project contains, 1,576,197 lines of code. Wait, that’s not right – not everything in the project is code. Let’s look at file types.

What filetypes does this project contain?

What are the most popular file suffixes in the project?

$ find . -type f | egrep -o "\.[a-z]+$" | sort | uniq -c | sort -nr
.conf
.php
.png
.inc
.ulaw
.js
.gif
.jpg
.txt
.bmp
.bin
.gsm
   ... some contents deleted ...

Let’s break that command down a little.

find . -type f is getting the names of all the files throughout the repository.
egrep -o "\.[a-z]+$" returns only the part of the file name that matches this regular expression for the file suffix. More on this in a moment.
sort sorts all these file suffixes alphabetically
uniq -c counts all the unique suffixes (but needs them to be sorted)
sort -nr sorts again, but from most occurrences to least (numerically, reversed)

The regular expression \.[a-z]+$ that we passed to egrep breaks down as:

\. a literal dot. (Without the backslash, . matches any character)
[a-z] any lower case letter
+ repeat the match to my left (“any lower case letter”) one or more times
$ the end of the file name. (There can’t be any more text after the last letter.)

The output tells us that there are 610 .php files. It also jogs my memory that we used .inc files for PHP library code that couldn’t be called by Apache directly. So we need to count another 375 .inc files.

How many PHP lines of code does the project contain?

Now let’s get the line count for all files ending .php or .inc.

$ find -E . -regex '.+\.(php|inc)' -type f | xargs wc -l | grep total
  143509 total

Here we’re using the -regex functionality of find. The -E flag turns on “extended” regular expressions.

Let’s break down the regular expression .+\.(php|inc) that we passed to find:

.+ Any number of any character. The -regex flag takes a pattern that matches the whole file name, so we use this pattern to match “whatever” the file name starts with. The . means any character, the + means “one or more of that thing to my left.”
\. Literally a dot. Without the backslash, this would mean “one more of anything.”
(php|inc) One of php or inc.

How can I avoid counting branch lines-of-code when estimating project size?

143,509 lines is at least the right order of magnitude, but when I ran it without the grep statement, I could see that it was double-counting some files that exist, unmodified, in different ongoing branches. So lets tighten up the count by only counting lines of text in PHP files in project trunks.

$ find -E . -regex '.*/trunk/.*\.(php|inc)' -type f | xargs wc -l | grep total
   45682 total

This regular expression we’re using with find matches:

.* Zero or more of any character.
/trunk/ The literal text /trunk/. In other words, the path contains a folder trunk somewhere.
.* Zero or more of any character. This means /trunk/ can be anywhere in the path.
\. Literally a dot.
(php|inc) One of php or inc.

For paths that contain a folder trunk, and filenames that end with .php or .inc, I have 45,682 lines of code.

Where am I using regular expressions in my PHP code?

Personally, I always use PHP’s Perl-compatible regular expression functions, which all begin “preg_”.

$ find -E . -regex '.*/trunk/.*\.(php|inc)$' -type f | xargs grep preg_
./project/trunk/announce-number-change.php:       if(preg_match($pattern,$dialed_number)){
./project/trunk/admintools/about.php: list($fs, $blocks, $used, $avail, $percent, $mount) = preg_split("/[\s]+/", array_shift($result));
   ... 352 other examples deleted ...

In this command, we use xargs to pass all the file names as arguments to grep. This grep command outputs all the lines in all those files that contain “preg_”.

What preg_ functions am I using, and how often?

$ find -E . -regex '.*/trunk/.*\.(php|inc)$' -type f | xargs egrep -o "preg_[a-z_]+" -h | sort | uniq -c | sort -nr
preg_match
preg_replace
preg_split
preg_replace_callback
preg_output
preg_match_all

Let’s break down that command:

find -E . -regex '.*/trunk/.*\.(php|inc)$' -type f Get all the files ending .php or .inc
xargs egrep -o "preg_[a-z_]+" -h Find the full preg_ function name. (More on the regex below) The -o option causes egrep to return only the part of the line that matches the expression. The -h option suppresses the file name you found it in.
sort | uniq -c | sort -nr Count how many times each function appears, and sort descending.

And the regular expression we hand to egrep:

preg_ - The literal text preg_
[a-z_] - Any character .
+ - Repeat “a to z, or an underscore” one or more times. This will stop matching when it gets to the ( around the arguments.

Next time we’ll talk about those functions in detail and how I used them.

If you’d like to learn more about regular expressions, you should try Wingtip Labs’ regular expressions tutorial.