Wingtip Labs Blog

A PHP Iterator for Amazon SimpleDB

I’m using Amazon SimpleDB to store all student progress information in my Regular Expressions tutorial. All the data I discuss in the article Why I Love My Error Logs (and You Should, Too) is in Simple DB: student records, all the solutions students have tried, and a few summary tables for data I’d just GROUP BY to get in a relational store (e.g., total time on site, fastest time to complete a level).

SimpleDB (SDB) has been great to us, but the PHP API that Amazon provides fetches data in batches. Makes sense, except they expose that annoying implementation detail in a way that makes my calling code ~10x longer than it has to be. This is a story about how I fixed that, with a PHP 5 feature called Iterators.

SimpleDB is a non-relational database accessed through an HTTP API for relatively small pieces of information (no value can be bigger than 1Kb). I wouldn’t call it a NoSQL exactly, it has a SELECT statement with SQL-like keywords—minus JOINs or sub-queries of any kind.

SimpleDB fetches results over HTTP, and breaks large result sets into chunks: typically 100 records or 1Mb, whichever comes first. Then it provides a NextToken identifier that you can use to ask for the next chunk. Your code ends up looking like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// Old way
$sdb = new AmazonSDB();
$select_expression = 'SELECT * FROM `expressions_users`';
$next_token = null;
do {
  if ($next_token) {
      $response = $sdb->select($select_expression, array(
          'NextToken' => $next_token,
      ));
  } else {
      $response = $sdb->select($select_expression);
  }

  $rows = $response->body->Item();
  foreach($rows as $row){
      $user = parse_user($row);
      //Do a thing with $user
  }

  $next_token = isset($response->body->SelectResult->NextToken)
      ? (string) $response->body->SelectResult->NextToken
      : null;
} while ($next_token);

That’s 21 lines of boilerplate and one line of Do a thing with $user. There’s an outer do loop to make sure we’re going through all the chunks (as long as we get a NextToken) and an inner for loop to handle the rows in the current chunk. What a nuisance!

Worse yet, when you’re initially writing the app and test data fits in one chunk, you’re likely to forget the outer loop at least once. That’s just human nature.

What can we do about it? Iterate!

Iterators in PHP

A PHP Iterator is a data structure that the calling code interacts with just one element at a time. Inside the Iterator you can cache or parse or even generate values as they’re requested, as long as you expose a way to start or re-start (->rewind()) test that there’s an element to fetch (->valid()) fetch an element (->current()) and move to the next element (->next()). Unlike an array, you don’t have to hold all the elements in memory at once, because the exposed surface has no way to back up or jump arbitrarily ahead. And best of all, the calling code can still use the familiar foreach control to walk the elements (it uses those methods so you don’t have to):

1
2
3
4
5
// New way
$user_iterator = new SDBSelectIterator("SELECT * FROM `expressions_users`", "parse_users");
foreach ($user_iterator as $user){
  //Do a thing with $user
}

If that solves a problem for you, the simple one-file class is available on GitHub.

Key Features of SDBSelectIterator

Most importantly, I wanted to get rid of the double-loop (outer loop for chunks, inner loop for rows) pattern in my code.

1
2
3
4
5
// New way
$user_iterator = new SDBSelectIterator("SELECT * FROM `expressions_users`", "parse_users");
foreach ($user_iterator as $user){
  //Do a thing with $user
}

Secondly, I really hate dealing with the SimpleXML response direct from the AWSSDKforPHP. So SDBSelectIterator has a built-in parser that turns these SimpleXML responses into a more PHP idiomatic associative array. The parser is doing some work to figure out whether each attribute should be a number, parsed as a JSON object (into more associative arrays) or just returned as a string. This is helpful, since SimpleDB treats everything as a string.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// Old values for a user
CFSimpleXML::__set_state(array(
   'Name' => 'jwadhams',
   'Attribute' => 
  array (
    0 => 
    CFSimpleXML::__set_state(array(
       'Name' => 'first_seen',
       'Value' => '2011-01-05 19:35:43',
    )),
    1 => 
    CFSimpleXML::__set_state(array(
       'Name' => 'level_attempts',
       'Value' => '526',
    )), 
...

//New values for a user
array (
  'primary_key' => 'jwadhams',
  'first_seen' => '2011-01-05 19:35:43',
  'level_attempts' => 526,
 ...

Thirdly, you can supply your own parser (and I encourage you to do so) for the data you’re receiving. Personally, I do things like populate default values for missing attributes, rename the primary key, anticipate which columns should always be arrays (even if they have zero or one entries for this row), and parse data types into more useful representations (e.g., the app carries any time values as epoch and leaves output formatting to the View, but we store as ‘YYYY-MM-DD HH:MM:SS’ which string-sorts nicely). You could even pass in a real Object factory.

1
2
3
4
5
array (
  'user_id' => 'jwadhams',
  'first_seen' => 1294256143,
  'level_attempts' => 526,
...

How it Works

You can check out the code in full on GitHub, but we’ll look at some snippets here.

We’re going to keep using the example of selecting user information. Here’s the calling code that creates value a user cares about.

1
2
3
4
5
$get_users = "SELECT * FROM `" . EXP_DOMAIN_USERS . "` WHERE first_seen is not null ORDER BY first_seen DESC "; //Newest users first
$users = new SDBSelectIterator($get_users, "parse_exp_user_cache");
foreach($users as $user){
  //Display a user row in a table
}

Obviously, first that runs the constructor:

1
2
3
4
5
6
7
8
9
10
11
12
public function __construct($query, $parser = false) {
    if(!class_exists('AmazonSDB')){
      require('AWSSDKforPHP/sdk.class.php');
  }
  $this->position = 0;
  $this->total_position = 0;
  $this->query = $query;
  $this->next_token = null;
  $this->parser = $parser;
  $this->sdb = new AmazonSDB();
  $this->query();
}

You’ll notice that we’re completely encapsulating Amazon’s SimpleDB API. (In a future version it would be wise to use dependency injection here.) The main work the constructor does is to initialize a bunch of pointers to help keep our place (notably position tells us where we are in this batch, and total_position keeps count across batches) and initializing the SimpleDB API. We also execute the query so that the calling code can find out immediately if there’s an error or zero results, instead of waiting for the first valid call to fail.

I’m not going to reproduce the query method here, but suffice it to say it takes the hit of the 23 lines of code at the top of this post. That method uses private variables to keep track of NextToken, and downloads and parses one entire batch into the private variable result_batch.

Now when we actually start consuming the Iterator in the foreach loop, it executes the methods valid

1
2
3
4
5
6
7
function valid() {
  //var_dump(__METHOD__);
  if(!$this->result_batch or !is_array($this->result_batch)){
      return false;
  }
  return isset($this->result_batch[$this->position]);
}

then current

1
2
3
4
function current() {
    //var_dump(__METHOD__);
    return $this->result_batch[$this->position];
}

then next

1
2
3
4
5
6
7
8
function next() {
    //var_dump(__METHOD__);
    $this->position += 1;
    $this->total_position += 1;
  if(!isset($this->result_batch[$this->position]) && $this->next_token){
      $this->query();
  }
}

The foreach loop doesn’t have to know about batches, the next method will fetch the next batch when it runs out of elements in the current batch. From an application perspective, every 100th call to next takes a little longer, but is otherwise indistinguishable.

Helper Methods

To catch problems with SimpleDB service or our query, we can check that the returned Iterator is valid, and extract error messages:

1
2
3
4
5
6
7
$users = new SDBSelectIterator($get_users, "parse_exp_user_cache");

if(!$users->isOK()){
  echo "<div class='error'>" . $users->error_message() . "</div>";
}

foreach ...

I added the next_valid method to be able to “peek” at whether the current element I was processing was going to be the last element. It’s not generally necessary, but it helped me when I was copying data between SimpleDB domains with the provided batch_put_attributes, which can only take 25 items at a time. (The batch_friendly_sdb_parse parser is also on GitHub.)

1
2
3
4
5
6
7
8
9
$old_domain_data = new SDBSelectIterator("SELECT * FROM `old`", "batch_friendly_sdb_parse");
$rows = array();
foreach($old_domain_data as $row){
  $rows = array_merge($rows, $row);
  if(count($rows) >= 25 or !$old_domain_data->next_valid()){
      $overwrite_stats = $sdb->batch_put_attributes($new_domain, $rows);
      $rows = array();
  }
}

You can also directly call the valid method to detect queries with zero results.

1
2
3
4
5
6
7
8
$user_iterator = new SDBSelectIterator("SELECT * FROM `expressions_users`", "parse_users");
if($user_iterator->valid()){
  foreach ($user_iterator as $user){
      //Do a thing with each $user
  }
} else {
  echo "Sorry, no users.";
}

To check out the code, or hit me with the “loving mallet of correction”, stop by GitHub. To see it in action, check out our Regular Expressions tutorial which uses it extensively.