Domain Specific Markup Language

HTML is the worst markup language, except all those others that have been tried.

Raw HTML is a low-level language, and it’s starting to bum me out. I’m working on a project that has me writing a large number of relatively simply marked up pages. (We’ll see the structure below.) In this post, I’m going to implement a Domain-Specific Markup Language, using PHP. It’ll use nouns relevant to my subject as I write it, and output good old HTML when it’s time to render to a user.

(It looks like I’m not the first person to use the name DSML, but I promise to keep this article a little less academic by adding code samples and not using the word “indeed.”)

What does low-level mean?

When I fire up my editor, my programming language knows nothing about my problem. One way to construct a program is to build up your language, to make it easier to express your problem in your problem’s language. When Ivan Jovanovic says programming languages are simply not powerful enough, he means it’s your job to make them more powerful, by shaping the primitives to fit your problem domain.

Likewise, HTML is full of wonderful primitives to mark up just about any human written knowledge. But I have a concrete problem that is a tiny specialized subset of human written knowledge, and so do you. Building a DSML is a way to build up HTML into the domain I’m working in. And, because HTML is the lingua franca of display, we’ll convert to it when it’s time to display.

Getting Concrete

I’m writing a course on MySQL administration. The largest volume of actual lines-typed-into-editors in the project is the content of the lessons. Every lesson is an HTML page, with one ordered list of steps, and every step has one or more tests that we perform to make sure the student did the work correctly.

Here’s how we used to write the lessons:

<ol id='steps'>
  <li>
      Put your left foot in
      <ol class='step-outcomes'>
          <li>Left foot is in front of you</li>
          <li>Balanced on right foot</li>
      </ol>
  </li>
  <li>
      Take your left foot out
      <pre class='cli'>
student@server $ out /dev/left-foot</pre>
      <ol class='step-outcomes'>
          <li>Left foot is in front of you</li>
          <li>Balanced on right foot</li>
      </ol>
  </li>
</ol>

This presents a few problems:

Local neighborhoods can be difficult to navigate. What does this string tell you: </li></ol> ? What did I just close? Where am I?
In what way is <pre class='cli'> superior to <cli>? What do I get in return for 10 more characters?
PRE blocks make attractive indentation futile.

How does a DSML help?

Here’s what I want to write:

<steps>
  <step>
      Put your left foot in
      <tests>
          <test>Left foot is in front of you</test>
          <test>Balanced on right foot</test>
      </tests>
  </step>
  <step>
      Take your left foot out
      <cli>
          student@server $ out /dev/left-foot
      </cli>
      <tests>
          <test>Left foot is behind you</test>
          <test>Balanced on right foot</test>
      </tests>
  </step>
</steps>

This is starting to look like XML, without accidentally becoming XHTML. The joy of my DSML is that I’m writing a language that knows what I mean, not that I want a new layer of quoting rules.

Rewriting Custom Tags into HTML

Let’s start with the easy work, let’s convert the <steps> list and the <step> elements back into <ol> and <li>s. I’m using the QueryPath library. It’s a very similar API to jQuery, and because I do the transform server-side, I can provide well-formed pages to clients without JavaScript (like search engine spiders).

<?php
require 'QueryPath/QueryPath.php';

$qp = htmlqp($dsml_file); //The DSML in the previous code block

foreach($qp->find(":root steps > step") as $step){
  $content = $step->innerHTML();
  $step->replaceWith("<li>" . $content . "</li>");
}

foreach($qp->find(":root steps:first") as $elm){
  $content = $elm->innerHTML();
  $elm->replaceWith("<ol id='tests'>" . $content . "</ol>");
}

$qp->writeHTML();
?>

Adding Application Logic and Error Checking

Of course, nobody’s perfect, so let’s add some rules to catch operator error:

<?php
if($qp->find(":root steps")){ //We already translated steps:first into an ol
  warn("You have more than one <steps> collection.");
}

if($qp->find(":root step")){ //We already translated any step that is a direct child of steps
  warn("You have <step> elements outside of the <steps> container.");
}
?>

While I’m writing lessons, my warn() function adds bold red error messages to the top of the parsed document. In production, warn() will quietly log them.

Tags that are Smarter than HTML

Those were simple replacements, you can do that with a regular expression and some duck tape. Let’s make this <cli> tag fix my problems with HTML’s <pre>:

I want to be able to indent the content for easier editing.
I want the </pre> tag on its own line, without showing the student an empty line at the bottom of the code block.

In other words, I want it to work like this:

<li>
  <pre class='cli'>
student@server $ out /dev/left-foot</pre>
</li>

But let me edit it like this:

<li>
  <cli>
      student@server $ out /dev/left-foot
  </cli>
</li>

Here’s the code that does it:

<?php
foreach($qp->find(":root cli") as $elm){
  $content = $elm->innerHTML();

  //Accept Windows or Unixy EOL
  $content_array = preg_split('/(\r\n|\r|\n)/', $content);

  //Get rid of whitespace on left and right.
  $content_array = array_map("trim", $content_array);

  //Get rid of trailing empty lines
  while(end($content_array) == ""){ array_pop($content_array); }

  //Reassemble with uniform EOL
  $content = implode("\r\n", $content_array);
  $elm->replaceWith("<pre class='console'>" . $content . "</pre>");
}
?>

DSML my Users Care About

In lessons, when we introduce new terms, the student can hover over them to get a Bootstrap Popover that loads the definition from our glossary, dynamically. Here’s how that used to look in our code:

Now edit my.cnf:
<pre>
$ <abbr href='/glossary/sudoedit'>sudoedit</abbr> /etc/my.cnf</pre>

Here’s how I want it to look:

Now edit my.cnf:
<cli>
  $ <explain>sudoedit</explain> /etc/my.cnf
</cli>

And here’s how we do it:

<?php
function term_to_url($title){
  $title = strtolower($title);
  $title = preg_replace('/[ \t\r\n]+/', '-', $title);
  $title = preg_replace('/[^a-z0-9\-_]/', '', $title);
  return $title;
}

foreach($qp->find(":root explain") as $elm){
  $content = $elm->innerHTML();

  $url = '/glossary/' . term_to_url($elm->text());
  if(file_exists('..' . $url)){
      $elm->replaceWith("<abbr href='$url'>" . $content . "</abbr>");
  }else{
      warn("No glossary entry for $url");
      $elm->replaceWith($content);
  }
}
?>

Now we get warnings about terms we haven’t written glossary entries for (and we don’t call attention to them, to avoid embarrassment in front of students). Tagging glossary entries is easier (so we’ll do it more). And we’re free to make dramatic changes to the way we present glossary terms without touching a zillion lesson files. For example:

We could slipstream in all the definitions into data attributes instead of fetching via AJAX.
We could paste all the definitions as numbered footnotes on the page.
We could switch the HTML we emit to the browser to use the <dfn> tag instead of <abbr>.

Now go forth, and HTML no more.

HTML is pretty great, but I wouldn’t want to write in it.

A DSML can get you closer to your problem domain, not just in code, but in presentation.
A DSML can free you to write content without bogging you down in implementation details.
A DSML can even make it easier to develop and update features spread across content.

Got comments? Head over to Reddit