Beyond XML: The Future of Extensible Metaformats

Beyond XML: The Future of Extensible Metaformats

Yesterday I discussed some of the issues with XML. Today I’ll be taking a look at three of the potential alternatives that may improve on the current situation.


YAML Ain’t Markup Language. To quote the spec, it is a “human-friendly, cross language, Unicode based data serialization language”. While mainly designed with so-called agile languages in mind, it can also be used with more traditional languages. The format is more data-centric than document-driven. Even so, one of its primary design goals is good readability. It was heavily influenced by RFC 2822 (Internet Message Format). That means it looks a lot like mail headers. It features built-in support for lists, hashes (i.e. dictionaries or associative array), and common data types. It also allows elements to have multiple parents, which also allows cross-references. There are some more minor features that make working with the format easier. Here’s a small example of a YAML document lifted from the spec:

invoice: 34843
date   : 2001-01-23
bill-to: &id001
    given  : Chris
    family : Dumars
        lines: |
            458 Walkman Dr.
            Suite #292
        city    : Royal Oak
        state   : MI
        postal  : 48046
ship-to: *id001
    - sku         : BL394D
      quantity    : 4
      description : Basketball
      price       : 450.00
    - sku         : BL4438H
      quantity    : 1
      description : Super Hoop
      price       : 2392.00
tax  : 251.42
total: 4443.52
    Late afternoon is best.
    Backup contact is Nancy
    Billsmer @ 338-4338.

While YAML seems like a solid solution with implementations written for many languages, there are some issues, like the surprising lack of momentum in the software development community and the controversial use of significant whitespace.


The JavaScript Object Notation is basically a subset of JavaScript used to statically describe data. With the release of YAML 1.2 it is also a subset of YAML, which means every JSON document is a YAML file. Its focus is primarily simplicity and readability. While it is trivial to parse JSON in a JavaScript environment, its simplicity makes it also quite easy parse in other languages. Parsers for most popular modern development platforms exist. Being so easily accessible in browsers has earned it quite some support and momentum in modern web development, already often completely replacing XML in the AJAX stack. Here’s a small snippet lifted from the JSON Wikipedia page:

     "firstName": "John",
     "lastName": "Smith",
     "address": {
         "streetAddress": "21 2nd Street",
         "city": "New York",
         "state": "NY",
         "postalCode": 10021
     "phoneNumbers": {
         "home": "212 555-1234",
         "fax": "646 555-4567"

JSON is currently quite popular, though it may in certain cases be hampered by its simpleness. Maybe YAML forward compatibility can provide an convenient upgrade path, should a more sophisiticated format be necessary.


Most of you may know Groovy as a JVM scripting language. I have also already blogged about using Groovy to replace especially painful parts of XML.

Groovy features a MarkupBuilder that let’s you create what is basically an XML DOM tree in memory using a slightly more fluent syntax. Have a look:

  car(name:'HSV Maloo', make:'Holden', year:2006) {
    record(type:'speed', 'Production Pickup Truck with speed of 271kph')

Some might consider this just syntactic sugar but it really goes a long way.

But I think Groovy can also be used as a first class configuration language. There are already several projects out there that use groovy scripts for tasks that have traditionally been in the firm grip of XML. One such example is gant, which is basically just a thin wrapper to write ant files. Of course nowadays, everyone is using maven instead, but there’s also a neat tool for that: Gradle is build system configured using a Groovy DSL, while employing Apache ivy and maven under the hood. Take a look at this example from the official gradle documentation:

usePlugin 'java'

sourceCompatibility = 1.5
version = '1.0'
    'Implementation-Title': 'Gradle Quickstart',
    'Implementation-Version': version

repositories {

dependencies {
    compile group: 'commons-collections', name: 'commons-collections', version: '3.2'
    testCompile group: 'junit', name: 'junit', version: '4.+'

test {
    options.systemProperties['property'] = 'value'

uploadArchives {
    repositories {
       flatDir(dirs: file('repos'))

A lot easier on the eyes, while still providing interoperability with “legacy” systems like maven.

The power and expressiveness of groovy make it really easy to create domain-specific languages like these. Such a special purpose language might not always be desirable, especially when interoperability is a key concern. But for certain applications these DSLs might be a better solution than any general purpose format.

On the whole, these are interesting times we live in and I’m curious to see what the future holds in store. As I said before, XML is probably gonna be here for quite a while, but there are some compelling alternatives out there. What are your thoughts on the legacy of XML?

  • Posted at 6:17 pm, August 11, 2009

    These are interesting ways to serialize tree like data structures. However, XML never was about just the serialization to text and also about the DOM, SAX, parsing, transformation, extensibility (through name spaces), validation, etc.

    If all you need is a quick way to serialize and unserialize simple, isolated tree structures, JSON or YAML work pretty well and XML is probably going to be overkill. If on the other hand you are looking to process hundreds of MB of tree like data, you might want to rely on XML and efficient ways to process that instead of essentially reinventing the wheel on top of some really primitive parsing library. There’s a lot of XML middleware, frameworks, and support that simply doesn’t exist for JSON, YAML, etc. XML may be tedious to edit but then you probably shouldn’t be doing that manually to begin with.

  • Posted at 6:26 pm, August 11, 2009

    It depends on what your use of these various markups and representations is for. Again, they all can be used wrongly and badly. JSON is particularly weak when representing complex document structures. YAML runs into the same issues people have with XML, even though it is a bit less wordy, and groovy is primarily a scripting language.

    The reason XML is over used is because it is great as a B2B data format, as well as being a good Document language. Take a look at DocBook for a good sample for document centric markup. If you are communicating the semantic meaning, then JSON starts to fall apart beyond three levels, even two is difficult to comprehend at times.

    So, the alternatives all have their down fall. The leason is basically use the correct tool for the correct job. I personally would rather use an XML B2B format compared to say EDI which is still widely used.

  • Konstantin Komissarchik
    Posted at 1:08 am, August 12, 2009

    I completely agree with the premise of this blog post. XML is pretty good as document markup language, but it is terrible for representing structured data in human-readable form. Sure you could put an editing front-end to hide the ugliness (heck, that’s what I’ve been doing for the better part of the last two years), but at that point why are we even using plain text? Various binary XML encodings are ridiculously more efficient on disk, on the wire and in the amount of processing power it takes to read and write them.

    XML became the standard for representing structured data because it was at the right place at the right time, not because it is the best solution for the job.

    I played around some with JSON recently and I am pretty encouraged about where that’s going. Encoded data is more readable. Just needs to shed some of it’s JavaScript roots. For instance, dropping quotes around item names would improve readability even further. YAML is an interesting effort, but significant whitespace is a non-starter for me.

  • Posted at 2:58 pm, August 12, 2009

    It often seems to me that if you put on appropriately colored/filtered glasses that all these things look almost exactly the same. I suppose that’s because the same information can be rendered in an infinite number of ways, each with its advantages and disadvantages. For humans what matters most is that the deep structure semantics are easily recognizable from the surface syntax. For machines what matters most is that the deep structure semantics themselves are well represented because in the end, that’s the actual information that needs to be manipulated. As long as there’s a well defined mapping between the important deep structure and the irrelevant surface syntax, life is good for the machine. This is why I think things like Xtext and Oslo/M are gaining ground. The obsession with surface syntax, which is irrelevant to manipulating the data programmatically, has obscured the more fundamental issue of representing deep semantic structure well.

  • Posted at 2:59 pm, August 12, 2009

    The solutions you propose all have in common that while you are able to change the abstract syntax (i.e. the concepts) you stick to one generic notation. That can be a positive thing, if you rarely read or edit such files. But if you do that often it’s very beneficial to have a tailored notation as well. That’s when external DSLs come into play. Frameworks like Xtext allow you to define both the abstract syntax as well as the notation very easily. In addition you get language-specific validation and IDE support. Also the EMF programming model is far nicer compared to DOM-trees and the like.

  • Posted at 2:07 am, November 23, 2009

    you might also want to look at vtd-xml, the latest and most advanced XML processing API available today