XML: Still No Silver Bullet

August 10, 2009 | 4 min Read

The XML format has done a lot in the last decade to reduce some of the pains of legacy formats and to encourage application interoperability. Having a standardized syntax and object model makes the development process a lot easier. But I still feel that there are some severe shortcomings when it comes to the general format itself and and the concrete implementation of XML dialects that I want to discuss in this blog post.

XML is a markup language

As the name Extensible Markup Language implies, the language is first and foremost a markup language. That means that the language annotates a body of text. So if you were to strip out all the markup from an XML document, it should still end up with legible and comprehensible. This is certainly true for (X)HTML and DocBook. It might be slightly harder to read and understand, but all the crucial information is right there in plain text. The markup just adds semantic or presentational meta-information, e.g. which bits of text are headers or quotes.

In many cases XML is misused as a general data format. This often means that there is no actual text (character data) whatsoever in the resulting files. Then why use a markup-language in the first place? The Eclipse plugin.xml format for example is a markup only format (save some extension points schemas).

Attribute and Element Dichotomy

Another gripe with XML is when to use attributes vs. elements. There are at least three options that at first glance seem equally plausible. For example:

Name as PCDATA


ACME Inc.

Name as extra element, PCDATA


ACME Inc.

Name as attribute

I have seen these three styles mixed within the same document for almost exactly the same fields. Should a general purpose data interchange format really be that hard to get right, or at least consistent?

Awkward mapping of common constructs

Because xml implicitly is a tree structure, all structures involving cyclic references or multiple cross-references in general are not easily mapped to an XML compatible form. Object-oriented models can often contain back-references, or reference a single object from different places in the object graph. Although there is some support for such constructs in the form of ID/IDREF attributes, this is neither supported by all parsers nor even publicly widespread information.

Another common data type that is painful to describe in XML is the associative map. An example from the Eclipse plugin.xml:

Compare that to a simple properties file format:

aboutImage=eclipse_lg.gif

The third type of commonly found data that is hard to put into XML is relational data. That kind of data is traditionally found in RDBMS, CSV files and unfortunately spreadsheets. While it is fairly trivial to create a corresponding xml format, the result is often needlessly verbose and repetitive, since all elements usually have the same attributes.

Readability

By far the biggest gripe I have with xml on a day to day basis is that it is really hard to parse for me as a human. Between the angle brackets bunched up against the element names, the endless repetition and escaped entities it seems like this format was not really designed with a legibility in mind. Another reason could be related to my first point: In regular markup languages the tags only contribute a relatively small percentage of the overall content. The majority is plain text, so the tags are few and far between. In markup-only languages however, there’s a much higher density of markup elements. In my opinion this redundant repetition lowers the signal to noise ratio significantly, making these documents much harder to read.

These are the main problems that I currently see with XML. I concede that the common, extensible meta format was a huge step forward, and that for some problem domains it is a pretty good fit. I also realize that XML is gonna be here for quite a while, but I think it’s time to stop resting on these laurels and see how we can address these problems in the future.

Tomorrow I’ll be looking at the some of the alternatives that might be potential successors to XML.