Eclipse Yoxos Services Downloads Blogs About
Home > Blogs >

Posts Tagged ‘xml’

on Aug 11th, 2009Beyond XML: The Future of Extensible Metaformats

Yesterday I discussed some of the issues with XML. Today I’ll be taking a look at three of the potential alternatives that may improve on the current situation.

YAML

YAML Ain’t Markup Language. To quote the spec, it is a “human-friendly, cross language, Unicode based data serialization language”. While mainly designed with so-called agile languages in mind, it can also be used with more traditional languages. The format is more data-centric than document-driven. Even so, one of its primary design goals is good readability. It was heavily influenced by RFC 2822 (Internet Message Format). That means it looks a lot like mail headers. It features built-in support for lists, hashes (i.e. dictionaries or associative array), and common data types. It also allows elements to have multiple parents, which also allows cross-references. There are some more minor features that make working with the format easier. Here’s a small example of a YAML document lifted from the spec:

invoice: 34843
date   : 2001-01-23
bill-to: &id001
    given  : Chris
    family : Dumars
    address:
        lines: |
            458 Walkman Dr.
            Suite #292
        city    : Royal Oak
        state   : MI
        postal  : 48046
ship-to: *id001
product:
    - sku         : BL394D
      quantity    : 4
      description : Basketball
      price       : 450.00
    - sku         : BL4438H
      quantity    : 1
      description : Super Hoop
      price       : 2392.00
tax  : 251.42
total: 4443.52
comments:
    Late afternoon is best.
    Backup contact is Nancy
    Billsmer @ 338-4338.

While YAML seems like a solid solution with implementations written for many languages, there are some issues, like the surprising lack of momentum in the software development community and the controversial use of significant whitespace.

JSON

The JavaScript Object Notation is basically a subset of JavaScript used to statically describe data. With the release of YAML 1.2 it is also a subset of YAML, which means every JSON document is a YAML file. Its focus is primarily simplicity and readability. While it is trivial to parse JSON in a JavaScript environment, its simplicity makes it also quite easy parse in other languages. Parsers for most popular modern development platforms exist. Being so easily accessible in browsers has earned it quite some support and momentum in modern web development, already often completely replacing XML in the AJAX stack. Here’s a small snippet lifted from the JSON Wikipedia page:

{
     "firstName": "John",
     "lastName": "Smith",
     "address": {
         "streetAddress": "21 2nd Street",
         "city": "New York",
         "state": "NY",
         "postalCode": 10021
     },
     "phoneNumbers": {
         "home": "212 555-1234",
         "fax": "646 555-4567"
     }
}

JSON is currently quite popular, though it may in certain cases be hampered by its simpleness. Maybe YAML forward compatibility can provide an convenient upgrade path, should a more sophisiticated format be necessary.

Groovy

Most of you may know Groovy as a JVM scripting language. I have also already blogged about using Groovy to replace especially painful parts of XML.

Groovy features a MarkupBuilder that let’s you create what is basically an XML DOM tree in memory using a slightly more fluent syntax. Have a look:

  car(name:'HSV Maloo', make:'Holden', year:2006) {
    country('Australia')
    record(type:'speed', 'Production Pickup Truck with speed of 271kph')
  }

Some might consider this just syntactic sugar but it really goes a long way.

But I think Groovy can also be used as a first class configuration language. There are already several projects out there that use groovy scripts for tasks that have traditionally been in the firm grip of XML. One such example is gant, which is basically just a thin wrapper to write ant files. Of course nowadays, everyone is using maven instead, but there’s also a neat tool for that: Gradle is build system configured using a Groovy DSL, while employing Apache ivy and maven under the hood. Take a look at this example from the official gradle documentation:

usePlugin 'java'
 
sourceCompatibility = 1.5
version = '1.0'
manifest.mainAttributes(
    'Implementation-Title': 'Gradle Quickstart',
    'Implementation-Version': version
)
 
repositories {
    mavenCentral()
}
 
dependencies {
    compile group: 'commons-collections', name: 'commons-collections', version: '3.2'
    testCompile group: 'junit', name: 'junit', version: '4.+'
}
 
test {
    options.systemProperties['property'] = 'value'
}
 
uploadArchives {
    repositories {
       flatDir(dirs: file('repos'))
    }
}

A lot easier on the eyes, while still providing interoperability with “legacy” systems like maven.

The power and expressiveness of groovy make it really easy to create domain-specific languages like these. Such a special purpose language might not always be desirable, especially when interoperability is a key concern. But for certain applications these DSLs might be a better solution than any general purpose format.

On the whole, these are interesting times we live in and I’m curious to see what the future holds in store. As I said before, XML is probably gonna be here for quite a while, but there are some compelling alternatives out there. What are your thoughts on the legacy of XML?

on Aug 10th, 2009XML: Still No Silver Bullet

The XML format has done a lot in the last decade to reduce some of the pains of legacy formats and to encourage application interoperability. Having a standardized syntax and object model makes the development process a lot easier. But I still feel that there are some severe shortcomings when it comes to the general format itself and and the concrete implementation of XML dialects that I want to discuss in this blog post.

XML is a markup language

As the name Extensible Markup Language implies, the language is first and foremost a markup language. That means that the language annotates a body of text. So if you were to strip out all the markup from an XML document, it should still end up with legible and comprehensible. This is certainly true for (X)HTML and DocBook. It might be slightly harder to read and understand, but all the crucial information is right there in plain text. The markup just adds semantic or presentational meta-information, e.g. which bits of text are headers or quotes.

In many cases XML is misused as a general data format. This often means that there is no actual text (character data) whatsoever in the resulting files. Then why use a markup-language in the first place? The Eclipse plugin.xml format for example is a markup only format (save some extension points schemas).

Attribute and Element Dichotomy

Another gripe with XML is when to use attributes vs. elements. There are at least three options that at first glance seem equally plausible. For example:

Name as PCDATA

<customer>
ACME Inc.
</customer>

Name as extra element, PCDATA

<customer>
<name>ACME Inc.</name>
</customer>

Name as attribute

<customer name="ACME Inc." />

I have seen these three styles mixed within the same document for almost exactly the same fields. Should a general purpose data interchange format really be that hard to get right, or at least consistent?

Awkward mapping of common constructs

Because xml implicitly is a tree structure, all structures involving cyclic references or multiple cross-references in general are not easily mapped to an XML compatible form. Object-oriented models can often contain back-references, or reference a single object from different places in the object graph. Although there is some support for such constructs in the form of ID/IDREF attributes, this is neither supported by all parsers nor even publicly widespread information.

Another common data type that is painful to describe in XML is the associative map. An example from the Eclipse plugin.xml:

<property name="aboutImage" value="eclipse_lg.gif" />

Compare that to a simple properties file format:

aboutImage=eclipse_lg.gif

The third type of commonly found data that is hard to put into XML is relational data. That kind of data is traditionally found in RDBMS, CSV files and unfortunately spreadsheets. While it is fairly trivial to create a corresponding xml format, the result is often needlessly verbose and repetitive, since all elements usually have the same attributes.

Readability

By far the biggest gripe I have with xml on a day to day basis is that it is really hard to parse for me as a human. Between the angle brackets bunched up against the element names, the endless repetition and escaped entities it seems like this format was not really designed with a legibility in mind. Another reason could be related to my first point: In regular markup languages the tags only contribute a relatively small percentage of the overall content. The majority is plain text, so the tags are few and far between. In markup-only languages however, there’s a much higher density of markup elements. In my opinion this redundant repetition lowers the signal to noise ratio significantly, making these documents much harder to read.

These are the main problems that I currently see with XML. I concede that the common, extensible meta format was a huge step forward, and that for some problem domains it is a pretty good fit. I also realize that XML is gonna be here for quite a while, but I think it’s time to stop resting on these laurels and see how we can address these problems in the future.

Tomorrow I’ll be looking at the some of the alternatives that might be potential successors to XML.

© EclipseSource 2008 - 2011