Persistent Trees in git, Clojure and CouchDB
This is a tale of three images. I found these images while investigating the internals of several different applications. There are some really neat software projects emerging at the moment, and as a developer I always find it interesting to take a look at the implementation details, because there is often a lot to be learned. It’s not always something you might need right now, but maybe a few years down the line you may be confronted with a similar problem. Plus - in my opinion - knowing a bit about the internals of a program helps reasoning about its behaviour.
Exhibit A: git repositories
Let’s start with the first image. This one is taken from Scott Chacon’s talk “Getting Git”. (Actually I had to mix slides 138 and 142 together to better fit the blog format.) I’ll try not to go to much into the details of git here, listen to the talk instead of you want to know more. The thing I do want to talk about though is git’s tree structures. These tree structures describe the project contents (both directory and file) for one specific commit. These trees are completely immutable (this is ensured by SHA-1 hashes). A new commit creates what amounts to a completely new tree. But usually the changes in each commit are only a fraction of the whole tree. Since a lot of the sub-trees of the original tree have not changed (and are immutable), these can be safely recycled. Only files and folders that have changed need to be added. Note that this means parent nodes of changed nodes have to be created as well, recursively up to the tree root. But even though the tree itself is new (as evidenced by the new root) the additional space requirements are quite low, on the order of approximately O(log n). This makes for extremely compact repository format, and an astonishingly simple one. Another thing I want to point out that the only element that needs to be mutated is the reference to the root of the tree, everything else is immutable.
Exhibit B: Clojure’s persistent data structures
Next up is a picture lifted from Rich Hickey’s “Clojure Concurrency” talk. If you are interested in figuring out ways to deal with the impending multi-core age, I highly recommend you take a look at this talk. The concepts and techniques are not exclusive to Clojure, so even if you are not into Lisp’s you might want to watch it.
The way clojure deals with concurrency (in a nutshell) is that every core data structure (list, trees, hash maps, etc) is immutable. This has one big perceived drawback: Immutable data structures incur a huge overhead because of all the copying involved in creating new instances. But this is only a problem if the immutability is implemented naively. Analysis shows that usually only a part of a data structure is actually modified at any given time. This again allows recycling large parts of the the “old” data structure and only creating part of it new, and just pointing to the bits of the old structure that have not changed. As these are immutable as well, it is completely safe to do so. In the picture the two green boxes are the root nodes of two data structures sharing large parts of their trees.
Since a program that only deals with static, immutable data is pretty boring, there are mechanisms for introducing changes in a program. These mechanisms are called references, and these do mutate. But the runtime ensures that these changes happen in a controlled and well behaved manner, e.g. in the form of Software Transactional Memory (STM) or agents. But everything else is essentially immutable. Note that this has the neat side effect that reading process are inherently thread-safe since nothing can suddenly change while you are looking at it, i.e. no more ConcurrentModificationExceptions.
Exhibit C: CouchDB’s append-only file format
The third picture is from a blog post by Ricky Ho titled “NoSQL patterns”. CouchDB has made quite a few waves recently, so I wanted to see what makes it tick under the hood. In keeping with the Erlang philosophy of robustness and failure tolerance, couchdb uses an append-only file format. This means that all data written is not modified any further and there is thus little chance for corruption. CouchDB uses a B+ Tree to index its entries, so again we have a tree structure. And because the tree entries live inside an append only file, these trees are immutable as well. When making a change to the database, the new data is written along with all changes to nodes in the B+ Tree. Since the B+ Tree is very flat, that means even for databases with millions of entries the number of updated tree nodes is fairly small. The most recent root of the index tree can always be found at the very end of the file, and thus the database file is read from the end backwards. If any kind of failure occurs while writing changes, the software can see that the last entry is corrupt and can seek further back until it finds a “good” entry and proceed from there. Another interesting benefit is that reading processes don’t block writing processes and vice versa. A reading process can just find the latest root and then work from there. Because all references go backward in the file, and everything is immutable there is no contention. Meanwhile a writing process can happily append at the end of the file without disrupting any readers.
There are probably more examples of this pattern to be found. The oldest reference I could find is “The Design and Implementation of a Log-Structured File System” written by Mendel Rosenblum and John K. Ousterhout in 1991. The most recent file system effort seems to be NILFS. The flash-based SSDs that are currently entering the market are also rumored to use something along these lines internally. I also suspect that a few of the traditional DBMS might use similar data structures under the hood
Although all the applications do very different things and come from completely different backgrounds, they share a common data structure underneath. Unfortunately I have not been able to find a well published moniker for this pattern. If anyone does, please let me know.
Update: As FuncFan pointed out in the comments the moniker I was looking for was “Purely functional trees”. Thanks for the quick reply!
Immutable, recycling trees
All three systems have these immutable trees in some way way or form that allow sharing structure. In git these are tree objects, in Clojure they are called persistent data structures, and in CouchDB they are part of the internal index tree.
To allow for changes in an otherwise immutable world, all systems allow for mutable reference constructs. All change is encapsulated in these references, which are called exactly that in both git and Clojure. In CouchDB the story is a little different. Here the “reference” to the latest tree is simply the end of the database file
More common ground
Apart from these “primary” characteristics there are other shared features among these three systems, that are a direct result of the underlying data structure.
Because data may be shared between data structures, it is often not safe to delete all children when a root node is no longer needed. Instead a garbage collection mechanism is needed to free unused structures. Git does have a special command that does exactly that, Clojure uses the Garbage Collection in the JVM implicitly, and CouchDB offers a compact operation, which removes old versions and tree indices.
Because no structure is changed, it is trivially easy to keep old versions around, and since a lot of data is shared it is usually fairly cheap in terms of memory to do so. Versioning is the primary application of git, so no big surprises on that end. In Clojure it is also fairly easy to add versioning for things like an “undo” buffer: Just keep a list of the old objects around. CouchDB also offers some lightweight versioning out-of-box, but it is mostly used for replication. But it should be fairly simple to add more sophisticated versioning features.
The immutability properties of all three system make reasoning about concurrent changes and processes a lot simpler. Reading is trivial because all that is read is set in stone, so to speak. Writing is made easier by the fact that there is only a single point - the reference - that is modified to effect a specific change. This critical point can then be guarded with traditional synchronization primitives, without having to worry about the rest of the data structure.
I have to admit I found it a bit eerie seeing a single pattern pop up in so many different places. It may be just an idea whose time has come (similar to Garbage Collection a few years back). Or it may be just that now we simply have enough memory and computing resources at our disposal, which allows us to never have to delete something from the record, but instead only add incrementally. And oftentimes the history of data is just as interesting as the data itself. The benefits of immutability may also be a good way to tame the concurrency beast in the years to come.
I hope you enjoyed this comparative trip into the internals of these software systems. Again, check out the two talks and the blog post, they are well worth the time.