Creating an effective archive with labels

October 11, 2012 | 4 min Read

Paper is tree-based

Developer documentation, if available, is a great thing. It is even better if you can actually find the piece of information you are looking for. Most documentation I have seen is organized in one big hierarchical tree. It’s impossible to find anything until you have intimate knowledge of the tree itself. The problem is often that the information itself is not structured enough to naturally fit in a single hierarchical tree, but it is pressed into it because it appears to be the only way to structure things.

The idea that information is best structured in single-dimensional hierarchies comes from a world where it was expensive to store objects in more than one place. Everything that is paper-based or contains other physical entities belongs in this category. Unfortunately, this idea has been burned rather strongly into the post-modern human brain, and only slowly are people realizing that this is an artificial border that is no longer relevant in the information age. In my last blog entry 10 Principles on Electronic self-management I roughly outlined the archive. Here I want to elaborate and generalize this idea.

Labels deal with unstructured information

Many applications nowadays allow their documents to apply labels, sometimes called tags.

Labels have simple characteristics:

  • A page/document can have multiple labels
  • Multiple pages/documents can have the same labels
  • You can browse by labels, or sometimes search for them

For creating a label-based archive your application needs another feature:

  • Browse combined tags, i.e. find all documents that have both tags rap and osgi

This feature is available in applications like Evernote, Google mail and Confluence. Some systems that provide tag clouds are missing this feature, though.

Creating an effective archive with labels

A label-based archive works fine with mostly unstructured information like documentation snippets or notes, but for a better understanding we’ll start with something very structured. Consider a classical paper-based address book. All entries are organized alphabetically, but more importantly, they are organized by single letters.

  • B - Steve Ballmer
  • C - Tim Cook

Now let’s move this address book to the electronic world, and forget address book applications for the moment. Both “Steve Ballmer” and “Tim Cook” are documents now, and we can label them. “Steve Ballmer” gets a B, but because it’s convenient he also gets an S. “Tim Cook” gets C and T.

  • Steve Ballmer - B, S
  • Tim Cook - T, C

When your address book grows, and you have 50 or more entries under B, it will still be easy to find Steve Ballmer, because the combination of S and B will remain rare. Not unique, but rare enough.

This approach alone is rather powerful, so it’s tempting to move other things to this document database. Let’s organize bookmarks, too. But wait a moment. Before you start labeling “Steve Ballmers Blog” with S and B, it’s time to categorize address book entries by adding the label addressbook.

  • Steve Ballmer - B, **S,addressbook**
  • Tim Cook - T, **C,addressbook**

Now you can add your bookmarks.

  • Steve Ballmers Blog - B, Sbookmarks
  • Tim Cooks Blog - T, C, B, bookmarks

Applying labels to somewhat structured information

Look at the hierarchical tree on the left. Imagine that this tree names only the folders, and each folder contains one or more documents with the actual information. It is a small but quite typical tree in any wiki of any mid-sized development project.

Let’s see how we can add labels to the documents in that tree. It appears obvious that client, server and database get their own categories. A document “Installing the client on Linux” gets the labels client, I, L. Also, if supported systems include Ubuntu and Redhat, the labels U and R are added.

faq is also a strong candidate for its own category label. After careful examination the team might decide that the installation and uninstallation topics are addressed well enough by I and U. Also, the homeless document that hides under “Server Migration” might get the labels server and M.

Should the team feel that migration is almost worth its own category, but does not want to commit to it, it’s a good idea to label the documents in question with both migration and M.

I use this archiving method for 500+ documents and generally find what I want with at most 3 intersecting tags. I don’t see any scaling issues for the next few hundred documents. What comes after that?