Reducing CouchDB disk space consumption

Apache CouchDB offers high availability, excellent throughput and scalability. These goals were achieved using immutable data structures – but they have a price: disk space. CouchDB was designed under the assumption that disk space is cheap. Though it is indeed getting cheaper and cheaper, it is not infinite. Here’s a tip to reduce CouchDB database files’ disk consumption.

When a CouchDB document is updated, the new version gets a new revision number, while the old one stays in the database. To get rid of those old revisions, database compaction can be used:

curl -H "Content-Type: application/json" -X POST http://localhost:5984/testdb/_compact

This call removes all revisions except the most recent one from the database. With CouchDB 1.2, it is also possible to configure automatic compaction.

However, there is more that can be done. Here is a table showing the increase in size of a CouchDB database file when adding a single revision of a document (100 byte raw data), depending on the revision:

Number of revisions Disk space needed per revision (KByte)
1 4
180 8
230 12
300 16
400 20
880 40
1123 44
2109 44
2800 44

As you can see in the table, the size consumed by one revision is not constant but grows with the number of revisions, and reaches a saturation point at about revision 1000.

The curious part is that when you’re at the point that a new revision needs 44K, and old revisions are deleted by compaction, new revisions continue using 44K. Even if the document is deleted and created again, each revision still uses that much space.

The reason for this behavior is the fact that with each document update, a list of all previous revision numbers is saved along with the document. CouchDB uses the list to find common ancestors during replication. Compaction does not purge this list. Since a revision number consumes about 40 bytes, the consumption of disk space per revision increases with the number of revisions. By default, CouchDB stores at most 1000 revisions, explaining the saturation.

To reduce the size of the list of old revisions, CouchDB offers a parameter _revs_limit (revisions limit) which limits the number of revisions stored in a database. By default it is set to 1000. It can be changed by issuing an HTTP PUT command:

curl -X PUT -d "10" http://localhost:5984/testdb/_revs_limit

Note that reducing the revisions limit increases the risk of getting conflicts during replication. Therefore you should only do that if you replicate often (before 10 new revisions have been created) or if you don’t use replication at all (thanks to Robert for the hint!).

Setting the revisions limit and using regular CouchDB compaction helped us to keep the disk space consumed by CouchDB databases at bay.

2 Responses to “Reducing CouchDB disk space consumption”

  1. Robert Newson says:

    You should explain the downside of reducing revs limit. Namely, by expunging the values of older revision values, couchdb will not be able to find common ancestors when replicating which will force it to introduce conflicts. If you never replicate or if you always replicate before making 10 edits, then this doesn’t matter. That couchdb does not keep all previous revisions is a compromise between guaranteeing eventual consistency (a core feature and promise) and reality (where disk space, while often the cheapest resource, is not unbounded).

    Users should consider _revs_limit an expert level feature and not, as they might conclude with a casual reading of the above, as just some silly thing couchdb does to waste disk space.

    B.

  2. Hi Robert,
    My post was not meant to be a rant against CouchDB for wasting disk space carelessly. The revisions list is there for a good reason, as you explained well.

    In fact, most CouchDB users will never care about those few extra bytes because they store large documents and update rarely. However in those corner cases where you store small documents and update often, IT operations will soon knock at your door and ask why this strange new database system you insisted on using never cleans up old data. And why do 100 bytes of raw data need 50 times as much space on disk? In those situations you need a good explanation, and sometimes even a solution. That’s what I wanted to give.

    I added a section about the consequences of changing the _revs_limit to the post. Thanks for the hint.

    Best regards

    Tillmann

2 responses so far

Written by . Published in Categories: EclipseSource News

Author:
Published:
Jul 11th, 2012