Reducing CouchDB disk space consumption

Reducing CouchDB disk space consumption

Apache CouchDB offers high availability, excellent throughput and scalability. These goals were achieved using immutable data structures – but they have a price: disk space. CouchDB was designed under the assumption that disk space is cheap. Though it is indeed getting cheaper and cheaper, it is not infinite. Here’s a tip to reduce CouchDB database files’ disk consumption.

When a CouchDB document is updated, the new version gets a new revision number, while the old one stays in the database. To get rid of those old revisions, database compaction can be used:

curl -H "Content-Type: application/json" -X POST https://localhost:5984/testdb/_compact

This call removes all revisions except the most recent one from the database. With CouchDB 1.2, it is also possible to configure automatic compaction.

However, there is more that can be done. Here is a table showing the increase in size of a CouchDB database file when adding a single revision of a document (100 byte raw data), depending on the revision:

Number of revisionsDisk space needed per revision (KByte)
14
1808
23012
30016
40020
88040
112344
210944
280044

As you can see in the table, the size consumed by one revision is not constant but grows with the number of revisions, and reaches a saturation point at about revision 1000.

The curious part is that when you’re at the point that a new revision needs 44K, and old revisions are deleted by compaction, new revisions continue using 44K. Even if the document is deleted and created again, each revision still uses that much space.

The reason for this behavior is the fact that with each document update, a list of all previous revision numbers is saved along with the document. CouchDB uses the list to find common ancestors during replication. Compaction does not purge this list. Since a revision number consumes about 40 bytes, the consumption of disk space per revision increases with the number of revisions. By default, CouchDB stores at most 1000 revisions, explaining the saturation.

To reduce the size of the list of old revisions, CouchDB offers a parameter _revs_limit (revisions limit) which limits the number of revisions stored in a database. By default it is set to 1000. It can be changed by issuing an HTTP PUT command:

curl -X PUT -d "10" https://localhost:5984/testdb/_revs_limit

Note that reducing the revisions limit increases the risk of getting conflicts during replication. Therefore you should only do that if you replicate often (before 10 new revisions have been created) or if you don’t use replication at all (thanks to Robert for the hint!).

Setting the revisions limit and using regular CouchDB compaction helped us to keep the disk space consumed by CouchDB databases at bay.