How to finally delete documents in CouchDB

April 20, 2015 | 3 min Read

Image via CC from Sh4rp\_i

Documents in Apache CouchDB are usually not really deleted but rather marked as such. In use cases with many document insertions and deletions, this considerably affects disc space consumption and performance. This post shows a practical way how to get rid of deleted documents and make your data base fast and efficient again.

Deletion is not deletion

Usually documents in Apache CouchDB are deleted by using its HTTP document API, through an HTTP DELETE request. Due to the append-only design of the underlying B-Tree, the document is not deleted but only marked as such. A new revision of the document is created (a so-called tombstone), containing only its ID, revision and a field _deleted which is set to true.

As a consequence, even if a document is deleted, CouchDB keeps old revisions of the document plus its tombstone. Data base compaction removes the old revisions, but the tombstone remains, as well as a list of old revision numbers (to allow replication). The deleted document increases the size of the B-Tree, blocks disc space permanently and slows down view regeneration.

The purge

CouchDB offers an alternative to deletion which is called purge. A purge ultimately removes all references to a document from CouchDB. However it is a problematic command since purges are immediate and final (no recovery for accidental purges) and they are not replicated. In a productive environment with data base replication enabled, it would compromise consistency of the replicas. Hence the official CouchDB documentation discourages the use of purge for our use case.

Periodic relocation

CouchDB documentation proposes a different approach to get rid of deleted documents. The idea is to switch periodically to a new data base when the entries in it have all expired, and then delete the old one.

While this is a convenient solution for some use cases, it is difficult if there is data that never expires - this data would have to be migrated to the new data base which necessitates additional programming.

Periodic cleanup

We used a different approach that allows to take over existing data without programming and only a little user interaction. Filtered replication is used to create a replica of the current data base without deleted documents. Once the replicaton is finished, only a short productive downtime is needed to delete the original data base and move the replica in its place.

Step by step

Create filter Open Futon on the original_database. Create a new document _design/filters in the data base:

{
   "_id": "_design/filters",
   "filters": {
       "deletedfilter": "function(doc, req) { return !doc._deleted; };"
   }
}

This code filters out all deleted documents.

Create replication document
Create a new document in data base _replicator:

{
   "_id": "replicateCleanup",
   "source": "original_database",
   "target": "https://admin:password@localhost:5984/original_database_replica",
   "create_target": true,
   "filter": "filters/deletedfilter",
   "owner": "admin",
   "continuous": true
}

When this document is created, CouchDB immediately creates a replica data base named original_database_replica and starts copying all documents which are not deleted.

Wait until replication is finished
Check https://localhost:5984/_utils/status.html and wait until progress is 100%
Shut down everything
Stop applications that can access the data base
Switch data bases
- Check again https://localhost:5984/_utils/status.html: progress should be 100%
- Stop CouchDB
- Delete original_database data base and views
- Rename original_database_replica to original_database
Start CouchDB and applications again

Note that if your original data base has other replicas (for load balancing or failover), the replicas need to be replaced as well during the downtime.

Conclusion

We presented a convenient alternative to finally get rid of deleted documents in Apache CouchDB. No programming is needed, it can be achieved with Futon only. It involves periodic user interaction, but most of the work can happen while CouchDB is up and running, and therefore minimizes productive downtime.