How to finally delete documents in CouchDB
April 20, 2015 | 3 min ReadImage via CC from Sh4rp\_i
Documents in Apache CouchDB are usually not really deleted but rather marked as such. In use cases with many document insertions and deletions, this considerably affects disc space consumption and performance. This post shows a practical way how to get rid of deleted documents and make your data base fast and efficient again.
Deletion is not deletion
Usually documents in Apache CouchDB are deleted by using its HTTP document API, through an HTTP DELETE request. Due to the append-only design of the underlying B-Tree, the document is not deleted but only marked as such. A new revision of the document is created (a so-called tombstone), containing only its ID, revision and a field _deleted
which is set to true
.
As a consequence, even if a document is deleted, CouchDB keeps old revisions of the document plus its tombstone. Data base compaction removes the old revisions, but the tombstone remains, as well as a list of old revision numbers (to allow replication). The deleted document increases the size of the B-Tree, blocks disc space permanently and slows down view regeneration.
The purge
CouchDB offers an alternative to deletion which is called purge. A purge ultimately removes all references to a document from CouchDB. However it is a problematic command since purges are immediate and final (no recovery for accidental purges) and they are not replicated. In a productive environment with data base replication enabled, it would compromise consistency of the replicas. Hence the official CouchDB documentation discourages the use of purge for our use case.
Periodic relocation
CouchDB documentation proposes a different approach to get rid of deleted documents. The idea is to switch periodically to a new data base when the entries in it have all expired, and then delete the old one.
While this is a convenient solution for some use cases, it is difficult if there is data that never expires - this data would have to be migrated to the new data base which necessitates additional programming.
Periodic cleanup
We used a different approach that allows to take over existing data without programming and only a little user interaction. Filtered replication is used to create a replica of the current data base without deleted documents. Once the replicaton is finished, only a short productive downtime is needed to delete the original data base and move the replica in its place.
Step by step
Create filter Open Futon on the
original_database
. Create a new document_design/filters
in the data base:{ "_id": "_design/filters", "filters": { "deletedfilter": "function(doc, req) { return !doc._deleted; };" } }
This code filters out all deleted documents.
Create replication document
Create a new document in data base_replicator
:{ "_id": "replicateCleanup", "source": "original_database", "target": "https://admin:password@localhost:5984/original_database_replica", "create_target": true, "filter": "filters/deletedfilter", "owner": "admin", "continuous": true }
When this document is created, CouchDB immediately creates a replica data base named
original_database_replica
and starts copying all documents which are not deleted.Wait until replication is finished
Checkhttps://localhost:5984/_utils/status.html
and wait until progress is 100%Shut down everything
Stop applications that can access the data baseSwitch data bases
- Check again
https://localhost:5984/_utils/status.html
: progress should be 100% - Stop CouchDB
- Delete
original_database
data base and views - Rename
original_database_replica
tooriginal_database
- Check again
Start CouchDB and applications again
Note that if your original data base has other replicas (for load balancing or failover), the replicas need to be replaced as well during the downtime.
Conclusion
We presented a convenient alternative to finally get rid of deleted documents in Apache CouchDB. No programming is needed, it can be achieved with Futon only. It involves periodic user interaction, but most of the work can happen while CouchDB is up and running, and therefore minimizes productive downtime.