How to finally delete documents in CouchDB

How to finally delete documents in CouchDB

Document shredder

Image via CC from Sh4rp_i

Documents in Apache CouchDB are usually not really deleted but rather marked as such. In use cases with many document insertions and deletions, this considerably affects disc space consumption and performance. This post shows a practical way how to get rid of deleted documents and make your data base fast and efficient again.

Deletion is not deletion

Usually documents in Apache CouchDB are deleted by using its HTTP document API, through an HTTP DELETE request. Due to the append-only design of the underlying B-Tree, the document is not deleted but only marked as such. A new revision of the document is created (a so-called tombstone), containing only its ID, revision and a field _deleted which is set to true.

As a consequence, even if a document is deleted, CouchDB keeps old revisions of the document plus its tombstone. Data base compaction removes the old revisions, but the tombstone remains, as well as a list of old revision numbers (to allow replication). The deleted document increases the size of the B-Tree, blocks disc space permanently and slows down view regeneration.

The purge

CouchDB offers an alternative to deletion which is called purge. A purge ultimately removes all references to a document from CouchDB. However it is a problematic command since purges are immediate and final (no recovery for accidental purges) and they are not replicated. In a productive environment with data base replication enabled, it would compromise consistency of the replicas. Hence the official CouchDB documentation discourages the use of purge for our use case.

Periodic relocation

CouchDB documentation proposes a different approach to get rid of deleted documents. The idea is to switch periodically to a new data base when the entries in it have all expired, and then delete the old one.

While this is a convenient solution for some use cases, it is difficult if there is data that never expires – this data would have to be migrated to the new data base which necessitates additional programming.

Periodic cleanup

We used a different approach that allows to take over existing data without programming and only a little user interaction. Filtered replication is used to create a replica of the current data base without deleted documents. Once the replicaton is finished, only a short productive downtime is needed to delete the original data base and move the replica in its place.

Step by step

  1. Create filter
    Open Futon on the original_database. Create a new document _design/filters in the data base:

    {
       "_id": "_design/filters",
       "filters": {
           "deletedfilter": "function(doc, req) { return !doc._deleted; };"
       }
    }

    This code filters out all deleted documents.

  2. Create replication document
    Create a new document in data base _replicator:
    {
       "_id": "replicateCleanup",
       "source": "original_database",
       "target": "http://admin:password@localhost:5984/original_database_replica",
       "create_target": true,
       "filter": "filters/deletedfilter",
       "owner": "admin",
       "continuous": true
    }

    When this document is created, CouchDB immediately creates a replica data base named original_database_replica and starts copying all documents which are not deleted.

  3. Wait until replication is finished
    Check http://localhost:5984/_utils/status.html and wait until progress is 100%
  4. Shut down everything
    Stop applications that can access the data base
  5. Switch data bases
    • Check again http://localhost:5984/_utils/status.html: progress should be 100%
    • Stop CouchDB
    • Delete original_database data base and views
    • Rename original_database_replica to original_database
  6. Start CouchDB and applications again

Note that if your original data base has other replicas (for load balancing or failover), the replicas need to be replaced as well during the downtime.

Conclusion

We presented a convenient alternative to finally get rid of deleted documents in Apache CouchDB. No programming is needed, it can be achieved with Futon only. It involves periodic user interaction, but most of the work can happen while CouchDB is up and running, and therefore minimizes productive downtime.

4 Comments
  • cory
    Reply
    Posted at 6:16 am, April 24, 2015

    I’m new to couch and doing some reading, but wouldn’t using a combination of delete+purge work well?

    1. DELETE
    2. Wait for compaction
    3. cron/daemon runs and purges all tombstoned documents that have been compacted

    Seems like that should be part of core. Some sort of configurable auto purge.

  • Will Holley
    Reply
    Posted at 8:40 am, April 24, 2015

    This approach works so long as documents are not being updated in the source database after you kick off the replication. In that scenario, a document could be replicated to the target before it is deleted but then the deletion operation would not propagate (because it would not pass the filter).

    There is a workaround using a validate_doc_update function on the replication target which rejects deleted documents which aren’t already known to the database. Something like:

    function(newDoc, oldDoc, userCtx) {
    // any update to an existing doc is OK
    if(oldDoc) {
    return;
    }

    // reject tombstones for docs we don’t know about
    if(newDoc[“_deleted”]) {
    throw({forbidden : “Deleted document rejected”});
    }
    }

    The workflow would then be:

    1. Add the above VDU to the target db
    2. Set up a continuous replication from the source to target
    3. When replication is complete (/ pending changes is within your tolerance), switch your application to point to the target db
    4. When you confirm there are no more changes to propagate from the source, cancel the replication

    You could also do this without a continuous replication – in step 4 just run another “top up” one-shot replication instead.

Post a Comment

Comment
Name
Email
Website