1. 1
  1. An Introduction to using Custom Timestamps in CQL3

    One interesting feature of Cassandra that’s exposed in CQL is applying a custom timestamp to a mutation. To understand what the impact of this is, we first need to dive into Cassandra’s internals a bit. Once we understand how reads and writes are handled, we can start to explore potential uses for custom timestamps.

    First, lets discuss storage. At the most basic level, each piece of data is stored in a Cell. A cell has these properties:

    protected final CellName name;
    protected final ByteBuffer value;
    protected final long timestamp;
    

    On a write, the timestamp stored here can be optionally provided by the client (via the USING TIMESTAMP clause), or generated automatically via the server. When records are written, they will always have a timestamp associated with them. Due to the nature of how Cassandra’s data is stored on disk, it’s possible to have multiple Cells for a given column name, and the timestamps are used to determine which one is the most current. This exists for values inserted as well as deleted, in the form of a tombstone. I suggest reading this doc from the Cassandra wiki to learn more.

    There is another implementation detail of Cassandra that’s extremely important: in most cases, inserts are exactly the same as updates. There is no differentiation because everything’s effectively an insert. Data is never updated, it’s simply written and merged on reads.

    Now that we understand how Cassandra uses the timestamp to resolve which is the “correct” data value, we can start to think of ways to make custom timestamps useful. The one which we’ve started using at SHIFT is writing deletions into the future.

    Why would we want this? Because a deletion written 30 seconds into the future will cause any mutations for the next 30 seconds to effectively be ignored. This can be used as an extremely cheap lock out mechanism.

    Lets consider an example. We have a group_membership table, which lets us view all the groups a particular user is in. We also store if the user is an admin, and when they last visited the group.

    create table group_membership (         
        user_id int,         
        group_id int,         
        admin boolean, 
        last_visited timestamp,
        primary key (user_id, group_id)
    );
    

    What happens when we want to remove the user from the group? We issue a delete. What if we want to update the last_visited timestamp? The update looks like this:

    cqlsh:test> update group_membership set last_visited = '2013-12-26' where user_id = 1 and group_id = 1;
    

    As we mentioned before, this update is effectively the same as an insert. I performed the above query on an empty table, and yet the data is there:

    cqlsh:test> select * from group_membership;
    
     user_id | group_id | admin | last_visited
    ---------+----------+-------+--------------------------
           1 |        1 |  null | 2013-12-26 00:00:00-0800
    

    This behavior is convenient for the sake of fast writes, but can be problematic when race conditions are introduced. For example, consider this series of events:

    1. Membership is read (thread 1)
    2. Membership is deleted (thread 2)
    3. Membership last_visited is updated (thread 1)

    The net result of this will be a record for the user & group, even though they were removed. With a relational DB, the update would do nothing if it was issued after the delete. With Cassandra we have to be careful.

    At this point, if all we do to check for the users membership in a group is look for the existence of a record, we end up with a false positive. The solution? Delete the membership into the future. Our sequence looks like this:

    # the future!
    delete from group_membership using timestamp 1388101196179000 where user_id = 1 and group_id = 1;
    
    1. Membership is read (thread 1)
    2. Membership is deleted one minute into the future (thread 2)
    3. Membership last_visited is updated (thread 1, but it doesn’t matter as long as it’s within the 1 minute window)

    We have not completely removed the opportunity for a race condition, but we’ve made it extremely unlikely. This is a faster alternative to locking, since locking requires a read before write.

    If you’re a Python user and using cqlengine, you’ll may find it useful to know that we’re working on adding custom timestamp support to the next cqlengine release. We’ll put out a blog post going over that functionality when it’s released.

    comments powered by Disqus