Friday, 19 November 2010

Timestamps in Cassandra

A couple of weeks ago, I had several splitting hair moments trying to understand timestamps in Cassandra.

Of course, I did my best to come with a clear mind, but like many, I was already polluted with SQL. Cassandra has nothing to do with SQL. You'll be wise to lock that monster in the basement before getting into Cassandra...
I did my best to avoid traps. I pulled my best analytic attitude. I proceeded methodically, step-by-step. I thought nothing could happen to me. Boy... did I confuse being anal with being accurate? I sure did! I made my own assumptions and found nothing to invalidate them in the documentation. This threw me into orbit faster than Superman on steroids. I started a lengthy discussion on the Cassandra mailing list. Yeah, the Ground control to Major Tom kind.
Let this post save you some time.
I am not going to cover Cassandra concepts here, as this has already been done very well by Ronald Mathies. I will only use reminders.
Error #1: A timestamp would be a column in SQL
Timestamps are part of the column triplet: name, value, timestamp. Cassandra columns are often described as triplets, but this can be misleading.

Timestamps are often used in SQL tables to keep track of a record status. For example, they can be used to monitor the status of a user in an application:
   003341   001000506  REGISTERED 
   003341   001214566  CONFIRMED
   003341   001394230  LOGGED_IN 
   003341   002004447  LOGGED_OUT

In the above SQL-like table, the timestamp is a SQL-like column. In Cassandra, there is no such thing as SQL-like columns for timestamps. A timestamp in Cassandra is a validity information for the Cassandra (name, value) pair.

A Cassandra name can be considered as the equivalent of a SQL-like column. For each row, a corresponding value and timestamp is defined and registered. Cassandra timestamps are not defined in SQL. There is no corresponding concept. Do not plan to use Cassandra timestamps as data in your application.

If you need to implement a SQL-like timestamp in your application, you will need to implement a Cassandra column like this:
   NAME  = 'MyTimeStamp'
   VALUE = 001000506, 001214566... // Row values
For each SQL-like timestamp row entry, one would set a Cassandra timestamp value too.

Error #2: Multiple Cassandra column value instances with different timestamps result in multiple entries
No. Just plain no. In the above example, multiple instances of a Cassandra column triplet with different Cassandra timestamp values would look like this:
   NAME           VALUE      TIMESTAMP
   'MyTimeStamp'  001000506  001000500 
   'MyTimeStamp'  001000506  001000503
   'MyTimeStamp'  001000506  001000510 
   'MyTimeStamp'  001000506  001000521
In this case, Cassandra does not register 4 entries, but only one. The one with the highest timestamp value: 001000521.

Error #3: I can use any type of value for Cassandra timestamps
In theory, yes you can. In practice, no it is not a good idea. Cassandra implements timestamps to facilitate idempotency. An operation is idempotent if applying it multiple times does not impact the final result. For example, I can add zero as many time as I want to a number, the result will always be that number.

In Cassandra, setting a column row value multiple times with the same timestamp does not change the final row value. Most application will use now() for their timestamp value, but it is not mandatory. For the sake of integrity, any type of value use for timestamp should enforce the idempotent property.

The motivation behind this is that, from time to time, Cassandra needs to re-submit a set of batch mutations on a node. The final result should be the same as if it was applied once. This also applies for your own batch mutations that you would like to re-apply at your application level.

Error #4: Cassandra secretly registers some kind of history about each columns entries
No. The only historical information stored by Cassandra during a write is the value of the timestamp of the corresponding column value. Any previous column value (i.e., Cassandra value-timestamp pair) is simply overwritten if its timestamp is older. The content is lost forever. No safety net. No second chance. No resurrection.
This is not to be confused with 'tombstoning', which happens when users explicitly delete previously created 'records'. Instead of performing the wipe out immediately, Cassandra flags the 'record' as 'to be deleted later'. It remains in a limbo state for some time, before being removed forever. However, the record content is lost as soon as it is flagged for delete. No safety net. No second chance. No resurrection.
You could say that Cassandra stores data on a big heap, but you only have access to the crust. And, every write with a higher Cassandra timestamp value results in an extra layer on the heap. Cassandra automatically deletes whatever is below the crust, synchronously or asynchronously.

Error #5: Concurrent writing operations in Cassandra are deterministic
No they are not, only reads are. Let's imagine a system with 11 Cassandra nodes operating with quorum. The quorum value is 11 DIV 2 + 1 = 6. Let's imagine two nodes A and B willing to perform a write on the same column at the same time:
   'MyColumn'  'MyValueA'  001000500
   'MyColumn'  'MyValueB'  001000500 
Who is going to win? The first that manages to perform writes on 6 nodes? No. You will notice a tie in the timestamps. In this case, 'value breaks timestamps ties'. If alphabetic order is set to sort column values, node B will win, because 'MyValueB' is greater than 'MyValueA', even if A manages to perform 6 writes before node B, for a given timestamp.
We could play semantics around the word 'deterministic'. What I mean here is that you have no guarantee that what a node writes is going to be the new crust of the heap for a given timestamp (I am not talking about different timestamps). Even a re-read would not help verifying this. You never know what write, with the same given timestamp, could happen in between.

Hence, concurrent writing is not 'locally' deterministic in Cassandra. You can't rely on what you just wrote to guess anything about the status of Cassandra content.
Concurrent readings are deterministic (assuming quorum mode). Why? Because Cassandra needs at least 6 coherent reads before returning a value to the user, for a given timestamp value (again, I am not talking about a different timestamp value situations). One way or the other, any potential timestamps tie will always have been resolved the same way on at least 6 nodes, before a column instance value is returned to the user. This is nothing more than an expression of determinism.
Error #6: Cassandra takes the current system clock value to automatically fabricate timestamps
No. The end user always has to provide the Cassandra column timestamp value when performing a write. It can be the current value of the system clock or any other value. The latter is not recommended though. A library facilitating the use of Cassandra can eventually do this for you automatically.

Keep in mind the idempotent property requirement described earlier in this post. 
Error #7: Cassandra manages timestamp ties on SuperColumn writes atomically
No.That's a tricky one. If two nodes perform a write on a SuperColumn (for example) using the same timestamp value, it can result in a write being a mix of the column values provided by both node A and node B. Some columns of B can beat A, and some columns of A can beat B. In other word, the crust can be a composite of A's values and B's values.
However, one does not need to delete columns one by one in SuperColumns. One can delete a whole SuperColumn (or rows) at once. This operation will happen atomically.
Error #8: Column inserts in SuperColums is atomic/deterministic for a given timestamp

No. Let's imagine two nodes willing to add columns in a super column. Node A wants to add column CA1. Node B wants to add column CB1 and CB2. CB2 will always be there, but you can't be sure who is going to win between CA1 and CB1.

So what is the solution?

If you are going to have concurrent writes, you will need a lock/synchronization system to tackle the issues described in this post.

Lamport's Bakery algorithm has been suggested. Another suggestion is to have nodes register update requests in a queue using (UUID, timestamp) unique pairs and have a unique 'thread' sort these and process them one-by-one.

A big thanks to Peter Schuller for clarifying these issues. This post is written when Cassandra 0.7.0 beta 3 just came out.