Written by Tom Wilkie
Acunu have been involved in some pretty ambitious projects lately, and we've found one of the most challenging shortcomings of Cassandra to be its limited support for very high capacity nodes. Cassandra can currently only store ~500GB of data per node before you run into problems, so when you have a project to store ~100TBs, you'll be needing over 600 nodes (with replication factor 3).

One way to reduce this cost is to increase the amount of data stored per node. At Acunu we're aiming for 10TBs per node, reducing the number of nodes in capacity-bound clusters by over 10x. In this blog post I want to examine some of the current limitations with Cassandra, and how Acunu Reflex addresses them.

(1) Bootstrap / decommission time

First problem with larger nodes is that the length of time it takes to bootstrap (or decommission) a node is proportional to the amount of data on that node. For example, even in ideal conditions copying 10TB over a gigabit Ethernet link will take about 28 hours, and in reality it can easily take over 4x longer!

There's not much you can do about bootstrap (you're network bound most of the time) but decommission is the interesting case. I say this because the time when you want to bootstrap a new node in a rush is when an old node failed - but if you had a technique to decommission that failed node (and re-replicate the data to other nodes) very quickly, this would solve that problem, and you could bootstrap a new node at your leisure.

Our solution is called 'Virtual Nodes'. Virtual Nodes is a technique where instead of having one token per node, as Cassandra does at the moment, Acunu Reflex can have many. This allows each node to own many small regions of keys around the ring. This in turn means that whenever you need to stream data to a node (say, for bootstrapping or decommission) we can involve every node in the cluster in this operation, drastically reducing the load and increasing the performance. What this means is that when a node fails, instead of having to re-replicate the data between a small set of nodes (as with Cassandra at the moment), we can re-replicate the data to every node, making the process over 10x faster. This removes one of the limiting factors of high capacity Cassandra nodes.

More information about virtual nodes can be found in this email thread and JIRA tickets:

http://mail-archives.apache.org/mod_mbox/cassandra-dev/201203.mbox/%3CCADjM4zvCm97jSiPhn59WsA1k5bgR-SOqcKsBo+_ZMpyRC69DOg@mail.gmail.com%3E

https://issues.apache.org/jira/browse/CASSANDRA-4119

(2) Repair / anti-entropy

In the theme of looking for operations which take time proportional to the amount of data stored on a node, we come to repair. This is the process of reconciling the differences between nodes in your cluster. The current technique requires a complete scan of the data to build a Merkle Tree, which is then used to compare the differences between the nodes, such that one only transfers data which has changed.

While this is very efficient in network IO, it takes a long time to scan all your data when you have many TB's per node, and is very heavy on disk IO, generally causing quite a large performance impact. What's more, there is the requirement to run repair once every 10 days in Cassandra, so that tombstones are synced across the cluster and data doesn't come back from the dead. Combining these factors means the more data you have per node, the more chance there is you will need to run repair on more than one node simultaneously, greatly impacting cluster performance. Eventually, with node capacities high enough, you will be in the situation where repair takes longer than 10 days to run!

We are building a technology we call "Incremental Repair", which removes the need to do this expensive local IO when building the Merkel Tree. This means repair operations only take time and IO proportional to the number of changes between the nodes, and have very little performance impact. 

(3) Memory limits    

Cassandra keeps all its indexes and bloom filters in memory, in the Java heap. The long-standing recommendation when using Cassandra is to not run a Java heap bigger than 8GB, or you will run into unpredictable Java GC pauses. These two factors combine to limit the size of the dataset you can run with open source Cassandra before either running out of heap space or making your heap so large that you have unpredictable performance.

Castle, Acunu's big data storage engine, stores its indexes and Bloom filters off the Java heap and allows Cassandra to run on machines with much more RAM, with much more predictable performance. Castle will easily scale to tens of terabytes of storage per node with little or no degradation in performance.

For more information, watch our talk about Castle.

(4) Bigger disks    

If you want to store tens of terabytes per node in Cassandra, you're going to want to store it on large, inexpensive SATA disks. What's more, if you are reducing the number of machines in your cluster from 100s to 10s, you do not want a disk failure to take out a whole machine, which will drastically impact the performance of the cluster as a whole. These two issues mean you want to run Cassandra with local replication between your disks, most likely with RAID.

However, RAID on large disks can take a very long time to recover from a disk failure. And during this time, your node is suffering heavy IO and reduced performance. Castle has technologies optimised for large SATA disks, such that rebuilds are over 10x faster than with RAID.

For more information, see our previous blog post on faster disk rebuilds.

Conclusion

In this blog post, I've laid out some of the reasons why Cassandra does not work well on nodes with terabytes of storage, and how Acunu solves these problems. With Acunu Reflex, you can drastically reduce the number of physical machines you need to store terabytes of data, without sacrificing performance or reliability. Fewer machines means less capital expenditure, less operational cost, and less operational complexity.
 


Comments


Your comment will be posted after it is approved.


Leave a Reply