One way to reduce this cost is to increase the amount of data stored per node. At Acunu we're aiming for 10TBs per node, reducing the number of nodes in capacity-bound clusters by over 10x. In this blog post I want to examine some of the current limitations with Cassandra, and how Acunu Reflex addresses them.
(1) Bootstrap / decommission time
There's not much you can do about bootstrap (you're network bound most of the time) but decommission is the interesting case. I say this because the time when you want to bootstrap a new node in a rush is when an old node failed - but if you had a technique to decommission that failed node (and re-replicate the data to other nodes) very quickly, this would solve that problem, and you could bootstrap a new node at your leisure.
Our solution is called 'Virtual Nodes'. Virtual Nodes is a technique where instead of having one token per node, as Cassandra does at the moment, Acunu Reflex can have many. This allows each node to own many small regions of keys around the ring. This in turn means that whenever you need to stream data to a node (say, for bootstrapping or decommission) we can involve every node in the cluster in this operation, drastically reducing the load and increasing the performance. What this means is that when a node fails, instead of having to re-replicate the data between a small set of nodes (as with Cassandra at the moment), we can re-replicate the data to every node, making the process over 10x faster. This removes one of the limiting factors of high capacity Cassandra nodes.
More information about virtual nodes can be found in this email thread and JIRA tickets:
(2) Repair / anti-entropy
While this is very efficient in network IO, it takes a long time to scan all your data when you have many TB's per node, and is very heavy on disk IO, generally causing quite a large performance impact. What's more, there is the requirement to run repair once every 10 days in Cassandra, so that tombstones are synced across the cluster and data doesn't come back from the dead. Combining these factors means the more data you have per node, the more chance there is you will need to run repair on more than one node simultaneously, greatly impacting cluster performance. Eventually, with node capacities high enough, you will be in the situation where repair takes longer than 10 days to run!
We are building a technology we call "Incremental Repair", which removes the need to do this expensive local IO when building the Merkel Tree. This means repair operations only take time and IO proportional to the number of changes between the nodes, and have very little performance impact.
(3) Memory limits
Castle, Acunu's big data storage engine, stores its indexes and Bloom filters off the Java heap and allows Cassandra to run on machines with much more RAM, with much more predictable performance. Castle will easily scale to tens of terabytes of storage per node with little or no degradation in performance.
For more information, watch our talk about Castle.
(4) Bigger disks
However, RAID on large disks can take a very long time to recover from a disk failure. And during this time, your node is suffering heavy IO and reduced performance. Castle has technologies optimised for large SATA disks, such that rebuilds are over 10x faster than with RAID.
For more information, see our previous blog post on faster disk rebuilds.