Written by Richard Low
Cassandra is designed to run in large clusters where node failure could be a common occurrence. It may be a temporary network failure, or maybe the PSU has failed and someone has to go onsite to fix it. Whatever the cause or length of the issue, Cassandra can cope with the failure and end users will not notice.
However, there can still be downtime. If power fails in your only data center then necessarily the cluster will be down. In this post I'll estimate the probability of failure of a Cassandra cluster. In particular, I want to see how the addition of Virtual Nodes affects it. Enabling Virtual Nodes already offers many benefits
but its effect on uptime is not immediate from the outset. As you can tell from this post's title, Virtual Nodes helps - at the end I'll say how much for a certain scenario.
Cassandra failure modes
There are many different scenarios that could lead to catastrophe in your data center, but there are two broad categories. Those that are strongly correlated and those that aren't. Correlated failures are like the example above - power failure. But could also be user error or network failure. Cassandra's multi-data center support helps in the first case and other procedures can help minimize the chance of other outages.
Weakly correlated failures are local to a node. These are things like PSU failure, disk failure or maybe a maintenance reboot fit into this category. There will still be some correlations - the disks may belong to the same batch so may fail at similar times, or when one node fails it puts more load on others so they may be more likely to fail.
Virtual Nodes doesn't help for correlated failures so I will only consider the uncorrelated modes. The first thing to do is to define what I mean by a cluster failure and see how node failures can cause a cluster failure.
I consider a cluster to have failed if any amount of the data cannot be read or written at a given time. Maybe you can read 99% of the data, but the 1% unavailable is still likely to cause big problems.
This situation arises when all replicas for any key are unavailable. (I am assuming the most failure tolerant but least consistent case of consistency level one.) The main parameter affecting this is the replication factor which I will call f. There must be at least f concurrent node failures for the replicas to be unavailable. We need to understand a bit about how Cassandra replication works to know which nodes need to fail to cause data to be unavailable.
Cassandra replication 101
Consider inserting a value for key k. This key needs to be inserted on f nodes. Two components of Cassandra tell you which nodes these are: the partitioner and replication strategy. The partitioner maps k to one particular host. Almost everyone uses the random partitioner, which uses hashing to spread arbitrary key distributions evenly across the nodes. Each partition (a partition is the data stored by a single virtual node, which becomes the same as a real node if Virtual Nodes is not used) is identified by a number, its token. The ordered collection of tokens is known as the ring. The partition with the smallest token larger than the hash is chosen. So for a ring with tokens 10, 20 and 30, the partitioner returns the second host for a key with hash 15.
Then this host is fed into the replication strategy. This returns f-1 hosts as replicas, although it is important to note that there is nothing special about the host the partitioner chose.
In a single data center environment, the replication strategy known as the simple strategy will be used. This simply chooses the next f-1 nodes around the ring.
This means that for k to be unavailable, these specific f nodes must fail simultaneously. This is really good - failures can only occur if f adjacent nodes fail simultaneously. In fact, as long as every f'th node is up your cluster will be available. So Cassandra can actually tolerate more than f simultaneous failures and still be up.
Distribution factor and virtual nodes
It is now helpful to introduce the concept of distribution factor. Unlike replication factor, this is a property of each node rather than a global property. It says how many nodes (real nodes this is) the node shares data with. Without Virtual Nodes, the discussion above can be summarised as saying the distribution factor of every node is equal to the replication factor f.
If a node has distribution factor d, any f nodes failing out of those d cause a cluster failure. If d=f, f specific nodes must fail for the cluster to be considered failed. For larger d with constant f, failure becomes more likely since the choice of nodes is less restrictive. The smaller d is (with constant f) the less the chance of a cluster failure.
Virtual Nodes, however, means that each real node has many virtual nodes. If the number of virtual nodes is large compared to the number of nodes then it is very likely that all nodes will share data with all other nodes. This means d=n (for n nodes), the largest possible. On the face of it this is bad - failure is more likely with Virtual Nodes, as if any f nodes fail, the cluster will be considered failed.
The above is true: in a given time period you are more likely to have a complete cluster outage when using Virtual Nodes. However, this isn't what most people care about. Most people care about uptime. If your cluster is unavailable for 1 second in a given year that is pretty good and users are unlikely to notice. Therefore, the time the cluster is down for is what we actually care about - this is what is being asked for when people request 99.999% uptime, for example.
In the event of a serious node failure, the time that it takes to create another replica of the lost data is crucial. If a node has failed and then another one fails shortly after, data unavailability is more likely. We need to know the rebuild time to estimate the downtime. For anyone unlucky enough to have to run it, the command in cassandra is 'nodetool removetoken' that does this rebuild.
The rebuild involves re-replicating (copying) the data that lived on the failed node, to ensure there is are enough replicas of the data somewhere else. This can involve all distribution factor nodes, since they can copy in parallel. With distribution factor f, the node that assumes the token of the dead node receives all the data so that node is the bottleneck. However, with Virtual Nodes all nodes will send and receive data in parallel so it is n times faster.
We're almost ready to state what exactly we want to know the probability of. We just need one last thing: replica sets. A replica set is a set of partitions that have common data. Without Virtual Nodes, there are n replica sets, each containing f adjacent nodes in the ring. With Virtual Nodes, every set of f nodes is a replica set, so there are n choose f in total.
Now the condition for cluster failure can be made more precise: the cluster fails if all nodes in any replica set fail within the rebuild time. This means there are two competing effects: high distribution factor makes a failure more likely but also makes the rebuild time smaller. Which one wins depends on the combinatorics of the replica sets.
The idea is to work out the probability of all nodes in a replica set failing, then multiply by the number of replica sets. This overcounts because nodes are in more than one replica set, but since the probability of any node failing is small anyway it is approximately correct.
The failure probability of a node scales with the length of time for small times. The rebuild time t is the time we are interested in - this means the probability of a node failing is proportional to t. All replica sets contain f nodes, so the probability of all nodes failing within time t is proportional to t^f. To find the probability within a fixed time unit we have to divide by t. So the overall probability is proportional to t^(f-1).
To compare the probabilities with and without Virtual Nodes, let's set t to be the rebuild time without Virtual Nodes. The failure probability is then t^(f-1) without Virtual Nodes and (t/n)^(f-1) with.
Then we need to multiply by the number of replica sets, since any of these can fail to cause an outage. There are n replica sets without Virtual Nodes and n choose f with. So the overall probabilities are
without Virtual Nodes and
with. The ratio with to without is
This means Virtual Nodes decreases the failure probability by replication factor factorial. Commonly, replication factor 3 is used. This means your expected downtime is 6 times less with Virtual Nodes!
In this post, we've seen that, even though individual cluster outages are more likely for Virtual Nodes, the rebuild time is sufficiently faster so that the total cluster outage time will actually be lower. The key to this is distributed rebuild, which only becomes an option with Virtual Nodes. This more than counteracts the increased likelihood of a cluster outage.
This, together with all the other benefits of Virtual Nodes, is available in Acunu Reflex v3