Written by Sam Overton
For distribution of data between many hosts in a cluster, Cassandra uses a technique called consistent hashing 
. In Cassandra's version of consistent hashing, each host is assigned a single "token" in a "hash-space" which determines the portion of data which that host will be storing.
Fig 1. Partitioning of hash-space between 4 hosts in a cluster
There are some disadvantages with this method: When adding a host to a balanced cluster (where each host stores 1/N of the data), the cluster will become unbalanced, as the new host can only take a contiguous portion of an existing range from one existing host. In order for the cluster to be balanced, the new host would have to have taken an equal amount of data from each other host, so that each host was now storing 1/(N+1) of the data.
The recommended workaround for this is to always try and double the number of hosts when scaling up, or half the number when scaling down. The second option is to move the tokens of all other hosts so that they become balanced. These may be viable options for small clusters, but become unwieldy at scale.
Fig 2. Balancing a cluster when adding hosts requires moving tokens (top) or doubling the number of hosts (bottom)
These problems make for considerable operations complexity. Cassandra administrators must concern themselves with the careful selection of tokens to ensure balanced load, and must manually rebalance when adding hosts. One way to address these issues is with virtual nodes
. A consistent hashing scheme which makes use of virtual nodes allows each host to be responsible for multiple, non-contiguous ranges of the data. Use of virtual nodes has the following benefits:
- A host joining the cluster assumes responsibility for an even portion of the data, taken evenly from each other host
- If a host fails, the load will be spread evenly across other hosts in the cluster
- Rebuilding a dead host will be faster as it will involve every other host
- The number of partitions assigned to a host can be determined by its capacity, allowing for heterogeneity in the hardware.
Virtual nodes schemes
At Acunu we wanted to bring virtual nodes support to Cassandra to alleviate some of these operations headaches and I spent some time looking at some of the different approaches we could take. The three main variants that I found could be broadly divided into the following three categories:
Random token assignment
Assign each host T random tokens. A partition is assigned to a host for each of its tokens, where the partition is defined by the interval between a token and the previous token on the ring. When a host joins the ring it is assigned T random tokens which will result in a portion of an existing partition being assigned to that host.
This method is used by libketama 
for distribution among memcached servers, and is a generalisation of the Cassandra distribution method when T=1.
Fig 3. Partitioning by random token assignment
Fixed partition assignment
Divide the hash-space into Q evenly sized partitions, and assign Q/N partitions per host (where N is the number of hosts). When a host joins the ring, it steals partitions evenly from the other hosts such that there are still Q/N partitions per host. When a host leaves the ring it distributes its partitions evenly to the other hosts.
This method is used by Dynamo 
(described in the paper as Strategy 3) and Voldemort 
Fig 4. Fixed partitioning assignment
Assign each host 1 token, such that each host has one partition (defined by the token intervals). When the data written to a partition exceeds a threshold, a new token is introduced in the interval corresponding to that partition so that a new partition is created. The new partition is assigned to a host with a lower data load.
This method is similar to sharding performed by Bigtable 
or Mongo auto-sharding 
Fig 5. Automatic sharding: when a partition contains too much data it is split into smaller partitions
In order to evaluate these schemes we need some idea of the implementation differences and how they might affect the performance of a cluster. The main differences between the above schemes are in partition size and number of partitions, and how these two things change with cluster size and data size.
Number of Partitions
As part of the cluster metadata we must store the assignment mapping of partition to the host on which it is stored. In Cassandra this placement mapping is gossiped between hosts, as all hosts must be capable of routing a client request to the correct replica in one hop, and so must have full awareness of data placement. If our partitioning scheme results in more partitions then this placement metadata will be larger, will put more load on Cassandra's gossip system, and will use more memory to store.
In Cassandra there are processes which will operate on a whole partition at once. An example of this is repair which must calculate digests for all of the data stored so they can be compared to the data on another replica. Other streaming operations, such as bootstrapping a new host into the cluster will also read a whole partition. There are some disadvantages to having very large partitions: If one of these processes fails or encounters some error then it must be retried. The larger the partition, the more work is wasted in the event of an error. When bootstrapping a new host could take many hours, an error in the transfer of one partition is very costly. With smaller partitions we can also stagger these operations to balance the impact on other hosts in the cluster.
On the other hand we cannot make partitions too small. Reading many small partitions can result in the disks spending most of their time seeking between partitions instead of transferring data. We want partitions to be large enough that reading them results in sequential disk access and not random disk access.
Table 1 shows a comparison of how the above two metrics scale with both data size (bytes) and cluster size (number of hosts).
First let's look at the Automatic Sharding scheme. Partition size is constant, since the scheme will "burst" partitions when they reach a pre-defined maximum size. From the criteria described above, this sounds ideal since we can ensure our partitions are in the Goldilocks sweet-spot between not-too-big and not-too-small. The drawback with Automatic Sharding is that the number of partitions (and so our placement metadata size) scales linearly with data size. This means that irrespective of how many hosts we have, the more data we store, the more memory and network bandwidth we will be using just to store placement information. Ideally we want Cassandra to be infinitely scalable, so this is a big problem.
The Fixed strategy has the opposite problem to Automatic sharding. Placement metadata is constant, since we have a fixed number of partitions, so irrespective of the cluster size or the amount of data we store, we will always be using the same amount of resources to handle placement. The draw-back of this is that partition size will increase linearly with data size. If we want to hit that sweet-spot of partition size then we will have to have a good idea, before we store any data in our cluster, of how much data we will be storing in total so that we can choose the number of partitions to divide it into. If we get this wrong then we could have problems in the future either because we chose too many partitions (or stored too little data) and so they're too small, or we chose too few (or stored more data than expected) and they're too big. With this scheme we have just replaced the operations complexity of Cassandra with a new set of operations complexities that we would prefer not to worry about, and put artificial limits on our scalability and agility.
The Random strategy sits somewhere in between the two other schemes. Placement metadata scales linearly with number of hosts in the cluster, not with data size. This is acceptable since adding more hosts will introduce more memory and network bandwidth to the cluster so the increased cost of storing and communicating placement information is accounted for. Partition size increases proportionally with data size but decreases as we add more hosts, so as we store more data and partition size increases, we can add more capacity to the cluster and the partition size will be kept to a steady size. This means that the natural scale-out growth of the database cluster automatically manages the partition size without any operator intervention.
In this blog post I have discussed the rationale behind our design choice (random token assignment) for Cassandra virtual node support. More information can be found in the Apache Cassandra JIRA
In the next instalment I will look at what properties this design has for load balancing and distribution, based on some simulation and early testing.
If you are interested in using or testing virtual nodes check out Acunu Reflex v3 (contact us
for the beta).