Cassandra under heavy write load, Part i
By Richard Low
9 Mar 2011
Category:
Technical Articles
Recently I’ve been investigating how Cassandra performs (and comparing a vanilla install to our distribution). This is the first in a series of posts on my findings, in which I’ll discuss the impact on read performance of a period of intense writes. The headline is that although point reads are not adversely affected, range reads can be slowed down by two orders of magnitude in the immediate aftermath of a 30 minute write burst, an effect taking one hour to clear up.
Inserts
Cassandra is a heavily write-optimised system that can process hundreds of thousands of writes per second. Writes are queued up in memory (and written to a commit log for durability) and flushed periodically to disk into an SSTable. All writes produce sequential I/O, which is what gives Cassandra its high throughput. In the background, Cassandra merges SSTables together into larger SSTables (a process called ”compaction”). If left long enough with few writes, Cassandra will compact all data into a few large SSTables. But in practice there will always be many SSTables lying around. The testing began by inserting 100 million rows into Cassandra 0.7.3, each containing 1 column with 50 bytes of random data. I inserted at the maximum possible rate with 8 client threads. The raw data size on disk was about 13 GB. This took 32 minutes on a 4 disk (1 commit log, 3 data), 4 core box with 6 GB RAM. This equates to about 52,000 keys per second. Since I want to perform range queries later on, I used the ByteOrderedPartitioner.
Reads
To read the value associated to a key, the particular SSTable containing that key needs to be identified. One way to do this is to query each SSTable in turn to see if the key is in it, but that would require at least one disk I/O per SSTable, which severely restricts the read throughput. Cassandra gets around this problem by using Bloom filters. A Bloom filter is a memory efficient data structure that can answer membership queries with high accuracy, and yet is usually small enough to fit in memory (a naive implementation performs very poorly on disk); with one filter per SSTable, the correct SSTable can be located without any disk I/O, which enables Cassandra’s reads to be served with approximately one I/O per query. Immediately after the inserts described above, there were 108 SSTables in the data directory. To show the Bloom filters working, I performed random point gets, choosing rows and columns randomly, ensuring that all requests existed. Cassandra performed 71 lookups per second. Querying all 108 SSTables would take at least 0.4 seconds, so this proves the Bloom filters are helping. This is assuming each of the three disks can serve 100 I/Os per second (a seek takes about 10ms). The read rate increased to 185 per second once Cassandra had finished compacting. This is close to optimal – the system can only serve 300 random I/Os per second. Two reasons why the performance was slower immediately after the writes are that compactions were ongoing, limiting the available I/Os for reading, and that the number of false positive checks required will be higher for more SSTables.
Range Queries
However, Bloom filters do not work for range queries. A Bloom filter cannot answer the question ‘is any key within the range a to b present?’. Without storing some other data structure in memory, a range query will require every SSTable to be queried. This results in a blow up in the number of I/Os by the number of SSTables, so we should expect that after lots of sustained writes, range queries will be slow. Straight after finishing the inserts, I performed some row range queries (a get_range_slices call, choosing a random start row key andread limiting the query to two rows). The throughput was about 0.4 queries per second, which is low, but close to what we should expect given that the data is fragmented over 108 SSTables: the maximum possible is 2.8 per second, ignoring the ongoing compaction work. I then scheduled a minute of range queries every five minutes, to see how the performance changed as compaction progresses. The results are in the graph below. The performance doesn’t pick up until compaction is mostly complete, which takes almost one hour. Even then, when Cassandra has stopped compacting, there were eight SSTables remaining and the range rate was still only 6 per second. I triggered a major compaction (which took 16 minutes) to put all the data in one SSTable. Now 45 range queries were served a second – about eight times before the major compaction. This is much more respectable, but still quite a bit under the 300 I/Os per second available, which indicates Cassandra is performing multiple I/Os per query.

Write spikes bad for business
So, a vanilla implementation of Cassandra gives reasonable range query performance only after it’s been left alone long enough to complete its compactions, and in particular it will perform horribly in the aftermath of a period of sustained writes. Of course this is not a desirable situation as the end user experience will suffer in the meantime; let us know if this is something you are experiencing. This is an instance of the kind of tuning inconvenience Acunu’s distribution of Cassandra eliminates. In the next post we’ll reveal some details about what we’re doing to deal with this specific problem.
Hardware and software specs
This was run on a 4 core (8 including hyperthreaded cores) Xeon E5620 2.4 GHz system with 6 GB RAM. Cassandra had a 3 GB heap, one dedicated commit log disk and three data disks in RAID0. Using CentOS 5.5 with OpenJDK java version 1.6.0_17. Cassandra version was 0.7.3.