Log file systems seem to be enjoying a resurgence in popularity, fuelled by the new storage companies and projects marrying SSDs and append-only copy-on-write (CoW) B-trees. What’s it all about?
The argument typically goes like this: SSDs are great at small random reads (4KB pages), but need writing to in large cells, and append-only CoW (copy-on-write) B-trees are great at this (see toy picture to left).
At a first glance, this seems true — append-only CoW B-trees and log-structured SSDs seem like a great match, offering full versioning, fast updates and great read performance. A log-structured CoW update requires two things: (1) a walk down the tree to find the path, and (2) a rewrite of the path. (1) is likely to be random IO, unless it’s cached, and the aim of the log file system is to make (2) completely sequential. Examples include ZFS, Btrfs, Nimble Storage,RethinkDB and CouchDB.
What’s wrong with a normal B-tree?
Let’s start by considering why in-place updates to B-trees fail to give good performance on SSDs. The figure below shows what happens to a fresh Intel X25M SSD (*) under three different workloads: (1) write a random 4KB buffer to a random 4KB-aligned offset, (2) as (1) but with 512KB buffers and 512KB-aligned offsets, and (3) as (1) but writing choosing sequential offsets. Sequential writes are great, even after overwriting the device several times. Random performance is very different – at 4KB, the performance is just terrible, but something interesting happens with 512KB buffers. The device’s stated capacity is 160GB, and around this point the performance drops off dramatically. The take-away message is this: to get consistent high performance from this device, we need to do something else. B-trees, or any other random-write-intensive data structure won’t work.
The reason for drop off once the write volume reaches the device capacity is quite complex, and depends on the internal structure of the device — if you’re interested, read this great MSR tech report for a simulation-based analysis of different SSD architectures. The basic reason is that although the flash memory chips have a 512KB erase block, most SSDs implement an internal log structure (the magic ‘flash translation layer’ or FTL) for several reasons, most notably because the bandwidth of these individual memory chips is relatively very low, and to enable wear leveling and error correction. This often makes the ”’effective”’ logical erase block size much larger, typically around 100s of MBs for recent MLC devices. The result is that writes are at the mercy of the device’s FTL, which is the part manufacturers keep quiet and closed.
(*) Model Number: INTEL SSDSA2M160G2GC, Firmware Revision: 2CV102M, writes use Linux AIO direct to device with queue depth 32.
So, are log file systems perfectly suited to SSDs thanks to the copy-on-write B-tree? Not so fast. There are two major problems: space blowup and garbage collection.
The CoW B-tree has a potentially big space blowup: to rewrite a 16-byte key/value pair in a tree of depth 3 with 256K block size, you may have to do 3x256K random reads and then write 768K of data. In practice, some of these nodes are cached and don’t need rewriting, but for random updates to large datasets, this is pretty close. Even if you don’t care about space utilisation, when the device is full, you’ll be writing, on average, a lot of data per small random update, and this means you’re no longer fast at writing. Unfortunately, other than heuristic tweaks or giving your machine gigantic amounts of RAM, this is an inherent problem for append-only CoW indexes.
The classic Achilles heel of a log file system is garbage collection (cleaning) — recovering invalidated (e.g. overwritten) blocks in order to reclaim sufficiently large contiguous regions of free space so that future writes can be efficient. Very few guarantees are known for garbage collection in log file systems, particularly when the system does not experience idle time, or is under low free space conditions. To make matters worse, the space blowup described above means that CoW trees generate a lot of extra work for the garbage collector — at a 50x space blowup, the garbage collector has to work 50x harder to keep ahead of the input stream.
Soules et al. (2003) compare the metadata efficiency of a versioning file system using both CoW B-trees and a structure (CVFS) based on the Multi-version B-tree (MVBT). They find that, in many cases, the size of the CoW metadata index exceeds the dataset size. In one trace, the versioned data occupies 123GB, yet the CoW metadata requires 152GB while the CVFS metadata requires 4GB, a saving of 97%. For more information on this topic, see this post (which reports 2GB of log data per 50MB of index data), Bonwick’s blog entry on ZFS space maps and this discussion on WAFL freespace, and Section 5 of our arxiv paper.
Stratified B-trees dominate CoW B-trees, with or without log file systems. They can be written without append-only logs and heuristic-based garbage collectors. They are the first data structure to offer provably optimal performance for full versioning (allowing updates in far less than 1 IO per update on average), use asymptotically optimal O(N) space, offer an optimal range of trade-offs between updates and queries, and can generally avoid performing random IO for both updates and range queries. In particular, one construction offers updates two orders of magnitude faster than CoW B-trees, and can answer range queries for any version mostly using sequential IO, again giving much better results than CoW B-trees.
So, yes — append-only CoW B-trees can perform significantly better on SSDs than rotating disks (by eliminating the random reads needed for updates), but stratified B-trees also only use large sequential writes for updates (and indeed, perform almost no random reads on updates on average). With blocks of size 128KB and 128 byte entries, we find that stratified B-trees perform roughly one to two orders of magnitude faster than CoW B-trees, for both disks and SSDs. Thepaper contains some preliminary prototype performance results. Stay tuned for some real-world results from the kernel store.
Plug: Acunu’s technology enables a new breed of Big Data applications. If you need fast, reliable, solid, versioned Cassandra, try our beta.