Written by Irit Katriel
A rich theory has been developed to explain and compare the complexity of algorithms for data sets that do not fit in a computer's main memory [1].  Yet this model fails to predict how today's Flash SSD devices behave.
The theory is based on the Disk Access Model (DAM), a computational model that abstracts so-called external memory as a block device with simple properties: data is read from and written to the device in blocks of a certain size B, and the complexity of an algorithm is equal to the number of blocks read or written. Be it a cache, a hard disk or a magnetic tape, all that changes is the block size, B. The holy grail is then an algorithm that works well for all B (and does not need to know the value of B). Such an algorithm (called cache oblivious) does not require tuning, and should work well for all levels of the memory hierarchy simultaneously. The DAM has been successful in designing IO-efficient algorithms for sorting, indexing, and more on a variety of storage devices.

However, the DAM fails to predict how today's flash SSDs behave. SSDs show different performance characteristics which are due to the physical properties of flash memory devices: Reading is faster than writing, and while sequential writes are significantly faster than random writes, all reads are fast.

Several researchers (e.g. [2], [3], [4]) noted the inadequacy of existing models which treat reads and writes symmetrically, and suggested to tweak the DAM by dividing B into Br for reads and Bw for writes, where Br is small and Bw is large. The goal is still to keep the number of blocks read or written low. Thus, the model stresses that batching writes is more important than batching reads. So, for example [5]: The B-Tree is replaced by the (Br,Bw)-Tree, which is a Bw-Tree whose nodes have Br-Trees embedded in them; it is read in Br-size blocks and written in Bw-size blocks. And for sorting, the model points to merge-based schemes where at every point in time we read data from several streams but write only to one.

Almost-sequential writes
Fix some chunk size b, and consider these three write patterns, wrapped around to overwrite the disk several times:

  • Sequential: Write b-byte chunks sequentially.
  • Regular Gaps: Write b-byte chunks with a b-byte gap between every two consecutive chunks.
  • Random Gaps: Write b-byte chunks, and after every chunk flip a coin to decide whether to leave a gap as above.
The DAM would consider these write sequences to be equivalent up to constant factors. If pressed to choose, it would predict that for b < 2B, the Regular Gaps workload is slowest, the Sequential workload is twice as fast and Random Gaps is somewhere in the middle.

Here are the results we got with b=64KB on an Intel X25-M, 160GB SSD (whose erase unit, Bw, is 512KB). What we actually see is that the Random Gaps performance is qualitatively different from the Regular Gaps and Sequential workloads, which, far from being separated by a factor of 2, are pretty much identical.


What's going on?
The randomness in the above Random Gaps workload is the obvious culprit for the 'cliff', but the situation is more subtle than that, as the following experiment reveals. We split the device into chunks, and write those chunks in random order. Then, once we have written the entire device once, we rewrite it three more times in the same order. In Python:

writeOrder = range(diskSize/chunkSize)
random.shuffle(writeOrder)
for i in range(4):
    for offset in writeOrder: 
        print "Write", offsets*chunkSize, chunkSize



 Here is the performance with chunkSize=64KB.

Despite the randomness there is no cliff, and the reason is this: the first time we write, the Flash Translation Layer assigns sequential physical addresses to the random sequence we provide. When we rewrite in the same random order, the existing mapping from logical to physical addresses still gives a sequential sequence, which can be rewritten efficiently. Any write pattern that repeats itself works well, which explains why Sequential and Regular Gaps workloads had similar performance -- they are both repetitive.

To summarise: Whether we write sequentially, write every other chunk, or repeat the same random permutation of the chunks in the logical address space, our writes translate into a sequential write pattern in the physical address space. But if we randomise our writes even a little bit between re-writings of the device, as we did in the Random Gaps workload, we see a sharp drop in performance as soon as we have written the device's capacity and the effects of write amplification and constant changes in logical-to-physical address mapping begin to show up.

Conclusion
Simplified models help us understand reality by focusing on what is important and ignoring what is not. A computational model enables us to compare algorithms in terms of those characteristics that the model deems to be important predictors of efficiency. It also guides the algorithm designer towards the schemes that will work well.

Our simple tests show that the DAM and its derivatives are not very good predictors of the performance of write sequences on SSDs. We found a scheme to efficiently use SSDs for the Acunu storage core only once we freed ourselves of the ideas that the DAM planted in our heads (more on that in a future article).
 


Comments


Comments are closed.