under #blue #lsmtk

Triangular Compaction in lsmtk

lsmtk, short for the log-structured-merge-toolkit is a new log-structured data structure intended to serve as a proving ground for a new compaction algorithm called triangular compaction. Triangular compaction makes the observation that moving data through multiple levels of an LSM-tree at once will be more efficient that moving data one level at a time.

Existing LSM-trees move data one level at a time through exponentially larger levels, which practically guarantees that write amplification will be equal to the number of levels multiplied by the growth factor between levels. To move one level requires reading 1MiB from level i and 10MiB from level j to write 11MiB to level j. For 7 levels, the write amplification will be at least 77x.

Moving to a Graph

lsmtk implements an LSM-tree by way of an LSM-graph. An LSM-graph uses a strongly-connected-components algorithm to find the LSM-tree structure. This allows lsmtk to be used for batch ingest; the sorted string tables (ssts) can be generated elsewhere and ingested/compacted into the LSM-graph. This overlap is handled by traditional LSM-trees by placing newly ingested files in level 0. The graph will create “knots” in the tree where overlapping files exist.

Triangular Compaction

Triangular compaction is at the heart of lsmtk. When examining the LSM-graph, the algorithm considers every strongly-connected-component (typically a single SST) and then projects the transitive closure of its overlap at the next n levels above said component. If the component is at level i, then levels i to i + n will be considered. The algorithm selects the best-case from among all triangles and levels in order to maximize the ratio of sum([i, i+n-1]) to sum([i, i+n]). The compaction with the highest ratio will be selected.

The intuition for why this works can be seen in this Python script:

TARGET_FILE_SIZE = 4 * 1024**2
LSM_LEVEL_0_NUM_FILES = 8
LSM_NUM_LEVELS = 11
LSM_LOWER_LEVEL_NUM_FILES = LSM_LEVEL_0_NUM_FILES
LSM_UPPER_LEVEL_NUM_FILES = LSM_LOWER_LEVEL_NUM_FILES * 2**ceil(LSM_NUM_LEVELS) # raw

sums = 0
ingest = 0
for i in range(2**ceil(LSM_NUM_LEVELS)):
    bits = 0
    while i and (1 << bits) & i:
        bits += 1
    sums += (1 << bits) * TARGET_FILE_SIZE * LSM_LEVEL_0_NUM_FILES
    ingest += TARGET_FILE_SIZE * LSM_LEVEL_0_NUM_FILES
WRITE_AMPLIFICATION = sums / ingest

This works out to an average write amplification of 6.5.

Evaluation

Here we take a look at a workload that’s quite adversarial. I generated 100GiB of random keys and values where each key was 8 to 16 bytes of uniformly random ASCII data, and each value was 64 to 256 bytes of uniformly random ASCII data. I picked uniformly at random data because it represents a worst case for LSM trees. Any skew in the data would be likely to increase the effectiveness of the algorithm.

All benchmarks are run on a Lenovo Thinkpad X280 with Core i7-8550U CPU, \unit{8}{\giga\byte} of RAM, and Samsung SSD 980 PRO 2TB hard drive running Ubuntu 22.04. The A/C adapter was connected and power governor set to performance mode for all trials.

First, we take a look at the rate of ingest. We set this to be rate limited to 8MiB/s of ingest. I once saw a RocksDB analysis on Twitter that said 10MiB/s was approximately what to expect from an SSD once compaction overhead for filling the disk was taken into account. This graph shows the number of bytes ingested per second. Once 1GiB was ingested the LSM-tree was left to its own devices until quiescence (about 5-10 minutes).

Bytes Ingested

Second, we take a look at the number of bytes written by compaction. This is determined by the ingest rate. In the prototype, the LSM-graph is constructed from scratch each and every time, which lead to the single compaction thread becoming CPU bottlenecked. In a future version, multiple threads of compaction will be supported at once, and the graph computation will be incrementally updated.

Bytes Written by Compaction

Taking the ratio of the second graph to the first tells us our write amplification. For the start of the experiment, write amplification is above the theoretical 6.5x amplification, but rapidly settles into something empirically less.

Write Amplfication

Status of lsmtk

lsmtk is a prototype that needs to adopt existing good ideas in the LSM space. A future release will adopt the following:

Compaction of tombstones. Currently not supported.
Grandparent overlap bytes, generalized to make the outputs of compaction adhere to a triangular shape.
Concurrent compaction threads for non-overlapping triangles.
Back-pressure against ingest
Gathering of nodes from Level0 that overlap the range of keys from levels 1 to n.

lsmtk is available via the blue project on GitHub.