Introduction

Log-Structured Merge-Tree (LSM-Tree) is an append-only, key-value data structure that provides high write throughput. Write throughput is the amount of data that can be written to a storage system over a period of time. LSM Tree is used by various databases including Apache Cassandra, LevelDB, and RocksDB just to name a few.

On a high level, LSM appends all incoming data then uses merge sort to handle deduplication and deletions. Underlying LSM are a few in-memory and on-disk data structures which will be explained in the following sections.

Sorted String Table (SSTable)

SSTable is a disk-based data structure consisting of a sorted, immutable sequence of key-value pairs. The sortedness allows for efficient data retrieval using algorithms like binary search.

Source: ScyllaDB

Initially, fresh write logs are buffered to an in-memory data structure called a Memtable. Once the Memtable reaches a configurable threshold, data is sorted and flushed to disk to become a new SSTable. In the diagram above, transactions are also appended to a Commit log or Write Ahead Log (WAL) file to recover from a crash and ensure durability. As more SSTables are added and a certain threshold is reached, the SSTables are merged via compaction to form a larger SSTable. Note that updates to a given key will just append the new key value. Older entries will eventually be removed during compaction.

Compaction

Compaction can be thought of as garbage collection for the LSM tree. It removes keys that are duplicated and/or marked for deletion. Under the hood, it utilizes a multiway merge algorithm to merge multiple SSTables into a new SSTable.

In most implementations, a dedicated background thread is used to perform compaction. There are many ways to customize how and when compaction gets triggered with different tradeoffs. Below are two classic compaction strategies.

Leveled Compaction

Source: RocksDB

Each level is a sorted run consisting of multiple SSTables
When the run of level $i$ is full, it will flush and merge with the run of level $i+1$
Good read performance and lower space amplification since no duplicate keys are present in each level

Sized-Tiered Compaction

Source: Alibaba Cloud

Organizes SSTables into sorted runs based on their size
Every level must accumulate $T$ runs before they are sort-merged
When the number of runs $\geq T$, the whole level is merged to become a larger SSTable and flushed to the next tier