Choosing between B-tree and LSM-tree storage engines is one of the most consequential decisions in database system design. Yet engineers often make this choice based on folklore rather than quantitative analysis. The reality is that both structures offer distinct performance characteristics that can be modeled mathematically, allowing precise predictions about their behavior under specific workloads.
B-trees have dominated relational databases for decades because they provide balanced read and write performance with predictable latency. LSM-trees emerged from log-structured file systems and gained prominence through systems like LevelDB, RocksDB, and Cassandra. They sacrifice some read performance for dramatically improved write throughput—but understanding exactly how much requires rigorous analysis.
This article develops a quantitative framework for comparing these storage engines. We derive write amplification factors from first principles, analyze read performance across different access patterns, and construct a workload characterization model. The goal is to replace intuition with mathematics, giving you the tools to make defensible engineering decisions based on your actual performance requirements.
Write Amplification Math
Write amplification measures the ratio of bytes written to storage versus bytes written by the application. For any storage engine optimizing durability, this ratio exceeds one—the question is by how much and under what conditions.
For a B-tree with branching factor B and uniform random inserts, the expected write amplification approaches O(B) per logical write. Each insert may trigger a page split, and maintaining balance requires writing modified pages from leaf to root. With typical page sizes of 4KB-16KB and branching factors around 100-500, a single key-value pair insertion can trigger multiple page writes. The steady-state write amplification for B-trees is approximately 2-4x for in-place updates, assuming reasonable page utilization.
LSM-trees achieve their write advantage through sequential I/O. New writes go to an in-memory buffer (memtable), then flush sequentially to disk as immutable sorted runs. The write amplification comes from compaction—merging sorted runs to bound read amplification. For a leveled compaction strategy with size ratio T between levels and L total levels, the write amplification is bounded by O(T × L). With typical values of T=10 and L=5-7, this yields write amplification of 10-30x.
This appears counterintuitive—LSM-trees have higher write amplification yet better write throughput. The resolution lies in I/O patterns. B-tree writes are random, limited by disk seek time or flash write latency. LSM-tree writes are sequential, approaching raw media bandwidth. On spinning disks, sequential throughput exceeds random by 100x or more. On SSDs, the gap narrows but remains significant due to write unit alignment and internal parallelism.
The mathematical insight: LSM-tree write amplification of 20x with sequential I/O often outperforms B-tree write amplification of 3x with random I/O. The crossover point depends on storage media characteristics, making this a quantifiable engineering decision rather than a philosophical one.
TakeawayWrite amplification alone doesn't determine write performance—the I/O pattern matters equally, and LSM-trees trade higher amplification for sequential access that better utilizes storage bandwidth.
Read Performance Trade-offs
Point queries in B-trees require traversing from root to leaf, yielding O(log_B N) I/O operations for N keys. With branching factors of 100+ and modern caching, this typically means 1-3 disk accesses for billion-key datasets. The performance is consistent and predictable—each query follows the same algorithmic path regardless of key distribution or recent write activity.
LSM-tree point queries face a different complexity. In the worst case, a key might not exist, requiring examination of every level. With L levels, this suggests O(L) I/O operations per query. However, Bloom filters transform this picture dramatically. A Bloom filter with false positive rate p reduces expected I/O to approximately 1 + p × (L-1) for existing keys. With p=0.01 and L=6, this averages just 1.05 disk accesses—competitive with B-trees.
Range scans reveal starker differences. B-trees store keys in sorted order within pages, making range scans sequential after initial tree traversal. The cost is proportional to result set size plus initial seek. LSM-trees must merge results from multiple sorted runs during range scans, increasing I/O and CPU overhead. The merge cost is O(L) per returned key in naive implementations, though optimizations like seek-based scanning reduce this for sparse ranges.
Block caches (page caches) affect both structures but interact differently with their access patterns. B-tree caches benefit from temporal locality in both index pages and leaf pages. LSM-tree caches face the challenge of multiple sorted runs containing overlapping key ranges—caching a block from one run doesn't help queries that hit different runs.
The quantitative framework suggests: for read-heavy workloads with point queries, properly-tuned LSM-trees with Bloom filters approach B-tree performance. For range-heavy workloads, B-trees maintain a structural advantage proportional to the number of LSM-tree levels.
TakeawayBloom filters close the point query gap between LSM-trees and B-trees, but range scans remain structurally more expensive in LSM-trees due to the merge overhead across sorted runs.
Workload Characterization
Effective storage engine selection requires characterizing workloads along multiple dimensions. The primary axes are: write intensity (operations per second and total volume), read patterns (point queries versus range scans, hit rate), and key distribution (uniform, skewed, sequential).
Write intensity determines whether write amplification dominates total system cost. For workloads exceeding 10,000 writes per second sustained, the I/O pattern advantage of LSM-trees typically outweighs their higher amplification factor. Below this threshold, B-trees' simpler architecture and lower space amplification often win. The crossover point varies with storage media—faster media shifts it higher.
Read patterns interact with LSM-tree level count. High read-to-write ratios favor B-trees, but the relationship is non-linear. An LSM-tree sized to keep the most recent data in memory can achieve sub-millisecond read latencies despite structural overhead. The formula involves memtable size, level zero file count, and expected working set size.
Key distribution profoundly affects both structures. Sequential key insertion causes B-tree page splits to cascade, increasing write amplification. LSM-trees handle sequential keys efficiently since each sorted run remains dense. Conversely, highly skewed read patterns (hotspots) favor B-trees due to cache effectiveness—frequently accessed pages remain resident.
A practical decision framework emerges: calculate the ratio R of range scan bytes to point query count. When R exceeds approximately 1000 bytes per point query, B-trees dominate. For write-intensive workloads with R below this threshold and adequate memory for Bloom filters, LSM-trees provide superior throughput. The exact threshold depends on storage characteristics, available memory, and latency requirements—but the framework provides a quantitative starting point for analysis.
TakeawayWorkload selection reduces to measurable parameters—write rate, range scan volume, and key distribution—that can be plugged into performance models rather than relying on general rules of thumb.
The choice between B-trees and LSM-trees is not a matter of which structure is "better" but which structure better fits your specific access patterns. The mathematics are clear: LSM-trees optimize for write throughput through sequential I/O at the cost of read complexity; B-trees provide balanced performance with predictable latency.
Building a performance model for your workload requires measuring three quantities: sustained write rate, the ratio of range scans to point queries, and key distribution characteristics. These inputs, combined with your storage media's sequential versus random I/O ratio, determine the optimal choice.
The deeper principle extends beyond this specific comparison: storage engine selection is a quantitative optimization problem. The techniques developed here—amplification analysis, I/O pattern modeling, workload characterization—apply to any storage engine comparison. Replace intuition with measurement, and folklore with mathematics.