How does the LSM tree work

LORY
6 min readFeb 11, 2024

And why you may not need it.

The story

“My system has a performance problem at the database level, and my boss asked me to look into Apache Cassandra and ScyllaDB,” my architect friend said.

“But what kind of problem? read or write, I/O-bound or CPU-bound?” I asked.

“Hard to say, based on the business logic, both read and write heavy, need high-speed write, scale storage, fast search, and reporting feature,” he said.

“Well, then have you tried hot-cold data separation, considered some data retention policy, archived or portion away those out-of-dated data?” I asked.

“e.g. 6–12 months hot data, B-Tree indexes for searching, create a summary table to reduce counting or grouping; if full-text search required then just index them into ES or solr; for cold data (which more than 1 year), async replicate into reporting DB, anyway the year-end report only < 3 times a year,” I said.

“Nope. our boss wants to store everything, fast read, fast write, and he said LSM tree-based database could solve all these problems, that’s why he asked me to look into it and pick one” he said.

“Sorry bro. LSM is not that simple, It is another CAP theory, ask your boss to only choose 2” (here are the details about RUM theory)

And anyway, LSM will not bring you everything, let me explain.

LSM tree overview

--

--

LORY

A channel which focusing on developer growth and self improvement