[HN Gopher] Understanding LSM Trees: What Powers Write-Heavy Dat...
___________________________________________________________________
Understanding LSM Trees: What Powers Write-Heavy Databases
Author : branko_d
Score : 33 points
Date : 2022-02-09 05:37 UTC (17 hours ago)
(HTM) web link (yetanotherdevblog.com)
(TXT) w3m dump (yetanotherdevblog.com)
| saurik wrote:
| I don't understand why every distributed project has gravitated
| to write-optimized data structures; like, if you look at all of
| the decentralized SQL offerings, they are now all built around
| LSN trees... does no one have extremely large transactional (so
| not like, bulk data analysis OLAP but standard OLTP work)
| datasets that are more read than write (ones where raw
| replication to balance the I/O or CPU load of queries due to the
| space exploitation, so you want the kind of autosharding these
| solutions offer)? I guess I am just confused as I would have
| expected write-oriented workloads to be more rare and
| specialized, mostly operating in the regime of transient shopping
| carts (Amazon's original use case for Dynamo) or click logs or
| whatever, as I would have thought most projects (maybe a social
| network or a chat client) would have numerous reads (maybe even
| tens of thousands) for every write.
| ot wrote:
| LSM trees have all kinds of tricks to make reads fast too, like
| Bloom filters to rule out levels that don't have the key.
| RocksDB, an LSM, powers what is probably the largest MySQL
| deployment in the world, for OLTP workloads [1]
|
| The real killer feature is not write efficiency, but the
| sequential write pattern, as opposed to the small random writes
| used by B-trees. On flash, this makes a significant difference
| for endurance.
|
| EDIT: And I forgot to mention that it is a lot easier to do
| compression on immutable data structures (SSTables) than on
| mutable ones, so LSMs are generally more space-efficient too.
|
| [1] https://vldb.org/pvldb/vol13/p3217-matsunobu.pdf
| saurik wrote:
| Every time I look into it though these tricks aren't making
| reads truly fast, they are just mitigating the impacts of LSM
| trees to the extent to which is possible, but a B-Tree are
| still in the end both better for reads and not even that much
| worse than writes (due to the write amplification from LSM
| trees; with B-trees this is only an issue if you are making
| truly random writes, which I believe is the worst case for
| LSM trees anyway).
| ot wrote:
| If you're not doing random writes (that is, random keys) an
| LSM has no write amplification, it essentially becomes a
| sorted list of SSTables.
|
| I can't think of any OLTP application where the writes are
| not random (in the sense of relative position in the key
| space). Either your primary keys are, or at least all your
| indexes are.
|
| Write amplification is not an issue if sequential writes
| are orders of magnitude faster than random writes, which is
| usually the case, so even with that LSMs come out ahead.
|
| Also if your application is read-heavy, you can tune the
| LSM parameters to have fewer levels (at the cost of more
| compaction traffic). With B-trees, you have to chase a
| fixed depth of log_{page size}.
| bob1029 wrote:
| > The real killer feature is not write efficiency, but the
| sequential write pattern, as opposed to the small random
| writes used by B-trees. On flash, this makes a significant
| difference for endurance.
|
| Sequential writes _with_ some degree of batching requests can
| add orders of magnitude to system throughput. I use a
| ringbuffer abstraction to serialize my writers and handle
| batches of up to 4096 transactions at a time. Everything in
| the batch gets compressed and written as 1 chunk to an
| append-only log. The benefits of processing multiple requests
| at a time are very hard to overstate. Saturating the write
| performance of virtually any storage device is trivial with
| this approach. It also carries back to magnetic storage
| really well, and even tape if you were so inclined to go down
| that path.
| ot wrote:
| The batching comes implicitly in LSMs: for the WAL, the
| writes are committed only when the WAL is synced, so the
| amount of batching is user-controlled. When writing
| SSTables (either from memtables or from compaction) the
| write sizes are controlled by the write buffer size.
| kragen wrote:
| Social networks and chat clients can deal with eventually
| consistent reads instead of consistent reads, so you can run
| your database with a master, which handles only writes, a
| standby master for failover, and 58 asynchronously updated
| readslaves. There are 307 different ways to scale out read
| performance horizontally as long as eventual consistency is
| good enough for your application: memcached, Redis, Varnish,
| Fastly (built on Varnish), etc. Some of these don't even have
| cache invalidation, relying on timeouts.
|
| This approach can get you pretty far before you need sharding,
| auto- or otherwise, but the master's write bandwidth is the
| bottleneck. Taken to the extreme this realization leads you to
| ridiculous places like Kafka/Samza.
|
| I'm trying to come up with an example of an OLTP application
| where the ratio of reads to writes within the same transaction
| needs to be large. I'm sure they must exist but I can't think
| of one.
| tommiegannert wrote:
| > An SSTable will consist of multiple sorted files called
| segments. [---] Recall that LSM trees only perform sequential
| writes.
|
| Assuming either that you only have one table, or file systems are
| magic things that don't make that argument moot.
|
| > Once the compaction process has written a new segment for the
| input segments, the old segment files are deleted.
|
| ... and the assumption of only a single table isn't enough! Thank
| you, file system developers, for making SSTables sound so easy.
|
| > Bloom filter
|
| I did some research a couple of weeks ago about Approximate
| member query data structures. Bloom filters are from 1970,
| Quotient filters arrived in 1984, then Pagh improved Bloom
| filters in 2004, and we got Cuckoo filters in 2014 [1]. Has there
| been progress since? Did I miss any key achievement in-between?
|
| [1] https://en.wikipedia.org/wiki/Cuckoo_filter
| 0xCMP wrote:
| First thing I thought reading this is "there is something missing
| here" cause you obviously can't wait until the memtable is full
| to reply OK to the requester. RocksDB Docs explain that they use
| a WAL+Manifest Log and I assume others do something similar:
| https://github.com/facebook/rocksdb/wiki/RocksDB-Overview#3-...
|
| Wish it was just mentioned at least off hand that clearly this
| isn't the whole solution and a WAL or similar is used to make
| sure losing the memtable doesn't lose the data.
| spullara wrote:
| There is one nit I would like to point out where they are talking
| about a bloom filter. They should replace "value is present" with
| "value may be present". The worst case scenario is a false
| positive that is generally tuned to be about 1% of the time.
| Whatever that false positive rate (R) is though your tail latency
| will definitely be affected at the p(100-R).
| bradengroom wrote:
| Author here! Thanks for sharing, and thanks for all of your
| suggestions in the comments here. I'll try and find some time to
| apply the feedback here amid my current parental leave.
___________________________________________________________________
(page generated 2022-02-09 23:00 UTC)