[HN Gopher] SplinterDB: High performance embedded key-value store
___________________________________________________________________
SplinterDB: High performance embedded key-value store
Author : ridruejo
Score : 52 points
Date : 2022-05-26 07:44 UTC (2 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| bufferoverflow wrote:
| You call it "high performance" and provide no benchmarks?
| dilyevsky wrote:
| Paper has it
| https://www.usenix.org/system/files/atc20-conway.pdf but yeah
| if you check out list of limitations looks more like a research
| proj at this stage. Pretty interesting architecture overall
| though
| bufferoverflow wrote:
| The numbers look very good actually.
|
| I don't care if it's a research project. If it doesn't crash,
| doesn't corrupt data, and delivers performance, it's useful.
|
| I'd want to see performance against Redis and KeyDB.
| dilyevsky wrote:
| Well you should read the limitations... I _think_ they are
| actually cheating by not calling fsync at all which makes
| writes not durable. This is different in rocks /pebble and
| friends.
|
| > I'd want to see performance against Redis and KeyDB.
|
| I think this is apples to oranges comparison as neither of
| these provide durability by default and if you enable it
| redis had terrible performance last I checked + redis needs
| to fit a whole dataset in memory
| ajhconway wrote:
| Hi, research lead for SplinterDB here.
|
| SplinterDB does make all writes durable and in fact has
| its own user-level cache which generally performs writes
| directly to disk (using O_DIRECT for example).
|
| Like RocksDB's default behavior (no fsyncs on the log),
| it does not immediately sync writes to its log when they
| happen. It waits to sync in batches, so that writes may
| not be immediately durable, but logging is more
| efficient. This is a slightly stronger default durability
| guarantee, and we intend to make this configurable.
| otterley wrote:
| I'm a little confused. If you don't ensure data is
| committed to storage (log or otherwise) before acking the
| write request, how can you call it durable?
|
| If it's not truly 100% durable by default, it's best not
| to suggest that it is. Experience says people will use
| the default settings and then become very cross if they
| lose data. It undermines trust and is harmful to
| reputation.
| ajhconway wrote:
| With many workloads, there's a tradeoff between the
| granularity of durability and the overall performance.
|
| If a workload has many small writes (some of our product
| workloads do), then syncing each write can cause write
| amplification and massively affect overall throughput and
| latency. Suppose I do a 100B write, this causes a 4KiB
| page write to sync, which is 40x write amp. Suddenly a
| 2GiB/sec SSD can effectively only write 50MiB/sec.
| Similarly, the per-write latency goes from <5us to 10us
| (with the fastest Optane SSDs) or 150us (with flash
| SSDs).
|
| So storage systems tend to offer a range of durability
| guarantees. Some systems have a special sync operation
| for applications to ensure that all writes are durable.
|
| RocksDB offers a fairly weak guarantee by default too,
| writing to the write-ahead-log (WAL), but not performing
| fsyncs (https://github.com/facebook/rocksdb/wiki/WAL-
| Performance). They make a similar write amplification
| argument too
| (https://github.com/facebook/rocksdb/wiki/WAL-
| Performance#wri...).
| dilyevsky wrote:
| I missed the use of direct io and the comment about fsync
| threw me off, thanks. Very impressive then!
| tyingq wrote:
| Ah, that's helpful, and explains why it exists:
|
| _" Three novel ideas contribute to the high performance of
| SplinterDB: the STB-tree, a new compaction policy that
| exposes more concurrency, and a concurrent memtable and user-
| level cache that removes scalability bottlenecks. All three
| components are designed to enable the CPU to drive high IOPS
| without wasting cycles."_
|
| _" At the heart of SplinterDB is the STB-tree, a novel data
| structure that combines ideas from log-structured merge tree
| and B-trees. The STB-tree adapts the idea of size-tiering
| (also known as fragmentation) from key-value stores such as
| Cassandra and PebblesDB and applies them to B-trees to reduce
| write amplification by reducing the number of times a data
| item is re-written during compaction."_
| stingraycharles wrote:
| Yeah I would appreciate a benchmark against its main
| alternative, rocksdb. I know benchmark are typically
| manufactured and not too representative for real world load,
| but at least a ballpark figure would be nice to know what we're
| talking about here.
|
| Their main website is at https://splinterdb.org/ by the way,
| for those interested. Also no benchmarks there. :)
| ridruejo wrote:
| The paper referenced in the other comment includes a
| benchmark against RocksDB
| https://news.ycombinator.com/item?id=31515765
| [deleted]
| killingtime74 wrote:
| Why would I pick this over SQLite?
| necubi wrote:
| Totally different use-cases. This is an embedded key value
| store, not an RDBMS. You would use this in place of e.g.,
| LevelDB or RocksDB, potentially as the storage layer of a
| database.
| axblount wrote:
| There's always the venerable: CREATE TABLE kv
| ( k TEXT PRIMARY KEY, v TEXT NOT NULL
| );
|
| Even if sqlite is technically an RDBMS, I think it's a
| legitimate comparison. Is SplinterDB worth giving up sqlite's
| reliability and feature set?
| necubi wrote:
| This is much lower-level than sqlite. In fact, you could
| use this as the storage layer for a SQL DB. See, e.g.,
| MyRocks[0] which is a MySQL backend that uses RocksDB as
| the storage layer.
|
| In other words, you'd use this when you just need a
| persistent KV store and want to build the higher level
| semantics according to your application's needs.
|
| [0] http://myrocks.io/
| 4khilles wrote:
| > In other words, you'd use this when you just need a
| persistent KV store and want to build the higher level
| semantics according to your application's needs.
|
| Why can't you use SQLite for this usecase? I believe FDB
| uses SQLite as an embedded KV store.
| tjungblut wrote:
| Checkout the limitations first, no fsync and no data recovery
| makes this of very little use. I wonder what makes you write a kv
| store without this from the start.
| adra wrote:
| Like 99% if the use cases redis/memcached should be used for?
| bufferoverflow wrote:
| Redis has optional durability
|
| https://redis.io/docs/manual/persistence/
| tjungblut wrote:
| Why write anything to disk if you can just store it in an
| array?
| capableweb wrote:
| Just one example: sharing data between processes
___________________________________________________________________
(page generated 2022-05-28 23:00 UTC)