[HN Gopher] What's the big deal about embedded key-value databases?
___________________________________________________________________
What's the big deal about embedded key-value databases?
Author : eatonphil
Score : 96 points
Date : 2022-08-23 15:58 UTC (7 hours ago)
(HTM) web link (notes.eatonphil.com)
(TXT) w3m dump (notes.eatonphil.com)
| eis wrote:
| A few more entries that might be of interest: *
| DynamoDB and the Dynamo KV store * LMDB (embedded kv)
| * Dgraph (distributed graph db) and its embedded kv store
| BadgerDB
| didgetmaster wrote:
| I am building a general-purpose data management system called
| Didgets (https://didgets.com/) that extensively uses KV stores
| that I invented. Since it was primarily designed to be a file
| system replacement, I used them for attaching contextual meta-
| data tags to file objects.
|
| My whole container started to look like a sparsely populated
| relational table where every row/column intersection could have
| multiple values (e.g. a photo could have a tag for every person
| in the picture attached). I started experimenting with using the
| KV stores as columns to form regular relational tables.
|
| It turns out that it was relatively easy and was extremely fast.
| I started building tables with 50+ million rows and many columns
| and performing queries against them. Benchmarking the system
| against other databases revealed that it was very fast (and
| didn't need separate indexes to accomplish this).
|
| Here is a video showing how it does a bunch of queries 10x faster
| than the same data stored in a highly indexed table in Postgres:
| https://www.youtube.com/watch?v=OVICKCkWMZE
| atmin wrote:
| No mention of SQLite as an embedded SQL database?
| eatonphil wrote:
| This post is about key-value stores.
|
| While foundationdb uses SQLite I didn't otherwise think of
| SQLite as being relevant here. :)
| morelisp wrote:
| "Time is a flat circle." - someone at Sleepycat, probably.
| rad_gruchalski wrote:
| This is a good read. By the way, Kafka Streams is also built on
| top of RocksDB. Not strictly a database but relevant to a certain
| extent.
| Xeoncross wrote:
| I highly recommend people comfortable with Go checkout the
| building blocks at https://github.com/thomasjungblut/go-sstables
|
| This codebase shows how SSTables, WAL, memtables, recordio,
| skiplists, segment files, and other storage engine components
| work in a digestible way. Includes a demo database showing how it
| all comes together to make a RocksDB / LevelDB competitor (not
| really).
| tristan957 wrote:
| I work on a storage engine at $dayJob. We have created a
| connector for MongoDB, although for a very ancient version. We
| are currently working with $cloudProvider to use our storage
| engine in their cloud DBaaS offerings.
|
| This field is pretty interesting when you're talking about
| performance vs space amp vs write amp vs read amp.
| aviramha wrote:
| Great article! One cool thing about RocksDB it's actually even
| used in other KV databases such as Redis on Flash
| https://redis.com/blog/hood-redis-enterprise-flash-database-...
| eatonphil wrote:
| Yup, FB's ZippyDB [0] is another example mentioned in the
| article.
|
| [0] https://engineering.fb.com/2021/08/06/core-data/zippydb/
|
| Edit: I've added Redis Enterprise Flash to the list now.
| Thanks!
| dboreham wrote:
| The article misses the point. All data storage and query
| systems end up architected in layers. Upper layers deal with
| higher abstractions (objects, rows, whatever). Lower layers
| deal with simpler functions, closer to the hardware. The upper
| layers are consumers of the lower layers. This is where
| "embedded KV stores" like LevelDB, RocksDB, etc come from. They
| began as the embedded storage layer for some bigger thing.
| Every product you think of as a database or document store is
| built like this, including MySQL and PostgreSQL and Oracle.
| Such a storage layer, shipped as an independent library, is how
| you (or anyone) builds your own database-ish thing. That's what
| the article should say.
|
| The list of examples are odd. For instance MongoRocks is cited
| for using RocksDB, but actual stock MongoDB uses Wired Tiger,
| which isn't mentioned.
|
| Disclosure: I played a part in the late-beginning of this space
| when Netscape funded Sleepycat to develop BerkeleyDB. dbm and
| ndbm existed beforehand, but BerkeleyDB used in LDAP servers is
| I think the genesis point for this pattern as it exists today.
| eatonphil wrote:
| If there's a difference between what you wrote and what I
| wrote I'm missing it.
|
| But you're also welcome to write your own post. :)
| morelisp wrote:
| I do feel like there's a historical perspective missing
| from the article which the GP touches on. Embedded KV
| stores aren't new (although some of the algorithms behind
| the current crop certainly are). They used to dominate
| "backend" software development until their popularity waned
| as the world got obsessed with "model the domain, damn the
| computation cost" (because all resources were doubling or
| more yearly) followed by "we'll just distribute it".
|
| The need for parallelism killed the first approach and the
| cost of increasingly complex reduce steps killed the
| second. Now we're back to "how much can we fit in RAM on a
| local machine" and it turns out, if you can still bang bits
| for smart key formats, a hell of a lot.
| galaxyLogic wrote:
| > Upper layers deal with higher abstractions (objects, rows,
| whatever)
|
| Right, I'm waiting for standard for a level above relational
| databases which is Object-databases. I know there are several
| ones already and there are Object-Relational mapping layers.
|
| I think the key point there is that Object databases are a
| level ABOVE relational databases. They are not "better" but
| they deal with the higher level of objects rather than
| "tables", just like relational databases can be seen to be
| are a level above key-value -stores.
|
| I would like Object databases to become better and easier to
| use and more standardized.
|
| I think there is value in being able to see both level, the
| objects, and the relational data that makes up the objects.
| morelisp wrote:
| Neither objects nor relations are "above" the other. You
| can map them in a vacuous mathematical sense, but it's a
| massively leaky abstraction in either direction.
| eatonphil wrote:
| Some concrete examples:
|
| 1. Yugabyte's relational query layer sits on top of a
| document store (DocDB):
| https://www.yugabyte.com/blog/how-we-built-a-high-
| performanc....
|
| 2. You can put documents in a PostgreSQL JSON(B) column.
| nicholasjarnold wrote:
| > They began as the embedded storage layer for some bigger
| thing.
|
| I immediately thought of Kafka's streaming query stuff when I
| read the headline (ksqlDB). I'm not sure if that's the origin
| story of RocksDB, but it's the storage engine underlying that
| streaming query tooling in Kafka's ecosystem.
| eis wrote:
| TiKV is not an embedded key-value store, it is distributed.
| eatonphil wrote:
| Thanks! Fixed and attributed you at the end.
| x3n0ph3n3 wrote:
| My team has a use-case that involves a precomputed RocksDB
| database saved on an AWS EFS volume that is mounted on a lambda
| with 100's-1000's of invocations per second. It allows for some
| extremely fast querying of relatively static data. Another
| process is responsible for periodically updating the database and
| writing it back to the EFS volume.
| samsquire wrote:
| With RockSet's converged indexes and an SQL query optimiser you
| can build an SQL database.
|
| https://rockset.com/blog/converged-indexing-the-secret-sauce...
|
| Rockset's converged indexes + denormalisation means you can have
| fast querying.
| mprovost wrote:
| I feel like this is missing any mention of the history of KV
| stores. Unix came with an embedded database (dbm) from the early
| days (1979) [0] which was rewritten at Berkeley into the more
| popular bdb in the 80s. [1] Sendmail was one of the more common
| programs that used it. And then when djb built his replacement
| for sendmail, qmail, he invented cdb. [2]
|
| [0] https://en.wikipedia.org/wiki/DBM_(computing)
|
| [1] https://en.wikipedia.org/wiki/Berkeley_DB
|
| [2] https://cr.yp.to/cdb.html
| LAC-Tech wrote:
| When I read about event sourcing, my mind immediately went to how
| that would map to a K/V database. Has anyone done this in
| production?
|
| Also - no mention of LMDB? RocksDB and LMDB feel like the ones
| that stand out in that field - levelDB definitely had a
| reputation for corrupting data.
| effnorwood wrote:
| adammarples wrote:
| Plug for my python dict wrapper
| https://github.com/adammarples/rocksdbdict
| Adiqq wrote:
| Honestly, I'm still not sure, why would I use something like
| RocksDB instead or in addition to plain PostgreSQL/MongoDB/Redis
| instances.
|
| I don't work with a lot of data, but typically my decisions base
| on basic factors and purpose:
|
| PostgreSQL - SQL, structured data, cannot scale horizontally
|
| MongoDB - NoSQL, unstructured data
|
| Redis - key-value, distributed cache
|
| I get it that you can replace storage engine and you can
| theoretically get more performance, but in practice compatibility
| and standardization is more important, because a lot of products
| (including third-party) will already use
| PostgreSQL/MongoDB/Redis, so it's no-brainer to use it as well
| for your solution.
|
| However for me to pick RocksDB or some other, new, shining
| database/storage engine, there would have to be more compelling
| reasons.
| jzelinskie wrote:
| Unless you are building a database, these embedded KV store
| libraries are less likely to be the best solution the job. If
| you are considering them for an app that isn't a database, you
| should also take a long, hard look at SQLite first.
|
| What's also interesting is the trend of newer distributed
| "database systems" like Vitess[0] or SpiceDB[1] that forego
| embedded KV stores and instead reuse existing SQL databases as
| their "embedded database". Vitess leverages MySQL and SpiceDB
| leverages MySQL, PostgreSQL, CockroachDB, or Spanner. Systems
| built this way get to leverage many high-level features from
| existing databases systems such that they can focus on
| innovating in even higher-level functionality. In the case of
| Vitess, it's scaling, distributing, and schema management of
| MySQL. In the case of SpiceDB, it's building a database
| specifically optimized for querying access control data in a
| way that can coordinate with causality across multiple
| services.
|
| [0]: https://github.com/vitessio/vitess
|
| [1]: https://github.com/authzed/spicedb
| zarzavat wrote:
| In your list RocksDB is most like Redis, but even faster
| because the data doesn't have to leave the process.
|
| Think of it as a high performance sports car like a Ferrari.
| It's not good at taking the kids to school or buying groceries.
| But if you need to prioritise performance at the expense of all
| other considerations then it's exactly what you need.
| Xeoncross wrote:
| Like S3 or Redis, RocksDB is much more performant when you
| don't need the query engine and want to have highly compact
| storage with fast lookups and high write throughput.
|
| Storage engines are different levels of complexity based on the
| query requirements. Simple K/V stores can run circles around
| Postgres/MySQL as long as you don't need the extra features.
| rajko_rad wrote:
| Two more examples to check out: Yugabyte also persists with
| rocksDB https://www.yugabyte.com/blog/how-we-built-a-high-
| performanc...
|
| And this is very cool, distributed SQLite with FDB:
| https://univalence.me/posts/mvsqlite
| eatonphil wrote:
| Thank you, edited to include Yugabyte!
| ramoz wrote:
| Should see a rise in embedded KV popularity in correlation with
| ML applications. Storing embeddings in something like leveldb in
| formats such as flatbuffer offer high-performance solutions for
| online prediction (i.e. for mapping business values to their
| embedding format on the fly to send off to some model for
| inference).
| jupp0r wrote:
| Would that be on mobile devices for offline usage? I'm thinking
| that for typical backend use cases one would use a dedicated
| key value store service, right?
| ramoz wrote:
| This would depend on your requirements and type of inference.
| Say you need to compute inference across 1000's of
| content/documents/images every second or so, out of some
| corpus of millions-billions, then having a kv store on
| disk/SSD (NVME) might be for more efficient & cheaper (in
| terms of grabbing those embeddings to conduct a downstream ML
| task). How you update the corpus matters too -- a lot of
| embedding spaces need to be updated in aggregate.
| lacker wrote:
| IMO it's just confusing to call both, say, RocksDB and MySQL
| "databases". They sit at different levels of the stack and it is
| easier to just think of them as entirely different things, your
| "SQL database" and your "storage engine". So your stack looks
| like
|
| Application
|
| |
|
| MySQL
|
| |
|
| RocksDB
|
| |
|
| Filesystem
|
| In general the MySQL layer is doing all the convenient stuff for
| application developers like supporting different queries and
| datatypes. The RocksDB layer is optimizing for performance
| metrics like throughput and reliability and just treats data as
| sequences of bytes.
| tomhallett wrote:
| 100% agreed. TIL that mysql uses RocksDB under the hood.
|
| Here's another example of a realtime database which uses
| RocksDB under the hood: https://rockset.com/blog/how-we-use-
| rocksdb-at-rockset/
| eatonphil wrote:
| As far as I'm aware, MySQL does not use RocksDB under the
| hood by default. MyRocks is a distribution of MySQL that uses
| RocksDB.
| moralestapia wrote:
| Yeah, weird comment from GP. By the time RocksDB was born,
| MySQL was already going to prom.
| ruw1090 wrote:
| Close, but in database years it was actually already in
| its mid life crisis.
| icelancer wrote:
| Only if you configure it that way. Same as MyISAM/InnoDB/etc.
| [deleted]
| lcnPylGDnU4H9OF wrote:
| Actually, this helps a lot. I'd never heard of RocksDB and I'm
| barely familiar with InnoDB and hopefully I am not wrong to
| compare the two.
| jeffbee wrote:
| I think the use of bare RocksDB is more common than the use of
| MyRocks.
| NetOpWibby wrote:
| You should add RethinkDB! I moved to it from MongoDB years ago.
| jeffbee wrote:
| RethinkDB is utterly defunct as a project, has not had a
| substantive release in years, and in my experience just flat
| out doesn't work. And let's don't even discuss Mongo. Asking
| yourself to choose between these is like selecting your
| favorite brand of thumbtack to step on.
| gqewogpdqa wrote:
| Lol. When did you last use MongoDB and why is it a thumbtack?
| NetOpWibby wrote:
| RethinkDB still works well for me /shrug
| orthecreedence wrote:
| Are you still using it? How is the pace going on the community-
| supported version? I stopped using it after the company folded,
| but I do kind of miss it. Definitely one of the more
| interesting designs, and light years beyond what MongoDB was at
| the time.
| NetOpWibby wrote:
| I'm definitely still using it, via rethinkdb-ts (npm
| package). I even forked it to make it work with Deno.
|
| The built-in Data Explorer is a must-have for me and idk of
| any other database that has something similar.
| eis wrote:
| There are plenty of data explorers for other databases,
| especially SQL DBs. I don't think it being built into the
| DB should be a make-it-or-break-it feature.
|
| I used RethinkDB back in the days because it was the first
| DB that had pretty good replication and sharding - it was
| zero effort. I felt the functional programming model to be
| strange, some stuff got executed locally, other parts
| remotely and it was not very straight forward when things
| didn't go as planned.
|
| By the time the RethinkDB company folded, CockroachDB
| emerged and has been my go-to distributed DB since.
| eatonphil wrote:
| No I don't think that's relevant. They implement their own
| btree it seems [0].
|
| They don't use a key-value store library.
|
| I know it's a bit of a fine line. But I'm talking about
| standalone libraries people embed across different
| applications/databases. That's what RocksDB/LevelDB/Pebble are.
|
| [0]
| https://github.com/rethinkdb/rethinkdb/tree/v2.4.x/src/btree
| tristan957 wrote:
| HSE[0] is another storage engine to throw on the pile.
|
| [0]: https://github.com/hse-project/hse
| NonNefarious wrote:
| The term is "key/value."
| mdzn wrote:
| The article says that Consul or etcd are designed to always be
| up, but it's actually quite the opposite. They both leverage Raft
| for maintaining consensus and thus optimize for consistency at
| the cost of availability in case of network partitions. See CAP
| theorem.
| cloudhead wrote:
| All distributed databases are designed to "always be up",
| that's the point of making them distributed, otherwise a single
| instance is fine.
| morelisp wrote:
| There are reasons to distribute DBs that do not need to be up
| constantly, e.g. distributing work (transactions or queries)
| across more resources than are available on one machine; or
| to bring a replica closer to some other service to reduce
| latency.
|
| Kafka Streams is the first kind; the source-of-truth storage
| is HA (as HA as the Kafka topics it's backed with at least)
| but can only be queried with high consistency when the
| consumer is active, and it goes down for rebalances when you
| scale out or fail over (and in many operational setups also
| when you upgrade).
|
| For an example of the second kind, see Fly.io's Litestream
| explanation - https://fly.io/blog/all-in-on-sqlite-
| litestream/.
|
| That being said, I think the etcd etc. examples are just
| meant to be in contrast to stock Redis or Memcache, which
| offer very little HA support, generally just failover with
| minimal consistency guarantee.
| kefir wrote:
| Apache Ignite 3 also uses RocksDB as a pluggable storage
| https://www.gridgain.com/resources/blog/apache-ignite-3-alph...
| eatonphil wrote:
| Thanks! Adding this.
___________________________________________________________________
(page generated 2022-08-23 23:00 UTC)