[HN Gopher] Umbra: A Disk-Based System with In-Memory Performanc...
___________________________________________________________________
Umbra: A Disk-Based System with In-Memory Performance [pdf]
Author : itunpredictable
Score : 31 points
Date : 2024-05-02 16:06 UTC (6 hours ago)
(HTM) web link (www.cidrdb.org)
(TXT) w3m dump (www.cidrdb.org)
| epistasis wrote:
| This is a Database System, if you're checking the comments to
| understand what type of system this is about. The paper appears
| in _10th Annual Conference on Innovative Data Systems Research_ ,
| and appearing in that context makes it clear.
| hinkley wrote:
| I still maintain that the existence of in memory databases has
| two main sources: scalability bottlenecks in GC, and storage
| latency falling behind network latency and staying there.
|
| If general purpose programming languages could store the data
| efficiently in main memory, the feature set of in memory
| databases is not so high that you can't roll your own
| incrementally. But your GC times are going to go nuts, and you'll
| go off the rails.
|
| If the speed of light governed data access, you'd collect your
| data locally and let the operating system decide which hot paths
| to keep in memory versus storage.
|
| The last time network was faster than disk was the 1980's, and we
| got things like process migration systems (Sprite). Those
| evaporated once the pendulum swung back.
| slaymaker1907 wrote:
| It depends on what is meant by in-memory database. The most
| useful kind IMO is the one which actually saves everything to
| disk, but is not designed like a traditional RDMS in that it
| assumes everything, including indices, can be saved in memory.
| Therefore, you don't need a complicated buffer pool system and
| you don't need to touch disk at all after startup to service
| read queries. The most simple approach to such a database is
| just to MMAP a file.
|
| This kind of workload is probably the most common in all of
| software development for the past couple of decades given how
| plentiful RAM is as well as most applications having some need
| for storing persistent data.
| hinkley wrote:
| I always feel a little weird using memcached because it has
| never once crashed on us but when it goes down we have a bad
| time with circuit breakers.
|
| We only have problems with memcached when we create them
| ourselves. Disk backing store would soften that considerably.
| lanstin wrote:
| memcached type API with RocksDB backing store is pretty
| good. Honestly, at this point hasn't every one written some
| in memory DB with various methods to persist and various
| consistency models as a result? At this point the magic is
| in client routing to the appropriate shard without having
| to redeploy to change the shard configuration and to allow
| multi-remote callers to access the data and still get
| access without disk load or mutex/locking around the data.
|
| I have a thing where the shards each have sub-shards in
| process and 1 go-routine per sub shard; communication
| to/from the remote callers is via channels to a per-request
| go-routine (or whatever it is gRPC does) and the main subs
| hard go-routine has no locking on itself. Just a big hash
| map and a DLL to implement an LRU so I have a hard cap on
| memory usage, and no allocations for lookups or mutations
| (just creations).
| anotherguy0 wrote:
| Since you mentioned MMAP: "Are You Sure You Want to Use MMAP
| in Your Database Management System?"
| https://www.cidrdb.org/cidr2022/papers/p13-crotty.pdf
| riku_iki wrote:
| Dataset was 20x times larger than available RAM in that
| case, so it makes sense that OS cashing was useless and
| only induced overhead..
|
| Another potential issue was that they compared their mmap
| code to fio O_DIRECT code, which kinda not clean
| experiment, fio could be just much more optimized itself..
| avmich wrote:
| I suspect John Ousterhout https://web.stanford.edu/~ouster/cgi-
| bin/papers/ramcloud.pdf would have other ideas.
| CurtHagenlocher wrote:
| You can see additional papers from the same group at
| https://umbra-db.com/#publications
| VHRanger wrote:
| But no github!
| sakras wrote:
| Unfortunately this is standard for TUM's database group.
| Their previous database, HyPer was similarly cutting-edge,
| but was closed source under a proprietary license. Last I
| heard it got sold to Tableau.
| iamlucaswolf wrote:
| Umbra was recently spun out as CedarDB [1].
|
| And Hyper is alive and well at Salesforce/Tableau! The team
| working on it is still in large parts the original Hyper
| team from TUM. You can actually download Hyper (as a binary
| with language bindings) and play around with it [2] for
| non-commercial use cases.
|
| If you think Hyper/Umbra is cool, the TUM database group
| has lots of other very interesting projects going on at the
| moment. LingoDB [3] pushes the database-as-a-compiler idea
| to the extreme by implementing query optimization and
| compilation query compilation in MLIR. LingoDB is open-
| source. Also Viktor Leis, who stands behind (among many
| other things) Hyper's Morsel scheduling and ART indexes as
| well as Umbra's buffer management recently started a very
| interesting project [4] to heavily co-design the DBMS
| together with the OS in a unikernel approach. Really
| interesting stuff!
|
| Disclaimer: I work on Hyper. Views are my own.
|
| [1]: https://cedardb.com/ [2]:
| https://tableau.github.io/hyper-db/docs/ [3]:
| https://www.lingo-db.com/ [4]:
| https://www.cs.cit.tum.de/dis/research/cumulus/
| beoberha wrote:
| Obligatory link to Neumann's presentation for the CMU DB lecture
| series
|
| https://m.youtube.com/watch?v=pS2_AJNIxzU
| zX41ZdbW wrote:
| Umbra in ClickBench:
| https://github.com/ClickHouse/ClickBench/pull/161
|
| The initial submission didn't reproduce successfully due to a
| segmentation fault in an attempt to restart it after data
| loading. But after some changes, it started to work and showed
| exceptionally good results.
| threeseed wrote:
| Benchmarks: https://benchmark.clickhouse.com
|
| So definitely compared against PostgreSQL, MariaDB it is
| significantly faster.
|
| On par with lower-end Snowflake.
___________________________________________________________________
(page generated 2024-05-02 23:00 UTC)