[HN Gopher] Umbra: A Disk-Based System with In-Memory Performanc...
       ___________________________________________________________________
        
       Umbra: A Disk-Based System with In-Memory Performance [pdf]
        
       Author : itunpredictable
       Score  : 31 points
       Date   : 2024-05-02 16:06 UTC (6 hours ago)
        
 (HTM) web link (www.cidrdb.org)
 (TXT) w3m dump (www.cidrdb.org)
        
       | epistasis wrote:
       | This is a Database System, if you're checking the comments to
       | understand what type of system this is about. The paper appears
       | in _10th Annual Conference on Innovative Data Systems Research_ ,
       | and appearing in that context makes it clear.
        
       | hinkley wrote:
       | I still maintain that the existence of in memory databases has
       | two main sources: scalability bottlenecks in GC, and storage
       | latency falling behind network latency and staying there.
       | 
       | If general purpose programming languages could store the data
       | efficiently in main memory, the feature set of in memory
       | databases is not so high that you can't roll your own
       | incrementally. But your GC times are going to go nuts, and you'll
       | go off the rails.
       | 
       | If the speed of light governed data access, you'd collect your
       | data locally and let the operating system decide which hot paths
       | to keep in memory versus storage.
       | 
       | The last time network was faster than disk was the 1980's, and we
       | got things like process migration systems (Sprite). Those
       | evaporated once the pendulum swung back.
        
         | slaymaker1907 wrote:
         | It depends on what is meant by in-memory database. The most
         | useful kind IMO is the one which actually saves everything to
         | disk, but is not designed like a traditional RDMS in that it
         | assumes everything, including indices, can be saved in memory.
         | Therefore, you don't need a complicated buffer pool system and
         | you don't need to touch disk at all after startup to service
         | read queries. The most simple approach to such a database is
         | just to MMAP a file.
         | 
         | This kind of workload is probably the most common in all of
         | software development for the past couple of decades given how
         | plentiful RAM is as well as most applications having some need
         | for storing persistent data.
        
           | hinkley wrote:
           | I always feel a little weird using memcached because it has
           | never once crashed on us but when it goes down we have a bad
           | time with circuit breakers.
           | 
           | We only have problems with memcached when we create them
           | ourselves. Disk backing store would soften that considerably.
        
             | lanstin wrote:
             | memcached type API with RocksDB backing store is pretty
             | good. Honestly, at this point hasn't every one written some
             | in memory DB with various methods to persist and various
             | consistency models as a result? At this point the magic is
             | in client routing to the appropriate shard without having
             | to redeploy to change the shard configuration and to allow
             | multi-remote callers to access the data and still get
             | access without disk load or mutex/locking around the data.
             | 
             | I have a thing where the shards each have sub-shards in
             | process and 1 go-routine per sub shard; communication
             | to/from the remote callers is via channels to a per-request
             | go-routine (or whatever it is gRPC does) and the main subs
             | hard go-routine has no locking on itself. Just a big hash
             | map and a DLL to implement an LRU so I have a hard cap on
             | memory usage, and no allocations for lookups or mutations
             | (just creations).
        
           | anotherguy0 wrote:
           | Since you mentioned MMAP: "Are You Sure You Want to Use MMAP
           | in Your Database Management System?"
           | https://www.cidrdb.org/cidr2022/papers/p13-crotty.pdf
        
             | riku_iki wrote:
             | Dataset was 20x times larger than available RAM in that
             | case, so it makes sense that OS cashing was useless and
             | only induced overhead..
             | 
             | Another potential issue was that they compared their mmap
             | code to fio O_DIRECT code, which kinda not clean
             | experiment, fio could be just much more optimized itself..
        
         | avmich wrote:
         | I suspect John Ousterhout https://web.stanford.edu/~ouster/cgi-
         | bin/papers/ramcloud.pdf would have other ideas.
        
       | CurtHagenlocher wrote:
       | You can see additional papers from the same group at
       | https://umbra-db.com/#publications
        
         | VHRanger wrote:
         | But no github!
        
           | sakras wrote:
           | Unfortunately this is standard for TUM's database group.
           | Their previous database, HyPer was similarly cutting-edge,
           | but was closed source under a proprietary license. Last I
           | heard it got sold to Tableau.
        
             | iamlucaswolf wrote:
             | Umbra was recently spun out as CedarDB [1].
             | 
             | And Hyper is alive and well at Salesforce/Tableau! The team
             | working on it is still in large parts the original Hyper
             | team from TUM. You can actually download Hyper (as a binary
             | with language bindings) and play around with it [2] for
             | non-commercial use cases.
             | 
             | If you think Hyper/Umbra is cool, the TUM database group
             | has lots of other very interesting projects going on at the
             | moment. LingoDB [3] pushes the database-as-a-compiler idea
             | to the extreme by implementing query optimization and
             | compilation query compilation in MLIR. LingoDB is open-
             | source. Also Viktor Leis, who stands behind (among many
             | other things) Hyper's Morsel scheduling and ART indexes as
             | well as Umbra's buffer management recently started a very
             | interesting project [4] to heavily co-design the DBMS
             | together with the OS in a unikernel approach. Really
             | interesting stuff!
             | 
             | Disclaimer: I work on Hyper. Views are my own.
             | 
             | [1]: https://cedardb.com/ [2]:
             | https://tableau.github.io/hyper-db/docs/ [3]:
             | https://www.lingo-db.com/ [4]:
             | https://www.cs.cit.tum.de/dis/research/cumulus/
        
       | beoberha wrote:
       | Obligatory link to Neumann's presentation for the CMU DB lecture
       | series
       | 
       | https://m.youtube.com/watch?v=pS2_AJNIxzU
        
       | zX41ZdbW wrote:
       | Umbra in ClickBench:
       | https://github.com/ClickHouse/ClickBench/pull/161
       | 
       | The initial submission didn't reproduce successfully due to a
       | segmentation fault in an attempt to restart it after data
       | loading. But after some changes, it started to work and showed
       | exceptionally good results.
        
         | threeseed wrote:
         | Benchmarks: https://benchmark.clickhouse.com
         | 
         | So definitely compared against PostgreSQL, MariaDB it is
         | significantly faster.
         | 
         | On par with lower-end Snowflake.
        
       ___________________________________________________________________
       (page generated 2024-05-02 23:00 UTC)