[HN Gopher] Differential storage: A key building block for a Duc...
___________________________________________________________________
Differential storage: A key building block for a DuckDB-based data
warehouse
Author : vgt
Score : 83 points
Date : 2024-03-12 16:15 UTC (6 hours ago)
(HTM) web link (motherduck.com)
(TXT) w3m dump (motherduck.com)
| richieartoul wrote:
| This is pretty cool. I'm a bit of a noob when it comes to stuff
| like FUSE. There is a bunch of commentary in the blog post about
| how when DuckDB does X, the Differential Storage implementation
| does Y. If the differential storage implementation just exposes
| itself as a filesystem though, how does it know what DuckDB is
| doing at the application layer? For example, how does it know
| from the filesystem layer that DuckDB is performing a snapshot?
| vegetable26 wrote:
| edit: author here
|
| Our current FUSE implementation is specific to DuckDB and does
| make (as well as validate) some assumptions about DuckDB's
| access patterns to the underlying files. In this case - we know
| that currently DuckDB always truncate the WAL on successful
| checkpoint - which triggers a Differential Storage snapshot
| under the hood.
|
| We are working on future-proofing this setup by removing our
| reliance on some of these assumptions and having our FUSE
| implementation act much more like a generic FS moving forward.
|
| Though in general the truncation of the WAL is still usually a
| good time to snapshot the database, as the truncation means
| that the only "state" needed to reconstruct the current
| database is completely captured in the database file.
| richieartoul wrote:
| Makes sense, very cool. Thanks!
| skadamat wrote:
| Curious why FUSE over NFS?
| zilti wrote:
| Performance
| klysm wrote:
| I absolutely hate working with NFS - the impedance mismatch of
| what it does with the API it provides is terrible
| cmollis wrote:
| sounds a bit like what Iceberg does with writes on parquet.
| vegetable26 wrote:
| edit: author here
|
| Yah that's a good point. We use a similar copy-on-write
| implementation as many other storage systems (such as Iceberg)
| that offer time-travel, branching, etc... where you have
| snapshots represented by pointers to immutable snapshot-layer
| or delta files (or however else you want to call them).
|
| In our case, we wanted to provide customers with a very
| similar/seamless experience of using DuckDB on MotherDuck as
| they would with local DuckDB. To provide this parity we needed
| the ability to extend DuckDB's native storage to support these
| capabilities (time-travel, branching, etc...). This led us to
| implementing Differential Storage.
| conradev wrote:
| I'm curious how their snapshotting and object storage
| implementation stacks up against SQLite's offering:
|
| https://sqlite.org/cloudsqlite/doc/trunk/www/index.wiki
|
| SQLite's object-per-page approach means you can fetch portions of
| the database as needed from a bucket, whereas the DuckDB
| snapshotting approach seems to imply that you'd need to fetch the
| whole database.
| TheGuyWhoCodes wrote:
| I did something similar when I worked on an internal timeseries
| DB with optimization for long term and short term storage.
|
| The EFS thing is a bit tricky. Once your credits are done it's
| super slow, so I bet they do provisioned but that's expensive.
|
| The other thing was that multiple instances trying to reach the
| same EFS endpoint at the same time ,for the first time, will fail
| so you need to stagger it or retry and the docker mount plugin
| was a bit iffy.
|
| That said cool concept.
___________________________________________________________________
(page generated 2024-03-12 23:01 UTC)