[HN Gopher] Craziest thing I ever used SQLite for: partial file ...
___________________________________________________________________
Craziest thing I ever used SQLite for: partial file deduplication
Author : ics
Score : 194 points
Date : 2023-03-26 18:00 UTC (4 hours ago)
(HTM) web link (sqlite.org)
(TXT) w3m dump (sqlite.org)
| hultner wrote:
| Very cool and probably a fun exercise, but I would probably put
| the data on a ZFS volume with dedupe instead, which from reading
| this implementation is pretty much the same thing to my layman's
| perspective. I could also add compression to the same dataset.
| traceroute66 wrote:
| > I would probably put the data on a ZFS volume with dedupe
| instead
|
| But doesn't ZFS dedupe eat RAM for breakfast ? Double-digit GB
| RAM per TB data IIRC ?
| giantrobot wrote:
| It's usually along the lines of 1GB per TB. Some factors can
| affect that number but I've found it pretty accurate. Note
| that's 1GB of RAM that ends up wired to ZFS and never usable
| by the rest of the system.
| tripleo1 wrote:
| ZFS is fun but it locks up for unexplained reasons sometimes
| [deleted]
| rurban wrote:
| Using proper a hash table with a bloom filter would save you the
| useless pass through a b-tree though. Much faster and much less
| memory
| pornel wrote:
| With a b-tree you can do even better. Instead of hashing entire
| files ahead of time, you can avoid hashing more than the
| minimum prefix required to conclude a file is unique, by
| building the b-tree lazily. I don't think it can be done with
| sqlite though, as it requires hashing the files while
| traversing the b-tree.
|
| https://lib.rs/crates/dupe-krill#readme-nerding-out-about-th...
| [deleted]
| JackSlateur wrote:
| Looks like https://github.com/markfasheh/duperemove
| philsnow wrote:
| Oh, hey Mark
|
| (Not (purely) a reference to the "oh hey Mark" meme, I'm
| vaguely acquainted with Mark from LUG days)
| [deleted]
| evmar wrote:
| Coincidentally, I just saw that the "USENIX Test of Time Award"
| for 2023 was an analysis for deduplication purposes of real-world
| files. I found Figure 1 particularly interesting in that partial
| deduplication didn't save much over whole-file based
| deduplication in practice.
|
| https://www.usenix.org/conferences/test-of-time-awards
| emmelaich wrote:
| You'd expect that in general, but for developers updating
| content the partial would often win.
| anyfoo wrote:
| Oh sweet. I've definitely used sqlite (in a zsh script) for
| _file_ deduplication. Very simple, mostly just rows consisting of
| paths and file content hash.
|
| But _partial_ file deduplication is something else...
| [deleted]
| MithrilTuxedo wrote:
| This sounds like it could be bolted onto/into rsync on the server
| side to present filesystems larger than the server can actually
| store.
| fbdab103 wrote:
| Does anyone else have any other unorthodox use cases? I love
| SQLite, and am always happy to ram this square peg into a round
| hole.
| mastax wrote:
| That's really cool. There are a bunch of tools that will let you
| symlink or hard link deduplicate files, but being able to do
| block-level dedupes simply by leaning on the filesystem is nice.
|
| It sometimes feels like games are made to thwart this type of
| thing. They often use packfiles, basically filesystems within
| files optimized to look up assets quickly. Also perhaps they
| allowed optimized data layout from when consoles had slow
| spinning hard drives. The upshot is that a tiny patch inserting a
| line of code in a script may offset hundreds of megabytes of
| other data in the packfiles, causing the block hashes to no
| longer match up. Do any filesystems model inserts in some way?
| I'm pretty sure Steam updates can handle situations like that. I
| frequently see updates which download a tiny amount (kilobytes)
| but write a huge amount to disk (gigabytes), and I can't think of
| any other cause. (Assuming developers aren't using hilariously
| un-compressed assets).
| mikepurvis wrote:
| Yes, Steamworks encourages developers to align assets to
| certain boundaries to help ensure efficient patching later, see
| the documentation here:
| https://partner.steamgames.com/doc/sdk/uploading
| mastax wrote:
| Interesting, it looks like steam doesn't have any special
| handling for inserts/offsets and just handles 1MB blocks. The
| extra disk IO probably just comes from rewriting large files:
|
| > Next - to update the pack file on a client device,
| SteamPipe builds the new version alongside the old version.
| When all new files are built, it then "commits" the update by
| deleting old files and moving the new files in. What this
| means is that to update a 25 GB pack file, SteamPipe will
| always build a new, 25GB file. If the update only requires 10
| bytes of changes to that file, SteamPipe will need to copy
| almost the entire 25GB from the old file to the new file.
| Depending on the client storage hardware, this can be a very
| slow process.
| bob1029 wrote:
| I think SQLite + WAL could be a compelling solution for
| game patching scenario.
|
| There are already SQLite clustering technologies that
| utilize WAL frame shipping in order to replicate data. I
| don't see why a steam patch couldn't just be a range of WAL
| frames to be applied to a client's database.
| beecafe wrote:
| What about using content defined chunking? Then inserting a few
| lines shouldn't cause all the subsequent chunks to shift
| viraptor wrote:
| That's not quite right. The contents in the file would still
| shift. You'd just be able to still deduplicate most chunks
| between multiple files... But only in your copies because the
| filesystem still uses constant size chunks.
| riceart wrote:
| But I think they're talking about if you're doing content
| defined chunking of a pack or archive file - inserting data
| should only affect insertion chunks + 1. And since it's
| content defined - those chunks are necessarily not constant
| size.
|
| Ie this is how backup tools like Arq or duplicacy work.
| panzi wrote:
| Some games just add an update pak that overloads the changed
| files, not changing the original pak file. So the installation
| size of such games might grow a lot on updates, wasting a lot
| of space by keeping the old files, but the files are immutable
| at least and rollbacks of updates are theoretically possible.
| sgtnoodle wrote:
| That's neat! In retrospect, did the database make the problem
| more tractable from an algorithmic or practical standpoint, or
| was it mostly just using the tools you're familiar with? If I
| were to approach the same problem, I likely would have tried to
| keep all the data in RAM and serialize it out to a file for
| incremental runs.
| nonethewiser wrote:
| I have used SQLite to deduplicate csvs when it couldn't all fit
| into memory.
|
| I imagine Dask would be a better tool for this job but I wasn't
| familiar with it at the time and it did allow me to deduplicate
| using Python and no external dependencies.
| qsort wrote:
| IME a database does not algorithmically improve anything, but
| it does make at your disposal some very efficient data
| structures that are definitely nontrivial to replicate. This is
| especially the case if you have to work with datasets that
| don't fit in memory!
|
| An added bonus is that you can use SQL.
| zerkten wrote:
| A friend works for a financial company that gets applicants
| from .NET and Linux backgrounds. He observed that the .NET
| devs almost always thought about solving these problems with
| a database and SQL that the Linux people would solve with
| pipelines of commands.
|
| Both were valid in his environment since they were both a big
| Oracle and SQL Server shop while having high Linux usage as
| well. I must ask him if PowerShell has changed things at all.
| theamk wrote:
| Definitely the latter.
|
| There are many of other products which implement "keep track of
| partial file contents to deduplicate", for example casync borg
| and restic.
|
| None of them use sqlite for metadata storage, which kinda makes
| sense - in author's schema, each block uses (fileid,
| block_index, hash) tuple, which is 2/3 wasted space, as fileids
| are mostly same (if files are big) and block_index is always
| increasing. A custom data structure would be both faster and
| more efficient.
| [deleted]
| maxyurk wrote:
| https://en.m.wikipedia.org/wiki/Content-addressable_storage
| groestl wrote:
| Also relevant in the context of partial file deduplication (to
| get robustness wrt inserts and deletions):
| https://en.wikipedia.org/wiki/Rolling_hash
| megous wrote:
| I used a similar trick to stuff 14 Linux distros onto a 6 GiB
| sized SD card image using btrfs filesystem. The distros share a
| lot of common files, so this works well to save space. (btrfs
| also supports the data block sharing CoW feature as APFS)
|
| https://xnux.eu/p-boot-demo/
|
| I had to write a custom tool to do it efficiently during image
| creation.
| itsthecourier wrote:
| You, savage. Kudos.
| tripleo1 wrote:
| Slightly off topic -> there was a project I seen on gh that
| claimed to support system administration using a relational
| tables. Something like everything in SQL. I thought it might be a
| cool idea.
| coppsilgold wrote:
| osquery <https://github.com/osquery/osquery>
| leetbulb wrote:
| "There is no file path, because macOS lets you look up files by
| id. The hash values are cryptographic hashes truncated to 64 bits
| and reinterpreted as integers."
|
| Is the author implying that APFS or HFS uses this method to
| calculate the file ID? I am unable to find any information
| regarding this. From what I understand, w.r.t APFS, the file ID
| is a combination of the inode OID and genID.
| cerved wrote:
| I assumed they are taking about inodes
| [deleted]
___________________________________________________________________
(page generated 2023-03-26 23:00 UTC)