[HN Gopher] SQLite Archive Files (2018)
       ___________________________________________________________________
        
       SQLite Archive Files (2018)
        
       Author : alfonsodev
       Score  : 159 points
       Date   : 2021-09-27 09:11 UTC (13 hours ago)
        
 (HTM) web link (www.sqlite.org)
 (TXT) w3m dump (www.sqlite.org)
        
       | reacweb wrote:
       | When I read "If the input X is incompressible, then a copy of X
       | is returned", I worry that this is broken. If I archive a file,
       | then extract it from the archive, I can not be sure to obtain the
       | same file. If the file is already compressed at the beginning, it
       | will be decompressed at the end.
       | 
       | Maybe I am wrong. I didn't know this tool. My brief review of the
       | documentation leads me to believe that it has an obvious problem.
        
         | kybernetikos wrote:
         | I think your concern is dealt with in the immediately following
         | section:
         | 
         | >The Y parameter is the compressed content (the output from a
         | prior call to sqlar_compress()) and SZ is the original
         | uncompressed size of the input X that generated Y. If SZ is
         | less than or equal to the size of Y, that indicates that no
         | compression occurred, and so sqlar_uncompress(Y,SZ) returns a
         | copy of Y.
         | 
         | The format stores both the possibly compressed blob _and_ the
         | original size. From those two pieces of information it can
         | always return the correct original file.
        
           | reacweb wrote:
           | Ok, I was wrong. My review was too brief.
        
         | mbreese wrote:
         | _> If SZ is less than or equal to the size of Y, that indicates
         | that no compression occurred, and so sqlar_uncompress(Y,SZ)
         | returns a copy of Y._
         | 
         | This is pretty common for compression tools. If the input is
         | incompressible (or not compressible by X%), then the original
         | data is stored. In this case, the code is checking the stored
         | size against the uncompressed size. If they are equal, then
         | uncompress is a noop.
         | 
         | My take away is that there is a zlib compression function built
         | into SQLite. Which can be pretty handy.
         | 
         | Another benefit I can see is that because the SQLite database
         | has a flexible schema, you could add new features to the
         | archive while maintaining backwards compatibility. For example,
         | if you wanted to add a SHA1 hash to each record, you should be
         | able to, while still allowing older tools to read the updated
         | file.
        
           | lifthrasiir wrote:
           | Note that this kind of fallback is also prevalent in
           | _compression stream formats_ including zlib which sqlar uses,
           | so the archive format doesn 't need to reimplement the same
           | fallback. The only reason it might be useful is the
           | opportunistic support for random access for uncompressed
           | data.
        
           | formerly_proven wrote:
           | It's still a really annoying design, one should really
           | explicitly communicate if and what compression method was
           | used for a given blob of data. Here another DB column would
           | have been a very easy way.
        
       | m_ke wrote:
       | Would love something like that for storing large image datasets
       | for computer vision. Storing embeddings, predictions and metadata
       | in a contiguous format with compression support, ANN indexing
       | support and SQL would be amazing.
        
         | noxer wrote:
         | There is an SQLite compression and encryption "add-on" which
         | seems to do what you want. Check the official website. Hint:
         | You have to buy a license to use it legally.
        
           | [deleted]
        
       | kybernetikos wrote:
       | Given how crazy the zip file format is, and the claim that sqlite
       | is faster than the filesystem for small files
       | (https://www.sqlite.org/fasterthanfs.html) this seems pretty
       | reasonable to me.
       | 
       | In particular, development repositories with many many small
       | source files often have horrendously slow copying/deleting
       | behaviour (particularly on windows) even on fast disks. I wonder
       | if sqlite archive files would be a better way to store them.
        
         | proto-n wrote:
         | I wonder if the particular slowness on windows could be related
         | to how closing file handles is slow on windows because of
         | blocking antivirus checks. Alas I can't recall where I read
         | about this, it was some kind of "a few things I learned over
         | many years of programming" blogpost.
        
           | noxer wrote:
           | That's why you turn AVs off either completely or at least
           | exclude your own code folder. It wont ever do anything useful
           | and worse even I had it delete/quarantine my own binaries in
           | the past because it somehow got triggered. Also by default it
           | happily uploads all you debug exe files to MS and executes
           | them in their sandbox environment. Absolutely unwanted
           | behavior if you ask me.
        
           | Isthatablackgsd wrote:
           | This is true. Even moving folders from one drive to other
           | drive will trigger it. I download pictures and videos weekly
           | (average 100 images a week). Often my browsers will hang when
           | it trying to save the file to the folder because it was
           | waiting for Windows Security to finish their scanning.
           | Windows are proactive (sometime too much) with their
           | securities because of the history of how we handle their
           | securities updates and many many (thousands of many)
           | unpatched Windows OS out there infecting with malware and
           | randomware.
           | 
           | Fortunately it have an exclusion list, so the list can add
           | any file, folder, file type and process.
        
           | marcodiego wrote:
           | I saw windows take many seconds to compile a hello world.
           | Compile it for the arduino took even longer, probably because
           | it needed to generate more files.
        
       | noxer wrote:
       | I wish they would build compression directly into SQLite. I use
       | SQLite as a log store mostly dumping JSON data in it. Due to the
       | lack of compression the DB is probably 10 times the size it could
       | be.
        
         | OskarS wrote:
         | I mean, they do: [1]. It's an extension, true, probably because
         | they don't want to force zlib as a dependency. Still, very
         | straightforward to build with and use, and as you can see from
         | the code, not hard to implement either.
         | 
         | [1]: https://sqlite.org/src/file?name=ext/misc/compress.c
        
           | kevinmgranger wrote:
           | Compression at the _database_ level would be more useful. If
           | you have a large text dataset but need to use the Full Text
           | Search feature, I don't think you can compress the index.
        
             | noxer wrote:
             | It exists, but its not build in and not free.
             | 
             | See my comment
             | https://news.ycombinator.com/item?id=28670543
        
         | phiresky wrote:
         | You can also try sqlite-zstd [1], which is an sqlite extension
         | allows transparent compression of individual rows by
         | automatically training dictionaries based on groups of data
         | rows.
         | 
         | Disclaimer: I made it and it's not production ready
         | 
         | [1] https://github.com/phiresky/sqlite-zstd
        
         | Cthulhu_ wrote:
         | You could probably pass the data through a gzip filter before
         | storage if you don't need the data to be indexable while in the
         | database. I know, that's an extra step, but in most programming
         | languages it's fairly straightforward.
         | 
         | And to be honest, if you need to search through or index JSON
         | blobs in a database you need to reconsider your design.
        
           | nicoburns wrote:
           | > And to be honest, if you need to search through or index
           | JSON blobs in a database you need to reconsider your design.
           | 
           | As a quick-and-dirty logging solution it makes quite a lot of
           | sense. You can add whatever fields you want to your
           | structured logs, and if you find you need to search on a
           | particular field, just add an index for that field.
        
           | mbreese wrote:
           | I would imagine SQLite would do a lot of random access IO,
           | which would make this difficult even without indexing.
           | 
           | But there is a gzip "variant" that we use in genomics called
           | bgzip that supports this. It's basically a chunked gzip file
           | (multiple gzip records concatenated together) with an extra
           | flag (in the gzip header) for the uncompressed length of each
           | chunk. Using this information, you can do random IO on the
           | compressed file while only uncompressing the chunks
           | necessary. I'm sure other compression formats have the same
           | support.
        
             | masklinn wrote:
             | > I would imagine SQLite would do a lot of random access
             | IO, which would make this difficult even without indexing.
             | 
             | To individual blobs? Seems unlikely.
             | 
             | Posgres automatically compress large blobs, it's part of
             | the TOAST system:
             | 
             | > The TOAST management code is triggered only when a row
             | value to be stored in a table is wider than
             | TOAST_TUPLE_THRESHOLD bytes (normally 2 kB). The TOAST code
             | will compress and/or move field values out-of-line until
             | the row value is shorter than TOAST_TUPLE_TARGET bytes
             | (also normally 2 kB, adjustable) or no more gains can be
             | had.
             | 
             | So if a TOAST-able value is larger than
             | TOAST_TUPLE_THRESHOLD it'll get compressed (using pglz by
             | default, from 14.0 LZ4 will finally be supported), and if
             | it's still above TOAST_TUPLE_TARGET it'll get stored out-
             | of-line (otherwise it gets stored inline).
        
               | dunham wrote:
               | Interesting. Sqlite moves some data off-page (out of the
               | btree node) if the row is bigger than a certain threshold
               | (the beginning of it is still stored in-page). I wonder
               | if this could be hooked to compress the overflow data.
        
               | mbreese wrote:
               | Not the filesystem... to the database file itself. I
               | thought that was what we were talking about.
               | 
               | The blobs in the sqlite archive are compressed, but one
               | of the critiques was that the compression was better when
               | you used multiple files to build the compression
               | dictionary (zip vs tarbal).
               | 
               | And now I realize that I probably replied to the wrong
               | comment...
        
           | noxer wrote:
           | >You could probably pass the data through a gzip filter
           | before storage
           | 
           | That p much voids all the benefit of using SQLite.
           | 
           | >And to be honest, if you need to search through or index
           | JSON blobs in a database you need to reconsider your design.
           | 
           | This is absolutely intentional the log contains information
           | to manually re-play certain events in case of an fatal error
           | mostly for debug purpose or performance benchmarks.
           | 
           | This means 99.99% of the data will never be needed and
           | eventually will be discarded.
           | 
           | But when I need to re-play data I need to be able to query
           | for exactly the data I need. Each JSON "blob" (which is
           | actually store as TEXT in SQLite) has a few additional fields
           | to quickly filter off 95% or more of unwanted data. The rest
           | SQLite simply reads completely to separate wanted and
           | unwanted entries further.
           | 
           | With a DB size around 100GB I get the data I need by reading
           | 5GB or less. That's perfectly fine for my use case even if I
           | would let the DB grow quite a lot larger it would be just
           | fine.
        
       | parhamn wrote:
       | Very cool idea! I'm a bit torn whether the format should concern
       | itself with compression. Seems like a useful general blob
       | container strategy, might be prudent to leave compression to the
       | consumer?
        
         | genocidicbunny wrote:
         | I've hand-rolled something very similar to this, and I just
         | used an extra 'tag' column that stored some extra info, often
         | the compression type.
         | 
         | FWIW, at the bottom of the linked page it says that you can
         | skip the sqlar_compress/uncompress functions and just roll your
         | own:
         | 
         | > The code above is for the general case. For the special case
         | of an SQLite Archive that only stores uncompressed or
         | uncompressible content (this might come up, for example, in an
         | SQLite Archive that stores only JPEG, GIF, and/or PNG images)
         | then the content can be inserted into and extracted from the
         | database without using the sqlar_compress() and
         | sqlar_uncompress() functions, and the sqlar.c extension is not
         | required.
        
           | kybernetikos wrote:
           | That seems like a nice effect of storing the uncompressed
           | size along with the blob, and using that to control
           | decompression. You can always choose to store uncompressed
           | blobs, and the size will always be the same so sqlite will
           | not attempt to decompress the blob. It's not quite the same
           | as rolling your own compression, but it's pretty easy to see
           | how the format could be extended to support that.
        
             | genocidicbunny wrote:
             | Using the size to trigger compression vs no-op is clever,
             | but doesn't help when you have different compression
             | methods that you want to use.
             | 
             | For example, you be using this as an archive format for a
             | game, and you may want to store the texture data using a
             | different compression method than the ai scripts or the
             | audio data. At that point, you definitely need another
             | piece of data to clue you in. A tag field is useful then,
             | though I've also seen filename/file extension-based methods
             | as well.
        
       | say_it_as_it_is wrote:
       | S3Lite?
        
       | cxr wrote:
       | For an honest assessment, difficulty of implementation--and,
       | accordingly, lack of diversity in implementations--should be be
       | listed in the "Disadvantages" section.
       | 
       | (It's interesting that applications against censorship are
       | brought up. Difficulty of implementation has consequences here,
       | too. In order to effectively use SQLite as an archive format, the
       | receiving end will need the SQLite software. By comparison, it's
       | pretty trivial to craft a polyglot file that is both plain text
       | and HTML and is self-extracting and assumes no software on the
       | other end except to rely on the ubiquity of commodity web
       | browsers. Always bet on text.)
        
         | bastawhiz wrote:
         | Sorry, how exactly do you create a "self extracting" archive
         | file with only text and HTML?
        
           | [deleted]
        
           | [deleted]
        
       | PostThisTooFast wrote:
       | I don't see why this is better than simply implementing a table
       | like this yourself.
        
       | ComputerGuru wrote:
       | I think the title could use a (2018) appended to it, just so no
       | one thinks this is a new thing that will be pushed on them or
       | something.
        
         | dang wrote:
         | Ok, added. Thanks!
        
       | edwintorok wrote:
       | Would be interesting if this was expanded to support more
       | compression formats. E.g. zstd. Gzip is quite an old format and
       | zstd is a lot quicker to decompress.
        
         | chrisseaton wrote:
         | I don't think it uses the gzip format does it? It just uses
         | deflate.
        
           | lifthrasiir wrote:
           | It's actually the zlib format. The linked post says DEFLATE,
           | but the linked source code correctly mentions zlib instead.
        
           | masklinn wrote:
           | It doesn't use the gzip file format, but it uses the gzip
           | compression format (which is just DEFLATE).
        
         | rini17 wrote:
         | Yes I did double check there's no column to indicate compressed
         | format, no idea how they want to extend it in future. Maybe
         | just autodetect from the contents? However that could fail if
         | you want to store already deflated files.
         | 
         | I'd also like (nullable) mimetype column. Can be handy for
         | example to store encoding for text files.
         | 
         | On the other side, come people call for file permissions... I'm
         | not a fan, too system-dependent.
        
           | mark-r wrote:
           | Since compression can produce any bit combination with equal
           | probability, the only way to reliably auto-detect from the
           | contents is to put some kind of fixed flag at the front. At
           | that point you get more flexibility by making it a new column
           | in the table instead. It doesn't seem like it would be hard
           | to extend this with a compression type and mime type if you
           | need those.
           | 
           | I question the need for multiple compression types though,
           | the difference between the default and the best won't be
           | great enough to be worth the hassle. Especially if it leaves
           | you open to a patent troll.
        
             | edwintorok wrote:
             | `zstd` and `zlib` as the only 2 supported formats would be
             | enough for me (zstd for state of the art, fast compression
             | and decompression, with choice of high compression ratios
             | if needed; and zlib for backward
             | compatibility/portability), but others may disagree and
             | want their own favourite compression format there. OTOH if
             | you use zstd that already has multiple file capability
             | built-in, so you don't gain much by using sqlite (unless
             | zstd were to adopt sqlite as.its storage format...).
        
               | edwintorok wrote:
               | Actually looks like if you pass multiple files to zstd
               | it'll compress them individually, it won't create an
               | archive, so tar or sqlite format would still be useful
               | there.
        
       | lifthrasiir wrote:
       | I'm generally supportive of SQLite's NIH syndrome---which is
       | normally bad, but it can work if the trade-off is well researched
       | and the resulting product is of high quality---but this one is
       | not. Specifically sqlar is a _worse_ replacement of ZIP.
       | 
       | It lacks pretty much every feature of modern compressed archive
       | formats: filesystem and custom metadata besides from simple
       | st_mode, solid compression, metadata compression, encryption and
       | integrity check and so on. Therefore it can only be legitimately
       | compared with ZIP, which does support custom metadata, very bad
       | encryption and partial integrity check (via zlib) and only lacks
       | the guaranteed encoding for file names. Even ignoring other
       | formats it is not without a problem: for example the compression
       | mode (DEFLATE vs. uncompressed) is implicitly indicated by `sz =
       | length(data)` and I don't think it is a good idea. If I were
       | designing sqlar and didn't want to spare an additional field I
       | would have instead set sz to something negative so that it never
       | collides with the compressed case (of course, if I had a chance I
       | would just add a separate field instead). Pretty disappointing
       | given other tools from the SQLite ecosystem.
        
         | m_eiman wrote:
         | Implicitly indicated compression is also a "feature" of LZJB
         | which we've used for firmware update files. Figuring out what
         | was going on and fixing the issue when a bootloader suddenly
         | started rejecting new firmware updates was interesting and
         | annoying.
         | 
         | The updates were processed in chunks, and rarely a chunk
         | doesn't compress at all - which means that it took a long time
         | before it happened the first time. Always fun to find issues
         | with code that has "always" worked and suddenly doesn't any
         | longer.
         | 
         | https://en.wikipedia.org/wiki/LZJB
        
         | mbreese wrote:
         | But given that we are talking about using a SQLite database as
         | a container, these features could in theory be added in the
         | future while maintaining the ability for older versions to
         | still be able to read the newer files.
         | 
         | I think that's still pretty impressive.
        
         | bityard wrote:
         | I don't think anybody is suggesting that SQLite archives are a
         | suitable replacement for ZIP or tar + compressor of choice
         | general-purpose compression duties.
         | 
         | The docs for this seem to suggest that this extension is most
         | useful when your application already heavily depends on SQLite
         | and you need a place to store some files but don't want to
         | introduce further OS-specific dependencies into your
         | application.
         | 
         | This is a low-effort solution to a rather specific problem.
         | Which is why it's not part of SQLite proper. The extension
         | itself is 124 lines of C including comments for heaven's sake.
        
           | lifthrasiir wrote:
           | The extension only deals with the compression part. The
           | actual operation happens in the shell which is intermingled
           | with general archive support. And this suggests the most
           | important fact: SQLite already supports an ordinary ZIP file
           | via the zipfile virtual table [1]. Both zipfile and sqlar
           | extensions are enabled when compiled with zlib, and it is not
           | even like that sqlar came first (they are added roughly
           | together AFAIK). Therefore if you have SQLite and want to
           | have a cheap file archive support, you can just use zip
           | instead of the inferior sqlar.
           | 
           | [1] https://sqlite.org/zipfile.html
        
         | chousuke wrote:
         | To me, this feature looks more like a convenience wrapper than
         | anything else. Maybe it actually was useful for something and
         | got implemented for that reason.
         | 
         | You pretty much get everything it needs "for free" with the
         | standard SQLite machinery. I can't imagine the code
         | implementing this is a large burden at all.
        
         | tptacek wrote:
         | ZIP encryption is an anti-feature, and you should probably
         | never, ever use it. The problem is that the ZIP ecosystem is
         | full of implementations of horrible insecure cipher
         | constructions (even the modern implementations may support them
         | for backwards compatibility). "Horrible" here means "breakable
         | with a Perl script".
         | 
         | We relied on ZIP encryption at a previous firm, because no
         | client of ours could ever get PGP to work. But we had to train
         | everyone to use 7z, and getting files back from clients was
         | fraught (they might just click the "encrypt" button in
         | whichever random ZIP program they used).
         | 
         | If you want to encrypt an archive file, you should do it with a
         | dedicated encryption program that you trust (Filippo Valsorda's
         | `age` is a good bet, since it has good implementations in both
         | Go and Rust, so it's very easy to integrate into designs, and
         | also has a very clean, simple command line interface). You
         | should not lean on ZIP's supposed AES support.
         | 
         | The rest of this critique of `sqlar` might be valid! But I
         | can't let the encryption thing pass unchallenged.
        
           | lifthrasiir wrote:
           | I think I made it clear that ZIP encryption is "very bad" :-)
        
             | tptacek wrote:
             | Totally fair! I think I'm mostly pushing back on the idea
             | that archive formats with built-in encryption are a good
             | idea.
        
               | lifthrasiir wrote:
               | That's reasonable. Given that compression softwares are
               | likely what typical people would use for encryption
               | though, I'd like to see an actually good encrypted format
               | design in a commonly used archival format. I'm even okay
               | and more than welcome if that'd be just `age` plus some
               | carefully selected headers.
        
           | nextaccountic wrote:
           | To be fair, they said that ZIP had very bad encryption.
        
         | zzo38computer wrote:
         | > Specifically sqlar is a worse replacement of ZIP.
         | 
         | I mostly agree. However, it does have some advantages too, as
         | mentioned in the linked article.
         | 
         | (An alternative way to query archives by SQL would be to
         | implement a virtual table; in this case it can work with any
         | archive format. In the case of ZIP format, it already does
         | this.)
         | 
         | > filesystem and custom metadata besides from simple st_mode
         | 
         | Since it is a SQLite database, you can add additional tables
         | etc, so it isn't a problem.
         | 
         | > solid compression, metadata compression
         | 
         | I think that using separate archiving and compression is good,
         | which would do these things. A separate program should be used
         | for archive and for compression.
         | 
         | However, in some circumstances, solid compression is
         | undesirable (as mentioned before in HN). In some formats, solid
         | compression is optional, but you might want an index, and there
         | will need to be some way to find parts of the file by index.
         | This is why in my own experimental concatenatable compression
         | format, you can add an optional key frame index.
         | 
         | > encryption
         | 
         | Also encryption should be a separate program from archive and
         | compression too, I should think.
         | 
         | (However, there are extensions for SQLite for doing encryption,
         | so it is possible to use those if wanted.)
         | 
         | > the compression mode (DEFLATE vs. uncompressed) is implicitly
         | indicated by `sz = length(data)` and I don't think it is a good
         | idea
         | 
         | I agree; it doesn't look like good to me either.
         | 
         | (Although, the Free Hero Mesh code page file does something
         | similar, but in that case the uncompressed sizes of lumps is
         | always exactly 2048 bytes anyways, so there is no need to store
         | a separate uncompressed size.)
        
           | lifthrasiir wrote:
           | > Since it is a SQLite database, you can add additional
           | tables etc, so it isn't a problem.
           | 
           | You can add additional tables, but you can't add support for
           | additional tables in the past versions. This is especially
           | true because sqlar has no version specified.
           | 
           | > I think that using separate archiving and compression is
           | good, which would do these things. A separate program should
           | be used for archive and for compression.
           | 
           | I've heard this argument a lot and I think it is completely
           | false. Archival and compression should go together, otherwise
           | you would get suboptimal results.
           | 
           | Archival then compression (i.e. the traditional tarballs)
           | will disable random access to individual files _and_
           | metadata. This hurts performance in pretty every use case
           | except for the one-shot decompression. In particular this
           | problem completely prohibits using archival then compression
           | in container formats. You can reenable them by using random-
           | accessible compression formats, but they are not common and
           | less efficient than formats without that constraint.
           | 
           | Compression then archival is much better than archival then
           | compression because it essentially mimics the combined
           | compression-archival format (I have used .bz2.tar in the
           | past), but this mode is unpopular so you need to manually
           | decompress them. And it is harder to use different
           | compression algorithms for different formats; you should put
           | that info to the file name or the metadata, and that would be
           | no different from compression-archival formats.
           | 
           | Some claim that the needs for solid compression demonstrate
           | the problem of combined compression-archival formats. This is
           | not true and even misleading. Archival then compression
           | commonly results in _worse_ compression than solid compressed
           | archives even with the same algorithm. The gist of solid
           | compression is a shared context for related files, so unless
           | the archiver somehow reorders files it can 't compete with
           | combined archivers with solid compression. And if your
           | archiver is doing that reordering, it is no longer decoupled
           | from the compressor so it would be better to combine them. In
           | addition to this, there are also alternatives to solid
           | compression with different trade-offs, for example learned
           | shared preset dictionaries (even zlib supports this, although
           | unpopular).
           | 
           | > Also encryption should be a separate program from archive
           | and compression too, I should think.
           | 
           | Yeah, I'm personally fine without encryption as it is really
           | hard to get it right (for example, you should be prepared of
           | metadata encryption and key stretching). I've listed it
           | partly for the completeness and partly for the fact that it
           | is probably the most approachable encryption software to end
           | users.
           | 
           | > (However, there are extensions for SQLite for doing
           | encryption, so it is possible to use those if wanted.)
           | 
           | And I think the official encryption extension [1] is
           | proprietary.
           | 
           | [1] https://www.sqlite.org/see/
        
         | slaymaker1907 wrote:
         | You forgot the biggest limitation of sqlar which is lack of
         | support for large files (max is 4GB IIRC). I think this is
         | really unfortunate since a sqlite based archive file format
         | could be really neat and easily support a lot of the features
         | you mention. Additionally, I think it is really cool to have
         | the ability to inspect the file format using SQL.
         | 
         | If sqlar were actually more usable, it would also be really
         | neat for application formats since you could have additional
         | metadata for your application in addition to the archive.
         | Instead of having some magic MANIFEST file for Java jar files,
         | that data could be stored in a separate table with its own
         | indices.
        
         | banana_giraffe wrote:
         | As I understand things, sqlar is more of a thought experiment
         | or proof of concept type thing, not a "we've got a better
         | solution than zip files"
         | 
         | https://youtu.be/8y_ABXwYtuc?t=2135
         | 
         | It's interesting, and maybe has some use cases, but not a
         | generic solution. I've certainly used a hand-rolled version of
         | this idea before to handle specific scenarios where it made
         | sense.
        
         | OskarS wrote:
         | To add to that list: when you do a .tar.gz, the compression
         | algorithm works across file boundaries (very useful when you
         | have a lot of small files), but neither sqlar nor .zip does
         | this.
         | 
         | I think the idea of using SQLite as an application file format
         | is excellent, and it's great that it can serve as a general
         | store of files. As a general purpose compressed archive format,
         | however, I agree it leaves a lot to be desired.
        
           | lifthrasiir wrote:
           | Yeah, I even think a _full file system_ based on SQLite
           | format is possible; pretty much all ingredients are there,
           | just waiting to be cooked. But it should have been cooked in
           | the first place.
        
             | tobias3 wrote:
             | Already there. E.g.
             | https://github.com/guardianproject/libsqlfs but I'd also
             | count something like s3ql where the data is stored as s3
             | objects.
        
             | brandmeyer wrote:
             | I don't think so. An important missing element in SQLite's
             | format is larger block support. Modern filesystems do this
             | via extents. SQLite has a strictly fixed block size and no
             | support for forming larger blocks from contiguous runs of
             | basic blocks.
             | 
             | Another missing element is that SQLite's support for large
             | blobs is via linked lists of pages. That's fine for
             | sequential access, but unacceptable for random access.
             | 
             | For a database to be usable as a filesystem backend, it
             | must at least be composable with itself without sacrificing
             | performance. Lets say you built a kernel module or fuse
             | module using sqlite as the backend. How well would sqlite
             | itself perform on a file in this filesystem?
        
               | lifthrasiir wrote:
               | > Another missing element is that SQLite's support for
               | large blobs is via linked lists of pages.
               | 
               | This surprised me and you were correct, there is no
               | guarantee that related payload overflow pages are
               | consecutively ordered. I thought extents can be
               | implemented with those pages but it would be pretty bad
               | for random access as is. In fact one of the reasons I
               | thought SQLite can function as a file system was a built-
               | in support of random-accessible BLOBs, and some more
               | research revealed that the support was based on the fact
               | that SQLite loads the entire BLOB into memory. Oops.
        
           | conradev wrote:
           | That is a trade off in that you can no longer randomly access
           | a single file without decompressing the entire stream.
           | 
           | That is often desired for file archives. For use cases like
           | genomics, though, reads from random offsets are common, hence
           | the bgzf format used by BAM files which is a sequence of gzip
           | chunks and an index. Lower compression ratio, but faster
           | seeking.
        
           | masklinn wrote:
           | > To add to that list: when you do a .tar.gz, the compression
           | algorithm works across file boundaries (very useful when you
           | have a lot of small files), but neither sqlar nor .zip does
           | this.
           | 
           | That's mentioned both in the article (3.2) and GP ("solid
           | compression" is the term of art for a single compression
           | stream for the entire archive).
        
       ___________________________________________________________________
       (page generated 2021-09-27 23:02 UTC)