[HN Gopher] Hosting SQLite databases on GitHub Pages or any stat...
       ___________________________________________________________________
        
       Hosting SQLite databases on GitHub Pages or any static file hoster
        
       Author : phiresky
       Score  : 1657 points
       Date   : 2021-05-02 16:43 UTC (1 days ago)
        
 (HTM) web link (phiresky.github.io)
 (TXT) w3m dump (phiresky.github.io)
        
       | Arnavion wrote:
       | I wrote a similar thing in Rust for a Factorio mod manager. Mods
       | are hosted on the remote HTTP server as ZIP files, and the mod
       | manager needs a single `info.json` file from the ZIP for the mod
       | metadata. So the mod manager avoids downloading the whole mod and
       | then unpacking it by building a file abstraction that uses HTTP
       | range queries to download just the chunks it needs. For ZIP files
       | the directory is stored at the end at an unknown offset, so the
       | read pattern is to gradually seek backwards from the end until
       | you find the start of the directory, then find the file entry,
       | then seek and read the file.
       | 
       | I didn't fiddle with the window sizes like the submitted article
       | (the chunk is fixed to 8KiB), but I did optimize it so that
       | reading chunk N+1 of the file reused the response reader of chunk
       | N rather than make a new request. Furthermore I keep an LRU cache
       | of only the last three chunks in memory, because the ZIP files
       | are each only read once.
       | 
       | [1]: https://github.com/Arnavion/fac-
       | rs/blob/2d2622a1c9934719ce65...
       | 
       | [2]: https://github.com/Arnavion/fac-
       | rs/blob/2d2622a1c9934719ce65...
        
       | CyberRabbi wrote:
       | Over high latency links this would be virtually unusable. Why not
       | just download the entire database into memory over XHR on page
       | load? SQLite databases of pure data usually aren't over 10MB in
       | size.
        
         | petters wrote:
         | > SQLite databases of pure data usually aren't over 10MB in
         | size
         | 
         | Why do you think this?
        
           | [deleted]
        
         | mcculley wrote:
         | That is hilariously wrong for a lot of use cases. I will find
         | this very handy for some SQLite databases I have that are
         | several GBs in size. I am looking right now at using this
         | contribution.
        
           | CyberRabbi wrote:
           | It's not hilariously wrong or wrong at all that over high
           | latency links this would be virtually unusable.
           | 
           | It's certainly possible that people are using SQLite
           | databases with sizes on the order of gigabytes but in my
           | experience those are the exception not the rule.
        
             | tehbeard wrote:
             | Over high latency links most anything interesting is
             | virtually unusable, so stop using it as a high horse to
             | stand upon.
        
         | simonw wrote:
         | The demo here uses a 670MB database file.
        
         | formerly_proven wrote:
         | This particular demo page actually makes queries against an
         | almost 700 MB large (fills one CD!) SQLite database. Because
         | the amount of data read is almost negligible (few hundred kB),
         | performance is limited by latency (as you say). However, high-
         | latency links also tend to be slower, so downloading the entire
         | database a-priori would almost always be much slower.
         | 
         | For example, on a 1 megabit/s link with 300 ms RTT, one example
         | would take about 2 seconds for the data transfer itself while
         | spending another 3 seconds or so on waiting. Downloading the
         | entire file would take around _an hour and a half_.
         | 
         | For your 10 MB database, transferring it as a whole would take
         | 80 seconds. Assuming this solution instead needs to read e.g.
         | 250 kB (taking 2 seconds to transfer), it could still bounce
         | around 250 times to the database before those 10 MB are fully
         | downloaded. (This would be a really odd query, since it would
         | only read on average two pages per read request)
        
           | CyberRabbi wrote:
           | Right but that is an artificially created demo by the author
           | to justify the solution being presented (no offense). The
           | question is how common are ~GB large SQLite databases in the
           | real world relative to databases that are ~MB large?
           | 
           | In my experience SQLite databases of millions of rows of raw
           | tabular data tend to compress very well into dozens of
           | megabytes. Indeed SQLite is often touted as a file format for
           | applications.
        
             | dkarras wrote:
             | Even if you compress it, you need it in client memory which
             | can also be a dealbreaker. Some people might need it for
             | some projects, I know I did.
        
               | CyberRabbi wrote:
               | > you need it in client memory which can also be a
               | dealbreaker
               | 
               | Most workstations have GBs of available memory. If not
               | you can dump it in indexeddb as a raw data store.
               | 
               | I never disputed that it would be useful for some use
               | cases. I only said it would be unusable with high latency
               | links. If you have a low latency link and aren't running
               | very complex queries with lots of random seeks, then this
               | should work fine for you.
        
             | rakoo wrote:
             | I'm running magnetico
             | (https://github.com/boramalper/magnetico) on my VPS. I
             | currently have an index of 1.6M magnet links stored in a
             | 5GB database.
             | 
             | SQLite is most interesting not when the database is small,
             | but when there are very few writes and all you do is
             | reading. You can also look at https://datasette.io/ and see
             | how SQLite is perfect for representing a lot of datasets
             | and querying them
        
             | brandmeyer wrote:
             | My team has a few TB of data in SQLite files that are
             | themselves dozens of GB each.
             | 
             | We're using them as a replacement for leveldb's sstables,
             | but with the structure of full SQL. It is highly effective.
        
               | CyberRabbi wrote:
               | Do you think your team's usage of SQLite is
               | representative of the average SQLite user?
        
               | detaro wrote:
               | Where has it been suggested that this is the best
               | solution for "the average SQLite user", instead of a tool
               | you can use if it fits your requirements? To take your
               | 10MB number, the article _starts_ by mentioning you can
               | probably just download the entire thing if you aren 't
               | above that exact same number.
        
               | CyberRabbi wrote:
               | I made two claims:
               | 
               | > this would be unusable over high latency links.
               | 
               | That is objectively true
               | 
               | > SQLite databases of pure data usually aren't over 10MB
               | in size.
               | 
               | No one here has refuted this point.
               | 
               | Any other counterargument is addressing a claim I did not
               | make.
        
               | nl wrote:
               | It's not anymore unusable over high latency links than
               | most website. Also worth noting that the caching is very
               | smart, so once things are downloaded it's very fast.
               | 
               | But most high latency links are very slow (so downloading
               | large databases is a horrible experience) and (more
               | importantly) are often priced by the size of downloads.
        
               | CyberRabbi wrote:
               | > It's not anymore unusable over high latency links than
               | most website.
               | 
               | That's false. Not all web applications suffer equally
               | from high latency links. Depends on how reliant the web
               | application is on independent requests. Making one
               | request and receiving a single bulk download is much less
               | bad than making many dependent requests on a high latency
               | link.
        
               | brandmeyer wrote:
               | This is the fundamental flaw of 80% thinking. The fact
               | that SQLite continues to reach for more users is what has
               | made it such a successful general-purpose tool.
        
               | CyberRabbi wrote:
               | You didn't answer the question.
               | 
               | > The fact that SQLite continues to reach for more users
               | is what has made it such a successful general-purpose
               | tool.
               | 
               | I never disputed this. You're responding to a straw man.
        
         | mythrwy wrote:
         | I have been using SQLite databases for a few user application
         | that has been running for close to a decade now. They are
         | usually about 1.5GB.
         | 
         | BTW, SQLite has a (theoretical?) max size of 140TB! (or so I've
         | read)
        
           | CyberRabbi wrote:
           | What do you think is the size of the average SQLite database?
        
             | mythrwy wrote:
             | Given their widespread uses on phones, the overall average
             | is likely very small. KBs even.
             | 
             | A more relevant question might be: what would the average
             | size of SQLite databases for web type (or even this
             | specific use case) applications. I don't know, but 10s or
             | 100s of MBs might not be a bad guess.
        
               | _flux wrote:
               | However, would those 10s or 100s of MBs be something web
               | sites would just like to share in their entirety to the
               | client to query as they wish? At least many commercial
               | services would prefer to keep most of the data secret and
               | even gain insight from the access requests.
               | 
               | But for a more open or a hobbyist project where
               | minimizing amounts of data transfer is less important
               | than minimizing the amount of doing more work (server-
               | side code), then this seems like a decent solution.
               | 
               | It is also worth reminding that this solution only
               | practically works for read-only databases.
        
         | detaro wrote:
         | And the article _starts_ by mentioning that you can download
         | the entire file if it 's not too large. And then goes on to
         | present a solution for larger files. What more answer to "Why
         | not just download the entire database" do you expect?
        
       | bob1029 wrote:
       | Modeling the DOM in SQL... Further evidence that anything we can
       | imagine has some stable representation in third normal form. Is
       | it fast? Maybe not. But is it correct? Provably.
        
         | cma wrote:
         | There was a group that rewrote the Chromium DOM itself in a
         | data oriented design (which learns from database design and
         | sort of applies to cache utilization and locality) and got a 6X
         | speedup in some places:
         | 
         | https://meetingcpp.com/mcpp/slides/2018/Data-oriented%20desi...
        
           | bob1029 wrote:
           | This paper is basically what we do, except we have a big
           | container type, aptly named "Domain.cs". Inside, you will
           | find a public List<T> of every type. We decided to emulate
           | SQL throughout the vertical (i.e. each List is a table) in
           | order to make mapping to SQL a trivial affair. None of our
           | Domain types contains any complex type as a property.
           | Everything can go 1:1 to SQL. We use LINQ (or SQL) to produce
           | projections as appropriate.
           | 
           | There isn't actually just 1 big domain instance either. It's
           | more like one per user session, and then a global instance.
           | 
           | The impact this had on reducing complexity and bugs is
           | incredible. I haven't seen a null ref exception in a long
           | time. Also, being able to dump your entire universe to disk
           | by serializing a single object is really nice.
        
       | bambax wrote:
       | This is incredibly clever and fun.
       | 
       | That said, all you need to query an SQLite database on the server
       | side is a simple PHP script (as opposed to a running db server),
       | and most static hosting providers offer PHP support.
        
       | crazygringo wrote:
       | This is hilariously clever.
       | 
       | Using the "Range" HTTP header to _read_ chunks of the database
       | file absolutely works!
       | 
       | But to be clear, there's no _write_ equivalent, is there? You can
       | 't use "Range" with a PUT request.
        
         | NoInkling wrote:
         | Even if there was, I can't imagine your everyday static host
         | ever supporting it.
        
         | anonydsfsfs wrote:
         | The write equivalent would be the PATCH method using a
         | "message/byteranges" body: https://tools.ietf.org/id/draft-
         | wright-http-partial-upload-0...
        
           | crazygringo wrote:
           | Wow that's fascinating, thanks. That would actually turn HTTP
           | into a kind of random-access filesystem, if adopted.
           | 
           | It's amazing but also slightly terrifying in the knowledge
           | that then someone's going to write an SMB-over-HTTP
           | connector.
        
         | TimTheTinker wrote:
         | TA was clear - there's no way to write, since static file
         | hosting doesn't support dynamic write to begin with.
         | 
         | However, I imagine a service to support your scenario could be
         | written in a standard back-end server language like Go or JS.
         | The challenges involved would be much greater, however -- how
         | to handle concurrency in particular. I suspect one would do
         | better to just run PostgreSQL behind a web API.
        
           | csomar wrote:
           | That's basically re-inventing the Database but on the client
           | side. We have gone a long way but we are closer to having the
           | server side just as a data store.
        
       | ianlevesque wrote:
       | This is easily the most clever web programming hack I've seen
       | this year. Bravo. I had seen this used for video or audio of
       | course but it never occurred to me you could use it for
       | databases. There are probably a ton of other formats this is good
       | for too.
        
         | Valgrim wrote:
         | I wonder if this could be used to serve dynamic maps.
        
           | blacha wrote:
           | This is pretty much exactly what we do to serve
           | aerial/satellite imagery maps.
           | 
           | We convert the imagery into Cloud optimised geo tiffs and
           | store them in S3 https://www.cogeo.org/ then the browser can
           | request the tiles directly from S3.
           | 
           | Even the big imagery providers are now storing their imagery
           | as COGs, eg
           | https://registry.opendata.aws/sentinel-2-l2a-cogs/
        
           | durkie wrote:
           | I believe this is protomaps approach: re-encode the mbtiles
           | (sqlite-based ) format in to something that can be requested
           | with a http range request and thus served from a single dumb
           | webserver that doesn't need to understand sqlite or mbtiles
           | parsing
        
             | bdon wrote:
             | This is the approach I took with
             | http://github.com/protomaps/pmtiles , though it's optimized
             | for the very specific use case of going from Z/X/Y integer
             | coordinates to binary blobs, and takes shortcuts to
             | accomplish that (fixed-width keys and root index page)
        
               | config_yml wrote:
               | Funny enough I was looking into mbtiles serverless
               | solutions before I went to bed, now I start my day
               | browsing HN and I find this clever solution. Love HN for
               | this type of stuff :)
        
       | sroussey wrote:
       | The innovation here is getting sql.js to use http and range
       | requests for file access rather than all being in memory.
       | 
       | I wonder when people using next.js will start using this for
       | faster builds for larger static sites?
        
         | jhgb wrote:
         | Microsoft Access Cloud Edition, basically?
        
           | nbevans wrote:
           | Sort of. Access had a "Forms" feature that let you create
           | basic GUIs on top of your database. Also, the OP's project is
           | (currently) only providing a read-only view of the SQLite
           | database. Adding write support is possible but will be far
           | less impressive to the HN crowd because SQLITE_BUSY will rear
           | its ugly head ;-)
        
             | jhgb wrote:
             | I was mostly referring to the shared file access approach.
        
         | tyingq wrote:
         | Would also be great to add (efficient) search to a static blog.
        
           | nanna wrote:
           | Definitely. Just need to add a layer to the static site
           | generator for it to populate the SQLite DB, right?
        
             | intergalplan wrote:
             | I'd also version the DB in the URL, else you could end up
             | changing the file out from under someone who's already got
             | the page loaded, with who-knows-what results depending on
             | how different the file is. You could just prefix a head to
             | every range request to check for changes, but that adds
             | overhead and doesn't actually completely close the gap, so
             | it'd still be possible to read a file different from the
             | one you intended. Cost is more disk usage, depending on how
             | many copies you keep around, but at least keeping the most
             | recent "old" version seems reasonable unless you're skating
             | _really_ close to the quota on whatever system you 're
             | using.
        
               | M2Ys4U wrote:
               | Requests could use the If-Unmodified-Since header to
               | guard against DB changes without the overhead of a
               | separate HEAD request.
               | 
               | If the underlying DB has been changed then the server
               | should respond with 412 Precondition Failed.
        
           | edoceo wrote:
           | yea, sqlite FTS5 has been pretty amazing for quick search
           | solutions (but I use english only)
        
         | misterdata wrote:
         | See also https://github.com/bittorrent/sqltorrent, same trick
         | but using BitTorrent
        
           | phiresky wrote:
           | Yeah, that was one of the inspirations for this. That one
           | does not work in the browser though, would be a good project
           | to do that same thing but with sqlite in wasm and integrated
           | with WebTorrent instead of a native torrent program.
           | 
           | I actually did also implement a similar thing fetching data
           | on demand from WebTorrent (and in turn helping to host the
           | data yourself by being on the website):
           | https://phiresky.github.io/tv-show-ratings/ That uses a
           | protobufs split into a hashmap instead of SQLite though.
        
             | westurner wrote:
             | This looks pretty efficient. Some chains can be interacted
             | with without e.g. web3.js? LevelDB indexes aren't SQLite.
             | 
             | Datasette is one application for views of _read-only_
             | SQLite dbs with out-of-band replication.
             | https://github.com/simonw/datasette
             | 
             | There are a bunch of *-to-sqlite utilities in corresponding
             | dogsheep project.
             | 
             | Arrow JS for 'paged' browser client access to DuckDB might
             | be possible and faster but without full SQLite SQL
             | compatibility and the SQLite test suite.
             | https://arrow.apache.org/docs/js/
             | 
             | https://duckdb.org/ :
             | 
             | > _Direct Parquet & CSV querying_
             | 
             | In-browser notebooks like Pyodide and Jyve have local
             | filesystem access with the new "Filesystem Access API", but
             | downloading/copying all data to the browser for every run
             | of a browser-hosted notebook may not be necessary.
             | https://web.dev/file-system-access/
        
               | hfmuehleisen wrote:
               | DuckDB can directly & selectively query Parquet files
               | over HTTP/S3 as well. See here for examples: https://gith
               | ub.com/duckdb/duckdb/blob/6c7c9805fdf1604039ebed...
        
         | mvanaltvorst wrote:
         | I'm curious, in what manner could this method speed up Next.js
         | builds? That's all done locally, which negates the effect of
         | HTTP range requests, right?
        
           | city41 wrote:
           | I'm guessing they mean rather than build a static Next site
           | that generates 10k+ pages (or whatever large means in the
           | given context), it instead creates one page that just queries
           | the data from the client.
           | 
           | I have one Next static site that has about 20k pages and
           | takes about 20 minutes to build and deploy. I think that's an
           | acceptable build time. But I do know of other people around
           | the net who have mentioned having sites with 20k-ish pages
           | taking an hour+ to build. For them I could see the desire to
           | try this sqlite trick.
        
             | gitgud wrote:
             | Generating 20k pages in 20mins is impressive, 16 pages a
             | second on average.
             | 
             | In my experience, it can take a couple of minutes just to
             | deploy 20 pages, but that could just be the overhead of
             | Typescript and SASS compilation too...
        
               | nicoburns wrote:
               | Hugo claims <1ms a page. Which would mean 20k pages in
               | under 20 seconds. 20k pages in 20 mins is not fast!
        
               | gawin wrote:
               | Confirmed: Build a 30k pages site with Hugo in 53
               | seconds, including grabbing JSON content from Contentful
               | and deploy on AWS S3.
               | 
               | Took some effort to get everything under 60 seconds,
               | speed was the main reason to use Hugo.
        
               | city41 wrote:
               | Oops, my memory was off. It's 10,925 pages and the last
               | build took 18 minutes.
        
             | altano wrote:
             | For really large sites Next.js already has Incremental
             | Static Regeneration which is usually the right solution to
             | fast [re]builds:
             | https://www.smashingmagazine.com/2021/04/incremental-
             | static-...
        
               | throw123123123 wrote:
               | Its not the same because you have to rebuild all the
               | pages if you change your data source. In this
               | implementation you can upload a new data set and it will
               | work.
               | 
               | Its just a different stack.
        
             | nindalf wrote:
             | You should write a post about this if you implement it. My
             | humble suggestion for blog post title - 'Speed up your
             | Next.js builds with this author's one weird trick! Vercel
             | hates him!'
        
               | lsmurray wrote:
               | I generate my entire blog on nextjs from a sqlite data
               | base already. https://lsmurray.com/blog/database-first-
               | development
               | 
               | It's got a ton of rough edges but the boilerplate is
               | there to get a proof of concept pretty quickly
        
       | shkkmo wrote:
       | Hmm... Is it possible to use this with two seperate databases,
       | one a remote read only and one a local writable DB that could be
       | re-exported to a static file host? Having just glanced at the
       | code it looks like you would have to load two seperate libraries,
       | one from this project and one from the original sql.js
        
       | modeless wrote:
       | Super cool. I wonder if an approach like this could ever work for
       | a writable database. I guess the issue is that you couldn't have
       | fine grained access control without a whole lot of server side
       | validation, at which point you might as well just run a regular
       | database server.
        
       | sarak12070 wrote:
       | Watch dragon ball season 1 full
       | https://animated2021.blogspot.com/2021/05/dragon-ball-season...
       | 
       | health tips for men and women. We all should know about these
       | things http://healthwithbeauty.xyz/2021/04/20/health-tips-for-
       | men/
       | 
       | watch best funny peter rabbit movie online
       | https://animated2021.blogspot.com/2021/05/watch-peter-rabbit...
       | 
       | Virginia Girls Scouts making cookie deliveries via drone
       | https://www.interestingnews.club/2021/05/virginia-girls-scou...
        
       | yoz wrote:
       | As everyone else has been saying, this is amazing work. It sounds
       | like the biggest issue with loading the page is the initial
       | sql.js download - it's about 1.2MB, is that right?
       | 
       | Might it be feasible to _easily_ strip down SQLite so that it
       | only compiles the parts for read-only use? The browser version is
       | obviously somewhat read-only but that 's because of the sandbox.
       | I'm talking about excluding the code for CREATE, UPDATE, INSERT
       | and everything else which is just for writing. The aim here would
       | be to produce a significantly smaller WASM binary.
       | 
       | I'm guessing that the answer is no, there's no easy way of doing
       | this without significant rewrites of SQLite's core, but... I
       | can't be the only one to think of this, surely?
        
         | duskwuff wrote:
         | I'd be curious whether there's any changes which could be made
         | in the file format to optimize for read-only usage. The SQLite
         | format probably has some features which aren't needed in this
         | context -- information about free pages and autoincrement
         | counters isn't relevant in a read-only file, for instance.
        
           | phiresky wrote:
           | I think it wouldn't change much - SQLite is already pretty
           | optimized towards reads, for example a write always replaces
           | a whole page and locks the whole DB. The free pages can
           | easily be removed by doing VACUUM beforehand which should be
           | done anyways to balance the b-trees.
           | 
           | The storage of SQLite is already really efficient, for
           | example integers are always stored as varints so small ones
           | only take a byte. The only thing I think could maybe be
           | improved for this use case is changing the structure of the
           | b-tree to be more columnar - since right now all the data
           | from different columns is intermingled with the btree
           | structure itself, querying a subset of columns has a high
           | overhead.
        
         | phiresky wrote:
         | The actual transferred data for the sqlite code should only be
         | 550kB (gzip compression).
         | 
         | Stripping out the write parts is a good idea. SQLite actually
         | has a set of compile time flags to omit features [1]. I just
         | tried enabling as many of those as possible, but it didn't seem
         | to reduce wasm size much, though I might be doing something
         | wrong. There's also no easy flags to disable CREATE / UPDATE /
         | INSERT .
         | 
         | [1] https://www.sqlite.org/compile.html#omitfeatures
        
         | singularity2001 wrote:
         | Sounds feasible to me. Either by replacing all those functions
         | on the C side with empty shells or maybe even with wasm-opt (
         | but probably the OP has already used it removed all garbage
         | collectible paths.)
        
       | nanna wrote:
       | Hello new paradigm.
        
       | tedk-42 wrote:
       | Very clever! Keep in mind this is a read-only dataset with the
       | really clever bit being the ability to scan through the dataset
       | and support sqlite query syntax.
        
       | tlarkworthy wrote:
       | works on ObservableHQ
       | 
       | https://observablehq.com/@tomlarkworthy/phiresky-sqlite-quer...
        
       | arafalov wrote:
       | Amazing, especially - for me - that the FTS5 full-text search
       | just works. Longer term, I am if it were possible to split the DB
       | code into read and write parts and cross-compile only read part
       | for delivery to the browser.
        
         | rzzzt wrote:
         | If you are interested in full-text search on the client, Lunr
         | is also an option: https://lunrjs.com/docs/index.html
        
           | tehbeard wrote:
           | Lunr needs the full index on client though, right?
           | 
           | Being able to use fts-5 without the penalty of having to pull
           | down the whole index make it work much better at larger
           | scales, even with the penalty of additional network requests.
        
             | formerly_proven wrote:
             | This could be genuinely interesting for tools like e.g.
             | sphinx-doc, which currently has a client-side search that
             | does indeed ship the entire index to the client.
        
               | tehbeard wrote:
               | There's probably a dataset size tradeoff with small
               | enough number of documents, but even then this could have
               | the option of if db is < x MB total, fetch all in async
               | task and then use that.
        
           | joshuanapoli wrote:
           | Time to create the index for lunr limits the size of the
           | data-set that can be used. If you have a lot of tiny
           | documents, then it is more practical to scan the documents
           | directly, rather than using the lunr index.
        
       | cozzyd wrote:
       | If you just want static data, using jsroot or jsfive might be a
       | better option.
        
         | jiofih wrote:
         | I think you missed the main part: the dataset is not in memory.
         | This is not SQLite-in-the-browser but using a virtual file
         | system over HTTP. Neither of those alternatives do anything
         | similar.
        
           | cozzyd wrote:
           | Same with those! They'll just grab the chunks they need (also
           | with range requests) to do what you want from binary ROOT or
           | (presumably, I haven't actually used jsfive) HDF5 files.
        
       | BMorearty wrote:
       | This is fantastically creative. And the author does a great job
       | of starting out by describing why this is useful.
       | 
       | And then using SQLite to insert and update DOM elements? Holy
       | cow, icing on the cake. Unlike the first part, there's no
       | explanation of why you'd want to do that. But by that point I was
       | so drawn in that I didn't care and was just enjoying the ride.
        
         | phiresky wrote:
         | Yeaah I felt like at that point the article was already long
         | enough so I didn't bother describing the DOM part too much -
         | even though I spent more time implementing that than I did
         | implementing the rest ;)
         | 
         | Basically SQLite has a virtual table mechanism [1] where you
         | have to define a few functions that figure out how to scan your
         | "fake" table / which indices to use and then how to read /
         | write the actual data. I hook into this mechanism and redirect
         | the request to DOM functions like querySelector() etc. Then
         | there's the issue about SQLite being fully synchronous, but I
         | have to run it in a WebWorker - and the WebWorker can't
         | actually access the DOM and it can only communicate
         | asynchronously with the main thread... So I have to do some
         | weird stuff with SharedArrayBuffer and Atomics.wait to make
         | that work [2].
         | 
         | [1] https://www.sqlite.org/vtab.html [2]
         | https://github.com/phiresky/sql.js-httpvfs/blob/master/src/v...
        
           | [deleted]
        
         | nbevans wrote:
         | "But by that point I was so drawn in that I didn't care and was
         | just enjoying the ride."
         | 
         | :-) There a high amount of SQLite content/articles/blogs on the
         | web that can provide this effect. SQLite is to programmers like
         | the stars are to astronomers. A wonder.
        
       | th0ma5 wrote:
       | Searching the Common Crawl does similar byte level access to S3
       | based on a smaller index of indexes. Really great, actually.
        
       | graderjs wrote:
       | Man you are a frickin genius, seriously. like how you put all
       | this together all the depth of knowledge of different topics this
       | would require the low level and the high level and the way you
       | explain it simply confidently and with impact. Your work is
       | really an inspiration. You computer scienced the sheet out of
       | this thing.
       | 
       | this achievement, and this blog post, to me is on par with blog
       | posts that you would see from a major company where they solve
       | some significant business critical technical challenge in-house.
       | for example: the GitHub blog post about how they created their
       | spinning globe of commits on their homepage, or a Netflix blog
       | post of how they optimized their Network infrastructure to serve
       | so many customers.
       | 
       | your work is truly incredible. You're next level of next level.
        
         | phiresky wrote:
         | Thank you, I really appreciate it. It's pretty fun to do this
         | kind of thing for yourself, but it's really rewarding to be
         | able to share it with other people.
        
           | f430 wrote:
           | have you thought of getting this to work with network
           | replicated p2p sqlite js library? forgot what it was called.
           | 
           | also does this mean that static sites now can scale without
           | relying on db???
           | 
           | what other implications and application do you see?
           | 
           | wonderful work!!!! please let me know how i can donate
        
       | simonw wrote:
       | Pop open the network pane in the browser tools and try running
       | this SQL query for a demo of how clever this is:
       | select country_code, long_name from wdi_country         order by
       | rowid desc limit 100
       | 
       | It fetches just 54.2KB of new data (across 49 small HTTP
       | requests) to return 100 results - from a statically hosted
       | database file that's 668.8MB!
       | 
       | I have an animated GIF demo of this here:
       | https://twitter.com/simonw/status/1388933800445452290
        
         | WJW wrote:
         | So assuming no country has a name longer than 98 characters and
         | that all country codes are 2 characters, that is over 500%
         | overhead? Are you missing a /s in your post?
        
           | phiresky wrote:
           | Since random accesses across the internet are really slow,
           | for this kind of fairly small table (where SQLite stores the
           | row data inline within the B-Tree of the table) it basically
           | fetches the whole content for each row - so even if you query
           | only the long_name and country_code column, it will in fact
           | fetch the data of all 29 columns in that table.
           | 
           | If you want it to fetch less data for querying a subset of
           | columns, you could create create an index on those columns -
           | then SQLite will do an COVERING INDEX scan and thus read only
           | the necessary data (with the B-Tree itself and the start /
           | end page alignment being the only overhead).
        
             | cookguyruffles wrote:
             | Nothing to add to the conversation, just wanted to say I
             | absolutely adore this. Years ago I used to think it'd be
             | nice if search engines simply exposed all their shards to
             | clients and let clients do all the results merging and
             | suchlike. Of course that's probably a terrible idea in
             | practice for performance, but this library is definitely
             | implementing the spirit in a very practical way!
             | 
             | It also reminded me of a vague inverse of this hack. In old
             | versions of Qemu (possibly it is still implemented, but I
             | have vague memories it got ripped out), you could point
             | Qemu at a directory on disk and it'd produce an emulated
             | floopy disk drive with a virtual FAT12 image containing the
             | directory contents. AFAIK it didn't keep the actual data in
             | memory, I guess all it needed was file sizes to know how to
             | build a virtual memory mapping that contained the
             | filesystem metadata + proxied reads from the underlying
             | files for data sectors. I look forward to seeing your
             | implementation of this concept in a virtualized SQLite file
             | <-> GraphQL proxy ;)
             | 
             | edit: insane, it still exists and apparently supports write
             | mode?! https://en.wikibooks.org/wiki/QEMU/Devices/Storage#V
             | irtual_F...
        
           | nl wrote:
           | This might be true.
           | 
           | But this approach lets you actually work out what the optimal
           | size is:                 select sum(length(country_code) +
           | length(long_name)) from wdi_country;
           | 
           | gives: 6307
           | 
           | Or on average:                 select
           | sum(length(country_code) + length(long_name))/count(*) from
           | wdi_country;
           | 
           | gives: 23
           | 
           | (Note that it doesn't seem possible to use aggregation
           | functions with a limit clause)
        
           | simonw wrote:
           | The clever part here is only fetching 50KB out of 680MB.
        
           | ajfjrbfbf wrote:
           | > sql.js only allows you to create and read from databases
           | that are fully in memory though - so I implemented a virtual
           | file system that fetches chunks of the database with HTTP
           | Range requests when SQLite tries to read from the filesystem:
           | sql.js-httpvfs. From SQLite's perspective, it just looks like
           | it's living on a normal computer with an empty filesystem
           | except for a file called /wdi.sqlite3 that it can read from.
           | 
           | From this paragraph it should be pretty clear that it's
           | actually a great result. The database will obviously need to
           | read more data than it presents, so more is fetched.
        
           | watermelon0 wrote:
           | Have you actually read the article? SQLite is unmodified, and
           | thinks it runs on a virtual file system, which fetches file
           | chunks via HTTP range headers.
           | 
           | It's REALLY impressive that you only need to read 54 KB out
           | of 700 MB, to fetch the records.
        
             | OOPMan wrote:
             | It's impressive on one hand.
             | 
             | On the other it's still a lot of overhead.
        
               | ineedasername wrote:
               | For a casual or personal use case though, the alternative
               | of running a client-server database on something like a
               | VPS is probably more overhead than this. It's unlikely to
               | be a very scalable option, but for use cases as described
               | by the author it seems like a good fit.
        
               | littlecranky67 wrote:
               | I know of the drawbacks of the approach and wouldn't
               | chose it for a lot of my projects, but I would say it is
               | very scalable in those cases where I would. Put the DB on
               | GitHub Pages for free along with your HTML/JS code and
               | you can scale to whatever GitHub is willing and capable
               | of delivering. Yes, your users might transfer way more
               | data than needed but you pay nothing for it and do not
               | have to maintain servers.
               | 
               | In the standard scenario for personal projects (not
               | enterprise) I would have a small VPS/Dedicated server
               | with a REST service running - that would be hugged to
               | death immediately if a link would make it to some site
               | like HN. And also, I completely share the experience of
               | the Author that after a couple of years you have moved
               | on, the VPS is dead etc and you don't want to invest
               | time.
               | 
               | Again, before considering using solution, be sure to
               | understand how it works and the resulting limitations or
               | you will likely chose wrong.
        
               | pfundstein wrote:
               | I would say it's less overhead than downloading the
               | entire db to query it locally...? What is your suggestion
               | for accessing a static database with less overhead?
        
               | OskarS wrote:
               | I would bet that if you compare it to a traditional
               | server-client database (which functionally does
               | essentially the same thing: you send it a query over the
               | network, and get a result back), the overhead is probably
               | massive. This is a very clever way to cram that kind of
               | functionality into a static hosting site, and you can
               | imagine some uses for it, but it's clearly not a
               | particularly efficient compared to doing it the "right"
               | way.
        
               | Salgat wrote:
               | The idea is that you're weighing the pros cons vs an
               | actual live database. This is basically only a good idea
               | if you're having someone else paying the hosting fees.
        
             | throwawayboise wrote:
             | Do most static site hosters support range requests?
        
               | nbevans wrote:
               | Generally yes. Because not having range support means you
               | can't resume file downloads. Which is a pretty essential
               | feature for a static file host.
        
               | dspillett wrote:
               | Most web servers do out of the box, so I would assume
               | most do. Basically all unless they have some reason to
               | turn range processing off or are running a
               | custom/experimental/both server that have implemented the
               | feature (yet).
               | 
               |  _Not_ supporting range requests would be a disadvantage
               | for any service hosting large files. Resuming failed long
               | downloads wouldn 't work so users might not be happy and
               | there would be more load on your bandwidth and other
               | resources as the AU falls back to performing a full
               | download.
        
               | NoInkling wrote:
               | I was wondering that too. Support was spotty in general
               | ~20 years ago but I assume things have improved since
               | then.
        
               | derefr wrote:
               | More interestingly, do reverse-proxies like Varnish /
               | CDNs like Cloudflare support range requests? If so, do
               | they fetch the whole content on the back, and then allow
               | arbitrary range requests within the cached content on the
               | front?
        
               | matsur wrote:
               | Yes, Cloudflare behaves as you describe.
        
         | bbkane wrote:
         | Would it be possible to use a datasette frontend with this as a
         | "backend" and statically host the whole thing?
        
           | simonw wrote:
           | Not easily - Datasette is written in Python, so you could try
           | running it in WebAssembly (like Mozilla did for Jupyter with
           | https://hacks.mozilla.org/2019/03/iodide-an-experimental-
           | too... ) but it would be enough work that it might be easier
           | to reimplement a subset of Datasette directly in JavaScript.
        
             | bbkane wrote:
             | Thank you!
        
         | f430 wrote:
         | mind blown. how is this possible???
        
           | havernator wrote:
           | TL;DR http, properly implemented, supports a ton more stuff
           | than even many "web developers" are aware of, like... range
           | requests, which are exactly what you'd think they'd be.
        
             | teruakohatu wrote:
             | Also SQLite store data in pages and the page size can be
             | tweaked. Combined with range requests any part of the
             | database can be requested.
        
             | Cullinet wrote:
             | the most recent update to the W3C's own research webserver,
             | written in Java, called Jigsaw, seems to be dated in 2007.
             | I used it for a lot of purposes until 2002 but I don't know
             | why I stopped working with Jigsaw only that by the time F#
             | emerged in 2004 I was absorbed into a new direction :
             | 
             | https://jigsaw.w3.org/
             | 
             | iirc Jigsaw was used to develop and validate the WebDAV
             | protocols and XQUERY which at the time I remember thinking
             | XQUERY was sure to be the future as implementations of
             | advanced data management and manipulation and query and
             | distribution and declaration looked to be what the whole
             | point of webservers were for. The incredible distractions
             | caused by "rich media" as opposed to multimedia as it was
             | understood then, are really worth thinking about. Saying
             | that, however, the BBC is doing excellent work on restoring
             | the balance of necessary powers to the network standards
             | engineers
             | https://www.bbc.co.uk/sounds/help/questions/about-bbc-
             | sounds...
             | 
             | https://www.bbc.co.uk/rd/blog/2014-03-media-source-
             | extension...
             | 
             | https://www.bbc.co.uk/rd/projects/nearly-live-production
        
               | tuukkah wrote:
               | The last link was a very interesting read - I wonder if
               | BBC or anyone else has open-sourced a video streaming and
               | editing system like that.
               | 
               | Connecting this to HTTP range requests, the edited video
               | feed can consist of a list of the exact byte ranges that
               | the clients need to download and play. Found this
               | description of how Akamai uses range requests to serve
               | low-latency streams:
               | https://blogs.akamai.com/2020/11/using-ll-hls-with-byte-
               | rang...
        
               | f430 wrote:
               | gpt-2 ?
        
             | Cullinet wrote:
             | BTW thank you havernator, because I have just realised what
             | I can do with the setup I'm almost ready to pull the
             | trigger on that'll give me a surfeit of online capacity (at
             | least a baseload can be maintained while the rest is used
             | for work instead of cloud time) : I am definitely going to
             | investigate the possibility of providing a high level of
             | standards specifications for simple web serving. If the W3C
             | Jigsaw project had been maintained, I'd simply put it up
             | and invite interested users to persuade me to send them a
             | shell login. OK obviously that's far too naive today, but I
             | would love to run a especially standards compliant host for
             | negligible nominal or even no charge so people could maybe
             | get a view of better ways to present the WWW.
             | 
             | frankly I think that unless we do things like this, the
             | Internet is simply going to become a closed shop to anyone
             | not wielding enterprise budgets and legal department
             | capabilities.
        
               | batch12 wrote:
               | Not trying to offend, but this comment was hard to follow
               | in a weird way. Along with your profile this makes me
               | wonder-- are you gpt-2?
        
               | amjd wrote:
               | So I'm not alone. I felt that too!
        
           | Salgat wrote:
           | Use Http Range (normally used for pausing and continuing
           | large file downloads) to request specific byte ranges from
           | the file. From there you can pull only what you need. With
           | sql indexes it'll be very tiny since the lookup is optimized.
           | Of course if you select *, you're still going to pull the
           | entire database locally.
        
         | phiresky wrote:
         | Huh, that's actually kind of a worst case I didn't think about:
         | Since you're doing a reverse table scan my "sequential access"
         | detection doesn't kick in. If you do the same query but with a
         | forward scan it should fetch roughly the same amount of data
         | but only do like 5 HTTP requests since the request size doubles
         | for every sequential access.
         | 
         | e.g.:
         | 
         | select country_code, long_name from wdi_country where rowid >=
         | 164 order by rowid asc limit 100;
        
           | vladf wrote:
           | I solved a similar problem recently: given a stream of data,
           | how should you choose packet size in an online way to
           | minimize regret (a linear combination of spare capacity of
           | last packet and total packets used).
           | 
           | Turns out doubling isn't the best strategy. The optimal
           | solution is actually to add a constant increment to packet
           | size. How much depends on relative cost of the terms in the
           | regret function.
        
           | brandmeyer wrote:
           | > I've set the page size to 1 KiB for this database.
           | 
           | > That's because I implemented a pre-fetching system that
           | tries to detect access patterns through three separate
           | virtual read heads and exponentially increases the request
           | size for sequential reads.
           | 
           | > Since you're doing a reverse table scan my "sequential
           | access" detection doesn't kick in.
           | 
           | You know, starting off with the default 4kB page size
           | naturally adds some resistance to these kinds of failure
           | cases. If the VFS isn't issuing many requests in parallel, I
           | would think that setting up a page size near target_bandwidth
           | * round_trip_time would be a better initial guess. 1kB would
           | be appropriate for a pretty low latency-bandwidth product.
        
             | phiresky wrote:
             | That's true, but it also means that random access will
             | always use at least that amount of data even if it only has
             | to fetch a tiny amount. I did a few (non-scientific)
             | benchmarks on a few queries and 1kB seemed like an OK
             | compromise.
             | 
             | And note that the request chunk size is bound to the SQLite
             | page size, and to change that page size you have to rewrite
             | the whole DB. So it can't be set on the fly unless you have
             | multiple copies of the database.
        
               | pjc50 wrote:
               | 1kb fits in most IP MTU sizes, so that seems reasonable.
        
               | kelnos wrote:
               | Do most HTTP responses have less than ~500 bytes of
               | headers? I guess specifically here, GH pages' responses.
               | 
               | It looks like one of the requests made to the DB included
               | a little over 700 bytes of response status line and
               | headers, so that would probably end up spilling into more
               | than one response packet, unfortunately.
        
           | simonw wrote:
           | Hah, I thought "in reverse order by ID" might be a stress
           | test but I was still very impressed by how it performed!
        
       | Procrastes wrote:
       | I'm also impressed with the implementation for the blog itself. I
       | love this sort of clean, version-tracked sort of implementation.
        
       | Gehinnn wrote:
       | TL;DR: Compile SQLite to JS with emscripten, implement a virtual
       | fs in JS to stream chunks of a statically hosted (readonly) SQL
       | database.
       | 
       | If queries make use of indices, only a fraction of the database
       | needs to be downloaded.
       | 
       | Also, you can use SQLite to query the DOM.
        
         | [deleted]
        
         | pudmaidai wrote:
         | I can't figure out exactly how it knows which chunk to
         | download. Does it always download the whole index first? Or
         | does it include it in the built JS file itself?
        
           | fulafel wrote:
           | B-Tree indexes are designed to work like this, to require a
           | low number of IO operations. The index contains pointers to
           | other places in the index.
        
             | pudmaidai wrote:
             | So the answer is "yes, it has to download the index first"?
             | None of these comments answer my question.
        
               | rakoo wrote:
               | Everything in SQLite is stored in B-Trees. Data or
               | indexes. So you don't need to download the whole index
               | first; you only need to download the necessary pages of
               | the trees to access data, whether it's part of an index
               | or actual data
        
               | tehbeard wrote:
               | It just needs to download the pages (x KB chunks) to
               | traverse from the header to the particular index and
               | table schema, a fraction of the whole data.
        
               | fulafel wrote:
               | It has to download some parts ("pages") of the index as
               | the query execution proceeds, and some header/schema
               | description pieces of it first before execution starts.
        
               | detaro wrote:
               | At least the information describing what tables and
               | indices there are and where to find them - and then it
               | gets what it needs once a query is run. Just like sqlite
               | would if running from a local file on disk.
        
               | phiresky wrote:
               | The B-Tree is a tree that in this case is perfectly
               | balanced. So if you do a query with an index in a
               | database it will fetch an logarithmic amount of data from
               | the index and then a constant amount of data from the
               | table.
               | 
               | For the example the wdi_data table is 300MB and an index
               | on it is 100MB in size. This index has a tree depth of 4
               | - which means SQLite has to read exactly 4 pages (4KiB)
               | to get to the bottom of it and find the exact position of
               | the actual row data.
               | 
               | you can check the depth of the b-trees with
               | `sqlite3_analyzer`.
        
           | formerly_proven wrote:
           | SQLite has runtime-pluggable VFS support, i.e. you give it a
           | struct with functions for opening a file, reading some bytes,
           | writing some bytes, synchronizing file contents, closing a
           | file. This project provides such a VFS module, that, because
           | it actually runs in the browser, performs HTTP requests to
           | read data. Emscripten provides a way to run a mix of C/C++
           | code in the same environment as some JavaScript code inside
           | the browser. The reason SQLite has this pluggable VFS support
           | is to properly support embedded systems, different locking
           | APIs, and things like database encryption.
           | 
           | https://www.sqlite.org/vfs.html
        
           | cookguyruffles wrote:
           | Both the index and table data are btrees. These are trees -
           | the root node sits in some known location (offset) in the
           | file, referenced by the file header and metadata. As SQLite
           | traverses the tree, it encounters new descendents it would
           | like to visit, presumably identified by their byte offset in
           | the file, which is all needed for this VFS magic to issue a
           | suitable range request.
           | 
           | - SQlite opens the file and reads 4kb worth of header ->
           | range request for byte 0-4096
           | 
           | - headers/metadata refers to index table with root node at
           | 8192kb
           | 
           | - user issues SELECT * from index WHERE name = 'foo'
           | 
           | - SQLite reads root node from the file (range request for
           | 8192kb..)
           | 
           | - Root node indicates left branch covers 'foo'. Left branch
           | node at address 12345kb
           | 
           | - Fetch left branch (range request for 12345kb)
           | 
           | - New node contains an index entry for 'foo', row 55 of data
           | page at 919191kb
           | 
           | - SQLite reads data page (range request for 91919191kb..)
           | 
           | etc etc etc
        
             | jaza wrote:
             | Thanks, I too was struggling to understand how it's able to
             | do such efficient targeted range requests, you explained it
             | nicely.
        
           | mvanaltvorst wrote:
           | That's part of SQLite. It has been optimised to reduce disk
           | reads, because those can be slow on spinning hard drives.
           | Coincidentally, this translates well into an optimised
           | algorithm that minimises the amount of HTTP range requests to
           | make.
        
       | diveanon wrote:
       | I am a huge fan of sqlite and will definitely be giving this a
       | shot in the near future.
       | 
       | Incredible work.
       | 
       | I see myself using this in conjunction with a conventionally
       | hosted pg db for dynamic content.
        
       | Seattle3503 wrote:
       | How close are we to hosting searchable torrent indices on IPFS?
        
       | rjeli wrote:
       | Nice. Reminds me of Bellard's https://vfsync.org/ . It's file
       | system pages served over http, to back his jslinux browser VMs
        
         | [deleted]
        
       | IceWreck wrote:
       | This is incredible ! Don't have a use case for this right now but
       | am certainly saving this for later.
        
       | hutrdvnj wrote:
       | First of all, this is a very cool web hack, I like it very much.
       | 
       | I have a question. It's a 668.8MB database file. What does
       | actually happen if the query has to scan 300 mb before finding
       | the right answer? Wouldn't it be better to do the work up front
       | and deliver the answers as static json files? Sure you loose the
       | flexibility of dynamic queries, but do you really have that
       | flexibility in non trivial cases (e.g. 300 mb search)?
        
         | willvarfar wrote:
         | If your statistic or whatever can be precomputed, you can
         | precompute it and put it in db table rather than compute it
         | each time by reading the 300MB.
        
       | aflag wrote:
       | Very clever. I wonder if there are databases optimised for this
       | use case. I can imagine something that always requires indexes to
       | do the queries and stores data in disk in ways to make it easy to
       | fetch only the bits you need.
        
         | aasasd wrote:
         | That would be equivalent to always putting your data in b-trees
         | or other structures, according to the request patterns, without
         | keeping the tables themselves. Sort of how you need to do that
         | in Redis for any kind of sane request strategy other than key-
         | value lookups.
        
           | aflag wrote:
           | Hm, yeah, I think a key-value store would be easier to
           | implement. I haven't looked at redis for some time now, but
           | last time I did, persistence was done through snapshotting
           | and everything would really be loaded into memory at start
           | time. So that wouldn't work for this use case, where all you
           | can do is serve a static file.
           | 
           | But my question revolves around databases assuming that the
           | disk they access is local or at least a fast network storage.
           | I wonder if there are any databases optimized to access slow
           | storage over low bandwidth, where you're really trying to
           | optimize the amount of data read more than anything else.
        
             | aasasd wrote:
             | Well, every database already optimizes disk access (at
             | least until the recent years with the 'just add SSDs'
             | attitude). However, they tend to assume that indexes should
             | be loaded into memory. For this use-case, you'd want a
             | database that can use indexes themselves from disk and
             | treat them like partitioned storage: e.g. when reading data
             | for years 2019 to 2021, only request parts corresponding to
             | that, and not previous years. Dunno whether SQLite can have
             | indexes partially in memory--with its niche of low-end
             | devices and apps, it's quite possible that it can.
             | 
             | Actually, this sort of partial access (i.e. partitioning)
             | is rather easy to implement by addressing separate data
             | files by name, instead of using numeric ranges into a
             | database. Basically just put the data into files named by
             | the years (in my example); or bucket the data into chunks
             | of arbitrary size and use the chunks as files. Elementary
             | to extend this to multiple fields in the index. In short,
             | partitioning based on actual field values can be much
             | easier in the static-http approach than using opaque
             | ranges. Probably also more effective if something like
             | http2 allows requesting several files in one request--since
             | you can avoid requesting too little or too much.
        
       | nsonha wrote:
       | This is perfect for my need, been looking for a way to add search
       | to my static site completely free of server. Now I can use sqlite
       | as index.
        
         | Benjamin_Dobell wrote:
         | This certainly does look like an interesting solution, I'd be
         | keen to try it myself. However, just in case you didn't already
         | know about Lunr (https://lunrjs.com/), it is fairly commonly
         | used to implement search on static websites.
         | 
         | e.g. https://squidfunk.github.io/mkdocs-material/
         | 
         | There are of course other similar libraries too.
         | 
         |  _EDIT: Whoops, just saw a few comments below Lunr is already
         | mentioned._
        
           | nsonha wrote:
           | yeah I have come across Lunr and maybe a couple of other
           | things in my research and I think for blogging it'll work
           | well. What I'm interested in finding out is what works for a
           | larger static site, that won't require you to load the index
           | up-front. I'm also curious about how this sqlite thing picks
           | the correct range to load (haven't looked at the code) and
           | what the worst case might be.
        
             | masklinn wrote:
             | > I'm also curious about how this sqlite thing picks the
             | correct range to load
             | 
             | Indexes are usually btree, it finds what and where the
             | relevant index is (probably from the schema queries), then
             | goes trawling through the b-tree.
             | 
             | That's how db engines normally work, they don't liberally
             | go through the entire db contents.
             | 
             | If there's no index then it finds the start of the table
             | and scans it sequentially (which is exactly what it sounds
             | like).
        
       | mwcampbell wrote:
       | This is certainly a clever hack. But it makes things cheaper and
       | easier for the website owner at the expense of the users. This
       | kind of solution will add to the bloat that is making the web
       | unusable enough that people are seriously proposing a remote
       | browser as a solution for desktop users (remember Mighty the
       | other day?). We shouldn't let the cleverness distract us from
       | this drawback.
        
       | bilater wrote:
       | This is awesome. Got my wheels turning :)
        
       | iveqy wrote:
       | This is both a clever hack and a brilliant solution. But it also
       | worries me a bit. Mostly because I've seen a lot of Visual Basic
       | + MS Access solutions. They work fine on a single computer, but
       | they you put a database on a network share to be able to share it
       | between a few computers and the performance is often horrendous.
       | If you're doing a lot of data processing it's often best to do it
       | as close to the data as possible.
       | 
       | But as always, it's seldom the tools, but the right tool used for
       | the wrong usecase that is the problem.
        
       | dreix wrote:
       | Over the last few months I tried to think of a clever way to set
       | up a legacy site for a dictionary that I serve on a VM just
       | because I also need to run sqlite. Since I want to make sure
       | it'll run for longer than me paying for the VM this is the best
       | possible solution. At some point no more updates will happen and
       | it's going to be a static website. So bundling it like this is
       | incredible. I can run multiple backups on different hosts with no
       | additional costs.
        
         | scottlamb wrote:
         | If you just do an occasional key/value lookup, you don't need
         | 1.2 MiB of WebAssembly. [1] That might already exceed your
         | total database size.
         | 
         | I'd solve it via sharding: divide the database into N pieces
         | via range- or hash-sharding. [1] Choose an N that's large
         | enough for each piece to be reasonably small. When you look up
         | a key, fetch the shard of interest.
         | 
         | You can put each piece into separate files (a little simpler to
         | code, and most static servers will use pre-gzipped files for
         | "Content-Encoding: gzip requests" easily, but you waste more
         | disk space due to internal fragmentation) or one file (with
         | range serving and an index of the byte range offset for each
         | piece).
         | 
         | The format for each piece can be anything, eg json (simple) or
         | an sstable-like format (more efficient). [3]
         | 
         | [1] Content-Length of https://phiresky.github.io/youtube-
         | sponsorship-stats/sql-was...
         | 
         | [2] hash-sharding means: piece[i] has all the keys where
         | hash(key) % N = i.
         | 
         | [3]
         | https://github.com/google/leveldb/blob/master/doc/table_form...
         | although they just say "formatted according to the code in
         | block_builder.cc" instead of describing the most relevant part.
        
       | TedDoesntTalk wrote:
       | This is great but wish it had write ability, not just read only.
        
       | 867-5309 wrote:
       | >Of course it can't write to this file, but a read-only database
       | is still very useful.
       | 
       | perhaps read-only should be added to the title
        
         | roddds wrote:
         | I thought the "static" part of "static file hoster" was clear
         | enough.
        
       | simonw wrote:
       | Using HTTP range requests like this is just SO clever.
        
       | isoprophlex wrote:
       | I have the feeling I'm watching some semi-major new web tech
       | being born... out of the blue, in a blog post.
       | 
       | Fucking amazing, mad props. Beautiful work!
        
       | dawe35 wrote:
       | Hi, I'm working on SkySQL, maybe it's useful for you:
       | https://github.com/upshot-tech/SkySQL
        
       | pmarreck wrote:
       | This is super clever. If you add handling POST somehow as random
       | access writes, you could have a read-write DB hosted pretty much
       | anywhere
        
       | singularity2001 wrote:
       | >>> From SQLite's perspective, it just looks like it's living on
       | a normal computer with an empty filesystem except for a file
       | called /wdi.sqlite3 that it can read from.
       | 
       | Beyond static hosting : Now imagine also implementing a virtual
       | file system that SENDS chunks of the database with HTTP Range
       | requests when SQLite tries to write from the filesystem
       | 
       | Or more generally: I predict a WASI implementation which will
       | treat ANY server resource as a virtual file, replacing REST.
        
       | laurencerowe wrote:
       | In the genomics world, Tabix indices enables similar use cases.
       | An ordered TSV file is compressed in chunks (bgzip) and a Tabix
       | index created to allow range based access by mapping from the
       | index -> chunk. This allows a browser based genome browser zoomed
       | into a section of the genome to fetch information from a multi
       | gigabyte file.
       | 
       | http://www.htslib.org/doc/tabix.html
        
         | Ultimatt wrote:
         | now if only tabix and most hokey bioinformatics formats would
         | die and just be replaced with a formal schema spec in SQLite...
        
           | laurencerowe wrote:
           | Honestly I think Tabix's bgzipped TSV is one of the less
           | hokey bioinformatics formats, at least compared to the
           | various custom binary formats floating around.
           | 
           | For web browser based genome browsers I suspect this (very
           | cool!) sqlite hack would require many more http requests.
        
       | spamalot159 wrote:
       | This is really cool! I wonder what the restrictions are and if we
       | would ever be able to write to a SQLite db like this in the
       | future. This could push more to the front end without needing to
       | write apis.
        
         | phiresky wrote:
         | The main restriction is that the DB really needs well fitting
         | indexes, otherwise querying is really slow and fetches a lot of
         | data.
         | 
         | Regarding writing:
         | 
         | You could of course implement a writing API with POST requests
         | for changing pages of the database - but then you would lose
         | most of the benefits of this (not requiring any special kind of
         | server).
         | 
         | I also thought about implementing a kind of overlay filesystem,
         | where chunks that are written to the file are stored in a local
         | storage so the modified data is available locally while still
         | reading everything else from the remote database.
         | 
         | Interestingly in SQLite that's already exactly what the WAL
         | mode does: It's a second file next to the database that's just
         | a set of pages that are overlaid over the main file when read
         | queries happen - which allows concurrent readers and writers
         | since the database itself isn't in an undefined state even when
         | write transactions are happening.
         | 
         | So you could enable WAL mode and disable WAL auto
         | checkpointing, then you get a downloadable WAL file that can be
         | read by normal SQLite and written back to the main file. It
         | would be neat, but I'm not sure what the actual use case would
         | be ;)
        
           | cxr wrote:
           | > I also thought about implementing a kind of overlay
           | filesystem, where chunks that are written to the file are
           | stored in a local storage so the modified data is available
           | locally while still reading everything else from the remote
           | database.
           | 
           | If ever the intent were to involve eventually persisting
           | those changes, then it would be worthwhile looking at
           | remoteStorage, which works like this.
        
           | WrtCdEvrydy wrote:
           | Since you can do static hosting from a git repo, I wonder if
           | you could directly push your changes to your git repo and
           | have your CI/CD solution just deploy it instead?
           | 
           | There has to be a git.js implementation out there and you
           | could move the DB to it's own repo and create an https access
           | token (for Github)... the issue there is that someone could
           | use that token to commit whatever to your database repo.
        
             | wh33zle wrote:
             | Maybe forcing the user to login with GitHub would be an
             | option? And the changes to the database could be modelled
             | as pull requests?
        
           | graderjs wrote:
           | I was thinking you could implement a write API using the
           | GitHub API, every write can be its own commit.
        
           | ItsMonkk wrote:
           | Seems like it might be possible to add WebRTC(or some other
           | peer-based live system, I don't know the performance
           | characteristics of WebRTC) to get realtime writes. Those
           | edits would then be added to the WAL, and every say, 10
           | minutes anyone who has been active for 10 minutes could
           | checkpoint and push the changes to github.
           | 
           | It's slightly more centralized than perfect, but man do you
           | get a lot for a little.
        
           | stagas wrote:
           | > I also thought about implementing a kind of overlay
           | filesystem, where chunks that are written to the file are
           | stored in a local storage so the modified data is available
           | locally while still reading everything else from the remote
           | database.
           | 
           | Perhaps adding IPFS to the mix for persisting data would be
           | interesting, I'm sure there are use cases in peer to peer
           | applications. Anyway, amazing innovation thank you for
           | writing this :)
        
       | bruhhh wrote:
       | couldn't you just link the source code or show how to reproduce
       | what you did? really? you're just showing everything about your
       | work but NOTHING about how to reproduce your work?? THERE IS
       | NOTHING I HATE MORE THAN THESE KIND OF POSTS!!!
        
       | winrid wrote:
       | Doesn't the webserver have to seek from the beginning of the
       | ~600mb file to the range you want, unless the file is in memory?
        
         | philshem wrote:
         | No, see here, for example
         | 
         | https://news.ycombinator.com/item?id=27018194
        
       | slver wrote:
       | I wonder sometimes, what if we focus on producing native JS VM
       | hardware, would it perform same as or better than WASM? Let's run
       | everything on JS.
        
       | irae wrote:
       | The Hacker News demographic, mostly US and EU based, might be
       | under appreciating how impactful this is because of CDNs. For
       | anyone not in the continent of your central database, this means
       | absurdly faster interactions for data visualization.
       | 
       | I moved from US to Brazil 3 years ago, and I still notice the
       | latency when a site runs their backend only in one location. This
       | cleaver solution makes interacting with the graph supper snappy
       | even compared to enterprises that do have database servers in
       | Brazil. Very impressive!
        
       | BlueTemplar wrote:
       | This seems cool, but doesn't Fossil (from the SQLite author)
       | already do it "out of the box" ?
       | 
       | https://www.fossil-scm.org/home/doc/trunk/www/index.wiki
        
         | detaro wrote:
         | Fossil runs a server application.
        
       | cryptonector wrote:
       | This is pretty brilliant.
        
       | wheybags wrote:
       | This is an really awesome idea! I have a plan for a static-ish
       | site (updated daily) that I was going to use sqlite for anyway,
       | but server side. I will definitely look into this approach
       | instead!
        
       | drtournier wrote:
       | Solution works really well to databases which will not be updated
       | frequently, like a standalone site.
       | 
       | Although one should be aware of one very important git behavior -
       | git does not diff binary files (like SQLite dbs). That means 2
       | things:
       | 
       | 1. Each db update will generate a new file in git, maintaining
       | the whole old file in history, instead of the diff in bytes. This
       | will accumulate a lot of clutter in the repo
       | 
       | 2. As git does not diff binaries, there is a very small risk of
       | corruption (especially if you work in multiple OSs, because of
       | CRLF)
       | 
       | Ref - https://robinwinslow.uk/dont-ever-commit-binary-files-to-
       | git
        
         | phiresky wrote:
         | Git does actually diff binaries and stores them very
         | efficiently :) If you do a single small UPDATE in your db file
         | git will only store the changed information. It's just kinda
         | slow, and for most binary files the effort git spends to try
         | and compute deltas for binary files is useless - which is why
         | it has a bad reputation for binary files.
         | 
         | Note that the diffs that git shows you are completely unrelated
         | to the deltas it uses to compress it's database - which are
         | always "binary deltas" and not line-based diffs.
         | 
         | Also I'm not sure why you mean that db corruption possibility
         | has something to do with whether or not it stores diffs?
        
         | laszlokorte wrote:
         | Minor nitpick: Git does not store diffs for any file format but
         | always the full file for each version. So it does not really
         | matter that its binary (except for not being able to VIEW the
         | diff, but I guess you could even implement a plugin for that)
         | but just that it's a big file. Even a huge text file would be
         | fully stored per version.
         | 
         | /edit: The sibling comments mentions that git can infact delta
         | compress older commits for storage efficency. But my point was
         | that git commits are not deltas but full snapshots.
        
           | iudqnolq wrote:
           | This is mostly correct. Git's core object model is snapshots,
           | which can then optionally be compressed. That should be
           | transparent though.
        
         | tehbeard wrote:
         | You are only gonna encounter corruption if u either a) messed
         | up the gitconfig for line endings or b) named the database
         | mydbfile.txt
        
         | xucheng wrote:
         | You can just store the database in text format (e.g. csv) in
         | the git and turn it to SQLite db when building the website.
        
       | claytongulick wrote:
       | Incredibly clever use of paging and http range queries! Well
       | done.
        
       | whoomp12342 wrote:
       | stahp, you gonna get gh pages shut down
        
       | banana_giraffe wrote:
       | In a similar vein, I've mentioned this before, but if you're
       | doing Python stuff, you can use the apsw package (not the one in
       | PyPi, though) to write a VFS layer that SQLite will use to read
       | the database.
       | 
       | I've used this for the same basic idea as this article, only
       | letting me store SQLite databases in AWS's S3 that I can access
       | with AWS APIs so they don't need to be public. It works well,
       | though it's absolutely not for every use case, the overhead is
       | considerable.
       | 
       | I even used it once to read SQLite database files in a zip file
       | stored in S3 without having any local storage to use. Not one of
       | my prouder moments, but hey, I coded my way out of the corner
       | someone else designed for me.
        
         | killingtime74 wrote:
         | This one? https://rogerbinns.github.io/apsw/
        
           | banana_giraffe wrote:
           | Yep, exactly that one. There's a simple example of a VFS
           | implementation on the examples page that's a reasonable
           | starting point:
           | 
           | https://rogerbinns.github.io/apsw/example.html
           | 
           | Once you wrap your head around how you need to pass
           | parameters to the helper, it's really straightforward, you
           | just need to implement the xOpen and xRead calls.
        
             | killingtime74 wrote:
             | Thank you!
        
       | sztanko wrote:
       | Apart from the genius of the author (conceiving a sequence of
       | relatively understandable steps that unlock a potentially huge
       | area), this highlights how efficient a SQLite can be in terms of
       | slow resource usage.
        
       | tlb wrote:
       | The question I had is answered by this line of code:
       | xhr.setRequestHeader("Range", "bytes=" + from + "-" + to);
       | 
       | I am a little surprised you can just do that. In
       | https://github.com/phiresky/sql.js-httpvfs/blob/master/src/l...
        
         | yread wrote:
         | I've learnt about this by using https://www.biodalliance.org
         | 
         | It's an embedded genome viewer, you can just point it at a
         | multigigabyte reference files and .seg files and it loads super
         | quick
        
           | punnerud wrote:
           | Here is direct link to GitHub with the usage: https://github.
           | com/dasmoth/dalliance/search?q=bytes%3D%27&ty...
        
         | e12e wrote:
         | I guess "all" that's needed for write would be a webdav server
         | with support for PATCH x-update-range? :)
         | 
         | https://tools.ietf.org/html/rfc5789
         | 
         | https://sabre.io/dav/http-patch/
         | 
         | Unfortunately, solid, stand-alone webdav servers are harder to
         | come by than decent http2/1.1 servers.
        
         | fanf2 wrote:
         | One of the heaviest users of range requests is (or was) the
         | Adobe Acrobat PDF plugin.
        
           | iudqnolq wrote:
           | I'm surprised that works, iirc pdf isn't defined in order and
           | can't be parsed streaming
        
             | tonyedgecombe wrote:
             | Linearised PDF files are defined in order and can be read
             | from the start.
             | 
             | https://blog.idrsolutions.com/2010/02/linearized-pdf-files/
        
             | kindall wrote:
             | I think that's what the "optimize for Web" checkbox does
        
               | IshKebab wrote:
               | Wow so it actually does something! I wish programs would
               | use such vague descriptions. (Or more of them had helpful
               | instant tooltips.)
        
               | maskros wrote:
               | Linearized (a.k.a. web optimized) PDF files only help for
               | displaying the first page quickly. The rest of the file
               | is still in pretty much random access order.
        
               | e12e wrote:
               | For what it's worth, there's a format for content like
               | pdfs that's optimized for viewing on screens (as opposed
               | to printing), djvu: https://en.m.wikipedia.org/wiki/DjVu
               | 
               | Oh, wow initial release 1998,now I'm feeling a bit old...
        
           | 0x0 wrote:
           | I remember putting together a sloppy http range
           | implementation that initially only supported single ranges,
           | it had quite the explosive effect on adobe reader when it
           | didn't get the expected response to its multi-range requests
           | :)
        
           | scottlamb wrote:
           | Also .mp4 files. The format is designed for seekability, and
           | browsers take advantage of this.
        
             | fy20 wrote:
             | Progressive JPEGs work well for this too, so you could have
             | the same file used for a tiny thumbnail and large preview
             | and full sized photo by sending different range requests.
             | However you need to know how many bytes to request.
             | 
             | I'm surprised this isn't used on mobile browsers to lower
             | data usage. I'm sure with a little research you could
             | figure out what a good mapping from pixel size to byte size
             | should be to give good enough results.
        
               | duskwuff wrote:
               | A browser doesn't have enough information to use this
               | optimization. At the point where it's about to request an
               | image, it doesn't know how large the resource will be,
               | whether it'll be a progressive JPEG, or even whether
               | it'll be a JPEG at all. Making range requests blindly
               | would probably be a net loss -- for every progressive
               | JPEG that the browser managed to save some time on, it'd
               | have to make follow-up requests for many more non-
               | progressive JPEGs, non-JPEG images, and progressive JPEGs
               | which it didn't get enough data from on the first try.
        
               | _flux wrote:
               | You are correct.
               | 
               | However, one could use this approach: download as usual,
               | and in a streaming fashion process the data and if it's a
               | progressive JPEG, you can close the connection before you
               | have received everything; and then you can cache the
               | prefix and later download the rest if needed.
               | 
               | Fast clients will just swallow the whole file, while slow
               | clients would be able to benefit from it.
               | 
               | It wouldn't work for pipelined HTTP connections though
               | without cancelling the whole pipeline, so maybe not a
               | very practical solution given the performance benefit
               | that already gives. And HTTP/2 maybe doesn't support
               | cancelling a transfer either, so.. ?
               | 
               | Maybe a direct "Accept" or "Prefer" header to indicate
               | that it's enough to send just something useful for an
               | icon would be a more ideal solution, but it would require
               | server-side support.
        
             | sp332 wrote:
             | You can even point VLC at a .iso file on a web server, and
             | seek around in it.
        
         | crazygringo wrote:
         | Not all webservers support/enable it, so YMMV.
         | 
         | But as long as you're dealing with a known server that does,
         | then gravy!
        
           | stabbles wrote:
           | nginx enables it by default. Another ingenious use of range
           | request is zsync, it allows you to diff compressed binaries
           | on a remote with local ones, so that you only have to
           | download what has changed on an update. AppImage uses this
        
           | altfredd wrote:
           | > Not all webservers support/enable it
           | 
           | Could you provide an example of server that does not?
           | 
           | AFAIK, Range is supported by all major CDNs, so not
           | supporting it in web server would be a death knell for it's
           | real-world adoption.
        
             | crazygringo wrote:
             | I can't think of a specific one, but a decent proportion
             | (maybe a quarter?) of HTTP downloads I attempt don't
             | support resuming a partial download, therefore don't
             | support Range. (I.e. resuming always starts from the
             | beginning.)
             | 
             | I would assume this is often because the site in question
             | isn't using Apache etc. to serve a file directly, but is
             | either essentially proxying it to some custom-built file
             | serving service, or a script that processes/authenticates
             | the file in some way, and they just never bothered to
             | implement Range.
        
             | enneff wrote:
             | Depends on what's being served. Any decent static file
             | server should support it, but if the content is at all
             | dynamically produced then the authors would have to think
             | to implement it and rarely do.
        
         | dathinab wrote:
         | Range headers are a pretty standard tools to e.g. continue
         | interrupted downloads and similar.
         | 
         | Any well designed system, especially if it has static sources
         | and is server cached _should_ support it.
         | 
         | Surprisingly many web-frameworks don't support it out of the
         | box, or don't support it well.
         | 
         | Either way gh-pages are static content and probably with some
         | server side regional caches, so I'm not surprised it works.
        
       | villasv wrote:
       | Wicket stuff, loved it. Not sure if this will ever be "production
       | ready" or merged by sql.JS-like projects but cool proof of
       | concept.
        
         | [deleted]
        
       | mFixman wrote:
       | This is amazing!
       | 
       | I'm also surprised that Github Pages lets you present arbitrary
       | JS to the point where you can upload SQLite as WebAssembly. Isn't
       | this dangerous?
        
         | nsonha wrote:
         | What is dangerous here? Wasm is more crippled in access than
         | normal js. And js is simply in every web page these days. You
         | can't make a static site hosting service and tell people not to
         | use js, that would be very lame.
        
       | ponytech wrote:
       | This is truly awesome. Very clever.
        
       | justinclift wrote:
       | This is pretty awesome. Now if only GitHub Pages did IPv6 as
       | well... :)
        
       ___________________________________________________________________
       (page generated 2021-05-03 23:02 UTC)