[HN Gopher] Hosting SQLite databases on GitHub Pages or any stat...
___________________________________________________________________
Hosting SQLite databases on GitHub Pages or any static file hoster
Author : phiresky
Score : 1657 points
Date : 2021-05-02 16:43 UTC (1 days ago)
(HTM) web link (phiresky.github.io)
(TXT) w3m dump (phiresky.github.io)
| Arnavion wrote:
| I wrote a similar thing in Rust for a Factorio mod manager. Mods
| are hosted on the remote HTTP server as ZIP files, and the mod
| manager needs a single `info.json` file from the ZIP for the mod
| metadata. So the mod manager avoids downloading the whole mod and
| then unpacking it by building a file abstraction that uses HTTP
| range queries to download just the chunks it needs. For ZIP files
| the directory is stored at the end at an unknown offset, so the
| read pattern is to gradually seek backwards from the end until
| you find the start of the directory, then find the file entry,
| then seek and read the file.
|
| I didn't fiddle with the window sizes like the submitted article
| (the chunk is fixed to 8KiB), but I did optimize it so that
| reading chunk N+1 of the file reused the response reader of chunk
| N rather than make a new request. Furthermore I keep an LRU cache
| of only the last three chunks in memory, because the ZIP files
| are each only read once.
|
| [1]: https://github.com/Arnavion/fac-
| rs/blob/2d2622a1c9934719ce65...
|
| [2]: https://github.com/Arnavion/fac-
| rs/blob/2d2622a1c9934719ce65...
| CyberRabbi wrote:
| Over high latency links this would be virtually unusable. Why not
| just download the entire database into memory over XHR on page
| load? SQLite databases of pure data usually aren't over 10MB in
| size.
| petters wrote:
| > SQLite databases of pure data usually aren't over 10MB in
| size
|
| Why do you think this?
| [deleted]
| mcculley wrote:
| That is hilariously wrong for a lot of use cases. I will find
| this very handy for some SQLite databases I have that are
| several GBs in size. I am looking right now at using this
| contribution.
| CyberRabbi wrote:
| It's not hilariously wrong or wrong at all that over high
| latency links this would be virtually unusable.
|
| It's certainly possible that people are using SQLite
| databases with sizes on the order of gigabytes but in my
| experience those are the exception not the rule.
| tehbeard wrote:
| Over high latency links most anything interesting is
| virtually unusable, so stop using it as a high horse to
| stand upon.
| simonw wrote:
| The demo here uses a 670MB database file.
| formerly_proven wrote:
| This particular demo page actually makes queries against an
| almost 700 MB large (fills one CD!) SQLite database. Because
| the amount of data read is almost negligible (few hundred kB),
| performance is limited by latency (as you say). However, high-
| latency links also tend to be slower, so downloading the entire
| database a-priori would almost always be much slower.
|
| For example, on a 1 megabit/s link with 300 ms RTT, one example
| would take about 2 seconds for the data transfer itself while
| spending another 3 seconds or so on waiting. Downloading the
| entire file would take around _an hour and a half_.
|
| For your 10 MB database, transferring it as a whole would take
| 80 seconds. Assuming this solution instead needs to read e.g.
| 250 kB (taking 2 seconds to transfer), it could still bounce
| around 250 times to the database before those 10 MB are fully
| downloaded. (This would be a really odd query, since it would
| only read on average two pages per read request)
| CyberRabbi wrote:
| Right but that is an artificially created demo by the author
| to justify the solution being presented (no offense). The
| question is how common are ~GB large SQLite databases in the
| real world relative to databases that are ~MB large?
|
| In my experience SQLite databases of millions of rows of raw
| tabular data tend to compress very well into dozens of
| megabytes. Indeed SQLite is often touted as a file format for
| applications.
| dkarras wrote:
| Even if you compress it, you need it in client memory which
| can also be a dealbreaker. Some people might need it for
| some projects, I know I did.
| CyberRabbi wrote:
| > you need it in client memory which can also be a
| dealbreaker
|
| Most workstations have GBs of available memory. If not
| you can dump it in indexeddb as a raw data store.
|
| I never disputed that it would be useful for some use
| cases. I only said it would be unusable with high latency
| links. If you have a low latency link and aren't running
| very complex queries with lots of random seeks, then this
| should work fine for you.
| rakoo wrote:
| I'm running magnetico
| (https://github.com/boramalper/magnetico) on my VPS. I
| currently have an index of 1.6M magnet links stored in a
| 5GB database.
|
| SQLite is most interesting not when the database is small,
| but when there are very few writes and all you do is
| reading. You can also look at https://datasette.io/ and see
| how SQLite is perfect for representing a lot of datasets
| and querying them
| brandmeyer wrote:
| My team has a few TB of data in SQLite files that are
| themselves dozens of GB each.
|
| We're using them as a replacement for leveldb's sstables,
| but with the structure of full SQL. It is highly effective.
| CyberRabbi wrote:
| Do you think your team's usage of SQLite is
| representative of the average SQLite user?
| detaro wrote:
| Where has it been suggested that this is the best
| solution for "the average SQLite user", instead of a tool
| you can use if it fits your requirements? To take your
| 10MB number, the article _starts_ by mentioning you can
| probably just download the entire thing if you aren 't
| above that exact same number.
| CyberRabbi wrote:
| I made two claims:
|
| > this would be unusable over high latency links.
|
| That is objectively true
|
| > SQLite databases of pure data usually aren't over 10MB
| in size.
|
| No one here has refuted this point.
|
| Any other counterargument is addressing a claim I did not
| make.
| nl wrote:
| It's not anymore unusable over high latency links than
| most website. Also worth noting that the caching is very
| smart, so once things are downloaded it's very fast.
|
| But most high latency links are very slow (so downloading
| large databases is a horrible experience) and (more
| importantly) are often priced by the size of downloads.
| CyberRabbi wrote:
| > It's not anymore unusable over high latency links than
| most website.
|
| That's false. Not all web applications suffer equally
| from high latency links. Depends on how reliant the web
| application is on independent requests. Making one
| request and receiving a single bulk download is much less
| bad than making many dependent requests on a high latency
| link.
| brandmeyer wrote:
| This is the fundamental flaw of 80% thinking. The fact
| that SQLite continues to reach for more users is what has
| made it such a successful general-purpose tool.
| CyberRabbi wrote:
| You didn't answer the question.
|
| > The fact that SQLite continues to reach for more users
| is what has made it such a successful general-purpose
| tool.
|
| I never disputed this. You're responding to a straw man.
| mythrwy wrote:
| I have been using SQLite databases for a few user application
| that has been running for close to a decade now. They are
| usually about 1.5GB.
|
| BTW, SQLite has a (theoretical?) max size of 140TB! (or so I've
| read)
| CyberRabbi wrote:
| What do you think is the size of the average SQLite database?
| mythrwy wrote:
| Given their widespread uses on phones, the overall average
| is likely very small. KBs even.
|
| A more relevant question might be: what would the average
| size of SQLite databases for web type (or even this
| specific use case) applications. I don't know, but 10s or
| 100s of MBs might not be a bad guess.
| _flux wrote:
| However, would those 10s or 100s of MBs be something web
| sites would just like to share in their entirety to the
| client to query as they wish? At least many commercial
| services would prefer to keep most of the data secret and
| even gain insight from the access requests.
|
| But for a more open or a hobbyist project where
| minimizing amounts of data transfer is less important
| than minimizing the amount of doing more work (server-
| side code), then this seems like a decent solution.
|
| It is also worth reminding that this solution only
| practically works for read-only databases.
| detaro wrote:
| And the article _starts_ by mentioning that you can download
| the entire file if it 's not too large. And then goes on to
| present a solution for larger files. What more answer to "Why
| not just download the entire database" do you expect?
| bob1029 wrote:
| Modeling the DOM in SQL... Further evidence that anything we can
| imagine has some stable representation in third normal form. Is
| it fast? Maybe not. But is it correct? Provably.
| cma wrote:
| There was a group that rewrote the Chromium DOM itself in a
| data oriented design (which learns from database design and
| sort of applies to cache utilization and locality) and got a 6X
| speedup in some places:
|
| https://meetingcpp.com/mcpp/slides/2018/Data-oriented%20desi...
| bob1029 wrote:
| This paper is basically what we do, except we have a big
| container type, aptly named "Domain.cs". Inside, you will
| find a public List<T> of every type. We decided to emulate
| SQL throughout the vertical (i.e. each List is a table) in
| order to make mapping to SQL a trivial affair. None of our
| Domain types contains any complex type as a property.
| Everything can go 1:1 to SQL. We use LINQ (or SQL) to produce
| projections as appropriate.
|
| There isn't actually just 1 big domain instance either. It's
| more like one per user session, and then a global instance.
|
| The impact this had on reducing complexity and bugs is
| incredible. I haven't seen a null ref exception in a long
| time. Also, being able to dump your entire universe to disk
| by serializing a single object is really nice.
| bambax wrote:
| This is incredibly clever and fun.
|
| That said, all you need to query an SQLite database on the server
| side is a simple PHP script (as opposed to a running db server),
| and most static hosting providers offer PHP support.
| crazygringo wrote:
| This is hilariously clever.
|
| Using the "Range" HTTP header to _read_ chunks of the database
| file absolutely works!
|
| But to be clear, there's no _write_ equivalent, is there? You can
| 't use "Range" with a PUT request.
| NoInkling wrote:
| Even if there was, I can't imagine your everyday static host
| ever supporting it.
| anonydsfsfs wrote:
| The write equivalent would be the PATCH method using a
| "message/byteranges" body: https://tools.ietf.org/id/draft-
| wright-http-partial-upload-0...
| crazygringo wrote:
| Wow that's fascinating, thanks. That would actually turn HTTP
| into a kind of random-access filesystem, if adopted.
|
| It's amazing but also slightly terrifying in the knowledge
| that then someone's going to write an SMB-over-HTTP
| connector.
| TimTheTinker wrote:
| TA was clear - there's no way to write, since static file
| hosting doesn't support dynamic write to begin with.
|
| However, I imagine a service to support your scenario could be
| written in a standard back-end server language like Go or JS.
| The challenges involved would be much greater, however -- how
| to handle concurrency in particular. I suspect one would do
| better to just run PostgreSQL behind a web API.
| csomar wrote:
| That's basically re-inventing the Database but on the client
| side. We have gone a long way but we are closer to having the
| server side just as a data store.
| ianlevesque wrote:
| This is easily the most clever web programming hack I've seen
| this year. Bravo. I had seen this used for video or audio of
| course but it never occurred to me you could use it for
| databases. There are probably a ton of other formats this is good
| for too.
| Valgrim wrote:
| I wonder if this could be used to serve dynamic maps.
| blacha wrote:
| This is pretty much exactly what we do to serve
| aerial/satellite imagery maps.
|
| We convert the imagery into Cloud optimised geo tiffs and
| store them in S3 https://www.cogeo.org/ then the browser can
| request the tiles directly from S3.
|
| Even the big imagery providers are now storing their imagery
| as COGs, eg
| https://registry.opendata.aws/sentinel-2-l2a-cogs/
| durkie wrote:
| I believe this is protomaps approach: re-encode the mbtiles
| (sqlite-based ) format in to something that can be requested
| with a http range request and thus served from a single dumb
| webserver that doesn't need to understand sqlite or mbtiles
| parsing
| bdon wrote:
| This is the approach I took with
| http://github.com/protomaps/pmtiles , though it's optimized
| for the very specific use case of going from Z/X/Y integer
| coordinates to binary blobs, and takes shortcuts to
| accomplish that (fixed-width keys and root index page)
| config_yml wrote:
| Funny enough I was looking into mbtiles serverless
| solutions before I went to bed, now I start my day
| browsing HN and I find this clever solution. Love HN for
| this type of stuff :)
| sroussey wrote:
| The innovation here is getting sql.js to use http and range
| requests for file access rather than all being in memory.
|
| I wonder when people using next.js will start using this for
| faster builds for larger static sites?
| jhgb wrote:
| Microsoft Access Cloud Edition, basically?
| nbevans wrote:
| Sort of. Access had a "Forms" feature that let you create
| basic GUIs on top of your database. Also, the OP's project is
| (currently) only providing a read-only view of the SQLite
| database. Adding write support is possible but will be far
| less impressive to the HN crowd because SQLITE_BUSY will rear
| its ugly head ;-)
| jhgb wrote:
| I was mostly referring to the shared file access approach.
| tyingq wrote:
| Would also be great to add (efficient) search to a static blog.
| nanna wrote:
| Definitely. Just need to add a layer to the static site
| generator for it to populate the SQLite DB, right?
| intergalplan wrote:
| I'd also version the DB in the URL, else you could end up
| changing the file out from under someone who's already got
| the page loaded, with who-knows-what results depending on
| how different the file is. You could just prefix a head to
| every range request to check for changes, but that adds
| overhead and doesn't actually completely close the gap, so
| it'd still be possible to read a file different from the
| one you intended. Cost is more disk usage, depending on how
| many copies you keep around, but at least keeping the most
| recent "old" version seems reasonable unless you're skating
| _really_ close to the quota on whatever system you 're
| using.
| M2Ys4U wrote:
| Requests could use the If-Unmodified-Since header to
| guard against DB changes without the overhead of a
| separate HEAD request.
|
| If the underlying DB has been changed then the server
| should respond with 412 Precondition Failed.
| edoceo wrote:
| yea, sqlite FTS5 has been pretty amazing for quick search
| solutions (but I use english only)
| misterdata wrote:
| See also https://github.com/bittorrent/sqltorrent, same trick
| but using BitTorrent
| phiresky wrote:
| Yeah, that was one of the inspirations for this. That one
| does not work in the browser though, would be a good project
| to do that same thing but with sqlite in wasm and integrated
| with WebTorrent instead of a native torrent program.
|
| I actually did also implement a similar thing fetching data
| on demand from WebTorrent (and in turn helping to host the
| data yourself by being on the website):
| https://phiresky.github.io/tv-show-ratings/ That uses a
| protobufs split into a hashmap instead of SQLite though.
| westurner wrote:
| This looks pretty efficient. Some chains can be interacted
| with without e.g. web3.js? LevelDB indexes aren't SQLite.
|
| Datasette is one application for views of _read-only_
| SQLite dbs with out-of-band replication.
| https://github.com/simonw/datasette
|
| There are a bunch of *-to-sqlite utilities in corresponding
| dogsheep project.
|
| Arrow JS for 'paged' browser client access to DuckDB might
| be possible and faster but without full SQLite SQL
| compatibility and the SQLite test suite.
| https://arrow.apache.org/docs/js/
|
| https://duckdb.org/ :
|
| > _Direct Parquet & CSV querying_
|
| In-browser notebooks like Pyodide and Jyve have local
| filesystem access with the new "Filesystem Access API", but
| downloading/copying all data to the browser for every run
| of a browser-hosted notebook may not be necessary.
| https://web.dev/file-system-access/
| hfmuehleisen wrote:
| DuckDB can directly & selectively query Parquet files
| over HTTP/S3 as well. See here for examples: https://gith
| ub.com/duckdb/duckdb/blob/6c7c9805fdf1604039ebed...
| mvanaltvorst wrote:
| I'm curious, in what manner could this method speed up Next.js
| builds? That's all done locally, which negates the effect of
| HTTP range requests, right?
| city41 wrote:
| I'm guessing they mean rather than build a static Next site
| that generates 10k+ pages (or whatever large means in the
| given context), it instead creates one page that just queries
| the data from the client.
|
| I have one Next static site that has about 20k pages and
| takes about 20 minutes to build and deploy. I think that's an
| acceptable build time. But I do know of other people around
| the net who have mentioned having sites with 20k-ish pages
| taking an hour+ to build. For them I could see the desire to
| try this sqlite trick.
| gitgud wrote:
| Generating 20k pages in 20mins is impressive, 16 pages a
| second on average.
|
| In my experience, it can take a couple of minutes just to
| deploy 20 pages, but that could just be the overhead of
| Typescript and SASS compilation too...
| nicoburns wrote:
| Hugo claims <1ms a page. Which would mean 20k pages in
| under 20 seconds. 20k pages in 20 mins is not fast!
| gawin wrote:
| Confirmed: Build a 30k pages site with Hugo in 53
| seconds, including grabbing JSON content from Contentful
| and deploy on AWS S3.
|
| Took some effort to get everything under 60 seconds,
| speed was the main reason to use Hugo.
| city41 wrote:
| Oops, my memory was off. It's 10,925 pages and the last
| build took 18 minutes.
| altano wrote:
| For really large sites Next.js already has Incremental
| Static Regeneration which is usually the right solution to
| fast [re]builds:
| https://www.smashingmagazine.com/2021/04/incremental-
| static-...
| throw123123123 wrote:
| Its not the same because you have to rebuild all the
| pages if you change your data source. In this
| implementation you can upload a new data set and it will
| work.
|
| Its just a different stack.
| nindalf wrote:
| You should write a post about this if you implement it. My
| humble suggestion for blog post title - 'Speed up your
| Next.js builds with this author's one weird trick! Vercel
| hates him!'
| lsmurray wrote:
| I generate my entire blog on nextjs from a sqlite data
| base already. https://lsmurray.com/blog/database-first-
| development
|
| It's got a ton of rough edges but the boilerplate is
| there to get a proof of concept pretty quickly
| shkkmo wrote:
| Hmm... Is it possible to use this with two seperate databases,
| one a remote read only and one a local writable DB that could be
| re-exported to a static file host? Having just glanced at the
| code it looks like you would have to load two seperate libraries,
| one from this project and one from the original sql.js
| modeless wrote:
| Super cool. I wonder if an approach like this could ever work for
| a writable database. I guess the issue is that you couldn't have
| fine grained access control without a whole lot of server side
| validation, at which point you might as well just run a regular
| database server.
| sarak12070 wrote:
| Watch dragon ball season 1 full
| https://animated2021.blogspot.com/2021/05/dragon-ball-season...
|
| health tips for men and women. We all should know about these
| things http://healthwithbeauty.xyz/2021/04/20/health-tips-for-
| men/
|
| watch best funny peter rabbit movie online
| https://animated2021.blogspot.com/2021/05/watch-peter-rabbit...
|
| Virginia Girls Scouts making cookie deliveries via drone
| https://www.interestingnews.club/2021/05/virginia-girls-scou...
| yoz wrote:
| As everyone else has been saying, this is amazing work. It sounds
| like the biggest issue with loading the page is the initial
| sql.js download - it's about 1.2MB, is that right?
|
| Might it be feasible to _easily_ strip down SQLite so that it
| only compiles the parts for read-only use? The browser version is
| obviously somewhat read-only but that 's because of the sandbox.
| I'm talking about excluding the code for CREATE, UPDATE, INSERT
| and everything else which is just for writing. The aim here would
| be to produce a significantly smaller WASM binary.
|
| I'm guessing that the answer is no, there's no easy way of doing
| this without significant rewrites of SQLite's core, but... I
| can't be the only one to think of this, surely?
| duskwuff wrote:
| I'd be curious whether there's any changes which could be made
| in the file format to optimize for read-only usage. The SQLite
| format probably has some features which aren't needed in this
| context -- information about free pages and autoincrement
| counters isn't relevant in a read-only file, for instance.
| phiresky wrote:
| I think it wouldn't change much - SQLite is already pretty
| optimized towards reads, for example a write always replaces
| a whole page and locks the whole DB. The free pages can
| easily be removed by doing VACUUM beforehand which should be
| done anyways to balance the b-trees.
|
| The storage of SQLite is already really efficient, for
| example integers are always stored as varints so small ones
| only take a byte. The only thing I think could maybe be
| improved for this use case is changing the structure of the
| b-tree to be more columnar - since right now all the data
| from different columns is intermingled with the btree
| structure itself, querying a subset of columns has a high
| overhead.
| phiresky wrote:
| The actual transferred data for the sqlite code should only be
| 550kB (gzip compression).
|
| Stripping out the write parts is a good idea. SQLite actually
| has a set of compile time flags to omit features [1]. I just
| tried enabling as many of those as possible, but it didn't seem
| to reduce wasm size much, though I might be doing something
| wrong. There's also no easy flags to disable CREATE / UPDATE /
| INSERT .
|
| [1] https://www.sqlite.org/compile.html#omitfeatures
| singularity2001 wrote:
| Sounds feasible to me. Either by replacing all those functions
| on the C side with empty shells or maybe even with wasm-opt (
| but probably the OP has already used it removed all garbage
| collectible paths.)
| nanna wrote:
| Hello new paradigm.
| tedk-42 wrote:
| Very clever! Keep in mind this is a read-only dataset with the
| really clever bit being the ability to scan through the dataset
| and support sqlite query syntax.
| tlarkworthy wrote:
| works on ObservableHQ
|
| https://observablehq.com/@tomlarkworthy/phiresky-sqlite-quer...
| arafalov wrote:
| Amazing, especially - for me - that the FTS5 full-text search
| just works. Longer term, I am if it were possible to split the DB
| code into read and write parts and cross-compile only read part
| for delivery to the browser.
| rzzzt wrote:
| If you are interested in full-text search on the client, Lunr
| is also an option: https://lunrjs.com/docs/index.html
| tehbeard wrote:
| Lunr needs the full index on client though, right?
|
| Being able to use fts-5 without the penalty of having to pull
| down the whole index make it work much better at larger
| scales, even with the penalty of additional network requests.
| formerly_proven wrote:
| This could be genuinely interesting for tools like e.g.
| sphinx-doc, which currently has a client-side search that
| does indeed ship the entire index to the client.
| tehbeard wrote:
| There's probably a dataset size tradeoff with small
| enough number of documents, but even then this could have
| the option of if db is < x MB total, fetch all in async
| task and then use that.
| joshuanapoli wrote:
| Time to create the index for lunr limits the size of the
| data-set that can be used. If you have a lot of tiny
| documents, then it is more practical to scan the documents
| directly, rather than using the lunr index.
| cozzyd wrote:
| If you just want static data, using jsroot or jsfive might be a
| better option.
| jiofih wrote:
| I think you missed the main part: the dataset is not in memory.
| This is not SQLite-in-the-browser but using a virtual file
| system over HTTP. Neither of those alternatives do anything
| similar.
| cozzyd wrote:
| Same with those! They'll just grab the chunks they need (also
| with range requests) to do what you want from binary ROOT or
| (presumably, I haven't actually used jsfive) HDF5 files.
| BMorearty wrote:
| This is fantastically creative. And the author does a great job
| of starting out by describing why this is useful.
|
| And then using SQLite to insert and update DOM elements? Holy
| cow, icing on the cake. Unlike the first part, there's no
| explanation of why you'd want to do that. But by that point I was
| so drawn in that I didn't care and was just enjoying the ride.
| phiresky wrote:
| Yeaah I felt like at that point the article was already long
| enough so I didn't bother describing the DOM part too much -
| even though I spent more time implementing that than I did
| implementing the rest ;)
|
| Basically SQLite has a virtual table mechanism [1] where you
| have to define a few functions that figure out how to scan your
| "fake" table / which indices to use and then how to read /
| write the actual data. I hook into this mechanism and redirect
| the request to DOM functions like querySelector() etc. Then
| there's the issue about SQLite being fully synchronous, but I
| have to run it in a WebWorker - and the WebWorker can't
| actually access the DOM and it can only communicate
| asynchronously with the main thread... So I have to do some
| weird stuff with SharedArrayBuffer and Atomics.wait to make
| that work [2].
|
| [1] https://www.sqlite.org/vtab.html [2]
| https://github.com/phiresky/sql.js-httpvfs/blob/master/src/v...
| [deleted]
| nbevans wrote:
| "But by that point I was so drawn in that I didn't care and was
| just enjoying the ride."
|
| :-) There a high amount of SQLite content/articles/blogs on the
| web that can provide this effect. SQLite is to programmers like
| the stars are to astronomers. A wonder.
| th0ma5 wrote:
| Searching the Common Crawl does similar byte level access to S3
| based on a smaller index of indexes. Really great, actually.
| graderjs wrote:
| Man you are a frickin genius, seriously. like how you put all
| this together all the depth of knowledge of different topics this
| would require the low level and the high level and the way you
| explain it simply confidently and with impact. Your work is
| really an inspiration. You computer scienced the sheet out of
| this thing.
|
| this achievement, and this blog post, to me is on par with blog
| posts that you would see from a major company where they solve
| some significant business critical technical challenge in-house.
| for example: the GitHub blog post about how they created their
| spinning globe of commits on their homepage, or a Netflix blog
| post of how they optimized their Network infrastructure to serve
| so many customers.
|
| your work is truly incredible. You're next level of next level.
| phiresky wrote:
| Thank you, I really appreciate it. It's pretty fun to do this
| kind of thing for yourself, but it's really rewarding to be
| able to share it with other people.
| f430 wrote:
| have you thought of getting this to work with network
| replicated p2p sqlite js library? forgot what it was called.
|
| also does this mean that static sites now can scale without
| relying on db???
|
| what other implications and application do you see?
|
| wonderful work!!!! please let me know how i can donate
| simonw wrote:
| Pop open the network pane in the browser tools and try running
| this SQL query for a demo of how clever this is:
| select country_code, long_name from wdi_country order by
| rowid desc limit 100
|
| It fetches just 54.2KB of new data (across 49 small HTTP
| requests) to return 100 results - from a statically hosted
| database file that's 668.8MB!
|
| I have an animated GIF demo of this here:
| https://twitter.com/simonw/status/1388933800445452290
| WJW wrote:
| So assuming no country has a name longer than 98 characters and
| that all country codes are 2 characters, that is over 500%
| overhead? Are you missing a /s in your post?
| phiresky wrote:
| Since random accesses across the internet are really slow,
| for this kind of fairly small table (where SQLite stores the
| row data inline within the B-Tree of the table) it basically
| fetches the whole content for each row - so even if you query
| only the long_name and country_code column, it will in fact
| fetch the data of all 29 columns in that table.
|
| If you want it to fetch less data for querying a subset of
| columns, you could create create an index on those columns -
| then SQLite will do an COVERING INDEX scan and thus read only
| the necessary data (with the B-Tree itself and the start /
| end page alignment being the only overhead).
| cookguyruffles wrote:
| Nothing to add to the conversation, just wanted to say I
| absolutely adore this. Years ago I used to think it'd be
| nice if search engines simply exposed all their shards to
| clients and let clients do all the results merging and
| suchlike. Of course that's probably a terrible idea in
| practice for performance, but this library is definitely
| implementing the spirit in a very practical way!
|
| It also reminded me of a vague inverse of this hack. In old
| versions of Qemu (possibly it is still implemented, but I
| have vague memories it got ripped out), you could point
| Qemu at a directory on disk and it'd produce an emulated
| floopy disk drive with a virtual FAT12 image containing the
| directory contents. AFAIK it didn't keep the actual data in
| memory, I guess all it needed was file sizes to know how to
| build a virtual memory mapping that contained the
| filesystem metadata + proxied reads from the underlying
| files for data sectors. I look forward to seeing your
| implementation of this concept in a virtualized SQLite file
| <-> GraphQL proxy ;)
|
| edit: insane, it still exists and apparently supports write
| mode?! https://en.wikibooks.org/wiki/QEMU/Devices/Storage#V
| irtual_F...
| nl wrote:
| This might be true.
|
| But this approach lets you actually work out what the optimal
| size is: select sum(length(country_code) +
| length(long_name)) from wdi_country;
|
| gives: 6307
|
| Or on average: select
| sum(length(country_code) + length(long_name))/count(*) from
| wdi_country;
|
| gives: 23
|
| (Note that it doesn't seem possible to use aggregation
| functions with a limit clause)
| simonw wrote:
| The clever part here is only fetching 50KB out of 680MB.
| ajfjrbfbf wrote:
| > sql.js only allows you to create and read from databases
| that are fully in memory though - so I implemented a virtual
| file system that fetches chunks of the database with HTTP
| Range requests when SQLite tries to read from the filesystem:
| sql.js-httpvfs. From SQLite's perspective, it just looks like
| it's living on a normal computer with an empty filesystem
| except for a file called /wdi.sqlite3 that it can read from.
|
| From this paragraph it should be pretty clear that it's
| actually a great result. The database will obviously need to
| read more data than it presents, so more is fetched.
| watermelon0 wrote:
| Have you actually read the article? SQLite is unmodified, and
| thinks it runs on a virtual file system, which fetches file
| chunks via HTTP range headers.
|
| It's REALLY impressive that you only need to read 54 KB out
| of 700 MB, to fetch the records.
| OOPMan wrote:
| It's impressive on one hand.
|
| On the other it's still a lot of overhead.
| ineedasername wrote:
| For a casual or personal use case though, the alternative
| of running a client-server database on something like a
| VPS is probably more overhead than this. It's unlikely to
| be a very scalable option, but for use cases as described
| by the author it seems like a good fit.
| littlecranky67 wrote:
| I know of the drawbacks of the approach and wouldn't
| chose it for a lot of my projects, but I would say it is
| very scalable in those cases where I would. Put the DB on
| GitHub Pages for free along with your HTML/JS code and
| you can scale to whatever GitHub is willing and capable
| of delivering. Yes, your users might transfer way more
| data than needed but you pay nothing for it and do not
| have to maintain servers.
|
| In the standard scenario for personal projects (not
| enterprise) I would have a small VPS/Dedicated server
| with a REST service running - that would be hugged to
| death immediately if a link would make it to some site
| like HN. And also, I completely share the experience of
| the Author that after a couple of years you have moved
| on, the VPS is dead etc and you don't want to invest
| time.
|
| Again, before considering using solution, be sure to
| understand how it works and the resulting limitations or
| you will likely chose wrong.
| pfundstein wrote:
| I would say it's less overhead than downloading the
| entire db to query it locally...? What is your suggestion
| for accessing a static database with less overhead?
| OskarS wrote:
| I would bet that if you compare it to a traditional
| server-client database (which functionally does
| essentially the same thing: you send it a query over the
| network, and get a result back), the overhead is probably
| massive. This is a very clever way to cram that kind of
| functionality into a static hosting site, and you can
| imagine some uses for it, but it's clearly not a
| particularly efficient compared to doing it the "right"
| way.
| Salgat wrote:
| The idea is that you're weighing the pros cons vs an
| actual live database. This is basically only a good idea
| if you're having someone else paying the hosting fees.
| throwawayboise wrote:
| Do most static site hosters support range requests?
| nbevans wrote:
| Generally yes. Because not having range support means you
| can't resume file downloads. Which is a pretty essential
| feature for a static file host.
| dspillett wrote:
| Most web servers do out of the box, so I would assume
| most do. Basically all unless they have some reason to
| turn range processing off or are running a
| custom/experimental/both server that have implemented the
| feature (yet).
|
| _Not_ supporting range requests would be a disadvantage
| for any service hosting large files. Resuming failed long
| downloads wouldn 't work so users might not be happy and
| there would be more load on your bandwidth and other
| resources as the AU falls back to performing a full
| download.
| NoInkling wrote:
| I was wondering that too. Support was spotty in general
| ~20 years ago but I assume things have improved since
| then.
| derefr wrote:
| More interestingly, do reverse-proxies like Varnish /
| CDNs like Cloudflare support range requests? If so, do
| they fetch the whole content on the back, and then allow
| arbitrary range requests within the cached content on the
| front?
| matsur wrote:
| Yes, Cloudflare behaves as you describe.
| bbkane wrote:
| Would it be possible to use a datasette frontend with this as a
| "backend" and statically host the whole thing?
| simonw wrote:
| Not easily - Datasette is written in Python, so you could try
| running it in WebAssembly (like Mozilla did for Jupyter with
| https://hacks.mozilla.org/2019/03/iodide-an-experimental-
| too... ) but it would be enough work that it might be easier
| to reimplement a subset of Datasette directly in JavaScript.
| bbkane wrote:
| Thank you!
| f430 wrote:
| mind blown. how is this possible???
| havernator wrote:
| TL;DR http, properly implemented, supports a ton more stuff
| than even many "web developers" are aware of, like... range
| requests, which are exactly what you'd think they'd be.
| teruakohatu wrote:
| Also SQLite store data in pages and the page size can be
| tweaked. Combined with range requests any part of the
| database can be requested.
| Cullinet wrote:
| the most recent update to the W3C's own research webserver,
| written in Java, called Jigsaw, seems to be dated in 2007.
| I used it for a lot of purposes until 2002 but I don't know
| why I stopped working with Jigsaw only that by the time F#
| emerged in 2004 I was absorbed into a new direction :
|
| https://jigsaw.w3.org/
|
| iirc Jigsaw was used to develop and validate the WebDAV
| protocols and XQUERY which at the time I remember thinking
| XQUERY was sure to be the future as implementations of
| advanced data management and manipulation and query and
| distribution and declaration looked to be what the whole
| point of webservers were for. The incredible distractions
| caused by "rich media" as opposed to multimedia as it was
| understood then, are really worth thinking about. Saying
| that, however, the BBC is doing excellent work on restoring
| the balance of necessary powers to the network standards
| engineers
| https://www.bbc.co.uk/sounds/help/questions/about-bbc-
| sounds...
|
| https://www.bbc.co.uk/rd/blog/2014-03-media-source-
| extension...
|
| https://www.bbc.co.uk/rd/projects/nearly-live-production
| tuukkah wrote:
| The last link was a very interesting read - I wonder if
| BBC or anyone else has open-sourced a video streaming and
| editing system like that.
|
| Connecting this to HTTP range requests, the edited video
| feed can consist of a list of the exact byte ranges that
| the clients need to download and play. Found this
| description of how Akamai uses range requests to serve
| low-latency streams:
| https://blogs.akamai.com/2020/11/using-ll-hls-with-byte-
| rang...
| f430 wrote:
| gpt-2 ?
| Cullinet wrote:
| BTW thank you havernator, because I have just realised what
| I can do with the setup I'm almost ready to pull the
| trigger on that'll give me a surfeit of online capacity (at
| least a baseload can be maintained while the rest is used
| for work instead of cloud time) : I am definitely going to
| investigate the possibility of providing a high level of
| standards specifications for simple web serving. If the W3C
| Jigsaw project had been maintained, I'd simply put it up
| and invite interested users to persuade me to send them a
| shell login. OK obviously that's far too naive today, but I
| would love to run a especially standards compliant host for
| negligible nominal or even no charge so people could maybe
| get a view of better ways to present the WWW.
|
| frankly I think that unless we do things like this, the
| Internet is simply going to become a closed shop to anyone
| not wielding enterprise budgets and legal department
| capabilities.
| batch12 wrote:
| Not trying to offend, but this comment was hard to follow
| in a weird way. Along with your profile this makes me
| wonder-- are you gpt-2?
| amjd wrote:
| So I'm not alone. I felt that too!
| Salgat wrote:
| Use Http Range (normally used for pausing and continuing
| large file downloads) to request specific byte ranges from
| the file. From there you can pull only what you need. With
| sql indexes it'll be very tiny since the lookup is optimized.
| Of course if you select *, you're still going to pull the
| entire database locally.
| phiresky wrote:
| Huh, that's actually kind of a worst case I didn't think about:
| Since you're doing a reverse table scan my "sequential access"
| detection doesn't kick in. If you do the same query but with a
| forward scan it should fetch roughly the same amount of data
| but only do like 5 HTTP requests since the request size doubles
| for every sequential access.
|
| e.g.:
|
| select country_code, long_name from wdi_country where rowid >=
| 164 order by rowid asc limit 100;
| vladf wrote:
| I solved a similar problem recently: given a stream of data,
| how should you choose packet size in an online way to
| minimize regret (a linear combination of spare capacity of
| last packet and total packets used).
|
| Turns out doubling isn't the best strategy. The optimal
| solution is actually to add a constant increment to packet
| size. How much depends on relative cost of the terms in the
| regret function.
| brandmeyer wrote:
| > I've set the page size to 1 KiB for this database.
|
| > That's because I implemented a pre-fetching system that
| tries to detect access patterns through three separate
| virtual read heads and exponentially increases the request
| size for sequential reads.
|
| > Since you're doing a reverse table scan my "sequential
| access" detection doesn't kick in.
|
| You know, starting off with the default 4kB page size
| naturally adds some resistance to these kinds of failure
| cases. If the VFS isn't issuing many requests in parallel, I
| would think that setting up a page size near target_bandwidth
| * round_trip_time would be a better initial guess. 1kB would
| be appropriate for a pretty low latency-bandwidth product.
| phiresky wrote:
| That's true, but it also means that random access will
| always use at least that amount of data even if it only has
| to fetch a tiny amount. I did a few (non-scientific)
| benchmarks on a few queries and 1kB seemed like an OK
| compromise.
|
| And note that the request chunk size is bound to the SQLite
| page size, and to change that page size you have to rewrite
| the whole DB. So it can't be set on the fly unless you have
| multiple copies of the database.
| pjc50 wrote:
| 1kb fits in most IP MTU sizes, so that seems reasonable.
| kelnos wrote:
| Do most HTTP responses have less than ~500 bytes of
| headers? I guess specifically here, GH pages' responses.
|
| It looks like one of the requests made to the DB included
| a little over 700 bytes of response status line and
| headers, so that would probably end up spilling into more
| than one response packet, unfortunately.
| simonw wrote:
| Hah, I thought "in reverse order by ID" might be a stress
| test but I was still very impressed by how it performed!
| Procrastes wrote:
| I'm also impressed with the implementation for the blog itself. I
| love this sort of clean, version-tracked sort of implementation.
| Gehinnn wrote:
| TL;DR: Compile SQLite to JS with emscripten, implement a virtual
| fs in JS to stream chunks of a statically hosted (readonly) SQL
| database.
|
| If queries make use of indices, only a fraction of the database
| needs to be downloaded.
|
| Also, you can use SQLite to query the DOM.
| [deleted]
| pudmaidai wrote:
| I can't figure out exactly how it knows which chunk to
| download. Does it always download the whole index first? Or
| does it include it in the built JS file itself?
| fulafel wrote:
| B-Tree indexes are designed to work like this, to require a
| low number of IO operations. The index contains pointers to
| other places in the index.
| pudmaidai wrote:
| So the answer is "yes, it has to download the index first"?
| None of these comments answer my question.
| rakoo wrote:
| Everything in SQLite is stored in B-Trees. Data or
| indexes. So you don't need to download the whole index
| first; you only need to download the necessary pages of
| the trees to access data, whether it's part of an index
| or actual data
| tehbeard wrote:
| It just needs to download the pages (x KB chunks) to
| traverse from the header to the particular index and
| table schema, a fraction of the whole data.
| fulafel wrote:
| It has to download some parts ("pages") of the index as
| the query execution proceeds, and some header/schema
| description pieces of it first before execution starts.
| detaro wrote:
| At least the information describing what tables and
| indices there are and where to find them - and then it
| gets what it needs once a query is run. Just like sqlite
| would if running from a local file on disk.
| phiresky wrote:
| The B-Tree is a tree that in this case is perfectly
| balanced. So if you do a query with an index in a
| database it will fetch an logarithmic amount of data from
| the index and then a constant amount of data from the
| table.
|
| For the example the wdi_data table is 300MB and an index
| on it is 100MB in size. This index has a tree depth of 4
| - which means SQLite has to read exactly 4 pages (4KiB)
| to get to the bottom of it and find the exact position of
| the actual row data.
|
| you can check the depth of the b-trees with
| `sqlite3_analyzer`.
| formerly_proven wrote:
| SQLite has runtime-pluggable VFS support, i.e. you give it a
| struct with functions for opening a file, reading some bytes,
| writing some bytes, synchronizing file contents, closing a
| file. This project provides such a VFS module, that, because
| it actually runs in the browser, performs HTTP requests to
| read data. Emscripten provides a way to run a mix of C/C++
| code in the same environment as some JavaScript code inside
| the browser. The reason SQLite has this pluggable VFS support
| is to properly support embedded systems, different locking
| APIs, and things like database encryption.
|
| https://www.sqlite.org/vfs.html
| cookguyruffles wrote:
| Both the index and table data are btrees. These are trees -
| the root node sits in some known location (offset) in the
| file, referenced by the file header and metadata. As SQLite
| traverses the tree, it encounters new descendents it would
| like to visit, presumably identified by their byte offset in
| the file, which is all needed for this VFS magic to issue a
| suitable range request.
|
| - SQlite opens the file and reads 4kb worth of header ->
| range request for byte 0-4096
|
| - headers/metadata refers to index table with root node at
| 8192kb
|
| - user issues SELECT * from index WHERE name = 'foo'
|
| - SQLite reads root node from the file (range request for
| 8192kb..)
|
| - Root node indicates left branch covers 'foo'. Left branch
| node at address 12345kb
|
| - Fetch left branch (range request for 12345kb)
|
| - New node contains an index entry for 'foo', row 55 of data
| page at 919191kb
|
| - SQLite reads data page (range request for 91919191kb..)
|
| etc etc etc
| jaza wrote:
| Thanks, I too was struggling to understand how it's able to
| do such efficient targeted range requests, you explained it
| nicely.
| mvanaltvorst wrote:
| That's part of SQLite. It has been optimised to reduce disk
| reads, because those can be slow on spinning hard drives.
| Coincidentally, this translates well into an optimised
| algorithm that minimises the amount of HTTP range requests to
| make.
| diveanon wrote:
| I am a huge fan of sqlite and will definitely be giving this a
| shot in the near future.
|
| Incredible work.
|
| I see myself using this in conjunction with a conventionally
| hosted pg db for dynamic content.
| Seattle3503 wrote:
| How close are we to hosting searchable torrent indices on IPFS?
| rjeli wrote:
| Nice. Reminds me of Bellard's https://vfsync.org/ . It's file
| system pages served over http, to back his jslinux browser VMs
| [deleted]
| IceWreck wrote:
| This is incredible ! Don't have a use case for this right now but
| am certainly saving this for later.
| hutrdvnj wrote:
| First of all, this is a very cool web hack, I like it very much.
|
| I have a question. It's a 668.8MB database file. What does
| actually happen if the query has to scan 300 mb before finding
| the right answer? Wouldn't it be better to do the work up front
| and deliver the answers as static json files? Sure you loose the
| flexibility of dynamic queries, but do you really have that
| flexibility in non trivial cases (e.g. 300 mb search)?
| willvarfar wrote:
| If your statistic or whatever can be precomputed, you can
| precompute it and put it in db table rather than compute it
| each time by reading the 300MB.
| aflag wrote:
| Very clever. I wonder if there are databases optimised for this
| use case. I can imagine something that always requires indexes to
| do the queries and stores data in disk in ways to make it easy to
| fetch only the bits you need.
| aasasd wrote:
| That would be equivalent to always putting your data in b-trees
| or other structures, according to the request patterns, without
| keeping the tables themselves. Sort of how you need to do that
| in Redis for any kind of sane request strategy other than key-
| value lookups.
| aflag wrote:
| Hm, yeah, I think a key-value store would be easier to
| implement. I haven't looked at redis for some time now, but
| last time I did, persistence was done through snapshotting
| and everything would really be loaded into memory at start
| time. So that wouldn't work for this use case, where all you
| can do is serve a static file.
|
| But my question revolves around databases assuming that the
| disk they access is local or at least a fast network storage.
| I wonder if there are any databases optimized to access slow
| storage over low bandwidth, where you're really trying to
| optimize the amount of data read more than anything else.
| aasasd wrote:
| Well, every database already optimizes disk access (at
| least until the recent years with the 'just add SSDs'
| attitude). However, they tend to assume that indexes should
| be loaded into memory. For this use-case, you'd want a
| database that can use indexes themselves from disk and
| treat them like partitioned storage: e.g. when reading data
| for years 2019 to 2021, only request parts corresponding to
| that, and not previous years. Dunno whether SQLite can have
| indexes partially in memory--with its niche of low-end
| devices and apps, it's quite possible that it can.
|
| Actually, this sort of partial access (i.e. partitioning)
| is rather easy to implement by addressing separate data
| files by name, instead of using numeric ranges into a
| database. Basically just put the data into files named by
| the years (in my example); or bucket the data into chunks
| of arbitrary size and use the chunks as files. Elementary
| to extend this to multiple fields in the index. In short,
| partitioning based on actual field values can be much
| easier in the static-http approach than using opaque
| ranges. Probably also more effective if something like
| http2 allows requesting several files in one request--since
| you can avoid requesting too little or too much.
| nsonha wrote:
| This is perfect for my need, been looking for a way to add search
| to my static site completely free of server. Now I can use sqlite
| as index.
| Benjamin_Dobell wrote:
| This certainly does look like an interesting solution, I'd be
| keen to try it myself. However, just in case you didn't already
| know about Lunr (https://lunrjs.com/), it is fairly commonly
| used to implement search on static websites.
|
| e.g. https://squidfunk.github.io/mkdocs-material/
|
| There are of course other similar libraries too.
|
| _EDIT: Whoops, just saw a few comments below Lunr is already
| mentioned._
| nsonha wrote:
| yeah I have come across Lunr and maybe a couple of other
| things in my research and I think for blogging it'll work
| well. What I'm interested in finding out is what works for a
| larger static site, that won't require you to load the index
| up-front. I'm also curious about how this sqlite thing picks
| the correct range to load (haven't looked at the code) and
| what the worst case might be.
| masklinn wrote:
| > I'm also curious about how this sqlite thing picks the
| correct range to load
|
| Indexes are usually btree, it finds what and where the
| relevant index is (probably from the schema queries), then
| goes trawling through the b-tree.
|
| That's how db engines normally work, they don't liberally
| go through the entire db contents.
|
| If there's no index then it finds the start of the table
| and scans it sequentially (which is exactly what it sounds
| like).
| mwcampbell wrote:
| This is certainly a clever hack. But it makes things cheaper and
| easier for the website owner at the expense of the users. This
| kind of solution will add to the bloat that is making the web
| unusable enough that people are seriously proposing a remote
| browser as a solution for desktop users (remember Mighty the
| other day?). We shouldn't let the cleverness distract us from
| this drawback.
| bilater wrote:
| This is awesome. Got my wheels turning :)
| iveqy wrote:
| This is both a clever hack and a brilliant solution. But it also
| worries me a bit. Mostly because I've seen a lot of Visual Basic
| + MS Access solutions. They work fine on a single computer, but
| they you put a database on a network share to be able to share it
| between a few computers and the performance is often horrendous.
| If you're doing a lot of data processing it's often best to do it
| as close to the data as possible.
|
| But as always, it's seldom the tools, but the right tool used for
| the wrong usecase that is the problem.
| dreix wrote:
| Over the last few months I tried to think of a clever way to set
| up a legacy site for a dictionary that I serve on a VM just
| because I also need to run sqlite. Since I want to make sure
| it'll run for longer than me paying for the VM this is the best
| possible solution. At some point no more updates will happen and
| it's going to be a static website. So bundling it like this is
| incredible. I can run multiple backups on different hosts with no
| additional costs.
| scottlamb wrote:
| If you just do an occasional key/value lookup, you don't need
| 1.2 MiB of WebAssembly. [1] That might already exceed your
| total database size.
|
| I'd solve it via sharding: divide the database into N pieces
| via range- or hash-sharding. [1] Choose an N that's large
| enough for each piece to be reasonably small. When you look up
| a key, fetch the shard of interest.
|
| You can put each piece into separate files (a little simpler to
| code, and most static servers will use pre-gzipped files for
| "Content-Encoding: gzip requests" easily, but you waste more
| disk space due to internal fragmentation) or one file (with
| range serving and an index of the byte range offset for each
| piece).
|
| The format for each piece can be anything, eg json (simple) or
| an sstable-like format (more efficient). [3]
|
| [1] Content-Length of https://phiresky.github.io/youtube-
| sponsorship-stats/sql-was...
|
| [2] hash-sharding means: piece[i] has all the keys where
| hash(key) % N = i.
|
| [3]
| https://github.com/google/leveldb/blob/master/doc/table_form...
| although they just say "formatted according to the code in
| block_builder.cc" instead of describing the most relevant part.
| TedDoesntTalk wrote:
| This is great but wish it had write ability, not just read only.
| 867-5309 wrote:
| >Of course it can't write to this file, but a read-only database
| is still very useful.
|
| perhaps read-only should be added to the title
| roddds wrote:
| I thought the "static" part of "static file hoster" was clear
| enough.
| simonw wrote:
| Using HTTP range requests like this is just SO clever.
| isoprophlex wrote:
| I have the feeling I'm watching some semi-major new web tech
| being born... out of the blue, in a blog post.
|
| Fucking amazing, mad props. Beautiful work!
| dawe35 wrote:
| Hi, I'm working on SkySQL, maybe it's useful for you:
| https://github.com/upshot-tech/SkySQL
| pmarreck wrote:
| This is super clever. If you add handling POST somehow as random
| access writes, you could have a read-write DB hosted pretty much
| anywhere
| singularity2001 wrote:
| >>> From SQLite's perspective, it just looks like it's living on
| a normal computer with an empty filesystem except for a file
| called /wdi.sqlite3 that it can read from.
|
| Beyond static hosting : Now imagine also implementing a virtual
| file system that SENDS chunks of the database with HTTP Range
| requests when SQLite tries to write from the filesystem
|
| Or more generally: I predict a WASI implementation which will
| treat ANY server resource as a virtual file, replacing REST.
| laurencerowe wrote:
| In the genomics world, Tabix indices enables similar use cases.
| An ordered TSV file is compressed in chunks (bgzip) and a Tabix
| index created to allow range based access by mapping from the
| index -> chunk. This allows a browser based genome browser zoomed
| into a section of the genome to fetch information from a multi
| gigabyte file.
|
| http://www.htslib.org/doc/tabix.html
| Ultimatt wrote:
| now if only tabix and most hokey bioinformatics formats would
| die and just be replaced with a formal schema spec in SQLite...
| laurencerowe wrote:
| Honestly I think Tabix's bgzipped TSV is one of the less
| hokey bioinformatics formats, at least compared to the
| various custom binary formats floating around.
|
| For web browser based genome browsers I suspect this (very
| cool!) sqlite hack would require many more http requests.
| spamalot159 wrote:
| This is really cool! I wonder what the restrictions are and if we
| would ever be able to write to a SQLite db like this in the
| future. This could push more to the front end without needing to
| write apis.
| phiresky wrote:
| The main restriction is that the DB really needs well fitting
| indexes, otherwise querying is really slow and fetches a lot of
| data.
|
| Regarding writing:
|
| You could of course implement a writing API with POST requests
| for changing pages of the database - but then you would lose
| most of the benefits of this (not requiring any special kind of
| server).
|
| I also thought about implementing a kind of overlay filesystem,
| where chunks that are written to the file are stored in a local
| storage so the modified data is available locally while still
| reading everything else from the remote database.
|
| Interestingly in SQLite that's already exactly what the WAL
| mode does: It's a second file next to the database that's just
| a set of pages that are overlaid over the main file when read
| queries happen - which allows concurrent readers and writers
| since the database itself isn't in an undefined state even when
| write transactions are happening.
|
| So you could enable WAL mode and disable WAL auto
| checkpointing, then you get a downloadable WAL file that can be
| read by normal SQLite and written back to the main file. It
| would be neat, but I'm not sure what the actual use case would
| be ;)
| cxr wrote:
| > I also thought about implementing a kind of overlay
| filesystem, where chunks that are written to the file are
| stored in a local storage so the modified data is available
| locally while still reading everything else from the remote
| database.
|
| If ever the intent were to involve eventually persisting
| those changes, then it would be worthwhile looking at
| remoteStorage, which works like this.
| WrtCdEvrydy wrote:
| Since you can do static hosting from a git repo, I wonder if
| you could directly push your changes to your git repo and
| have your CI/CD solution just deploy it instead?
|
| There has to be a git.js implementation out there and you
| could move the DB to it's own repo and create an https access
| token (for Github)... the issue there is that someone could
| use that token to commit whatever to your database repo.
| wh33zle wrote:
| Maybe forcing the user to login with GitHub would be an
| option? And the changes to the database could be modelled
| as pull requests?
| graderjs wrote:
| I was thinking you could implement a write API using the
| GitHub API, every write can be its own commit.
| ItsMonkk wrote:
| Seems like it might be possible to add WebRTC(or some other
| peer-based live system, I don't know the performance
| characteristics of WebRTC) to get realtime writes. Those
| edits would then be added to the WAL, and every say, 10
| minutes anyone who has been active for 10 minutes could
| checkpoint and push the changes to github.
|
| It's slightly more centralized than perfect, but man do you
| get a lot for a little.
| stagas wrote:
| > I also thought about implementing a kind of overlay
| filesystem, where chunks that are written to the file are
| stored in a local storage so the modified data is available
| locally while still reading everything else from the remote
| database.
|
| Perhaps adding IPFS to the mix for persisting data would be
| interesting, I'm sure there are use cases in peer to peer
| applications. Anyway, amazing innovation thank you for
| writing this :)
| bruhhh wrote:
| couldn't you just link the source code or show how to reproduce
| what you did? really? you're just showing everything about your
| work but NOTHING about how to reproduce your work?? THERE IS
| NOTHING I HATE MORE THAN THESE KIND OF POSTS!!!
| winrid wrote:
| Doesn't the webserver have to seek from the beginning of the
| ~600mb file to the range you want, unless the file is in memory?
| philshem wrote:
| No, see here, for example
|
| https://news.ycombinator.com/item?id=27018194
| slver wrote:
| I wonder sometimes, what if we focus on producing native JS VM
| hardware, would it perform same as or better than WASM? Let's run
| everything on JS.
| irae wrote:
| The Hacker News demographic, mostly US and EU based, might be
| under appreciating how impactful this is because of CDNs. For
| anyone not in the continent of your central database, this means
| absurdly faster interactions for data visualization.
|
| I moved from US to Brazil 3 years ago, and I still notice the
| latency when a site runs their backend only in one location. This
| cleaver solution makes interacting with the graph supper snappy
| even compared to enterprises that do have database servers in
| Brazil. Very impressive!
| BlueTemplar wrote:
| This seems cool, but doesn't Fossil (from the SQLite author)
| already do it "out of the box" ?
|
| https://www.fossil-scm.org/home/doc/trunk/www/index.wiki
| detaro wrote:
| Fossil runs a server application.
| cryptonector wrote:
| This is pretty brilliant.
| wheybags wrote:
| This is an really awesome idea! I have a plan for a static-ish
| site (updated daily) that I was going to use sqlite for anyway,
| but server side. I will definitely look into this approach
| instead!
| drtournier wrote:
| Solution works really well to databases which will not be updated
| frequently, like a standalone site.
|
| Although one should be aware of one very important git behavior -
| git does not diff binary files (like SQLite dbs). That means 2
| things:
|
| 1. Each db update will generate a new file in git, maintaining
| the whole old file in history, instead of the diff in bytes. This
| will accumulate a lot of clutter in the repo
|
| 2. As git does not diff binaries, there is a very small risk of
| corruption (especially if you work in multiple OSs, because of
| CRLF)
|
| Ref - https://robinwinslow.uk/dont-ever-commit-binary-files-to-
| git
| phiresky wrote:
| Git does actually diff binaries and stores them very
| efficiently :) If you do a single small UPDATE in your db file
| git will only store the changed information. It's just kinda
| slow, and for most binary files the effort git spends to try
| and compute deltas for binary files is useless - which is why
| it has a bad reputation for binary files.
|
| Note that the diffs that git shows you are completely unrelated
| to the deltas it uses to compress it's database - which are
| always "binary deltas" and not line-based diffs.
|
| Also I'm not sure why you mean that db corruption possibility
| has something to do with whether or not it stores diffs?
| laszlokorte wrote:
| Minor nitpick: Git does not store diffs for any file format but
| always the full file for each version. So it does not really
| matter that its binary (except for not being able to VIEW the
| diff, but I guess you could even implement a plugin for that)
| but just that it's a big file. Even a huge text file would be
| fully stored per version.
|
| /edit: The sibling comments mentions that git can infact delta
| compress older commits for storage efficency. But my point was
| that git commits are not deltas but full snapshots.
| iudqnolq wrote:
| This is mostly correct. Git's core object model is snapshots,
| which can then optionally be compressed. That should be
| transparent though.
| tehbeard wrote:
| You are only gonna encounter corruption if u either a) messed
| up the gitconfig for line endings or b) named the database
| mydbfile.txt
| xucheng wrote:
| You can just store the database in text format (e.g. csv) in
| the git and turn it to SQLite db when building the website.
| claytongulick wrote:
| Incredibly clever use of paging and http range queries! Well
| done.
| whoomp12342 wrote:
| stahp, you gonna get gh pages shut down
| banana_giraffe wrote:
| In a similar vein, I've mentioned this before, but if you're
| doing Python stuff, you can use the apsw package (not the one in
| PyPi, though) to write a VFS layer that SQLite will use to read
| the database.
|
| I've used this for the same basic idea as this article, only
| letting me store SQLite databases in AWS's S3 that I can access
| with AWS APIs so they don't need to be public. It works well,
| though it's absolutely not for every use case, the overhead is
| considerable.
|
| I even used it once to read SQLite database files in a zip file
| stored in S3 without having any local storage to use. Not one of
| my prouder moments, but hey, I coded my way out of the corner
| someone else designed for me.
| killingtime74 wrote:
| This one? https://rogerbinns.github.io/apsw/
| banana_giraffe wrote:
| Yep, exactly that one. There's a simple example of a VFS
| implementation on the examples page that's a reasonable
| starting point:
|
| https://rogerbinns.github.io/apsw/example.html
|
| Once you wrap your head around how you need to pass
| parameters to the helper, it's really straightforward, you
| just need to implement the xOpen and xRead calls.
| killingtime74 wrote:
| Thank you!
| sztanko wrote:
| Apart from the genius of the author (conceiving a sequence of
| relatively understandable steps that unlock a potentially huge
| area), this highlights how efficient a SQLite can be in terms of
| slow resource usage.
| tlb wrote:
| The question I had is answered by this line of code:
| xhr.setRequestHeader("Range", "bytes=" + from + "-" + to);
|
| I am a little surprised you can just do that. In
| https://github.com/phiresky/sql.js-httpvfs/blob/master/src/l...
| yread wrote:
| I've learnt about this by using https://www.biodalliance.org
|
| It's an embedded genome viewer, you can just point it at a
| multigigabyte reference files and .seg files and it loads super
| quick
| punnerud wrote:
| Here is direct link to GitHub with the usage: https://github.
| com/dasmoth/dalliance/search?q=bytes%3D%27&ty...
| e12e wrote:
| I guess "all" that's needed for write would be a webdav server
| with support for PATCH x-update-range? :)
|
| https://tools.ietf.org/html/rfc5789
|
| https://sabre.io/dav/http-patch/
|
| Unfortunately, solid, stand-alone webdav servers are harder to
| come by than decent http2/1.1 servers.
| fanf2 wrote:
| One of the heaviest users of range requests is (or was) the
| Adobe Acrobat PDF plugin.
| iudqnolq wrote:
| I'm surprised that works, iirc pdf isn't defined in order and
| can't be parsed streaming
| tonyedgecombe wrote:
| Linearised PDF files are defined in order and can be read
| from the start.
|
| https://blog.idrsolutions.com/2010/02/linearized-pdf-files/
| kindall wrote:
| I think that's what the "optimize for Web" checkbox does
| IshKebab wrote:
| Wow so it actually does something! I wish programs would
| use such vague descriptions. (Or more of them had helpful
| instant tooltips.)
| maskros wrote:
| Linearized (a.k.a. web optimized) PDF files only help for
| displaying the first page quickly. The rest of the file
| is still in pretty much random access order.
| e12e wrote:
| For what it's worth, there's a format for content like
| pdfs that's optimized for viewing on screens (as opposed
| to printing), djvu: https://en.m.wikipedia.org/wiki/DjVu
|
| Oh, wow initial release 1998,now I'm feeling a bit old...
| 0x0 wrote:
| I remember putting together a sloppy http range
| implementation that initially only supported single ranges,
| it had quite the explosive effect on adobe reader when it
| didn't get the expected response to its multi-range requests
| :)
| scottlamb wrote:
| Also .mp4 files. The format is designed for seekability, and
| browsers take advantage of this.
| fy20 wrote:
| Progressive JPEGs work well for this too, so you could have
| the same file used for a tiny thumbnail and large preview
| and full sized photo by sending different range requests.
| However you need to know how many bytes to request.
|
| I'm surprised this isn't used on mobile browsers to lower
| data usage. I'm sure with a little research you could
| figure out what a good mapping from pixel size to byte size
| should be to give good enough results.
| duskwuff wrote:
| A browser doesn't have enough information to use this
| optimization. At the point where it's about to request an
| image, it doesn't know how large the resource will be,
| whether it'll be a progressive JPEG, or even whether
| it'll be a JPEG at all. Making range requests blindly
| would probably be a net loss -- for every progressive
| JPEG that the browser managed to save some time on, it'd
| have to make follow-up requests for many more non-
| progressive JPEGs, non-JPEG images, and progressive JPEGs
| which it didn't get enough data from on the first try.
| _flux wrote:
| You are correct.
|
| However, one could use this approach: download as usual,
| and in a streaming fashion process the data and if it's a
| progressive JPEG, you can close the connection before you
| have received everything; and then you can cache the
| prefix and later download the rest if needed.
|
| Fast clients will just swallow the whole file, while slow
| clients would be able to benefit from it.
|
| It wouldn't work for pipelined HTTP connections though
| without cancelling the whole pipeline, so maybe not a
| very practical solution given the performance benefit
| that already gives. And HTTP/2 maybe doesn't support
| cancelling a transfer either, so.. ?
|
| Maybe a direct "Accept" or "Prefer" header to indicate
| that it's enough to send just something useful for an
| icon would be a more ideal solution, but it would require
| server-side support.
| sp332 wrote:
| You can even point VLC at a .iso file on a web server, and
| seek around in it.
| crazygringo wrote:
| Not all webservers support/enable it, so YMMV.
|
| But as long as you're dealing with a known server that does,
| then gravy!
| stabbles wrote:
| nginx enables it by default. Another ingenious use of range
| request is zsync, it allows you to diff compressed binaries
| on a remote with local ones, so that you only have to
| download what has changed on an update. AppImage uses this
| altfredd wrote:
| > Not all webservers support/enable it
|
| Could you provide an example of server that does not?
|
| AFAIK, Range is supported by all major CDNs, so not
| supporting it in web server would be a death knell for it's
| real-world adoption.
| crazygringo wrote:
| I can't think of a specific one, but a decent proportion
| (maybe a quarter?) of HTTP downloads I attempt don't
| support resuming a partial download, therefore don't
| support Range. (I.e. resuming always starts from the
| beginning.)
|
| I would assume this is often because the site in question
| isn't using Apache etc. to serve a file directly, but is
| either essentially proxying it to some custom-built file
| serving service, or a script that processes/authenticates
| the file in some way, and they just never bothered to
| implement Range.
| enneff wrote:
| Depends on what's being served. Any decent static file
| server should support it, but if the content is at all
| dynamically produced then the authors would have to think
| to implement it and rarely do.
| dathinab wrote:
| Range headers are a pretty standard tools to e.g. continue
| interrupted downloads and similar.
|
| Any well designed system, especially if it has static sources
| and is server cached _should_ support it.
|
| Surprisingly many web-frameworks don't support it out of the
| box, or don't support it well.
|
| Either way gh-pages are static content and probably with some
| server side regional caches, so I'm not surprised it works.
| villasv wrote:
| Wicket stuff, loved it. Not sure if this will ever be "production
| ready" or merged by sql.JS-like projects but cool proof of
| concept.
| [deleted]
| mFixman wrote:
| This is amazing!
|
| I'm also surprised that Github Pages lets you present arbitrary
| JS to the point where you can upload SQLite as WebAssembly. Isn't
| this dangerous?
| nsonha wrote:
| What is dangerous here? Wasm is more crippled in access than
| normal js. And js is simply in every web page these days. You
| can't make a static site hosting service and tell people not to
| use js, that would be very lame.
| ponytech wrote:
| This is truly awesome. Very clever.
| justinclift wrote:
| This is pretty awesome. Now if only GitHub Pages did IPv6 as
| well... :)
___________________________________________________________________
(page generated 2021-05-03 23:02 UTC)