[HN Gopher] Static.wiki - read-only Wikipedia using a 43GB SQLit...
___________________________________________________________________
Static.wiki - read-only Wikipedia using a 43GB SQLite file
Author : segfall
Score : 106 points
Date : 2021-07-30 21:50 UTC (1 hours ago)
(HTM) web link (static.wiki)
(TXT) w3m dump (static.wiki)
| suprbeast wrote:
| We put this on an e-reader once. Onboard wikipedia, it was cool.
| It was 2011.
| dleslie wrote:
| Two immediate issues:
|
| 1. Doesn't work if Privacy Badger is active.
|
| 2. Nim (Programming Language) is not indexed.
| chadcatlett wrote:
| I noticed if you go directly to the page for Nim, it works.
|
| http://static.wiki/en/Nim_(programming_language)
| twotwotwo wrote:
| That's super cool.
|
| The autocomplete trickiness is fun. Looks (from links in GitHub)
| like it's using full-text search and currently needs to scan a
| lot to rank results. Doing just prefix search might be a cheap
| half-solution to autocomplete, and maybe layering on hacks around
| stopwords etc. can get you a bit further.
|
| For another angle, enwiki's full compressed list of titles
| (enwiki-20210720-all-titles-in-ns0.gz) is 86MB. Grabbing the
| whole dang thing (in the background?) and doing something with it
| client-side is a different sort of approach from what this page
| is going for but not unthinkable.
| dom96 wrote:
| Really cool! I think I see a bug: searching "Nim" shows "Nim
| (programming language)" but it's not clickable.
| punnerud wrote:
| Related post 89 days ago:
| https://news.ycombinator.com/item?id=27016630 (Hosting SQLite
| databases on GitHub Pages or any static file hoster)
| voldemort1968 wrote:
| There's also https://www.kiwix.org/en/ which is great to take on
| long offline flights, browsing through wikipedia.
| mongol wrote:
| Perfect if stuck on a remote deserted island! Must be the closest
| existing thing to the Hitchhiker's guide?
| yreg wrote:
| I installed a reader with a Wikipedia dump (without multimedia)
| on the original iPhone. At the time (at least in my region)
| WiFi was quite rare and it really did feel like the
| Hitchhiker's guide and incredibly powerful.
|
| I only kept it for a few weeks though since it consumed almost
| all of the space. (And I imagine it was probably my language
| mutation, the english Wikipedia must have been too big
| already.)
| r00fus wrote:
| Hitchhikers Guide to Wikipedia? HHGTWP?
| wyldfire wrote:
| Towel not included.
| nynx wrote:
| This is fantastic. I knew that we'd see things like this once
| that hosting sqlite databases project was posted.
| arianon wrote:
| Shocked to find out that this is very much a static website, it's
| "merely" downloading a small portion of the 43GB SQLite file in
| question with HTTP Range requests, and then it uses a WASM-
| compiled copy of SQLite to query the requested data. Very
| impressive.
| k__ wrote:
| Didn't know you could pump 43GB of data into a browser.
| simonw wrote:
| It's the same trick that was described here - it's absolutely
| brilliant: https://phiresky.github.io/blog/2021/hosting-sqlite-
| database...
| JoeyBananas wrote:
| wouldn't this be vulnerable to DOS attacks? I can make the
| database run arbitrarily long and complicated queries
| CodesInChaos wrote:
| The database runs on the client. It can't do anything you
| couldn't do through any other http client.
| simonw wrote:
| Since the work all happens in your browser, the only victim
| of a long complicated query would be you own browse and the
| S3 bandwidth bill of the person hosting the database (since
| you'd end up sucking down a lot of data).
|
| But if you want to jack up someone's S3 bandwidth bill
| there are already easier ways to do it!
| floatingatoll wrote:
| It's a static file, CDN it.
| simlevesque wrote:
| Many CDNs cannot cache 43GB files.
|
| Cloudflare's limit is 10GB, 20GB for Cloudfront, 150GB
| for Akamai and Azure.
| max1cc wrote:
| The search seems to be broken right now, but it looks like you
| can access whatever page you want via URL (e.g.
| http://static.wiki/en/Computer_Science)
| sgt wrote:
| thgttg.sqlite3
| simonw wrote:
| I'm nervous on your behalf for your S3 bill, have you done napkin
| maths for how much this will cost to host, especially given the
| relative inefficiency of the autocomplete queries?
|
| A trick that might help: create a separate SQLite table
| containing just the article titles to autocomplete against. That
| might give you queries that use smaller ranges of the file.
| sitkack wrote:
| I built a wikipedia reader on the ipad years ago, I used two
| sqlite databases, one for the article titles using sqlite FTS
| then I compressed the articles themselves and merged the
| compressed chunks into a set of files < 2G, the title database
| had the offset into the compressed blobs. Only including the
| most popular pages, it was just under 10G.
| simonw wrote:
| That's really clever. Were you storing the article text as
| compressed data in BLOB columns in a SQLite table?
| dom96 wrote:
| Why do people even use S3 if that's a risk? I really don't get
| this, do most just not worry about accidentally wasting money
| or do they just not understand this risk?
| simonw wrote:
| This is a risk any time you host anything on the internet on
| any system that charges you for storage, bandwidth or CPU
| usage. It's fine to take that risk on as long as you're aware
| of it and keep an eye on your spend.
|
| In this case I'm nervous because 43GB is a lot larger than
| most projects!
| slunk wrote:
| It's one thing if you have a source of revenue so justify
| your S3 costs. My interpretation of the parent commentator's
| concern is that this person has opened up a multi gigabyte S3
| file publicly and sent lots of traffic its way (via hacker
| news) for what appears to be a passion project.
| max_ wrote:
| He should seed a torrent file
| Weryj wrote:
| IPFS?
| ignoramous wrote:
| > _A trick that might help: create a separate SQLite table
| containing just the article titles to autocomplete against._
|
| Another one: build a directed-acyclic-word-graph [0][1] of just
| the page-titles and download it to browser's localStorage for
| prefix-based searches.
|
| [0] https://github.com/smhanov/dawg
|
| [1] http://stevehanov.ca/blog/?id=115
| Weryj wrote:
| Get a FusionIO drive off ebay and host the DB yourself, if you
| need, stick the computer behind a static ip VPN.
| capableweb wrote:
| Seems to be hosted at Surge,sh which is hosted at Digital
| Ocean, so seems this is not OPs problem. $
| nslookup static.wiki Server: 8.8.8.8
| Address: 8.8.8.8#53 Non-authoritative answer:
| static.wiki canonical name = na-west1.surge.sh. Name:
| na-west1.surge.sh Address: 188.166.132.94
| $ whois 188.166.132.94 % This is the RIPE
| Database query service. % The objects are in RPSL
| format. % % The RIPE Database is subject to
| Terms and Conditions. % See
| http://www.ripe.net/db/support/db-terms-conditions.pdf
| % Note: this output has been filtered. % To
| receive output for a database update, use the "-B" flag.
| % Information related to '188.166.132.0 - 188.166.135.255'
| % Abuse contact for '188.166.132.0 - 188.166.135.255' is
| 'abuse@digitalocean.com' inetnum:
| 188.166.132.0 - 188.166.135.255 netname:
| DIGITALOCEAN country: NL admin-c:
| PT7353-RIPE tech-c: PT7353-RIPE status:
| ASSIGNED PA mnt-by: digitalocean
| created: 2019-04-17T14:02:52Z last-modified:
| 2019-04-17T14:02:52Z source: RIPE
| person: Network Operations address: 101
| Ave of the Americas, 10th Floor address: New
| York, NY, 10013 address: United States of
| America phone: +13478756044 nic-hdl:
| PT7353-RIPE mnt-by: digitalocean
| created: 2015-03-11T16:37:07Z last-modified:
| 2019-04-17T14:37:51Z source: RIPE # Filtered
| org: ORG-DOI2-RIPE % This query was
| served by the RIPE Database Query Service version 1.101
| (HEREFORD)
|
| But yes, it is wise to run storage/transfer heavy websites on
| something that doesn't charge for traffic like AWS and most
| other infrastructure providers do.
| simonw wrote:
| That's just the site itself (the HTML and JavaScript). The
| 43GB SQLite database is hosted on S3.
| [deleted]
___________________________________________________________________
(page generated 2021-07-30 23:00 UTC)