[HN Gopher] Static.wiki - read-only Wikipedia using a 43GB SQLit...
       ___________________________________________________________________
        
       Static.wiki - read-only Wikipedia using a 43GB SQLite file
        
       Author : segfall
       Score  : 106 points
       Date   : 2021-07-30 21:50 UTC (1 hours ago)
        
 (HTM) web link (static.wiki)
 (TXT) w3m dump (static.wiki)
        
       | suprbeast wrote:
       | We put this on an e-reader once. Onboard wikipedia, it was cool.
       | It was 2011.
        
       | dleslie wrote:
       | Two immediate issues:
       | 
       | 1. Doesn't work if Privacy Badger is active.
       | 
       | 2. Nim (Programming Language) is not indexed.
        
         | chadcatlett wrote:
         | I noticed if you go directly to the page for Nim, it works.
         | 
         | http://static.wiki/en/Nim_(programming_language)
        
       | twotwotwo wrote:
       | That's super cool.
       | 
       | The autocomplete trickiness is fun. Looks (from links in GitHub)
       | like it's using full-text search and currently needs to scan a
       | lot to rank results. Doing just prefix search might be a cheap
       | half-solution to autocomplete, and maybe layering on hacks around
       | stopwords etc. can get you a bit further.
       | 
       | For another angle, enwiki's full compressed list of titles
       | (enwiki-20210720-all-titles-in-ns0.gz) is 86MB. Grabbing the
       | whole dang thing (in the background?) and doing something with it
       | client-side is a different sort of approach from what this page
       | is going for but not unthinkable.
        
       | dom96 wrote:
       | Really cool! I think I see a bug: searching "Nim" shows "Nim
       | (programming language)" but it's not clickable.
        
       | punnerud wrote:
       | Related post 89 days ago:
       | https://news.ycombinator.com/item?id=27016630 (Hosting SQLite
       | databases on GitHub Pages or any static file hoster)
        
       | voldemort1968 wrote:
       | There's also https://www.kiwix.org/en/ which is great to take on
       | long offline flights, browsing through wikipedia.
        
       | mongol wrote:
       | Perfect if stuck on a remote deserted island! Must be the closest
       | existing thing to the Hitchhiker's guide?
        
         | yreg wrote:
         | I installed a reader with a Wikipedia dump (without multimedia)
         | on the original iPhone. At the time (at least in my region)
         | WiFi was quite rare and it really did feel like the
         | Hitchhiker's guide and incredibly powerful.
         | 
         | I only kept it for a few weeks though since it consumed almost
         | all of the space. (And I imagine it was probably my language
         | mutation, the english Wikipedia must have been too big
         | already.)
        
         | r00fus wrote:
         | Hitchhikers Guide to Wikipedia? HHGTWP?
        
         | wyldfire wrote:
         | Towel not included.
        
       | nynx wrote:
       | This is fantastic. I knew that we'd see things like this once
       | that hosting sqlite databases project was posted.
        
       | arianon wrote:
       | Shocked to find out that this is very much a static website, it's
       | "merely" downloading a small portion of the 43GB SQLite file in
       | question with HTTP Range requests, and then it uses a WASM-
       | compiled copy of SQLite to query the requested data. Very
       | impressive.
        
         | k__ wrote:
         | Didn't know you could pump 43GB of data into a browser.
        
         | simonw wrote:
         | It's the same trick that was described here - it's absolutely
         | brilliant: https://phiresky.github.io/blog/2021/hosting-sqlite-
         | database...
        
           | JoeyBananas wrote:
           | wouldn't this be vulnerable to DOS attacks? I can make the
           | database run arbitrarily long and complicated queries
        
             | CodesInChaos wrote:
             | The database runs on the client. It can't do anything you
             | couldn't do through any other http client.
        
             | simonw wrote:
             | Since the work all happens in your browser, the only victim
             | of a long complicated query would be you own browse and the
             | S3 bandwidth bill of the person hosting the database (since
             | you'd end up sucking down a lot of data).
             | 
             | But if you want to jack up someone's S3 bandwidth bill
             | there are already easier ways to do it!
        
               | floatingatoll wrote:
               | It's a static file, CDN it.
        
               | simlevesque wrote:
               | Many CDNs cannot cache 43GB files.
               | 
               | Cloudflare's limit is 10GB, 20GB for Cloudfront, 150GB
               | for Akamai and Azure.
        
       | max1cc wrote:
       | The search seems to be broken right now, but it looks like you
       | can access whatever page you want via URL (e.g.
       | http://static.wiki/en/Computer_Science)
        
       | sgt wrote:
       | thgttg.sqlite3
        
       | simonw wrote:
       | I'm nervous on your behalf for your S3 bill, have you done napkin
       | maths for how much this will cost to host, especially given the
       | relative inefficiency of the autocomplete queries?
       | 
       | A trick that might help: create a separate SQLite table
       | containing just the article titles to autocomplete against. That
       | might give you queries that use smaller ranges of the file.
        
         | sitkack wrote:
         | I built a wikipedia reader on the ipad years ago, I used two
         | sqlite databases, one for the article titles using sqlite FTS
         | then I compressed the articles themselves and merged the
         | compressed chunks into a set of files < 2G, the title database
         | had the offset into the compressed blobs. Only including the
         | most popular pages, it was just under 10G.
        
           | simonw wrote:
           | That's really clever. Were you storing the article text as
           | compressed data in BLOB columns in a SQLite table?
        
         | dom96 wrote:
         | Why do people even use S3 if that's a risk? I really don't get
         | this, do most just not worry about accidentally wasting money
         | or do they just not understand this risk?
        
           | simonw wrote:
           | This is a risk any time you host anything on the internet on
           | any system that charges you for storage, bandwidth or CPU
           | usage. It's fine to take that risk on as long as you're aware
           | of it and keep an eye on your spend.
           | 
           | In this case I'm nervous because 43GB is a lot larger than
           | most projects!
        
           | slunk wrote:
           | It's one thing if you have a source of revenue so justify
           | your S3 costs. My interpretation of the parent commentator's
           | concern is that this person has opened up a multi gigabyte S3
           | file publicly and sent lots of traffic its way (via hacker
           | news) for what appears to be a passion project.
        
         | max_ wrote:
         | He should seed a torrent file
        
           | Weryj wrote:
           | IPFS?
        
         | ignoramous wrote:
         | > _A trick that might help: create a separate SQLite table
         | containing just the article titles to autocomplete against._
         | 
         | Another one: build a directed-acyclic-word-graph [0][1] of just
         | the page-titles and download it to browser's localStorage for
         | prefix-based searches.
         | 
         | [0] https://github.com/smhanov/dawg
         | 
         | [1] http://stevehanov.ca/blog/?id=115
        
         | Weryj wrote:
         | Get a FusionIO drive off ebay and host the DB yourself, if you
         | need, stick the computer behind a static ip VPN.
        
         | capableweb wrote:
         | Seems to be hosted at Surge,sh which is hosted at Digital
         | Ocean, so seems this is not OPs problem.                   $
         | nslookup static.wiki                  Server:  8.8.8.8
         | Address: 8.8.8.8#53                  Non-authoritative answer:
         | static.wiki canonical name = na-west1.surge.sh.         Name:
         | na-west1.surge.sh         Address: 188.166.132.94
         | $ whois 188.166.132.94                  % This is the RIPE
         | Database query service.         % The objects are in RPSL
         | format.         %         % The RIPE Database is subject to
         | Terms and Conditions.         % See
         | http://www.ripe.net/db/support/db-terms-conditions.pdf
         | % Note: this output has been filtered.         %       To
         | receive output for a database update, use the "-B" flag.
         | % Information related to '188.166.132.0 - 188.166.135.255'
         | % Abuse contact for '188.166.132.0 - 188.166.135.255' is
         | 'abuse@digitalocean.com'                  inetnum:
         | 188.166.132.0 - 188.166.135.255         netname:
         | DIGITALOCEAN         country:        NL         admin-c:
         | PT7353-RIPE         tech-c:         PT7353-RIPE         status:
         | ASSIGNED PA         mnt-by:         digitalocean
         | created:        2019-04-17T14:02:52Z         last-modified:
         | 2019-04-17T14:02:52Z         source:         RIPE
         | person:         Network Operations         address:        101
         | Ave of the Americas, 10th Floor         address:        New
         | York, NY, 10013         address:        United States of
         | America         phone:          +13478756044         nic-hdl:
         | PT7353-RIPE         mnt-by:         digitalocean
         | created:        2015-03-11T16:37:07Z         last-modified:
         | 2019-04-17T14:37:51Z         source:         RIPE # Filtered
         | org:            ORG-DOI2-RIPE                  % This query was
         | served by the RIPE Database Query Service version 1.101
         | (HEREFORD)
         | 
         | But yes, it is wise to run storage/transfer heavy websites on
         | something that doesn't charge for traffic like AWS and most
         | other infrastructure providers do.
        
           | simonw wrote:
           | That's just the site itself (the HTML and JavaScript). The
           | 43GB SQLite database is hosted on S3.
        
           | [deleted]
        
       ___________________________________________________________________
       (page generated 2021-07-30 23:00 UTC)