hngopher.com

       [HN Gopher] Show HN: Static.wiki - read-only Wikipedia using a 4...
       ___________________________________________________________________
        
       Show HN: Static.wiki - read-only Wikipedia using a 43GB SQLite file
        
       Author : segfall
       Score  : 733 points
       Date   : 2021-07-30 21:50 UTC (1 days ago)
        
 (HTM) web link (static.wiki)
 (TXT) w3m dump (static.wiki)
        
       | jart wrote:
       | [Hit #1 on HN hosting 43gb Wikipedia SQLite DB on AWS]
       | https://justine.lol/hugemistake.gif
       | 
       | Each time I post a link to Hacker News that becomes popular it
       | ends up being at least a $100 in bandwidth bills from Google
       | Cloud, and I've got a simple tiny unbloated website. This poor
       | guy is going to go bankrupt. The Internet dark age is like Ma
       | Bell all over again when we'd get those $500 long distance bills
       | for dialing into BBSs.
        
       | cafxx wrote:
       | [nit] you should really run analyze before vacuum, not the other
       | way around (https://github.com/segfall/static-
       | wiki/blob/6d455de149c24e2a...)
        
       | yukinon wrote:
       | There's a bug when the target article is titled the same as the
       | disambiguation page. Search for "Earthbound", and try clicking on
       | the article for the video game. It'll bring you right back to the
       | disambiguation page.
        
       | geenat wrote:
       | Cool example of minimalism.
       | 
       | But hate to be that guy- if you're querying a database, it is not
       | static.
        
         | peterkelly wrote:
         | That's what's so brilliant about this. The database query
         | happens on the client, not on the server. The server side is
         | entirely static hosting.
        
         | dmd wrote:
         | Nope, it is - unless you think "loading html pages from a
         | remote web server" is also not static.
         | 
         | The point is there's no _computation_ going on on the server,
         | behind the scenes; your browser is doing all the work of
         | knowing what to fetch, NOT the remote end, with the magic of
         | range requests.
        
       | suprbeast wrote:
       | We put this on an e-reader once. Onboard wikipedia, it was cool.
       | It was 2011.
        
         | spiritplumber wrote:
         | cool! was it able to update itself or no?
        
       | NelsonMinar wrote:
       | I've been using static copies of Wikipedia for years; they're
       | great to have when traveling. I mostly use the Kiwix app. Their
       | ZIM format database for all English is 83 GB; 43 GB for one
       | without pictures. Compares nicely to the 43 GB here.
       | 
       | I once spent an hour in the Louvre learning about Rubens
       | astonishing paintings of Marie de' Medici thanks to Kiwix.
       | Without it I was just like "huh, nice paintings, I wonder what
       | they are?" They're incredible and I owe that moment of kismet to
       | Wikipedia, Kiwix, and my mobile phone.
       | https://en.wikipedia.org/wiki/Marie_de%27_Medici_cycle
        
         | elcritch wrote:
         | That's brilliant! Art museums are fun but there's so much
         | context missing for most paintings.
        
         | mohiit wrote:
         | I remember having it on my ipod touch way back in 2012 when it
         | was only around 7gb.
        
       | ctenb wrote:
       | If wikipedia used this approach they could stop asking for money
       | for server upkeep.
        
         | brokensegue wrote:
         | wikipedia keeps getting updated. this approach is static. it's
         | really easy to serve a static wikipedia lots of different ways.
        
       | dleslie wrote:
       | Two immediate issues:
       | 
       | 1. Doesn't work if Privacy Badger is active.
       | 
       | 2. Nim (Programming Language) is not indexed.
        
         | Retr0id wrote:
         | More generally, it appears that any link containing brackets is
         | broken/missing.
        
           | qwertox wrote:
           | This appears to be the issue.
        
         | qwertox wrote:
         | http://static.wiki/en/Gardena does not contain a link to
         | http://static.wiki/en/Gardena_(company) yet
         | https://en.wikipedia.org/wiki/Gardena does.
        
         | chadcatlett wrote:
         | I noticed if you go directly to the page for Nim, it works.
         | 
         | http://static.wiki/en/Nim_(programming_language)
        
         | JadeNB wrote:
         | > 2. Nim (Programming Language) is not indexed.
         | 
         | You and dom96 (https://news.ycombinator.com/item?id=28013376)
         | both noticed this issue, just two minutes apart. I guess
         | there's some sort of inclination among HN readers testing out a
         | static copy of Wikipedia to search for Nim? :-)
        
           | johnisgood wrote:
           | dom96 is a core developer of the language, and author of Nim
           | in Action, so it makes sense for him to look it up. I don't
           | know the OP.
        
       | twotwotwo wrote:
       | That's super cool.
       | 
       | The autocomplete trickiness is fun. Looks (from links in GitHub)
       | like it's using full-text search and currently needs to scan a
       | lot to rank results. Doing just prefix search might be a cheap
       | half-solution to autocomplete, and maybe layering on hacks around
       | stopwords etc. can get you a bit further.
       | 
       | For another angle, enwiki's full compressed list of titles
       | (enwiki-20210720-all-titles-in-ns0.gz) is 86MB. Grabbing the
       | whole dang thing (in the background?) and doing something with it
       | client-side is a different sort of approach from what this page
       | is going for but not unthinkable.
        
       | dom96 wrote:
       | Really cool! I think I see a bug: searching "Nim" shows "Nim
       | (programming language)" but it's not clickable.
        
         | segfall wrote:
         | Yep, you're right. The wikitext -> markdown conversion strips
         | out some links, noticeably on the disambiguation pages, for an
         | unclear reason.
        
       | punnerud wrote:
       | Related post 89 days ago:
       | https://news.ycombinator.com/item?id=27016630 (Hosting SQLite
       | databases on GitHub Pages or any static file hoster)
        
       | dnamlin wrote:
       | I plowed the 40 GiB database file into sqlite_zstd_vfs [1] which
       | reduced it 75% to 10 GiB. This plug-in also supports HTTP access
       | [2] in the spirit of phiresky's, however, I don't have a
       | WebAssembly build of mine yet, so it's a library for
       | desktop/command-line apps for now. You can try it out on Linux or
       | macOS x86-64:                 pip3 install genomicsqlite
       | genomicsqlite https://f000.backblazeb2.com/file/mlin-
       | public/static.wiki/en.zstd.db "select text from wiki_articles
       | where title = 'SQLite'"
       | 
       | ("genomicsqlite" is the CLI for my Genomics Extension [3], which
       | is built around these Zstandard compression & web layers.)
       | 
       | [1] https://github.com/mlin/sqlite_zstd_vfs
       | 
       | [2] https://github.com/mlin/sqlite_web_vfs
       | 
       | [3] https://mlin.github.io/GenomicSQLite
       | 
       | EDITS: I expanded on this comment in this gist
       | https://gist.github.com/mlin/ee20d7c5156baf9b12518961f36590c...
       | 
       | If you want to download the whole en.zstd.db, then please kindly
       | get it from zenodo (which doesn't support HTTP range requests,
       | but is free): https://zenodo.org/record/5149677
        
         | jart wrote:
         | Great work but why not compress with deflate if you are serving
         | http requests since then you could directly copy the database
         | content to the wire as gzip encoded responses.
        
           | dnamlin wrote:
           | With sqlite_zstd_vfs the data are compressed beforehand and
           | stored that way "at rest", so web responses are directly
           | copied to the wire controlled by HTTP range headers,
           | similarly to the OP. They need to be decompressed by a client
           | library sitting between SQLite and the wire.
        
       | voldemort1968 wrote:
       | There's also https://www.kiwix.org/en/ which is great to take on
       | long offline flights, browsing through wikipedia.
        
         | rycomb wrote:
         | And someone keeps a relatively up-to-date .slob file with the
         | entire wikipedia (that can be read with Aard2), which I find
         | more useful than kiwix!
         | 
         | (A link to it can be found in the GitHub page with Aard
         | dictionaries)
        
           | ce4 wrote:
           | Aard2 Android: https://github.com/itkach/aard2-android
           | 
           | Aard2 web: https://github.com/itkach/aard2-web
           | 
           | Dictionaries:
           | https://github.com/itkach/slob/wiki/Dictionaries
        
             | exikyut wrote:
             | files:  enwiki-20210616*       size:   18.9 GB
             | 
             | How'd they manage _that_...
        
           | zelienople wrote:
           | Seriously? They couldn't make the content something standard
           | like a .zip file? Or a zip file with a SQL database in it?
           | Bloody monkeys always have to go and create a new standard
           | for everything. They will argue it's for performance or
           | something, but it's really an exercise in ego. "Oh, mine's
           | better!" No, it's not. What is better is what works with
           | everything else in the world.
           | 
           | And that index! We couldn't capitalize the proper name of
           | Wikipedia? And do you not understand there are visually
           | impaired people in the world who can't read light red text on
           | a white background? What is wrong with people?
           | 
           | And the terminology! What in the fuckety is Aard2? What is a
           | .slob file? Why is it that everyone uses obscure acronyms or
           | whatever for everything and never explains anything? Ego
           | again, probably. Let's create a language that no one else
           | understands to make ourselves feel superior! No, you are not
           | superior, you are just turning language into thick vegetable
           | soup.
           | 
           | Didn't you people go to university, or did you not pay
           | attention when they taught you that the first reference is
           | always completely spelled out and explained and you can use
           | the acronyms (or whatever they are) after that. It's only
           | common sense that you don't spout gibberish and nonsense
           | words without explaining then in the first instance, unless
           | your goal is to confuse everyone for the purpose of making
           | yourself feel superior.
           | 
           | And the Google sheet and the flat index, what a hodgepodge of
           | nonsense. What the heck does it mean? What is the difference
           | between the different versions? Who knows?
           | 
           | I have to go now. My eyes are bleeding from trying to read
           | light red text on a bright white background. I'll check back
           | again once I get a Braille laptop screen.
        
         | lmm wrote:
         | Does it actually work? I tried and it would always give up
         | partway through the downloads (and most of the times when I ask
         | in these threads it turns out the person recommending it
         | doesn't actually use it, they just think it's a cool idea...)
        
       | mongol wrote:
       | Perfect if stuck on a remote deserted island! Must be the closest
       | existing thing to the Hitchhiker's guide?
        
         | phil294 wrote:
         | I'm not trying to kill the fun here, but since this thing
         | relies on a working internet connection, it would probably be
         | pretty useless on a remote island
        
           | dTal wrote:
           | You could host it locally easily enough!
        
             | jtsiskin wrote:
             | Yes, but you could also host a normal Postgres db easily
             | enough
        
               | dTal wrote:
               | This is entirely static, so I think it's easier.
        
           | anigbrowl wrote:
           | You can put a 43gb database on a $30 SD card and still have
           | room for a bunch of movies.
        
             | [deleted]
        
           | phkahler wrote:
           | When SpaceX sends people to Mars they will want a local
           | Wikipedia among many other things.
        
         | yreg wrote:
         | I installed a reader with a Wikipedia dump (without multimedia)
         | on the original iPhone. At the time (at least in my region)
         | WiFi was quite rare and it really did feel like the
         | Hitchhiker's guide and incredibly powerful.
         | 
         | I only kept it for a few weeks though since it consumed almost
         | all of the space. (And I imagine it was probably my language
         | mutation, the english Wikipedia must have been too big
         | already.)
        
         | exikyut wrote:
         | _Sigh_ obligatory XKCD https://xkcd.com/548/
        
         | pronoiac wrote:
         | You might look at Kiwix, and the data sources they've indexed:
         | https://www.kiwix.org/en/
         | 
         | (Oddly, it's crashing at launch on my iPad right now)
        
           | [deleted]
        
         | r00fus wrote:
         | Hitchhikers Guide to Wikipedia? HHGTWP?
        
         | wyldfire wrote:
         | Towel not included.
        
       | andrepd wrote:
       | How does this compare to https://www.kiwix.org/en/
        
       | [deleted]
        
       | nynx wrote:
       | This is fantastic. I knew that we'd see things like this once
       | that hosting sqlite databases project was posted.
        
       | arianon wrote:
       | Shocked to find out that this is very much a static website, it's
       | "merely" downloading a small portion of the 43GB SQLite file in
       | question with HTTP Range requests, and then it uses a WASM-
       | compiled copy of SQLite to query the requested data. Very
       | impressive.
        
         | k__ wrote:
         | Didn't know you could pump 43GB of data into a browser.
        
           | wmf wrote:
           | The trick is that the browser is not downloading all 43 GB,
           | just the parts it needs.
        
         | simonw wrote:
         | It's the same trick that was described here - it's absolutely
         | brilliant: https://phiresky.github.io/blog/2021/hosting-sqlite-
         | database...
        
           | scrame wrote:
           | Wow! The end where they use sql to manipulate the DOM I think
           | just blew a fuse in my brain. "Brilliant" feels like an
           | understatement.
        
           | MuffinFlavored wrote:
           | On insert from Netlify:
           | 
           | > [error: RuntimeError: abort(Error: server uses gzip or
           | doesn't have length). Build with -s ASSERTIONS=1 for more
           | info.]
           | 
           | On insert from Github Pages:
           | 
           | > [error: Error: SQLite: ReferenceError: SharedArrayBuffer is
           | not defined] Your browser might either be too old to support
           | SharedArrayBuffer, or too new and have some Spectre
           | protections enabled that don't work on GitHub Pages since
           | they don't allow setting the necessary isolation headers. Try
           | going to the Netlify mirror of this blog for the DOM demos.
        
           | JoeyBananas wrote:
           | wouldn't this be vulnerable to DOS attacks? I can make the
           | database run arbitrarily long and complicated queries
        
             | chrisseaton wrote:
             | > I can make the database run arbitrarily long and
             | complicated queries
             | 
             | Do you understand that the database runs on your computer?
             | You can only DOS yourself.
        
             | carlhjerpe wrote:
             | In a smart webserver this could mean not a whole lot more
             | than a memcpy, and between you and the CDN it probably
             | won't be too high latency.
             | 
             | One would have to version the SQLite DB by version to avoid
             | corruption by cache invalidation.
        
               | rkeene2 wrote:
               | My static file HTTP server called "filed" [0] will
               | satisfy a request in as few as 1 system call (no memcpy
               | involved -- the kernel reads the file and sends the
               | buffer to the NIC), by using sendfile(2). Most other
               | webservers do a bit more work, like open the file for
               | every request.
               | 
               | [0] http://filed.rkeene.org/
        
             | CodesInChaos wrote:
             | The database runs on the client. It can't do anything you
             | couldn't do through any other http client.
        
             | simonw wrote:
             | Since the work all happens in your browser, the only victim
             | of a long complicated query would be you own browse and the
             | S3 bandwidth bill of the person hosting the database (since
             | you'd end up sucking down a lot of data).
             | 
             | But if you want to jack up someone's S3 bandwidth bill
             | there are already easier ways to do it!
        
               | floatingatoll wrote:
               | It's a static file, CDN it.
        
               | simlevesque wrote:
               | Many CDNs cannot cache 43GB files.
               | 
               | Cloudflare's limit is 10GB, 20GB for Cloudfront, 150GB
               | for Akamai and Azure.
        
               | sltkr wrote:
               | It should be trivial to split it up into 1GB chunks or
               | whatever. In fact, if you only request single pages, you
               | could split the database up into pagesize-sized files.
               | This is a lot of files, but at least it avoids the need
               | to do range requests at all. You probably want to
               | increase the pagesize a bit though (e.g. maybe 256 KiB
               | instead of the default 4 KiB?)
        
               | floatingatoll wrote:
               | Range requests are the star, so I would vote keep them? I
               | mean, technically, the CDN could represent each 4kb block
               | _of_ the file as an individual URL, so that you do range
               | requests by filename instead of by .. range request ..
               | but at some point I definitely think RRs are more sane.
        
               | rzzzt wrote:
               | Probably both, compute file name and relative offset from
               | the original offset. (A minor complication is requests
               | crossing file boundaries, in which case you have to do
               | two requests instead of one.)
        
               | milankragujevic wrote:
               | Isn't Cloudflare's limit 512 MB? (except for enterprise
               | plan which has no price defined where it's 5GB)
               | 
               | https://support.cloudflare.com/hc/en-
               | us/articles/200172516-U...
        
               | floatingatoll wrote:
               | TIL, for whatever it's worth :) Thanks!
        
               | phiresky wrote:
               | You can chunk the file into e.g. 2MB chunks. The CDN can
               | then cache all or the most commonly used ones. That's
               | what I did in the original blog post to be able to host
               | it on GitHub Pages.
        
               | dmos62 wrote:
               | I would look into caching range requests. Simpler than
               | pre-chunking or caching the whole database.
        
               | jlg23 wrote:
               | If I get the essence of what you are saying... "the only
               | victims would be both ends of the communication"? ;)
        
               | quickthrower2 wrote:
               | What is said is this site isn't any more susceptible to
               | DOS than any other site. Of all the ways of architecting
               | a site this is probably the least vulnerable.
        
       | chovybizzass wrote:
       | Searched for "porn" -- got this:
       | http://static.wiki/en/Progressive_outer_retinal_necrosis
        
       | andyxor wrote:
       | this is awesome, seems to be a good use case for using IPFS
       | decentralized storage, either for copies of the big sqllite db
       | dump file, or (as an alternative) for the individual wiki
       | articles "snapshots"
       | 
       | i think it would be easier to keep it in sync with original wiki
       | using the second option, where each article in static.wiki is
       | synced with original wikipedia pages and updated individually,
       | vs. updating the huge sqllite dump on every change in the
       | originals
       | 
       | then the sqllite will be only holding metadata and links to
       | static (and may be version controlled) versions of individual
       | static pages/article snapshots, so the size will be much smaller
       | and easier to distribute on ipfs
       | 
       | Edit: looks like there is similar effort in the works, to put
       | wiki on ipfs https://blog.ipfs.io/24-uncensorable-wikipedia/
        
       | max1cc wrote:
       | The search seems to be broken right now, but it looks like you
       | can access whatever page you want via URL (e.g.
       | http://static.wiki/en/Computer_Science)
        
       | graderjs wrote:
       | That's awesome man. I didn't know you could host Wikipedia on
       | your own client / own back-end. Seems like this means if someone
       | could built a better interface and performance for Wikipedia they
       | could begin to steal some of their traffic. Is it not?
        
       | sgt wrote:
       | thgttg.sqlite3
        
         | schwartzworld wrote:
         | hhgttg
        
       | segfall wrote:
       | OP here. Happy to take questions, with the caveat that my work
       | here is mostly glue.
       | 
       | Some background: https://github.com/segfall/static-wiki
       | 
       | The datasets I generated:
       | https://www.kaggle.com/segfall/markdownlike-wikipedia-dumps-...
       | 
       | The original author of this approach:
       | https://news.ycombinator.com/item?id=28013514
        
         | johnwheeler wrote:
         | Is the db in memory?
        
           | segfall wrote:
           | sql.js is in memory, but the database file is only partially
           | loaded from the static file host.
        
           | [deleted]
        
           | exikyut wrote:
           | No, as other comments have noted it's stored in S3.
        
         | segfall wrote:
         | For added fun, you can try switching between different
         | databases:
         | 
         | http://static.wiki/en/United_States
         | 
         | http://static.wiki/zh/Mei Guo
         | 
         | http://static.wiki/fr/United_States
         | 
         | http://static.wiki/enwikiquote/United_States
         | 
         | http://static.wiki/simple/United_States
        
         | ddtaylor wrote:
         | Is there a way to get this simple theme / stylesheet to work
         | for the regular Wikipedia? I really like it.
        
       | hutrdvnj wrote:
       | Since it doesn't require any backend, a IPFS mirror would be
       | great.
        
       | andymoe wrote:
       | Reminds me that at one point each page of the Apple Wiki Server
       | that shipped with OS X Server back in the day used one SQLite
       | file per page. Worked ok-ish.
        
       | mwidell wrote:
       | I remember having the iPhone 2G back in 2007, and that someone
       | created a way to store a 5GB wikipedia dump (the complete
       | wikipedia at that time, but without photos) on it, so you could
       | have the whole Wikipedia in your pocket. At the time it felt
       | pretty amazing.
        
         | spiritplumber wrote:
         | Don't panic!
        
       | phiresky wrote:
       | Author of the referenced blog (and library) here. This is great!
       | 
       | The full text search engine in SQLite is sadly not really good
       | for this - one reason is that it uses standard B-Trees, another
       | is that it forces storing all token positions if you want BM25
       | sorting, which is a huge overhead for articles as long as
       | Wikipedia's.
       | 
       | But that doesn't mean full text search isn't possible in a very
       | efficient manner with statically hosted data! I wrote a proof of
       | concept of making the Rust-based tantivy library work in the same
       | way, which has a lot of internal things that can make the index
       | much smaller and more efficient than SQLite's. It's also >10x
       | faster in creating the search index.
       | 
       | Here's the demo also for Wikipedia:
       | https://demo.phiresky.xyz/tmp-ytccrzsovkcjoylr/dist/index.ht...
       | I'm not sure if it's more efficient than the SQlite version in
       | this form, but it definitely has more upward potential and is
       | more fun to work with.
       | 
       | And the corresponding draft PR: https://github.com/tantivy-
       | search/tantivy/pull/1067
       | 
       | I sadly haven't gotten around to working on it more and writing
       | an article about it.
       | 
       | Other people are also working on using this stuff to make Sci-hub
       | and LibGen more available by using this in combination with IPFS
       | for distributed and "uncensorable" hosting which is pretty
       | awesome.
       | 
       | Edit: Just realized that the OP demo only searches in article
       | titles, while mine searches in full article contents by default.
       | You can search in only the titles in my demo by querying
       | `title:harry title:potter`
        
         | dheera wrote:
         | Yeah I was just thinking that. Why not just a static
         | filesystem? Just mount a ext4 image full of .html files and
         | browse away. It would be remarkably efficient, likely much more
         | efficient than SQLite.
        
         | segfall wrote:
         | Thank you, Phiresky. My little side project only exists because
         | of your work.
        
           | ignoramous wrote:
           | Congrats on your side project! This wonderful discussion
           | exists because of your work.
        
         | neocodesoftware wrote:
         | how are the xml dumps generated?
        
         | peterhunt wrote:
         | You can also get efficient FTS with this method if you
         | implement indexing in user space and avoid BM25. The Lucene
         | practical scoring function works well with this method in my
         | experience:
         | https://www.elastic.co/guide/en/elasticsearch/guide/current/...
        
           | piyh wrote:
           | Is there a dumbed down version of this indexing conversation
           | for someone who understands b-trees, but BM25 or user space
           | indexing?
        
             | peterhunt wrote:
             | Well, this post is in the context of an e2e encrypted DB,
             | but it's subject to the same constraints:
             | https://medium.com/@ZeroDB_/scalable-full-text-search-
             | over-e...
             | 
             | If you understand btrees you understand the hardest part
             | already :)
             | 
             | Basically, you need to design a search index that examines
             | the fewest DB pages in order to find the result. The Lucene
             | scoring method stores a mapping of term -> document[]
             | sorted in relevance order. The main idea is that you can
             | examine only the first n documents for each term in the
             | search query in order to find the most relevant search
             | results. Picking n is sort of tricky, but if your index is
             | stored in this way it's possible to fulfill a large % of
             | queries efficiently without downloading the whole index.
             | 
             | Here's a little Python implementation of what I mean by a
             | "user space implementation". Note that it's a toy but it
             | performs pretty well on some demo sklearn data sets: https:
             | //gist.github.com/petehunt/724eeb77189332db101ad7b0db8...
        
         | walrus01 wrote:
         | How does it do for performance if you throw the whole 43GB into
         | RAM? Plenty of very affordable workstation systems out there
         | today gently used with 128GB in them.
        
           | londons_explore wrote:
           | You still have to download it all... Which is a barrier for
           | most...
        
             | tw04 wrote:
             | As someone who was very recently on about as bad of a dsl
             | connection you could get... What? Outside of metered
             | cellular connections this is available to basically
             | everyone. Even the remotest parts of Africa have at least a
             | handful of unmetered connections within a days drive and a
             | USB drive.
        
             | tjoff wrote:
             | For English it states 90 GB uncompressed, doesn't say
             | compressed size but that doesn't sound much larger than a
             | large game. In the context I don't see it as a barrier.
        
             | unknownOrigin wrote:
             | That's only a size of regular modern video game... the
             | downloads of which are pretty mainstream these days. Saying
             | that a 50 gig download is a barrier "for most" is
             | definitely not true.
        
             | exdsq wrote:
             | Gears of War 5 on the Xbox is 133GB with all its updates. I
             | think we're at the point where 100GB is a high but
             | reasonable request for people nowadays, at least in cities
             | in the West.
        
           | shp0ngle wrote:
           | what? the point of this is that it seeks just parts of the
           | files through http ranges... not all 43 GB... just try the
           | page? or am I missing something?
        
             | unixhero wrote:
             | You're missing his pragmatic and creative solution to
             | leapfrog the problem altogether.
        
               | shp0ngle wrote:
               | ahhh I misread his comments.
               | 
               | Sure, RAM is always faster.
        
               | sonograph wrote:
               | > Sure, RAM is always faster.
               | 
               | Well, why not shove it all into L2 cache and leapfrog
               | RAM?
               | 
               | /first joke I ever made on HN
        
               | trollied wrote:
               | Jokes are funny...
        
               | bawolff wrote:
               | Which problem?
               | 
               | Yes putting it in ram will be faster than downloading it
               | over the network. I don't think anyone ever doubted that.
               | 
               | I imagine people who wanted to download it would just use
               | kiwix. The potential to download isn't the part that
               | makes this project interesting.
        
               | IshKebab wrote:
               | A pragmatic solution that requires users to have 43GB of
               | RAM? Right.
        
               | AussieWog93 wrote:
               | To put it in perspective, 43GB of RAM to browse the
               | entirety of Wikipedia doesn't sound _that_ ridiculous
               | when popular Electron-based IM programs use 1GB+.
        
               | jggonz wrote:
               | I'm going to upvote this comment because not everyone in
               | the world can afford $300 for 64GB+ of RAM. The solution
               | to not dump everything into RAM is a great one.
               | 
               | Fortunately, I've had a fancy Silicon Valley job for 15+
               | years which now lets me afford many GBs of RAM if I
               | wanted to but it was a different story in my 20s. I
               | relied on free or almost free hardware to learn. And that
               | free hardware usually lagged in specs.
        
               | walrus01 wrote:
               | Yeah I agree with you not everyone can afford it, but
               | lots of young people build $1800 gaming PCs. If you want
               | a powerful Linux workstation desktop PC for fun projects
               | like this you can get a Dell t5610 with two 8 core xeons
               | and 128GB of ram on eBay for $650, then add your own $100
               | SSD.
               | 
               | I would not recommend running out and buying a new
               | current gen Ryzen board+cpu or core i7-whatever and 128GB
               | of new ram at ddr4-4200 prices.
        
               | azeirah wrote:
               | I mean sure it's for enthusiasts, but my 3 year old pc
               | already has 32GB ram. It's definitely not out of the
               | question
        
               | unixhero wrote:
               | 2021 called, memory is cheap.
        
               | linkpuff wrote:
               | Is it though? We are in the middle of a chip crisis where
               | prices are raising to record levels...
        
               | unixhero wrote:
               | Yes it is. In the grand scheme of things, looking at how
               | far we've come DDR4 modules are cheap dollar per
               | gigabyte.
        
               | andi999 wrote:
               | Notebook vendors didn't get this memo.
        
               | Eremotherium wrote:
               | Don't buy a garbage laptop with soldered RAM and just
               | upgrade it yourself?
        
               | andi999 wrote:
               | Good solution. Which one that allows for 128GB do you
               | recommend?
        
               | unixhero wrote:
               | A workstation. Any professional one.
        
               | IshKebab wrote:
               | And how many users have more than 16GB of RAM?
        
               | jbverschoor wrote:
               | 20 applications x 50GB = 1tb of memory.
        
               | Quekid5 wrote:
               | o... kay?
               | 
               | You're not wrong, but who's talking about 20
               | applications?
        
               | jbverschoor wrote:
               | Look at the amount of applications and web-application
               | you have open at a given point.
               | 
               | If this is such a good idea, let's just all use iscsi and
               | mount the internet.
        
               | IgorPartola wrote:
               | On phones too?
        
               | gsich wrote:
               | Phones are not first class devices.
        
               | exdsq wrote:
               | It's around PS300 for 64GB of Corsair RAM - you need a
               | motherboard that supports it but you can go cheap and you
               | don't need a great processor. You could probably put this
               | together for PS600? It's not exactly crazy if this was
               | for an office or something.
        
         | tored wrote:
         | I have been thinking of using FTS5 with SQLite to search
         | emails. Not close of 43 GB ofc, so I would probably not have
         | any major performance problems, but still is FTS5 any good? Or
         | should I look into other solutions?
        
           | psanford wrote:
           | The nice thing about FTS5 is you can have full text search up
           | and running on a dataset in a couple of minutes. So give it
           | try and see if it is sufficient for your use case. If it is,
           | great! If not, you've not wasted much time.
        
           | OJFord wrote:
           | Have you seen 'notmuch'? I was honestly shocked how good/fast
           | it is. It uses a C (++?) lib called 'xapian' to provide the
           | actual index/search capability.
        
           | phiresky wrote:
           | If you're talking about a normal local SQlite DB and your
           | dataset is less than maybe 100GB of plaintext then SQLite FTS
           | will work fine regarding performance.
        
       | [deleted]
        
       | simonw wrote:
       | I'm nervous on your behalf for your S3 bill, have you done napkin
       | maths for how much this will cost to host, especially given the
       | relative inefficiency of the autocomplete queries?
       | 
       | A trick that might help: create a separate SQLite table
       | containing just the article titles to autocomplete against. That
       | might give you queries that use smaller ranges of the file.
        
         | segfall wrote:
         | Great catch, simonw (and everyone else in this thread).
         | 
         | I've reviewed some other static hosts and settled on wasabi for
         | now. When my S3 bill rolls in, I'll reply to this comment with
         | the damage. A scary number will help others avoid my oversight.
        
           | jinseokim wrote:
           | It seems to be 403 happens.
           | 
           | Quick test: https://s3.us-
           | east-2.wasabisys.com/static.wiki/en.db
        
         | sitkack wrote:
         | I built a wikipedia reader on the ipad years ago, I used two
         | sqlite databases, one for the article titles using sqlite FTS
         | then I compressed the articles themselves and merged the
         | compressed chunks into a set of files < 2G, the title database
         | had the offset into the compressed blobs. Only including the
         | most popular pages, it was just under 10G.
        
           | simonw wrote:
           | That's really clever. Were you storing the article text as
           | compressed data in BLOB columns in a SQLite table?
        
             | sitkack wrote:
             | I realize I wasn't clear. I used 2 sqlite databases just to
             | store the article titles and offsets into binary files
             | (<2GB) that had chunks of compressed data. Each chunk
             | contained 10s-100s of stories, so that the decompressed
             | around was around 20megs I believe. I had planned to store
             | everything in sqlite, but there were too many issues. The
             | FTS extension to sqlite is excellent, it made whole classes
             | of problems literally melt away.
             | 
             | Recalling now, the article text itself, I stored in groups
             | of articles concatenated together so that that data
             | compression could take advantage of similar texts. This
             | data was just Lua data structures, with the article title
             | and article body in an array. So I knew which chunk it was
             | in, but I had to decompress a whole chunk at a time and
             | then iterate through the list of decompressed articles to
             | find the specific one. I was trying to tune search time,
             | decompression time and compression ratio. I had a handful
             | of parameters in the job that would take the xml dumps and
             | create the sqlite and compressed blobs. Memory consumption
             | would spike as a chunk was decompressed, but after a call
             | to the Lua garbage collector after pulling the article out
             | of the decompression buffer, everything was fine again.
             | 
             | This was late 2010 or so. The job that processed the XML
             | into database+bin data took over an hour to run on my
             | laptop.
             | 
             | If I did it again, I'd use a compression library that was
             | tuned for decompression and used an external dictionary.
             | For the index, I'd use fst [1]
             | 
             | The project stalled out in parsing the wikitext, there are
             | now libraries that can handle the parsing. I have
             | considered resurrecting it, with jpeg2000 or other image
             | codecs, one could include the images as well. The system
             | did _not_ have FTS over the article text itself. I had
             | planned to use some sort tree of bloom filters, something
             | similar to [2].
             | 
             | I was really trying to keep the total footprint of all the
             | data under 10G or so it could fit on the majority of
             | devices. Now with larger flash sizes, better codecs and
             | better compression libraries, the functionality and
             | performance could be greatly enhanced.
             | 
             | I learned a ton from this project, I really recommend
             | getting deep into something, but also working within your
             | limits and having fallback plans when something looks
             | insurmountable. There were plenty of times where I thought
             | I was a dead end, only to say, "what if ..." the next day,
             | run a quick experiment and be on to the next insurmountable
             | roadblock.
             | 
             | [1] https://github.com/BurntSushi/fst
             | 
             | [2] https://arxiv.org/abs/2002.03057
        
         | dom96 wrote:
         | Why do people even use S3 if that's a risk? I really don't get
         | this, do most just not worry about accidentally wasting money
         | or do they just not understand this risk?
        
           | RhodesianHunter wrote:
           | Every system has risks. You just need to be aware of them
           | going in.
           | 
           | S3 is excellent at many things, serving a ton of public
           | traffic isn't one of them. Usually people put a CDN in front
           | of it for that use case.
        
           | simonw wrote:
           | This is a risk any time you host anything on the internet on
           | any system that charges you for storage, bandwidth or CPU
           | usage. It's fine to take that risk on as long as you're aware
           | of it and keep an eye on your spend.
           | 
           | In this case I'm nervous because 43GB is a lot larger than
           | most projects!
        
             | remram wrote:
             | I think the point is that not every host charges for
             | bandwidth or CPU usage, making S3 particularly risky.
        
               | RhodesianHunter wrote:
               | There's no CPU charge, only per-request charges.
               | 
               | Bandwidth is free into S3 or from S3 to another AWS
               | service so it's particularly handy for that use case.
        
               | remram wrote:
               | For S3, yes.
        
           | slunk wrote:
           | It's one thing if you have a source of revenue so justify
           | your S3 costs. My interpretation of the parent commentator's
           | concern is that this person has opened up a multi gigabyte S3
           | file publicly and sent lots of traffic its way (via hacker
           | news) for what appears to be a passion project.
        
         | max_ wrote:
         | He should seed a torrent file
        
           | thrower123 wrote:
           | I would like a copy of that sqlite database to play with.
           | 
           | Having that all preprocessed would save a bunch of work
           | compared to getting it from Wikipedia itself
        
           | Weryj wrote:
           | IPFS?
        
             | warpech wrote:
             | Second that. I have no hands down experience with it, but
             | I've seen seemingly more complex apps deployed on IPFS
             | https://docs.ipfs.io/concepts/usage-ideas-examples/
        
           | wmf wrote:
           | Torrents aren't great for random access which this app is
           | doing.
        
             | dTal wrote:
             | But at least we could get a copy without incurring a
             | massive s3 bill.
        
               | rudian wrote:
               | Can we really? For torrent to work you still need someone
               | to host it in full at all times. Who will host a 40GB
               | file that will be out of date in a week? The website
               | provider will, so we're back to square one.
        
               | dTal wrote:
               | No, you don't. You just need N people to host 1/N of the
               | file some of the time. The only person who needs to
               | upload it in full is the original creator.
        
             | magila wrote:
             | Bittorrent can do random access just fine, especially with
             | v2 which enables content verification in 16KB chunks.
             | Clients tend to be throughput oriented which leads to sub-
             | optimal latency, but that's not a limitation of the
             | protocol.
        
               | wmf wrote:
               | I guess just fine is a matter of opinion. Connection
               | setup and piece bitmap tracking seems like more overhead
               | than HTTP range requests.
        
               | magila wrote:
               | Connection establishment in bittorrent might be
               | marginally higher overhead than HTTP, but not by much and
               | the cost is highly amortized if you're doing many range
               | requests in the same torrent. There is indeed more
               | overhead from HAVE messages and such, but that's just
               | because bittorrent is doing a lot more than HTTP to
               | enable P2P. It's still not that bad as a percentage of
               | the payload traffic.
        
         | ignoramous wrote:
         | > _A trick that might help: create a separate SQLite table
         | containing just the article titles to autocomplete against._
         | 
         | Another one: build a directed-acyclic-word-graph [0][1] of just
         | the page-titles and download it to browser's localStorage for
         | prefix-based searches.
         | 
         | In our usage, ~5M entries take ~18MB. Wikipedia has ~6.3M
         | articles [2].
         | 
         | [0] https://github.com/smhanov/dawg
         | 
         | [1] http://stevehanov.ca/blog/?id=115
         | 
         | [2] https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia
        
           | mkl wrote:
           | A version of [1] with the pictures working: https://web.archi
           | ve.org/web/20120211141906/http://stevehanov...
        
           | Bayart wrote:
           | Thanks for making me aware of a new data structure !
        
         | Weryj wrote:
         | Get a FusionIO drive off ebay and host the DB yourself, if you
         | need, stick the computer behind a static ip VPN.
        
           | gjs278 wrote:
           | the drivers for fusion-io are maintained by the community.
           | sandisk practically abandoned iodrive by kernel 4.20
           | 
           | I would suggest an intel 900p instead, more space and not too
           | much more expensive. very similar latencies as well. ive used
           | both.
        
         | capableweb wrote:
         | Seems to be hosted at Surge,sh which is hosted at Digital
         | Ocean, so seems this is not OPs problem.                   $
         | nslookup static.wiki                  Server:  8.8.8.8
         | Address: 8.8.8.8#53                  Non-authoritative answer:
         | static.wiki canonical name = na-west1.surge.sh.         Name:
         | na-west1.surge.sh         Address: 188.166.132.94
         | $ whois 188.166.132.94                  % This is the RIPE
         | Database query service.         % The objects are in RPSL
         | format.         %         % The RIPE Database is subject to
         | Terms and Conditions.         % See
         | http://www.ripe.net/db/support/db-terms-conditions.pdf
         | % Note: this output has been filtered.         %       To
         | receive output for a database update, use the "-B" flag.
         | % Information related to '188.166.132.0 - 188.166.135.255'
         | % Abuse contact for '188.166.132.0 - 188.166.135.255' is
         | 'abuse@digitalocean.com'                  inetnum:
         | 188.166.132.0 - 188.166.135.255         netname:
         | DIGITALOCEAN         country:        NL         admin-c:
         | PT7353-RIPE         tech-c:         PT7353-RIPE         status:
         | ASSIGNED PA         mnt-by:         digitalocean
         | created:        2019-04-17T14:02:52Z         last-modified:
         | 2019-04-17T14:02:52Z         source:         RIPE
         | person:         Network Operations         address:        101
         | Ave of the Americas, 10th Floor         address:        New
         | York, NY, 10013         address:        United States of
         | America         phone:          +13478756044         nic-hdl:
         | PT7353-RIPE         mnt-by:         digitalocean
         | created:        2015-03-11T16:37:07Z         last-modified:
         | 2019-04-17T14:37:51Z         source:         RIPE # Filtered
         | org:            ORG-DOI2-RIPE                  % This query was
         | served by the RIPE Database Query Service version 1.101
         | (HEREFORD)
         | 
         | But yes, it is wise to run storage/transfer heavy websites on
         | something that doesn't charge for traffic like AWS and most
         | other infrastructure providers do.
        
           | simonw wrote:
           | That's just the site itself (the HTML and JavaScript). The
           | 43GB SQLite database is hosted on S3.
        
           | [deleted]
        
         | mayli wrote:
         | Correct, downloading this 40GB file could easily leads to huge
         | s3 bills due to traffic. That's 1$ for a full download.
        
           | banana_giraffe wrote:
           | $2, if my math is right.
           | 
           | As pointed out elsewhere, there's some basic work being done
           | to prevent everyone from grabbing the .db file, but it's easy
           | enough to bypass.
           | 
           | Also, they're probably seeing non-zero API call charges. S3
           | only charges $0.0004 per 1000 calls, but when you start to
           | make lots of calls at scale, it can really add up. Still, for
           | this site, it's probably not that bad. If my simple tests are
           | a fair judge, it works out to something like $20 a month for
           | 100k lookups.
           | 
           | I've mentioned elsewhere on HN I used a similar technique
           | with Python to query SQLite databases in S3. It works well,
           | but there was an edge case on that project that resulted in
           | millions of requests to S3 per invocation. It didn't break
           | the bank, but it did push the project into the unprofitable
           | range till we fixed that case.
        
           | Asdrubalini wrote:
           | I tried downloading the file directly from curl [0] but it
           | seems like it blocks non-partial GET requests, so downloading
           | it is not really straightaway.
           | 
           | [0] https://s3.amazonaws.com/static.wiki/db/en.db
        
             | llacb47 wrote:
             | curl -v "https://s3.amazonaws.com/static.wiki/db/en.db" -H
             | "Referer: http://static.wiki/" -H "Range: bytes=0-"
             | 
             | Above works but for the sake of OP's wallet I would suggest
             | that you do not download the entire database from his S3
             | bucket.
        
               | infogulch wrote:
               | Instead download it from kaggle which is offering to host
               | it for free: https://www.kaggle.com/segfall/markdownlike-
               | wikipedia-dumps-...
        
       | lanevorockz wrote:
       | Maybe that would be a nice way to start Wikipedia alternatives,
       | sijce it became a form of political activism the trustworthiness
       | of Wikipedia fell down significantly.
        
       | Scaevolus wrote:
       | You should compress the individual article texts with zlib. Right
       | now, looking up an article text for a decent size article
       | ("Virus") takes 10 round-trips-- 5 appear to be btree traversals,
       | but another 5 (!) are to read in the page content, as chunks of
       | 8K 8K 16K 32K 64K. Compressing the article might cut one round
       | trip off that
       | 
       | Increasing the sqlite page size to a much larger size like 64KB
       | or 128KB may improve average latency too, since fewer btree pages
       | would need to be fetched to find the offset of the desired row,
       | and most page contents could be read with one fetch. This matters
       | more when fetches over the network take ~100ms.
        
       | harlanji wrote:
       | Bonito. I want and yet have problems downloading even a 350mb
       | file on my carrier, tethered via 2 different devices. Being able
       | to DL something like this should be a requirement to be called an
       | Internet carrier. Non-mobile is not an option for some, eg. low
       | income unemployed and homeless. Maybe they've been a software
       | engineer for 20 years and can reach unreachable markets. (No
       | doubt it's more profitable to spy and steal from the vulnerable,
       | if God allows).
       | 
       | What if a customer needs their records and can not afford to get
       | them and they lose their life or suffer significantly? Who pays
       | the balance?
       | 
       | They didn't need use the Internet? Okay, well they did use it and
       | they do during pandemic times.
        
       | 8note wrote:
       | I find that most wikipedia pages I'd want to look at are utterly
       | unintelligible with communication conventions youd need to look
       | up separately to understand the article.
       | 
       | I can't use an offline Wikipedia without that coursework
        
       | Camillo wrote:
       | Read-only Wikipedia is just "Wikipedia" for 99.99% of users. I
       | can't remember the last time I've even considered clicking the
       | edit button.
        
         | quenix wrote:
         | Give it a try! It's strangely fulfilling.
        
           | foxpurple wrote:
           | Editing an existing page is usually pretty easy and conflict
           | free. Creating a new page is close to impossible and strictly
           | guarded.
           | 
           | It's almost not worth trying for a beginner because you will
           | almost certainly be rejected. The moderators require multiple
           | approved sources of information for a page to be created.
           | These pages must be reputable sources (IE not blog posts,
           | forum posts, or other wikis), and must not have any kind of
           | commercial link to the subject. So if you create a page about
           | a product and you state that it has feature X, you can not
           | use the companies own page that talks about having X as a
           | reference.
           | 
           | So basically you can't make a page about something until a
           | major news website writes about it or you can reference it in
           | some formal study.
           | 
           | I tried to create a few pages. One of them being about a file
           | format and found that I could not use the companies website
           | with the PDF describing the spec as a reference for the
           | format because there is a commercial link between the file
           | format and the source linked.
        
             | quenix wrote:
             | This is so painfully true. I've tried to submit two
             | articles so far to AfC (articles for creation) and both
             | were rejected. It's puzzling to me because I often see
             | extremely low quality articles (with just one sentence, for
             | example), published onto mainspace.
        
               | thealig wrote:
               | Oh interesting. I've started editing a bit on Wikipedia
               | recently, and seems like there are a lot of potential
               | articles to improve upon. The deletionism issue reminds
               | me of this article https://www.gwern.net/In-Defense-Of-
               | Inclusionism
        
       ___________________________________________________________________
       (page generated 2021-07-31 23:02 UTC)