[HN Gopher] Show HN: Static.wiki - read-only Wikipedia using a 4...
___________________________________________________________________
Show HN: Static.wiki - read-only Wikipedia using a 43GB SQLite file
Author : segfall
Score : 733 points
Date : 2021-07-30 21:50 UTC (1 days ago)
(HTM) web link (static.wiki)
(TXT) w3m dump (static.wiki)
| jart wrote:
| [Hit #1 on HN hosting 43gb Wikipedia SQLite DB on AWS]
| https://justine.lol/hugemistake.gif
|
| Each time I post a link to Hacker News that becomes popular it
| ends up being at least a $100 in bandwidth bills from Google
| Cloud, and I've got a simple tiny unbloated website. This poor
| guy is going to go bankrupt. The Internet dark age is like Ma
| Bell all over again when we'd get those $500 long distance bills
| for dialing into BBSs.
| cafxx wrote:
| [nit] you should really run analyze before vacuum, not the other
| way around (https://github.com/segfall/static-
| wiki/blob/6d455de149c24e2a...)
| yukinon wrote:
| There's a bug when the target article is titled the same as the
| disambiguation page. Search for "Earthbound", and try clicking on
| the article for the video game. It'll bring you right back to the
| disambiguation page.
| geenat wrote:
| Cool example of minimalism.
|
| But hate to be that guy- if you're querying a database, it is not
| static.
| peterkelly wrote:
| That's what's so brilliant about this. The database query
| happens on the client, not on the server. The server side is
| entirely static hosting.
| dmd wrote:
| Nope, it is - unless you think "loading html pages from a
| remote web server" is also not static.
|
| The point is there's no _computation_ going on on the server,
| behind the scenes; your browser is doing all the work of
| knowing what to fetch, NOT the remote end, with the magic of
| range requests.
| suprbeast wrote:
| We put this on an e-reader once. Onboard wikipedia, it was cool.
| It was 2011.
| spiritplumber wrote:
| cool! was it able to update itself or no?
| NelsonMinar wrote:
| I've been using static copies of Wikipedia for years; they're
| great to have when traveling. I mostly use the Kiwix app. Their
| ZIM format database for all English is 83 GB; 43 GB for one
| without pictures. Compares nicely to the 43 GB here.
|
| I once spent an hour in the Louvre learning about Rubens
| astonishing paintings of Marie de' Medici thanks to Kiwix.
| Without it I was just like "huh, nice paintings, I wonder what
| they are?" They're incredible and I owe that moment of kismet to
| Wikipedia, Kiwix, and my mobile phone.
| https://en.wikipedia.org/wiki/Marie_de%27_Medici_cycle
| elcritch wrote:
| That's brilliant! Art museums are fun but there's so much
| context missing for most paintings.
| mohiit wrote:
| I remember having it on my ipod touch way back in 2012 when it
| was only around 7gb.
| ctenb wrote:
| If wikipedia used this approach they could stop asking for money
| for server upkeep.
| brokensegue wrote:
| wikipedia keeps getting updated. this approach is static. it's
| really easy to serve a static wikipedia lots of different ways.
| dleslie wrote:
| Two immediate issues:
|
| 1. Doesn't work if Privacy Badger is active.
|
| 2. Nim (Programming Language) is not indexed.
| Retr0id wrote:
| More generally, it appears that any link containing brackets is
| broken/missing.
| qwertox wrote:
| This appears to be the issue.
| qwertox wrote:
| http://static.wiki/en/Gardena does not contain a link to
| http://static.wiki/en/Gardena_(company) yet
| https://en.wikipedia.org/wiki/Gardena does.
| chadcatlett wrote:
| I noticed if you go directly to the page for Nim, it works.
|
| http://static.wiki/en/Nim_(programming_language)
| JadeNB wrote:
| > 2. Nim (Programming Language) is not indexed.
|
| You and dom96 (https://news.ycombinator.com/item?id=28013376)
| both noticed this issue, just two minutes apart. I guess
| there's some sort of inclination among HN readers testing out a
| static copy of Wikipedia to search for Nim? :-)
| johnisgood wrote:
| dom96 is a core developer of the language, and author of Nim
| in Action, so it makes sense for him to look it up. I don't
| know the OP.
| twotwotwo wrote:
| That's super cool.
|
| The autocomplete trickiness is fun. Looks (from links in GitHub)
| like it's using full-text search and currently needs to scan a
| lot to rank results. Doing just prefix search might be a cheap
| half-solution to autocomplete, and maybe layering on hacks around
| stopwords etc. can get you a bit further.
|
| For another angle, enwiki's full compressed list of titles
| (enwiki-20210720-all-titles-in-ns0.gz) is 86MB. Grabbing the
| whole dang thing (in the background?) and doing something with it
| client-side is a different sort of approach from what this page
| is going for but not unthinkable.
| dom96 wrote:
| Really cool! I think I see a bug: searching "Nim" shows "Nim
| (programming language)" but it's not clickable.
| segfall wrote:
| Yep, you're right. The wikitext -> markdown conversion strips
| out some links, noticeably on the disambiguation pages, for an
| unclear reason.
| punnerud wrote:
| Related post 89 days ago:
| https://news.ycombinator.com/item?id=27016630 (Hosting SQLite
| databases on GitHub Pages or any static file hoster)
| dnamlin wrote:
| I plowed the 40 GiB database file into sqlite_zstd_vfs [1] which
| reduced it 75% to 10 GiB. This plug-in also supports HTTP access
| [2] in the spirit of phiresky's, however, I don't have a
| WebAssembly build of mine yet, so it's a library for
| desktop/command-line apps for now. You can try it out on Linux or
| macOS x86-64: pip3 install genomicsqlite
| genomicsqlite https://f000.backblazeb2.com/file/mlin-
| public/static.wiki/en.zstd.db "select text from wiki_articles
| where title = 'SQLite'"
|
| ("genomicsqlite" is the CLI for my Genomics Extension [3], which
| is built around these Zstandard compression & web layers.)
|
| [1] https://github.com/mlin/sqlite_zstd_vfs
|
| [2] https://github.com/mlin/sqlite_web_vfs
|
| [3] https://mlin.github.io/GenomicSQLite
|
| EDITS: I expanded on this comment in this gist
| https://gist.github.com/mlin/ee20d7c5156baf9b12518961f36590c...
|
| If you want to download the whole en.zstd.db, then please kindly
| get it from zenodo (which doesn't support HTTP range requests,
| but is free): https://zenodo.org/record/5149677
| jart wrote:
| Great work but why not compress with deflate if you are serving
| http requests since then you could directly copy the database
| content to the wire as gzip encoded responses.
| dnamlin wrote:
| With sqlite_zstd_vfs the data are compressed beforehand and
| stored that way "at rest", so web responses are directly
| copied to the wire controlled by HTTP range headers,
| similarly to the OP. They need to be decompressed by a client
| library sitting between SQLite and the wire.
| voldemort1968 wrote:
| There's also https://www.kiwix.org/en/ which is great to take on
| long offline flights, browsing through wikipedia.
| rycomb wrote:
| And someone keeps a relatively up-to-date .slob file with the
| entire wikipedia (that can be read with Aard2), which I find
| more useful than kiwix!
|
| (A link to it can be found in the GitHub page with Aard
| dictionaries)
| ce4 wrote:
| Aard2 Android: https://github.com/itkach/aard2-android
|
| Aard2 web: https://github.com/itkach/aard2-web
|
| Dictionaries:
| https://github.com/itkach/slob/wiki/Dictionaries
| exikyut wrote:
| files: enwiki-20210616* size: 18.9 GB
|
| How'd they manage _that_...
| zelienople wrote:
| Seriously? They couldn't make the content something standard
| like a .zip file? Or a zip file with a SQL database in it?
| Bloody monkeys always have to go and create a new standard
| for everything. They will argue it's for performance or
| something, but it's really an exercise in ego. "Oh, mine's
| better!" No, it's not. What is better is what works with
| everything else in the world.
|
| And that index! We couldn't capitalize the proper name of
| Wikipedia? And do you not understand there are visually
| impaired people in the world who can't read light red text on
| a white background? What is wrong with people?
|
| And the terminology! What in the fuckety is Aard2? What is a
| .slob file? Why is it that everyone uses obscure acronyms or
| whatever for everything and never explains anything? Ego
| again, probably. Let's create a language that no one else
| understands to make ourselves feel superior! No, you are not
| superior, you are just turning language into thick vegetable
| soup.
|
| Didn't you people go to university, or did you not pay
| attention when they taught you that the first reference is
| always completely spelled out and explained and you can use
| the acronyms (or whatever they are) after that. It's only
| common sense that you don't spout gibberish and nonsense
| words without explaining then in the first instance, unless
| your goal is to confuse everyone for the purpose of making
| yourself feel superior.
|
| And the Google sheet and the flat index, what a hodgepodge of
| nonsense. What the heck does it mean? What is the difference
| between the different versions? Who knows?
|
| I have to go now. My eyes are bleeding from trying to read
| light red text on a bright white background. I'll check back
| again once I get a Braille laptop screen.
| lmm wrote:
| Does it actually work? I tried and it would always give up
| partway through the downloads (and most of the times when I ask
| in these threads it turns out the person recommending it
| doesn't actually use it, they just think it's a cool idea...)
| mongol wrote:
| Perfect if stuck on a remote deserted island! Must be the closest
| existing thing to the Hitchhiker's guide?
| phil294 wrote:
| I'm not trying to kill the fun here, but since this thing
| relies on a working internet connection, it would probably be
| pretty useless on a remote island
| dTal wrote:
| You could host it locally easily enough!
| jtsiskin wrote:
| Yes, but you could also host a normal Postgres db easily
| enough
| dTal wrote:
| This is entirely static, so I think it's easier.
| anigbrowl wrote:
| You can put a 43gb database on a $30 SD card and still have
| room for a bunch of movies.
| [deleted]
| phkahler wrote:
| When SpaceX sends people to Mars they will want a local
| Wikipedia among many other things.
| yreg wrote:
| I installed a reader with a Wikipedia dump (without multimedia)
| on the original iPhone. At the time (at least in my region)
| WiFi was quite rare and it really did feel like the
| Hitchhiker's guide and incredibly powerful.
|
| I only kept it for a few weeks though since it consumed almost
| all of the space. (And I imagine it was probably my language
| mutation, the english Wikipedia must have been too big
| already.)
| exikyut wrote:
| _Sigh_ obligatory XKCD https://xkcd.com/548/
| pronoiac wrote:
| You might look at Kiwix, and the data sources they've indexed:
| https://www.kiwix.org/en/
|
| (Oddly, it's crashing at launch on my iPad right now)
| [deleted]
| r00fus wrote:
| Hitchhikers Guide to Wikipedia? HHGTWP?
| wyldfire wrote:
| Towel not included.
| andrepd wrote:
| How does this compare to https://www.kiwix.org/en/
| [deleted]
| nynx wrote:
| This is fantastic. I knew that we'd see things like this once
| that hosting sqlite databases project was posted.
| arianon wrote:
| Shocked to find out that this is very much a static website, it's
| "merely" downloading a small portion of the 43GB SQLite file in
| question with HTTP Range requests, and then it uses a WASM-
| compiled copy of SQLite to query the requested data. Very
| impressive.
| k__ wrote:
| Didn't know you could pump 43GB of data into a browser.
| wmf wrote:
| The trick is that the browser is not downloading all 43 GB,
| just the parts it needs.
| simonw wrote:
| It's the same trick that was described here - it's absolutely
| brilliant: https://phiresky.github.io/blog/2021/hosting-sqlite-
| database...
| scrame wrote:
| Wow! The end where they use sql to manipulate the DOM I think
| just blew a fuse in my brain. "Brilliant" feels like an
| understatement.
| MuffinFlavored wrote:
| On insert from Netlify:
|
| > [error: RuntimeError: abort(Error: server uses gzip or
| doesn't have length). Build with -s ASSERTIONS=1 for more
| info.]
|
| On insert from Github Pages:
|
| > [error: Error: SQLite: ReferenceError: SharedArrayBuffer is
| not defined] Your browser might either be too old to support
| SharedArrayBuffer, or too new and have some Spectre
| protections enabled that don't work on GitHub Pages since
| they don't allow setting the necessary isolation headers. Try
| going to the Netlify mirror of this blog for the DOM demos.
| JoeyBananas wrote:
| wouldn't this be vulnerable to DOS attacks? I can make the
| database run arbitrarily long and complicated queries
| chrisseaton wrote:
| > I can make the database run arbitrarily long and
| complicated queries
|
| Do you understand that the database runs on your computer?
| You can only DOS yourself.
| carlhjerpe wrote:
| In a smart webserver this could mean not a whole lot more
| than a memcpy, and between you and the CDN it probably
| won't be too high latency.
|
| One would have to version the SQLite DB by version to avoid
| corruption by cache invalidation.
| rkeene2 wrote:
| My static file HTTP server called "filed" [0] will
| satisfy a request in as few as 1 system call (no memcpy
| involved -- the kernel reads the file and sends the
| buffer to the NIC), by using sendfile(2). Most other
| webservers do a bit more work, like open the file for
| every request.
|
| [0] http://filed.rkeene.org/
| CodesInChaos wrote:
| The database runs on the client. It can't do anything you
| couldn't do through any other http client.
| simonw wrote:
| Since the work all happens in your browser, the only victim
| of a long complicated query would be you own browse and the
| S3 bandwidth bill of the person hosting the database (since
| you'd end up sucking down a lot of data).
|
| But if you want to jack up someone's S3 bandwidth bill
| there are already easier ways to do it!
| floatingatoll wrote:
| It's a static file, CDN it.
| simlevesque wrote:
| Many CDNs cannot cache 43GB files.
|
| Cloudflare's limit is 10GB, 20GB for Cloudfront, 150GB
| for Akamai and Azure.
| sltkr wrote:
| It should be trivial to split it up into 1GB chunks or
| whatever. In fact, if you only request single pages, you
| could split the database up into pagesize-sized files.
| This is a lot of files, but at least it avoids the need
| to do range requests at all. You probably want to
| increase the pagesize a bit though (e.g. maybe 256 KiB
| instead of the default 4 KiB?)
| floatingatoll wrote:
| Range requests are the star, so I would vote keep them? I
| mean, technically, the CDN could represent each 4kb block
| _of_ the file as an individual URL, so that you do range
| requests by filename instead of by .. range request ..
| but at some point I definitely think RRs are more sane.
| rzzzt wrote:
| Probably both, compute file name and relative offset from
| the original offset. (A minor complication is requests
| crossing file boundaries, in which case you have to do
| two requests instead of one.)
| milankragujevic wrote:
| Isn't Cloudflare's limit 512 MB? (except for enterprise
| plan which has no price defined where it's 5GB)
|
| https://support.cloudflare.com/hc/en-
| us/articles/200172516-U...
| floatingatoll wrote:
| TIL, for whatever it's worth :) Thanks!
| phiresky wrote:
| You can chunk the file into e.g. 2MB chunks. The CDN can
| then cache all or the most commonly used ones. That's
| what I did in the original blog post to be able to host
| it on GitHub Pages.
| dmos62 wrote:
| I would look into caching range requests. Simpler than
| pre-chunking or caching the whole database.
| jlg23 wrote:
| If I get the essence of what you are saying... "the only
| victims would be both ends of the communication"? ;)
| quickthrower2 wrote:
| What is said is this site isn't any more susceptible to
| DOS than any other site. Of all the ways of architecting
| a site this is probably the least vulnerable.
| chovybizzass wrote:
| Searched for "porn" -- got this:
| http://static.wiki/en/Progressive_outer_retinal_necrosis
| andyxor wrote:
| this is awesome, seems to be a good use case for using IPFS
| decentralized storage, either for copies of the big sqllite db
| dump file, or (as an alternative) for the individual wiki
| articles "snapshots"
|
| i think it would be easier to keep it in sync with original wiki
| using the second option, where each article in static.wiki is
| synced with original wikipedia pages and updated individually,
| vs. updating the huge sqllite dump on every change in the
| originals
|
| then the sqllite will be only holding metadata and links to
| static (and may be version controlled) versions of individual
| static pages/article snapshots, so the size will be much smaller
| and easier to distribute on ipfs
|
| Edit: looks like there is similar effort in the works, to put
| wiki on ipfs https://blog.ipfs.io/24-uncensorable-wikipedia/
| max1cc wrote:
| The search seems to be broken right now, but it looks like you
| can access whatever page you want via URL (e.g.
| http://static.wiki/en/Computer_Science)
| graderjs wrote:
| That's awesome man. I didn't know you could host Wikipedia on
| your own client / own back-end. Seems like this means if someone
| could built a better interface and performance for Wikipedia they
| could begin to steal some of their traffic. Is it not?
| sgt wrote:
| thgttg.sqlite3
| schwartzworld wrote:
| hhgttg
| segfall wrote:
| OP here. Happy to take questions, with the caveat that my work
| here is mostly glue.
|
| Some background: https://github.com/segfall/static-wiki
|
| The datasets I generated:
| https://www.kaggle.com/segfall/markdownlike-wikipedia-dumps-...
|
| The original author of this approach:
| https://news.ycombinator.com/item?id=28013514
| johnwheeler wrote:
| Is the db in memory?
| segfall wrote:
| sql.js is in memory, but the database file is only partially
| loaded from the static file host.
| [deleted]
| exikyut wrote:
| No, as other comments have noted it's stored in S3.
| segfall wrote:
| For added fun, you can try switching between different
| databases:
|
| http://static.wiki/en/United_States
|
| http://static.wiki/zh/Mei Guo
|
| http://static.wiki/fr/United_States
|
| http://static.wiki/enwikiquote/United_States
|
| http://static.wiki/simple/United_States
| ddtaylor wrote:
| Is there a way to get this simple theme / stylesheet to work
| for the regular Wikipedia? I really like it.
| hutrdvnj wrote:
| Since it doesn't require any backend, a IPFS mirror would be
| great.
| andymoe wrote:
| Reminds me that at one point each page of the Apple Wiki Server
| that shipped with OS X Server back in the day used one SQLite
| file per page. Worked ok-ish.
| mwidell wrote:
| I remember having the iPhone 2G back in 2007, and that someone
| created a way to store a 5GB wikipedia dump (the complete
| wikipedia at that time, but without photos) on it, so you could
| have the whole Wikipedia in your pocket. At the time it felt
| pretty amazing.
| spiritplumber wrote:
| Don't panic!
| phiresky wrote:
| Author of the referenced blog (and library) here. This is great!
|
| The full text search engine in SQLite is sadly not really good
| for this - one reason is that it uses standard B-Trees, another
| is that it forces storing all token positions if you want BM25
| sorting, which is a huge overhead for articles as long as
| Wikipedia's.
|
| But that doesn't mean full text search isn't possible in a very
| efficient manner with statically hosted data! I wrote a proof of
| concept of making the Rust-based tantivy library work in the same
| way, which has a lot of internal things that can make the index
| much smaller and more efficient than SQLite's. It's also >10x
| faster in creating the search index.
|
| Here's the demo also for Wikipedia:
| https://demo.phiresky.xyz/tmp-ytccrzsovkcjoylr/dist/index.ht...
| I'm not sure if it's more efficient than the SQlite version in
| this form, but it definitely has more upward potential and is
| more fun to work with.
|
| And the corresponding draft PR: https://github.com/tantivy-
| search/tantivy/pull/1067
|
| I sadly haven't gotten around to working on it more and writing
| an article about it.
|
| Other people are also working on using this stuff to make Sci-hub
| and LibGen more available by using this in combination with IPFS
| for distributed and "uncensorable" hosting which is pretty
| awesome.
|
| Edit: Just realized that the OP demo only searches in article
| titles, while mine searches in full article contents by default.
| You can search in only the titles in my demo by querying
| `title:harry title:potter`
| dheera wrote:
| Yeah I was just thinking that. Why not just a static
| filesystem? Just mount a ext4 image full of .html files and
| browse away. It would be remarkably efficient, likely much more
| efficient than SQLite.
| segfall wrote:
| Thank you, Phiresky. My little side project only exists because
| of your work.
| ignoramous wrote:
| Congrats on your side project! This wonderful discussion
| exists because of your work.
| neocodesoftware wrote:
| how are the xml dumps generated?
| peterhunt wrote:
| You can also get efficient FTS with this method if you
| implement indexing in user space and avoid BM25. The Lucene
| practical scoring function works well with this method in my
| experience:
| https://www.elastic.co/guide/en/elasticsearch/guide/current/...
| piyh wrote:
| Is there a dumbed down version of this indexing conversation
| for someone who understands b-trees, but BM25 or user space
| indexing?
| peterhunt wrote:
| Well, this post is in the context of an e2e encrypted DB,
| but it's subject to the same constraints:
| https://medium.com/@ZeroDB_/scalable-full-text-search-
| over-e...
|
| If you understand btrees you understand the hardest part
| already :)
|
| Basically, you need to design a search index that examines
| the fewest DB pages in order to find the result. The Lucene
| scoring method stores a mapping of term -> document[]
| sorted in relevance order. The main idea is that you can
| examine only the first n documents for each term in the
| search query in order to find the most relevant search
| results. Picking n is sort of tricky, but if your index is
| stored in this way it's possible to fulfill a large % of
| queries efficiently without downloading the whole index.
|
| Here's a little Python implementation of what I mean by a
| "user space implementation". Note that it's a toy but it
| performs pretty well on some demo sklearn data sets: https:
| //gist.github.com/petehunt/724eeb77189332db101ad7b0db8...
| walrus01 wrote:
| How does it do for performance if you throw the whole 43GB into
| RAM? Plenty of very affordable workstation systems out there
| today gently used with 128GB in them.
| londons_explore wrote:
| You still have to download it all... Which is a barrier for
| most...
| tw04 wrote:
| As someone who was very recently on about as bad of a dsl
| connection you could get... What? Outside of metered
| cellular connections this is available to basically
| everyone. Even the remotest parts of Africa have at least a
| handful of unmetered connections within a days drive and a
| USB drive.
| tjoff wrote:
| For English it states 90 GB uncompressed, doesn't say
| compressed size but that doesn't sound much larger than a
| large game. In the context I don't see it as a barrier.
| unknownOrigin wrote:
| That's only a size of regular modern video game... the
| downloads of which are pretty mainstream these days. Saying
| that a 50 gig download is a barrier "for most" is
| definitely not true.
| exdsq wrote:
| Gears of War 5 on the Xbox is 133GB with all its updates. I
| think we're at the point where 100GB is a high but
| reasonable request for people nowadays, at least in cities
| in the West.
| shp0ngle wrote:
| what? the point of this is that it seeks just parts of the
| files through http ranges... not all 43 GB... just try the
| page? or am I missing something?
| unixhero wrote:
| You're missing his pragmatic and creative solution to
| leapfrog the problem altogether.
| shp0ngle wrote:
| ahhh I misread his comments.
|
| Sure, RAM is always faster.
| sonograph wrote:
| > Sure, RAM is always faster.
|
| Well, why not shove it all into L2 cache and leapfrog
| RAM?
|
| /first joke I ever made on HN
| trollied wrote:
| Jokes are funny...
| bawolff wrote:
| Which problem?
|
| Yes putting it in ram will be faster than downloading it
| over the network. I don't think anyone ever doubted that.
|
| I imagine people who wanted to download it would just use
| kiwix. The potential to download isn't the part that
| makes this project interesting.
| IshKebab wrote:
| A pragmatic solution that requires users to have 43GB of
| RAM? Right.
| AussieWog93 wrote:
| To put it in perspective, 43GB of RAM to browse the
| entirety of Wikipedia doesn't sound _that_ ridiculous
| when popular Electron-based IM programs use 1GB+.
| jggonz wrote:
| I'm going to upvote this comment because not everyone in
| the world can afford $300 for 64GB+ of RAM. The solution
| to not dump everything into RAM is a great one.
|
| Fortunately, I've had a fancy Silicon Valley job for 15+
| years which now lets me afford many GBs of RAM if I
| wanted to but it was a different story in my 20s. I
| relied on free or almost free hardware to learn. And that
| free hardware usually lagged in specs.
| walrus01 wrote:
| Yeah I agree with you not everyone can afford it, but
| lots of young people build $1800 gaming PCs. If you want
| a powerful Linux workstation desktop PC for fun projects
| like this you can get a Dell t5610 with two 8 core xeons
| and 128GB of ram on eBay for $650, then add your own $100
| SSD.
|
| I would not recommend running out and buying a new
| current gen Ryzen board+cpu or core i7-whatever and 128GB
| of new ram at ddr4-4200 prices.
| azeirah wrote:
| I mean sure it's for enthusiasts, but my 3 year old pc
| already has 32GB ram. It's definitely not out of the
| question
| unixhero wrote:
| 2021 called, memory is cheap.
| linkpuff wrote:
| Is it though? We are in the middle of a chip crisis where
| prices are raising to record levels...
| unixhero wrote:
| Yes it is. In the grand scheme of things, looking at how
| far we've come DDR4 modules are cheap dollar per
| gigabyte.
| andi999 wrote:
| Notebook vendors didn't get this memo.
| Eremotherium wrote:
| Don't buy a garbage laptop with soldered RAM and just
| upgrade it yourself?
| andi999 wrote:
| Good solution. Which one that allows for 128GB do you
| recommend?
| unixhero wrote:
| A workstation. Any professional one.
| IshKebab wrote:
| And how many users have more than 16GB of RAM?
| jbverschoor wrote:
| 20 applications x 50GB = 1tb of memory.
| Quekid5 wrote:
| o... kay?
|
| You're not wrong, but who's talking about 20
| applications?
| jbverschoor wrote:
| Look at the amount of applications and web-application
| you have open at a given point.
|
| If this is such a good idea, let's just all use iscsi and
| mount the internet.
| IgorPartola wrote:
| On phones too?
| gsich wrote:
| Phones are not first class devices.
| exdsq wrote:
| It's around PS300 for 64GB of Corsair RAM - you need a
| motherboard that supports it but you can go cheap and you
| don't need a great processor. You could probably put this
| together for PS600? It's not exactly crazy if this was
| for an office or something.
| tored wrote:
| I have been thinking of using FTS5 with SQLite to search
| emails. Not close of 43 GB ofc, so I would probably not have
| any major performance problems, but still is FTS5 any good? Or
| should I look into other solutions?
| psanford wrote:
| The nice thing about FTS5 is you can have full text search up
| and running on a dataset in a couple of minutes. So give it
| try and see if it is sufficient for your use case. If it is,
| great! If not, you've not wasted much time.
| OJFord wrote:
| Have you seen 'notmuch'? I was honestly shocked how good/fast
| it is. It uses a C (++?) lib called 'xapian' to provide the
| actual index/search capability.
| phiresky wrote:
| If you're talking about a normal local SQlite DB and your
| dataset is less than maybe 100GB of plaintext then SQLite FTS
| will work fine regarding performance.
| [deleted]
| simonw wrote:
| I'm nervous on your behalf for your S3 bill, have you done napkin
| maths for how much this will cost to host, especially given the
| relative inefficiency of the autocomplete queries?
|
| A trick that might help: create a separate SQLite table
| containing just the article titles to autocomplete against. That
| might give you queries that use smaller ranges of the file.
| segfall wrote:
| Great catch, simonw (and everyone else in this thread).
|
| I've reviewed some other static hosts and settled on wasabi for
| now. When my S3 bill rolls in, I'll reply to this comment with
| the damage. A scary number will help others avoid my oversight.
| jinseokim wrote:
| It seems to be 403 happens.
|
| Quick test: https://s3.us-
| east-2.wasabisys.com/static.wiki/en.db
| sitkack wrote:
| I built a wikipedia reader on the ipad years ago, I used two
| sqlite databases, one for the article titles using sqlite FTS
| then I compressed the articles themselves and merged the
| compressed chunks into a set of files < 2G, the title database
| had the offset into the compressed blobs. Only including the
| most popular pages, it was just under 10G.
| simonw wrote:
| That's really clever. Were you storing the article text as
| compressed data in BLOB columns in a SQLite table?
| sitkack wrote:
| I realize I wasn't clear. I used 2 sqlite databases just to
| store the article titles and offsets into binary files
| (<2GB) that had chunks of compressed data. Each chunk
| contained 10s-100s of stories, so that the decompressed
| around was around 20megs I believe. I had planned to store
| everything in sqlite, but there were too many issues. The
| FTS extension to sqlite is excellent, it made whole classes
| of problems literally melt away.
|
| Recalling now, the article text itself, I stored in groups
| of articles concatenated together so that that data
| compression could take advantage of similar texts. This
| data was just Lua data structures, with the article title
| and article body in an array. So I knew which chunk it was
| in, but I had to decompress a whole chunk at a time and
| then iterate through the list of decompressed articles to
| find the specific one. I was trying to tune search time,
| decompression time and compression ratio. I had a handful
| of parameters in the job that would take the xml dumps and
| create the sqlite and compressed blobs. Memory consumption
| would spike as a chunk was decompressed, but after a call
| to the Lua garbage collector after pulling the article out
| of the decompression buffer, everything was fine again.
|
| This was late 2010 or so. The job that processed the XML
| into database+bin data took over an hour to run on my
| laptop.
|
| If I did it again, I'd use a compression library that was
| tuned for decompression and used an external dictionary.
| For the index, I'd use fst [1]
|
| The project stalled out in parsing the wikitext, there are
| now libraries that can handle the parsing. I have
| considered resurrecting it, with jpeg2000 or other image
| codecs, one could include the images as well. The system
| did _not_ have FTS over the article text itself. I had
| planned to use some sort tree of bloom filters, something
| similar to [2].
|
| I was really trying to keep the total footprint of all the
| data under 10G or so it could fit on the majority of
| devices. Now with larger flash sizes, better codecs and
| better compression libraries, the functionality and
| performance could be greatly enhanced.
|
| I learned a ton from this project, I really recommend
| getting deep into something, but also working within your
| limits and having fallback plans when something looks
| insurmountable. There were plenty of times where I thought
| I was a dead end, only to say, "what if ..." the next day,
| run a quick experiment and be on to the next insurmountable
| roadblock.
|
| [1] https://github.com/BurntSushi/fst
|
| [2] https://arxiv.org/abs/2002.03057
| dom96 wrote:
| Why do people even use S3 if that's a risk? I really don't get
| this, do most just not worry about accidentally wasting money
| or do they just not understand this risk?
| RhodesianHunter wrote:
| Every system has risks. You just need to be aware of them
| going in.
|
| S3 is excellent at many things, serving a ton of public
| traffic isn't one of them. Usually people put a CDN in front
| of it for that use case.
| simonw wrote:
| This is a risk any time you host anything on the internet on
| any system that charges you for storage, bandwidth or CPU
| usage. It's fine to take that risk on as long as you're aware
| of it and keep an eye on your spend.
|
| In this case I'm nervous because 43GB is a lot larger than
| most projects!
| remram wrote:
| I think the point is that not every host charges for
| bandwidth or CPU usage, making S3 particularly risky.
| RhodesianHunter wrote:
| There's no CPU charge, only per-request charges.
|
| Bandwidth is free into S3 or from S3 to another AWS
| service so it's particularly handy for that use case.
| remram wrote:
| For S3, yes.
| slunk wrote:
| It's one thing if you have a source of revenue so justify
| your S3 costs. My interpretation of the parent commentator's
| concern is that this person has opened up a multi gigabyte S3
| file publicly and sent lots of traffic its way (via hacker
| news) for what appears to be a passion project.
| max_ wrote:
| He should seed a torrent file
| thrower123 wrote:
| I would like a copy of that sqlite database to play with.
|
| Having that all preprocessed would save a bunch of work
| compared to getting it from Wikipedia itself
| Weryj wrote:
| IPFS?
| warpech wrote:
| Second that. I have no hands down experience with it, but
| I've seen seemingly more complex apps deployed on IPFS
| https://docs.ipfs.io/concepts/usage-ideas-examples/
| wmf wrote:
| Torrents aren't great for random access which this app is
| doing.
| dTal wrote:
| But at least we could get a copy without incurring a
| massive s3 bill.
| rudian wrote:
| Can we really? For torrent to work you still need someone
| to host it in full at all times. Who will host a 40GB
| file that will be out of date in a week? The website
| provider will, so we're back to square one.
| dTal wrote:
| No, you don't. You just need N people to host 1/N of the
| file some of the time. The only person who needs to
| upload it in full is the original creator.
| magila wrote:
| Bittorrent can do random access just fine, especially with
| v2 which enables content verification in 16KB chunks.
| Clients tend to be throughput oriented which leads to sub-
| optimal latency, but that's not a limitation of the
| protocol.
| wmf wrote:
| I guess just fine is a matter of opinion. Connection
| setup and piece bitmap tracking seems like more overhead
| than HTTP range requests.
| magila wrote:
| Connection establishment in bittorrent might be
| marginally higher overhead than HTTP, but not by much and
| the cost is highly amortized if you're doing many range
| requests in the same torrent. There is indeed more
| overhead from HAVE messages and such, but that's just
| because bittorrent is doing a lot more than HTTP to
| enable P2P. It's still not that bad as a percentage of
| the payload traffic.
| ignoramous wrote:
| > _A trick that might help: create a separate SQLite table
| containing just the article titles to autocomplete against._
|
| Another one: build a directed-acyclic-word-graph [0][1] of just
| the page-titles and download it to browser's localStorage for
| prefix-based searches.
|
| In our usage, ~5M entries take ~18MB. Wikipedia has ~6.3M
| articles [2].
|
| [0] https://github.com/smhanov/dawg
|
| [1] http://stevehanov.ca/blog/?id=115
|
| [2] https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia
| mkl wrote:
| A version of [1] with the pictures working: https://web.archi
| ve.org/web/20120211141906/http://stevehanov...
| Bayart wrote:
| Thanks for making me aware of a new data structure !
| Weryj wrote:
| Get a FusionIO drive off ebay and host the DB yourself, if you
| need, stick the computer behind a static ip VPN.
| gjs278 wrote:
| the drivers for fusion-io are maintained by the community.
| sandisk practically abandoned iodrive by kernel 4.20
|
| I would suggest an intel 900p instead, more space and not too
| much more expensive. very similar latencies as well. ive used
| both.
| capableweb wrote:
| Seems to be hosted at Surge,sh which is hosted at Digital
| Ocean, so seems this is not OPs problem. $
| nslookup static.wiki Server: 8.8.8.8
| Address: 8.8.8.8#53 Non-authoritative answer:
| static.wiki canonical name = na-west1.surge.sh. Name:
| na-west1.surge.sh Address: 188.166.132.94
| $ whois 188.166.132.94 % This is the RIPE
| Database query service. % The objects are in RPSL
| format. % % The RIPE Database is subject to
| Terms and Conditions. % See
| http://www.ripe.net/db/support/db-terms-conditions.pdf
| % Note: this output has been filtered. % To
| receive output for a database update, use the "-B" flag.
| % Information related to '188.166.132.0 - 188.166.135.255'
| % Abuse contact for '188.166.132.0 - 188.166.135.255' is
| 'abuse@digitalocean.com' inetnum:
| 188.166.132.0 - 188.166.135.255 netname:
| DIGITALOCEAN country: NL admin-c:
| PT7353-RIPE tech-c: PT7353-RIPE status:
| ASSIGNED PA mnt-by: digitalocean
| created: 2019-04-17T14:02:52Z last-modified:
| 2019-04-17T14:02:52Z source: RIPE
| person: Network Operations address: 101
| Ave of the Americas, 10th Floor address: New
| York, NY, 10013 address: United States of
| America phone: +13478756044 nic-hdl:
| PT7353-RIPE mnt-by: digitalocean
| created: 2015-03-11T16:37:07Z last-modified:
| 2019-04-17T14:37:51Z source: RIPE # Filtered
| org: ORG-DOI2-RIPE % This query was
| served by the RIPE Database Query Service version 1.101
| (HEREFORD)
|
| But yes, it is wise to run storage/transfer heavy websites on
| something that doesn't charge for traffic like AWS and most
| other infrastructure providers do.
| simonw wrote:
| That's just the site itself (the HTML and JavaScript). The
| 43GB SQLite database is hosted on S3.
| [deleted]
| mayli wrote:
| Correct, downloading this 40GB file could easily leads to huge
| s3 bills due to traffic. That's 1$ for a full download.
| banana_giraffe wrote:
| $2, if my math is right.
|
| As pointed out elsewhere, there's some basic work being done
| to prevent everyone from grabbing the .db file, but it's easy
| enough to bypass.
|
| Also, they're probably seeing non-zero API call charges. S3
| only charges $0.0004 per 1000 calls, but when you start to
| make lots of calls at scale, it can really add up. Still, for
| this site, it's probably not that bad. If my simple tests are
| a fair judge, it works out to something like $20 a month for
| 100k lookups.
|
| I've mentioned elsewhere on HN I used a similar technique
| with Python to query SQLite databases in S3. It works well,
| but there was an edge case on that project that resulted in
| millions of requests to S3 per invocation. It didn't break
| the bank, but it did push the project into the unprofitable
| range till we fixed that case.
| Asdrubalini wrote:
| I tried downloading the file directly from curl [0] but it
| seems like it blocks non-partial GET requests, so downloading
| it is not really straightaway.
|
| [0] https://s3.amazonaws.com/static.wiki/db/en.db
| llacb47 wrote:
| curl -v "https://s3.amazonaws.com/static.wiki/db/en.db" -H
| "Referer: http://static.wiki/" -H "Range: bytes=0-"
|
| Above works but for the sake of OP's wallet I would suggest
| that you do not download the entire database from his S3
| bucket.
| infogulch wrote:
| Instead download it from kaggle which is offering to host
| it for free: https://www.kaggle.com/segfall/markdownlike-
| wikipedia-dumps-...
| lanevorockz wrote:
| Maybe that would be a nice way to start Wikipedia alternatives,
| sijce it became a form of political activism the trustworthiness
| of Wikipedia fell down significantly.
| Scaevolus wrote:
| You should compress the individual article texts with zlib. Right
| now, looking up an article text for a decent size article
| ("Virus") takes 10 round-trips-- 5 appear to be btree traversals,
| but another 5 (!) are to read in the page content, as chunks of
| 8K 8K 16K 32K 64K. Compressing the article might cut one round
| trip off that
|
| Increasing the sqlite page size to a much larger size like 64KB
| or 128KB may improve average latency too, since fewer btree pages
| would need to be fetched to find the offset of the desired row,
| and most page contents could be read with one fetch. This matters
| more when fetches over the network take ~100ms.
| harlanji wrote:
| Bonito. I want and yet have problems downloading even a 350mb
| file on my carrier, tethered via 2 different devices. Being able
| to DL something like this should be a requirement to be called an
| Internet carrier. Non-mobile is not an option for some, eg. low
| income unemployed and homeless. Maybe they've been a software
| engineer for 20 years and can reach unreachable markets. (No
| doubt it's more profitable to spy and steal from the vulnerable,
| if God allows).
|
| What if a customer needs their records and can not afford to get
| them and they lose their life or suffer significantly? Who pays
| the balance?
|
| They didn't need use the Internet? Okay, well they did use it and
| they do during pandemic times.
| 8note wrote:
| I find that most wikipedia pages I'd want to look at are utterly
| unintelligible with communication conventions youd need to look
| up separately to understand the article.
|
| I can't use an offline Wikipedia without that coursework
| Camillo wrote:
| Read-only Wikipedia is just "Wikipedia" for 99.99% of users. I
| can't remember the last time I've even considered clicking the
| edit button.
| quenix wrote:
| Give it a try! It's strangely fulfilling.
| foxpurple wrote:
| Editing an existing page is usually pretty easy and conflict
| free. Creating a new page is close to impossible and strictly
| guarded.
|
| It's almost not worth trying for a beginner because you will
| almost certainly be rejected. The moderators require multiple
| approved sources of information for a page to be created.
| These pages must be reputable sources (IE not blog posts,
| forum posts, or other wikis), and must not have any kind of
| commercial link to the subject. So if you create a page about
| a product and you state that it has feature X, you can not
| use the companies own page that talks about having X as a
| reference.
|
| So basically you can't make a page about something until a
| major news website writes about it or you can reference it in
| some formal study.
|
| I tried to create a few pages. One of them being about a file
| format and found that I could not use the companies website
| with the PDF describing the spec as a reference for the
| format because there is a commercial link between the file
| format and the source linked.
| quenix wrote:
| This is so painfully true. I've tried to submit two
| articles so far to AfC (articles for creation) and both
| were rejected. It's puzzling to me because I often see
| extremely low quality articles (with just one sentence, for
| example), published onto mainspace.
| thealig wrote:
| Oh interesting. I've started editing a bit on Wikipedia
| recently, and seems like there are a lot of potential
| articles to improve upon. The deletionism issue reminds
| me of this article https://www.gwern.net/In-Defense-Of-
| Inclusionism
___________________________________________________________________
(page generated 2021-07-31 23:02 UTC)