[HN Gopher] Local LLMs versus offline Wikipedia
___________________________________________________________________
Local LLMs versus offline Wikipedia
Author : EvanHahn
Score : 97 points
Date : 2025-07-19 16:49 UTC (6 hours ago)
(HTM) web link (evanhahn.com)
(TXT) w3m dump (evanhahn.com)
| vFunct wrote:
| Why not both?
|
| LLM+Wikipedia RAG
| loloquwowndueo wrote:
| Because old laptop that can't run a local LLM in reasonable
| time.
| NitpickLawyer wrote:
| 0.6b - 1.5b models are surprisingly good for RAG, and should
| work reasonably well even on old toasters. Then there's gemma
| 3n which runs fine-ish even on mobile phones.
| ozim wrote:
| Most people who can nag about old laptops on HN can just
| afford newer one but are cheap as Scrooge Mcduck.
| mlnj wrote:
| FYI: non-Western countries exist.
| folkrav wrote:
| Eh, even just "countries that are not the US" would be a
| correct statement. US tech salaries are just in an entire
| different ballpark to what most companies outside the US
| can offer. I'm in Canada, I make good money (as far as
| Canadian salaries go), but nowhere near "buy an expensive
| laptop whenever" money.
| lblume wrote:
| It may also come down to laptops being produced and sold
| mostly by US companies, which means that the general fact
| of most items (e.g. produce) being much more expensive in
| the US compared to, say, Europe doesn't really apply.
| ozim wrote:
| People who are from those countries that can nag on HN
| and know whant HN is are most likely still better off
| than most of their fellow countrymen.
| moffkalast wrote:
| Now this is an avengers level threat.
| simonw wrote:
| This is a sensible comparison.
|
| My "help reboot society with the help of my little USB stick"
| thing was a throwaway remark to the journalist at a random point
| in the interview, I didn't anticipate them using it in the
| article! https://www.technologyreview.com/2025/07/17/1120391/how-
| to-r...
|
| A bunch of people have pointed out that downloading Wikipedia
| itself onto a USB stick is sensible, and I agree with them.
|
| Wikipedia dumps default to MySQL, so I'd prefer to convert that
| to SQLite and get SQLite FTS working.
|
| 1TB or more USB sticks are pretty available these days so it's
| not like there's a space shortage to worry about for that.
| cyanydeez wrote:
| the real valuable would be both of them. the LLM is good for
| refining/interpreting questions or longer form progress issues,
| and the wiki would be actual information for each component of
| whatever you're trying to do.
|
| But neither are sufficient for modern technology beyond
| pointing to a starting point.
| antonkar wrote:
| A bit related: AI companies distilled the whole Web into LLMs to
| make computers smart, why humans can't do the same to make the
| best possible new Wikipedia with some copyrighted bits to make
| kids supersmart?
|
| Why kids are worse than AI companies and have to bum around?)
| horseradish7k wrote:
| we did that and still do. people just don't buy encyclopedias
| that much nowadays
| antonkar wrote:
| Imagine taking the whole Web, removing spam, duplicates, bad
| explanations
|
| It will be the free new Wikipedia+ to learn anything in the
| best way possible, with the best graphs, interactive widgets,
| etc
|
| What LLMs have for free but humans for some reason don't
|
| In some places it is possible to use copyrighted materials to
| educate if not directly for profit
| literalAardvark wrote:
| Love it when Silicon Valley reinvents encyclopedias
| dcc wrote:
| One important distinction is that the strength of LLMs isn't just
| in storing or retrieving knowledge like Wikipedia, it's in
| comprehension.
|
| LLMs will return faulty or imprecise information at times, but
| what they can do is understand vague or poorly formed questions
| and help guide a user toward an answer. They can explain complex
| ideas in simpler terms, adapt responses based on the user's level
| of understanding, and connect dots across disciplines.
|
| In a "rebooting society" scenario, that kind of interactive
| comprehension could be more valuable. You wouldn't just have a
| frozen snapshot of knowledge, you'd have a tool that can help
| people use it, even if they're starting with limited background.
| progval wrote:
| An unreliable computer treated as a god by a pre-information-
| age society sounds like a Star Trek episode.
| bryanrasmussen wrote:
| hey generally everything worked pretty good in those
| societies, it was only people who didn't fit in who had a
| brief painful headache and then died!
| bigyabai wrote:
| Or the plot to _2001_ if you managed to stay awake long
| enough.
| gretch wrote:
| Definitely sounds like a plausible and fun episode.
|
| On the other hand, real history if filled with all sorts of
| things being treated as a god that were much worse than
| "unreliable computer". For example, a lot of times it's just
| a human with malice.
|
| So how bad could it really get
| fzeroracer wrote:
| In a 'rebooting society' doomsday scenario you're assuming that
| our language and understanding would persist. An LLM would
| essentially be a blackbox that you cannot understand or
| decipher, and would be doubly prone to hallucinations and
| issues when interacting with it using a language it was not
| trained on. Wikipedia is something you could gradually
| untangle, especially if the downloaded version also contained
| associated images.
| lblume wrote:
| I would not subscribe to your certainty. With LLMs, even
| empty or nonsensical prompts yield answers, however faulty
| they may be. Based on its level of comprehension and ability
| to generalize between languages I would not be too surprised
| to see LLMs being able to communicate on a very superficial
| level in a language not part of the training data.
| Furthermore, the compression ratio seems to be much better
| with LLMs compared to Wikipedia, considering the generality
| of questions one can pose to e.g. Qwen that Wikipedia cannot
| answer even when knowing how to navigate the site properly.
| It could also come down to the classic dichotomy between
| symbolic expert systems and connectionist neural networks
| which has historically and empirically been decisively won by
| the latter.
| cyanydeez wrote:
| which means you'd still want wikipedia, as the impercision will
| get in the way of real progress beyond the basics.
| belter wrote:
| > LLMs will return faulty or imprecise information at times,
| but what they can do is understand vague or poorly formed
| questions and help guide a user toward an answer.
|
| - "'Pray, Mr. Babbage, if you put into the machine wrong
| figures, will the right answers come out?' "
| ianmcgowan wrote:
| A tangent - sounds like
| https://en.wikipedia.org/wiki/The_Book_of_Koli - a key plot
| component is a chatty Sony AI music player. A little YA, but a
| fun read..
| gonzobonzo wrote:
| Indeed. Ideally, you don't want to trust other people's
| summaries of sources, but you want to look at the sources
| yourself, often with a critical eye. This is one of the things
| that everyone gets taught in school, everyone's says they agree
| with, and then just about no one does (and at times, people
| will outright disparage the idea). Once out of school, tertiary
| sources get treated as if they're completely reliable.
|
| I've found using LLM's to be a good way of getting an idea of
| where the current historiography of a topic stands, and which
| sources I should dive into. Conversely, I've been disappointed
| by the number of Wikipedia editors who become outright hostile
| when you say that Wikipedia is unreliable and that people often
| need to dive into the sources to get a better understanding of
| things. There have been some Wikipedia articles I've come
| across that have been so unreliable that people who didn't look
| at other sources would have been greatly mislead.
| ranger_danger wrote:
| > LLMs will return faulty or imprecise information at times
|
| To be fair, so do humans and wikipedia.
| spankibalt wrote:
| Wikipedia-snapshots without the most important meta layers, i. e.
| a) the article's discussion pages and related archives, as well
| as b) the version history, would be useless to me as critical
| contexts might be/are missing... especially with regards to LLM-
| augmented text analysis. Even when just focusing on the standout-
| lemmata.
| pinkmuffinere wrote:
| I'm a massive Wikipedia fan, have a lot of it downloaded
| locally on my phone, binge read it before bed, etc. Even so, I
| rarely go through talk pages or version history unless I'm
| contributing something. What would you see in an article that
| motivates you to check out the meta layers?
| nine_k wrote:
| Try any article on a controversial issue.
| pinkmuffinere wrote:
| I guess if I know it's controversial then I don't need the
| talk page, and if I don't then I wouldn't think to check
| asacrowflies wrote:
| Any article with social or political controversy ... Try
| gamergate. Or any of the presidents pages for since at least
| bush lol
| spankibalt wrote:
| > "I'm a massive Wikipedia fan, have a lot of it downloaded
| locally on my phone, binge read it before bed, etc."
|
| Me too, albeit these days I'm more interested in its
| underrated capabilities to foster teaching of e-governance
| and democracy/participation.
|
| > "What would you see in an article that motivates you to
| check out the meta layers?"
|
| Generally: How the lemma came to be, how it developed, any
| contentious issues around it, and how it compares to
| tangential lemmata under the same topical umbrella,
| especially with regards to working groups/SIGs (e. g.
| philosophy, history), and their specific methods and
| methodologies, as well as relevant authors.
|
| With regards to contentious issues, one obviously gets a look
| into what the hot-button issues of the day are, as well as
| (comparatives of) internal political issues in different wiki
| projects (incl. scandals, e. g. the right-wing/fascist
| infiltration and associated revisionism and negationism in
| the Croatian wiki [1]). Et cetera.
|
| I always look at the talk pages. And since I mentioned it
| before: Albeit I have almost no use for LLMs in my private
| life, running a Wiki, or a set of articles within, through an
| LLM-ified text analysis engine sounds certainly interesting.
|
| 1. [https://en.wikipedia.org/wiki/Denial_of_the_genocide_of_S
| erb...]
| wangg wrote:
| Wouldn't Wikipedia compress a lot more than llms? Are these
| uncompressed sizes?
| Philpax wrote:
| Yes, they're uncompressed. For reference,
| `enwiki-20250620-pages-articles-multistream.xml.bz2` is
| 25,176,364,573 bytes; you could get that lower with better
| compression. You can do partial reads from multistream bz2,
| though, which is handy.
| GuB-42 wrote:
| The downloads are (presumably) already compressed.
|
| And there are strong ties between LLMs and compression. LLMs
| work by predicting the next token. The best compression
| algorithms work by predicting the next token and encoding the
| difference between the predicted token and the actual token in
| a space-efficient way. So in a sense, a LLM trained on
| Wikipedia is kind of a compressed version of Wikipedia.
| haunter wrote:
| I thought this would be about training a local LLM with an
| offline downloaded copy of Wikipedia
| s1mplicissimus wrote:
| Upvoted this because I like the lighthearted, honest approach.
| meander_water wrote:
| One thing to note is that the quality of LLM output is related to
| the quality and depth of the input prompt. If you don't know what
| to ask (likely in the apocalypse scenario), then that info is
| locked away in the weights.
|
| On the other hand, with Wikipedia, you can just read and search
| everything.
| badsectoracula wrote:
| I've found this amusing because right now i'm downloading
| `wikipedia_en_all_maxi_2024-01.zim` so i can use it with an LLM
| with pages extracted using `libzim` :-P. AFAICT the zim files
| have the pages as HTML and the file i'm downloading is ~100GB.
|
| (reason: trying to cross-reference my _tons_ of downloaded games
| my HDD - for which i only have titles as i never bothered to do
| any further categorization over the years aside than the place i
| got them from - with wikipedia articles - assuming they have one
| - to organize them in genres, some info, etc and after some
| experimentation it turns out an LLM - specifically a quantized
| Mistral Small 3.2 - can make some sense of the chaos while being
| fast enough to run from scripts via a custom llama.cpp program)
___________________________________________________________________
(page generated 2025-07-19 23:00 UTC)