[HN Gopher] Local LLMs versus offline Wikipedia
       ___________________________________________________________________
        
       Local LLMs versus offline Wikipedia
        
       Author : EvanHahn
       Score  : 97 points
       Date   : 2025-07-19 16:49 UTC (6 hours ago)
        
 (HTM) web link (evanhahn.com)
 (TXT) w3m dump (evanhahn.com)
        
       | vFunct wrote:
       | Why not both?
       | 
       | LLM+Wikipedia RAG
        
         | loloquwowndueo wrote:
         | Because old laptop that can't run a local LLM in reasonable
         | time.
        
           | NitpickLawyer wrote:
           | 0.6b - 1.5b models are surprisingly good for RAG, and should
           | work reasonably well even on old toasters. Then there's gemma
           | 3n which runs fine-ish even on mobile phones.
        
           | ozim wrote:
           | Most people who can nag about old laptops on HN can just
           | afford newer one but are cheap as Scrooge Mcduck.
        
             | mlnj wrote:
             | FYI: non-Western countries exist.
        
               | folkrav wrote:
               | Eh, even just "countries that are not the US" would be a
               | correct statement. US tech salaries are just in an entire
               | different ballpark to what most companies outside the US
               | can offer. I'm in Canada, I make good money (as far as
               | Canadian salaries go), but nowhere near "buy an expensive
               | laptop whenever" money.
        
               | lblume wrote:
               | It may also come down to laptops being produced and sold
               | mostly by US companies, which means that the general fact
               | of most items (e.g. produce) being much more expensive in
               | the US compared to, say, Europe doesn't really apply.
        
               | ozim wrote:
               | People who are from those countries that can nag on HN
               | and know whant HN is are most likely still better off
               | than most of their fellow countrymen.
        
         | moffkalast wrote:
         | Now this is an avengers level threat.
        
       | simonw wrote:
       | This is a sensible comparison.
       | 
       | My "help reboot society with the help of my little USB stick"
       | thing was a throwaway remark to the journalist at a random point
       | in the interview, I didn't anticipate them using it in the
       | article! https://www.technologyreview.com/2025/07/17/1120391/how-
       | to-r...
       | 
       | A bunch of people have pointed out that downloading Wikipedia
       | itself onto a USB stick is sensible, and I agree with them.
       | 
       | Wikipedia dumps default to MySQL, so I'd prefer to convert that
       | to SQLite and get SQLite FTS working.
       | 
       | 1TB or more USB sticks are pretty available these days so it's
       | not like there's a space shortage to worry about for that.
        
         | cyanydeez wrote:
         | the real valuable would be both of them. the LLM is good for
         | refining/interpreting questions or longer form progress issues,
         | and the wiki would be actual information for each component of
         | whatever you're trying to do.
         | 
         | But neither are sufficient for modern technology beyond
         | pointing to a starting point.
        
       | antonkar wrote:
       | A bit related: AI companies distilled the whole Web into LLMs to
       | make computers smart, why humans can't do the same to make the
       | best possible new Wikipedia with some copyrighted bits to make
       | kids supersmart?
       | 
       | Why kids are worse than AI companies and have to bum around?)
        
         | horseradish7k wrote:
         | we did that and still do. people just don't buy encyclopedias
         | that much nowadays
        
           | antonkar wrote:
           | Imagine taking the whole Web, removing spam, duplicates, bad
           | explanations
           | 
           | It will be the free new Wikipedia+ to learn anything in the
           | best way possible, with the best graphs, interactive widgets,
           | etc
           | 
           | What LLMs have for free but humans for some reason don't
           | 
           | In some places it is possible to use copyrighted materials to
           | educate if not directly for profit
        
             | literalAardvark wrote:
             | Love it when Silicon Valley reinvents encyclopedias
        
       | dcc wrote:
       | One important distinction is that the strength of LLMs isn't just
       | in storing or retrieving knowledge like Wikipedia, it's in
       | comprehension.
       | 
       | LLMs will return faulty or imprecise information at times, but
       | what they can do is understand vague or poorly formed questions
       | and help guide a user toward an answer. They can explain complex
       | ideas in simpler terms, adapt responses based on the user's level
       | of understanding, and connect dots across disciplines.
       | 
       | In a "rebooting society" scenario, that kind of interactive
       | comprehension could be more valuable. You wouldn't just have a
       | frozen snapshot of knowledge, you'd have a tool that can help
       | people use it, even if they're starting with limited background.
        
         | progval wrote:
         | An unreliable computer treated as a god by a pre-information-
         | age society sounds like a Star Trek episode.
        
           | bryanrasmussen wrote:
           | hey generally everything worked pretty good in those
           | societies, it was only people who didn't fit in who had a
           | brief painful headache and then died!
        
           | bigyabai wrote:
           | Or the plot to _2001_ if you managed to stay awake long
           | enough.
        
           | gretch wrote:
           | Definitely sounds like a plausible and fun episode.
           | 
           | On the other hand, real history if filled with all sorts of
           | things being treated as a god that were much worse than
           | "unreliable computer". For example, a lot of times it's just
           | a human with malice.
           | 
           | So how bad could it really get
        
         | fzeroracer wrote:
         | In a 'rebooting society' doomsday scenario you're assuming that
         | our language and understanding would persist. An LLM would
         | essentially be a blackbox that you cannot understand or
         | decipher, and would be doubly prone to hallucinations and
         | issues when interacting with it using a language it was not
         | trained on. Wikipedia is something you could gradually
         | untangle, especially if the downloaded version also contained
         | associated images.
        
           | lblume wrote:
           | I would not subscribe to your certainty. With LLMs, even
           | empty or nonsensical prompts yield answers, however faulty
           | they may be. Based on its level of comprehension and ability
           | to generalize between languages I would not be too surprised
           | to see LLMs being able to communicate on a very superficial
           | level in a language not part of the training data.
           | Furthermore, the compression ratio seems to be much better
           | with LLMs compared to Wikipedia, considering the generality
           | of questions one can pose to e.g. Qwen that Wikipedia cannot
           | answer even when knowing how to navigate the site properly.
           | It could also come down to the classic dichotomy between
           | symbolic expert systems and connectionist neural networks
           | which has historically and empirically been decisively won by
           | the latter.
        
         | cyanydeez wrote:
         | which means you'd still want wikipedia, as the impercision will
         | get in the way of real progress beyond the basics.
        
         | belter wrote:
         | > LLMs will return faulty or imprecise information at times,
         | but what they can do is understand vague or poorly formed
         | questions and help guide a user toward an answer.
         | 
         | - "'Pray, Mr. Babbage, if you put into the machine wrong
         | figures, will the right answers come out?' "
        
         | ianmcgowan wrote:
         | A tangent - sounds like
         | https://en.wikipedia.org/wiki/The_Book_of_Koli - a key plot
         | component is a chatty Sony AI music player. A little YA, but a
         | fun read..
        
         | gonzobonzo wrote:
         | Indeed. Ideally, you don't want to trust other people's
         | summaries of sources, but you want to look at the sources
         | yourself, often with a critical eye. This is one of the things
         | that everyone gets taught in school, everyone's says they agree
         | with, and then just about no one does (and at times, people
         | will outright disparage the idea). Once out of school, tertiary
         | sources get treated as if they're completely reliable.
         | 
         | I've found using LLM's to be a good way of getting an idea of
         | where the current historiography of a topic stands, and which
         | sources I should dive into. Conversely, I've been disappointed
         | by the number of Wikipedia editors who become outright hostile
         | when you say that Wikipedia is unreliable and that people often
         | need to dive into the sources to get a better understanding of
         | things. There have been some Wikipedia articles I've come
         | across that have been so unreliable that people who didn't look
         | at other sources would have been greatly mislead.
        
         | ranger_danger wrote:
         | > LLMs will return faulty or imprecise information at times
         | 
         | To be fair, so do humans and wikipedia.
        
       | spankibalt wrote:
       | Wikipedia-snapshots without the most important meta layers, i. e.
       | a) the article's discussion pages and related archives, as well
       | as b) the version history, would be useless to me as critical
       | contexts might be/are missing... especially with regards to LLM-
       | augmented text analysis. Even when just focusing on the standout-
       | lemmata.
        
         | pinkmuffinere wrote:
         | I'm a massive Wikipedia fan, have a lot of it downloaded
         | locally on my phone, binge read it before bed, etc. Even so, I
         | rarely go through talk pages or version history unless I'm
         | contributing something. What would you see in an article that
         | motivates you to check out the meta layers?
        
           | nine_k wrote:
           | Try any article on a controversial issue.
        
             | pinkmuffinere wrote:
             | I guess if I know it's controversial then I don't need the
             | talk page, and if I don't then I wouldn't think to check
        
           | asacrowflies wrote:
           | Any article with social or political controversy ... Try
           | gamergate. Or any of the presidents pages for since at least
           | bush lol
        
           | spankibalt wrote:
           | > "I'm a massive Wikipedia fan, have a lot of it downloaded
           | locally on my phone, binge read it before bed, etc."
           | 
           | Me too, albeit these days I'm more interested in its
           | underrated capabilities to foster teaching of e-governance
           | and democracy/participation.
           | 
           | > "What would you see in an article that motivates you to
           | check out the meta layers?"
           | 
           | Generally: How the lemma came to be, how it developed, any
           | contentious issues around it, and how it compares to
           | tangential lemmata under the same topical umbrella,
           | especially with regards to working groups/SIGs (e. g.
           | philosophy, history), and their specific methods and
           | methodologies, as well as relevant authors.
           | 
           | With regards to contentious issues, one obviously gets a look
           | into what the hot-button issues of the day are, as well as
           | (comparatives of) internal political issues in different wiki
           | projects (incl. scandals, e. g. the right-wing/fascist
           | infiltration and associated revisionism and negationism in
           | the Croatian wiki [1]). Et cetera.
           | 
           | I always look at the talk pages. And since I mentioned it
           | before: Albeit I have almost no use for LLMs in my private
           | life, running a Wiki, or a set of articles within, through an
           | LLM-ified text analysis engine sounds certainly interesting.
           | 
           | 1. [https://en.wikipedia.org/wiki/Denial_of_the_genocide_of_S
           | erb...]
        
       | wangg wrote:
       | Wouldn't Wikipedia compress a lot more than llms? Are these
       | uncompressed sizes?
        
         | Philpax wrote:
         | Yes, they're uncompressed. For reference,
         | `enwiki-20250620-pages-articles-multistream.xml.bz2` is
         | 25,176,364,573 bytes; you could get that lower with better
         | compression. You can do partial reads from multistream bz2,
         | though, which is handy.
        
         | GuB-42 wrote:
         | The downloads are (presumably) already compressed.
         | 
         | And there are strong ties between LLMs and compression. LLMs
         | work by predicting the next token. The best compression
         | algorithms work by predicting the next token and encoding the
         | difference between the predicted token and the actual token in
         | a space-efficient way. So in a sense, a LLM trained on
         | Wikipedia is kind of a compressed version of Wikipedia.
        
       | haunter wrote:
       | I thought this would be about training a local LLM with an
       | offline downloaded copy of Wikipedia
        
       | s1mplicissimus wrote:
       | Upvoted this because I like the lighthearted, honest approach.
        
       | meander_water wrote:
       | One thing to note is that the quality of LLM output is related to
       | the quality and depth of the input prompt. If you don't know what
       | to ask (likely in the apocalypse scenario), then that info is
       | locked away in the weights.
       | 
       | On the other hand, with Wikipedia, you can just read and search
       | everything.
        
       | badsectoracula wrote:
       | I've found this amusing because right now i'm downloading
       | `wikipedia_en_all_maxi_2024-01.zim` so i can use it with an LLM
       | with pages extracted using `libzim` :-P. AFAICT the zim files
       | have the pages as HTML and the file i'm downloading is ~100GB.
       | 
       | (reason: trying to cross-reference my _tons_ of downloaded games
       | my HDD - for which i only have titles as i never bothered to do
       | any further categorization over the years aside than the place i
       | got them from - with wikipedia articles - assuming they have one
       | - to organize them in genres, some info, etc and after some
       | experimentation it turns out an LLM - specifically a quantized
       | Mistral Small 3.2 - can make some sense of the chaos while being
       | fast enough to run from scripts via a custom llama.cpp program)
        
       ___________________________________________________________________
       (page generated 2025-07-19 23:00 UTC)