[HN Gopher] How large are large language models?
       ___________________________________________________________________
        
       How large are large language models?
        
       Author : rain1
       Score  : 218 points
       Date   : 2025-07-02 10:39 UTC (12 hours ago)
        
 (HTM) web link (gist.github.com)
 (TXT) w3m dump (gist.github.com)
        
       | unwind wrote:
       | Meta: The inclusion of the current year ("(2025)") in the title
       | is strange, even though it's in the actual title of the linked-to
       | post, repeating it here makes me look around for the time machine
       | controls.
        
       | dale_glass wrote:
       | How big are those in terms of size on disk and VRAM size?
       | 
       | Something like 1.61B just doesn't mean much to me since I don't
       | know much about the guts of LLMs. But I'm curious about how that
       | translates to computer hardware -- what specs would I need to run
       | these? What could I run now, what would require spending some
       | money, and what I might hope to be able to run in a decade?
        
         | mjburgess wrote:
         | At 1byte/param that's 1.6GB (f8), at 2 bytes (f16) that's 2.3GB
         | -- but there's other space costs beyond loading the parameters
         | for the GPU. So a rule of thumb is ~4x parameter count. So
         | round up, 2B -> 2*4 = 8GB VRAM
        
         | loudmax wrote:
         | Most of these models have been trained using 16-bit weights. So
         | a 1 billion parameter model takes up 2 gigabytes.
         | 
         | In practice, models can be quantized to smaller weights for
         | inference. Usually, the performance loss going from 16 bit
         | weights to 8 bit weights is very minor, so a 1 billion
         | parameter model can take 1 gigabyte. Thinking about these
         | models in terms of 8-bit quantized weights has the added
         | benefit of making the math really easy. A 20B model needs 20G
         | of memory. Simple.
         | 
         | Of course, models can be quantized down even further, at
         | greater cost of inference quality. Depending on what you're
         | doing, 5-bit weights or even lower might be perfectly
         | acceptable. There's some indication that models that have been
         | trained on lower bit weights might perform better than larger
         | models that have been quantized down. For example, a model that
         | was trained using 4-bit weights might perform better than a
         | model that was trained at 16 bits, then quantized down to 4
         | bits.
         | 
         | When running models, a lot of the performance bottleneck is
         | memory bandwidth. This is why LLM enthusiasts are looking for
         | GPUs with the most possible VRAM. You computer might have 128G
         | of RAM, but your GPU's access to that memory is so constrained
         | by bandwidth that you might as well run the model on your CPU.
         | Running a model on the CPU can be done, it's just much slower
         | because the computation is so parallel.
         | 
         | Today's higher end consumer grade GPUs have up to 24G of
         | dedicated VRAM (an Nvidia RTX 5090 has 32G of VRAM and they're
         | like $2k). The dedicated VRAM on a GPU has a memory bandwidth
         | of about 1 Tb/s. Apple's M-series of ARM-based CPU's have 512
         | Gb/s of bandwidth, and they're one of the most popular ways of
         | being able to run larger LLMs on consumer hardware. AMD's new
         | "Strix Halo" CPU+GPU chips have up to 128G of unified memory,
         | with a memory bandwidth of about 256 Gb/s.
         | 
         | Reddit's r/LocalLLaMA is a reasonable place to look to see what
         | people are doing with consumer grade hardware. Of course, some
         | of what they're doing is bonkers so don't take everything you
         | see there as a guide.
         | 
         | And as far as a decade from now, who knows. Currently, the top
         | silicon fabs of TSMC, Samsung, and Intel are all working flat-
         | out to meet the GPU demand from hyperscalers rolling out
         | capacity (Microsoft Azure, AWS, Google, etc). Silicon chip
         | manufacturing has traditionally followed a boom/bust cycle. But
         | with geopolitical tensions, global trade barriers, AI-driven
         | advances, and whatever other black swan events, what the next
         | few years will look like is anyone's guess.
        
       | OtherShrezzing wrote:
       | >None of this document was not written by AI
       | 
       | I think in these scenarios, articles should include the prompt
       | and generating model.
        
         | oc1 wrote:
         | You are absolutely right! The AI slop is getting out of
         | control.
        
         | WesolyKubeczek wrote:
         | I don't think the author knows that double negatives in English
         | in a sentence like this cancel, not reinforce, each other.
        
         | kylecazar wrote:
         | I thought this was an accidental double negative by the author
         | -- trying to declare they wrote it themselves.
         | 
         | There are some signs it's written by possibly a non-native
         | speaker.
        
         | rain1 wrote:
         | I have corrected that. It was supposed to say "None of this
         | document was written by AI."
         | 
         | Thank you for spotting the error.
        
           | OtherShrezzing wrote:
           | Understood, thanks for updating it!
        
       | mjburgess wrote:
       | Deepseek v1 is ~670Bn which is ~1.4TB physical.
       | 
       | All digitized books ever written/encoded compress to a few TB.
       | The public web is ~50TB. I think a usable zip of all english
       | electronic text publicly available would be on O(100TB). So we're
       | at about 1% of that in model size, and we're in a diminishing-
       | returns area of training -- ie., going to >1% has not yielded
       | improvements (cf. gpt4.5 vs 4o).
       | 
       | This is why compute spend is moving to inference time with
       | "reasoning" models. It's likely we're close to diminshing returns
       | on inference-time compute now too, hence agents whereby (mostly,)
       | deterministic tools are supplementing information /capability
       | into the system.
       | 
       | I think to get any more value out of this model class, we'll be
       | looking at domain-specific specialisation beyond instruction
       | fine-tuning.
       | 
       | I'd guess targeting 1TB inference-time VRAM would be a reasonable
       | medium-term target for high quality open source models -- that's
       | within the reach of most SMEs today. That's about 250bn params.
        
         | account-5 wrote:
         | > All digitized books ever written/encoded compress to a few
         | TB. The public web is ~50TB. I think a usable zip of all
         | english electronic text publicly available would be on
         | O(100TB).
         | 
         | Where you getting these numbers from? Interested to see how
         | that's calculated.
         | 
         | I read somewhere, but cannot find the source anymore, that all
         | written text prior to this century was approx 50MB. (Might be
         | misquoted as don't have source anymore).
        
           | WesolyKubeczek wrote:
           | Maybe prior to the prior century, and even then I smell a lot
           | of bullshit. I mean, just look at the Project Gutenberg. Even
           | plaintext only, even compressed.
        
             | bravesoul2 wrote:
             | Even Shakespeare alone needs 4 floppy disks.
        
           | kmm wrote:
           | Perhaps that's meant to be 50GB (and that still seems like a
           | serious underestimation)? Just the Bible is already 5MB.
        
             | _Algernon_ wrote:
             | English Wikipedia without media alone is ~24 GB
             | _compressed_.
             | 
             | https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia
        
               | kmm wrote:
               | I don't see how the size of Wikipedia has any bearing on
               | the 50MB figure given for pre-20th century literature by
               | the parent.
        
           | mjburgess wrote:
           | Anna's Archive full torrent is O(1PB), project gutenberg is
           | O(1TB), many AI training torrents are reported in the O(50TB)
           | range.
           | 
           | Extract just the plain text from that (+social media, etc.),
           | remove symbols outside of a 64 symbol alphabet (6 bits) and
           | compress. "Feels" to me around a 100TB max for absolutely
           | everything.
           | 
           | Either way, full-fat LLMs are operating at 1-10% of this
           | scale, depending how you want to estimate it.
           | 
           | If you run a more aggressive filter on that 100TB, eg., for a
           | more semantic dedup, there's a plausible argument for
           | "information" in english texts available being ~10TB -- then
           | we're running close to 20% of that in LLMs.
           | 
           | If we take LLMs to just be that "semantic compression
           | algorithm", and supposing the maximum useful size of an LLM
           | is 2TB, then you could run the argument that everything
           | "salient" ever written is <10TB.
           | 
           | Taking LLMs to be running at close-to 50% "everything useful"
           | rather than 1% would be a explanation of why training has
           | capped out.
           | 
           | I think the issue is at least as much to do with what we're
           | using LLMs for -- ie., instruction fine-tuning requires some
           | more general (proxy/quasi-) semantic structures in LLMs and I
           | think you only need O(1%) of "everything ever written" to
           | capture these. So it wouldnt really matter how much more we
           | added, instruction-following LLMs don't really need it.
        
           | TeMPOraL wrote:
           | > _I read somewhere, but cannot find the source anymore, that
           | all written text prior to this century was approx 50MB.
           | (Might be misquoted as don 't have source anymore)._
           | 
           | 50 MB feels too low, unless the quote meant text up until the
           | _20th century_ , in which case it feels much more believable.
           | In terms of text production and publishing, we're still
           | riding an exponent, so a couple orders of magnitude increase
           | between 1899 and 2025 is not surprising.
           | 
           | (Talking about S-curves is all the hotness these days, but I
           | feel it's usually a way to avoid understanding what
           | exponential growth means - if one assumes we're past the
           | inflection point, one can wave their hands and pretend the
           | change is linear, and continue to not understand it.)
        
             | ben_w wrote:
             | Even by the start of the 20th century, 50 MB is definitely
             | far too low.
             | 
             | Any given English translation of Bible is by itself
             | something like 3-5 megabytes of ASCII; the complete works
             | of Shakespeare are about 5 megabytes; and I think (back of
             | the envelope estimate) you'd get about the same again for
             | what Arthur Conan Doyle wrote before 1900.
             | 
             | I can just about believe there might have been only ten
             | _thousand_ Bible-or-Shakespeare sized books (plus all the
             | court documents, newspapers, etc. that add up to that)
             | worldwide by 1900, but not _ten_.
             | 
             | Edit: I forgot about encyclopaedias, by 1900 the
             | Encyclopaedia Britannica was almost certainly more than 50
             | MB all by itself.
        
             | jerf wrote:
             | 50MB feels like "all the 'ancient' text we have" maybe, as
             | measured by the size of the original content and not
             | counting copies. A quick check at Alice in Wonderland puts
             | it at 163kB in plain text. About 300 of those gets us to
             | 50MB. There's way more than 300 books of similar size from
             | the 19th century. They may not all be digitized and freely
             | available, but you can fill libraries with even existing
             | 19th century texts, let alone what may be lost by now.
             | 
             | Or it may just be someone bloviating and just being
             | wrong... I think even ancient texts could exceed that
             | number, though perhaps not by an order of magnitude.
        
           | bravesoul2 wrote:
           | I reckon a prolific writer could publish a million words in
           | their career.
           | 
           | Most people who blog could wrote 1k words a day. That's a
           | million in 3 years. So not crazy numbers here.
           | 
           | That's 5Mb. Maybe you meant 50Gb. I'd hazard 50Tb.
        
         | smokel wrote:
         | Simply add images and video, and these estimates start to sound
         | like the "640 KB should be enough for everyone".
         | 
         | After that, make the robots explore and interact with the world
         | by themselves, to fetch even more data.
         | 
         | In all seriousness, adding image and interaction data will
         | probably be enormously useful, even for generating text.
        
           | netcan wrote:
           | Like both will be done. Idk what the roi is on adding video
           | data to the text models, but it's presumably lower than text.
           | 
           | There are just a lot of avenues to try at this point.
        
             | llSourcell wrote:
             | no its not lower than text, its higher ROI than text for
             | understanding the physics of the world, which is exactly
             | what videos are better at than text when it comes to
             | training data
        
               | AstroBen wrote:
               | Does that transfer, though? I'm not sure we can expect
               | its ability to approximate physics in video form would
               | transfer to any other mode (text, code, problem solving
               | etc)
        
               | ricopags wrote:
               | depends on the hyperparams but one of the biggest
               | benefits of a latent space is transfer between modalities
        
         | generalizations wrote:
         | > has not yielded improvements (cf. gpt4.5 vs 4o).
         | 
         | FWIW there is a huge difference between 4.5 and 4o.
        
         | charcircuit wrote:
         | >The public web is ~50TB
         | 
         | Did you mean to type EB?
        
           | gosub100 wrote:
           | Only if you included all images and video
        
         | andrepd wrote:
         | > 50TB
         | 
         | There's no way the entire Web fits in 400$ worth of hard
         | drives.
        
           | AlienRobot wrote:
           | Text is small.
        
           | flir wrote:
           | Nah, Common Crawl puts on 250TB a month.
           | 
           | Maybe text only, though...
        
         | fouc wrote:
         | Maybe you're thinking of Library of Congress when you say
         | ~50TB? Internet is definitely larger..
        
         | rain1 wrote:
         | This is kind of related to the jack morris post
         | https://blog.jxmo.io/p/there-are-no-new-ideas-in-ai-only he
         | discusses how the big leaps in LLMs have mostly come - not so
         | much from new training methods or arch. changes as such - but
         | the ability of new archs. to ingest _more_ data.
        
         | layer8 wrote:
         | Just a nitpick, but please don't misuse big O notation like
         | that. Any fixed storage amount is O(100TB).
        
       | christianqchung wrote:
       | This is a bad article. Some of the information is wrong, and it's
       | missing lots of context.
       | 
       | For example, it somehow merged Llama 4 Maverick's custom Arena
       | chatbot version with Behemoth, falsely claiming that the former
       | is stopping the latter from being released. It also claims 40B of
       | internet text data is 10B tokens, which seems a little odd. Llama
       | 405B was also trained on more than 15 trillion tokens[1], but the
       | post claims only 3.67 trillion for some reason. It also doesn't
       | mention Mistral large for some reason, even though it's the first
       | good European 100B+ dense model.
       | 
       | >The MoE arch. enabled larger models to be trained and used by
       | more people - people without access to thousands of
       | interconnected GPUs
       | 
       | You still need thousands of GPUs to train a MoE model of any
       | actual use. This is true for inference in the sense that it's
       | faster I guess, but even that has caveats because MoE models are
       | less powerful than dense models of the same size, though the
       | trade-off has apparently been worth it in many cases. You also
       | didn't need thousands of GPUs to do inference before, even for
       | the largest models.
       | 
       | The conclusion is all over the place, and has lots of just weird
       | and incorrect implications. The title is about how big LLMs are,
       | why is there such a focus on token training count? Also no
       | mention of quantized size. This is a bad AI slop article (whoops,
       | turns out the author accidentally said it was AI generated, so
       | it's a bad human slop article).
       | 
       | [1] https://ai.meta.com/blog/meta-llama-3-1/
        
         | rain1 wrote:
         | I can correct mistakes.
         | 
         | > it somehow merged Llama 4 Maverick's custom Arena chatbot
         | version with Behemoth
         | 
         | I can clarify this part. I wrote 'There was a scandal as
         | facebook decided to mislead people by gaming the lmarena
         | benchmark site - they served one version of llama-4 there and
         | released a different model' which is true.
         | 
         | But it is inside the section about the llama 4 model behemoth.
         | So I see how that could be confusing/misleading.
         | 
         | I could restructure that section a little to improve it.
         | 
         | > Llama 405B was also trained on more than 15 trillion
         | tokens[1],
         | 
         | You're talking about Llama 405B instruct, I'm talking about
         | Llama 405B base. Of course the instruct model has been traiend
         | on more tokens.
         | 
         | > why is there such a focus on token training count?
         | 
         | I tried to include the rough training token count for each
         | model I wrote about - plus additional details about training
         | data mixture if available. Training data is an important part
         | of an LLM.
        
       | fossa1 wrote:
       | It's ironic: for years the open-source community was trying to
       | match GPT-3 (175B dense) with 30B-70B models + RLHF + synthetic
       | data--and the performance gap persisted.
       | 
       | Turns out, size really did matter, at least at the base model
       | level. Only with the release of truly massive dense (405B) or
       | high-activation MoE models (DeepSeek V3, DBRX, etc) did we start
       | seeing GPT-4-level reasoning emerge outside closed labs.
        
       | stared wrote:
       | If you want it visually, here's a chart of total parameters as a
       | function of year: https://app.charts.quesma.com/s/rmyk38
        
         | rain1 wrote:
         | This is really awesome. Thank you for creating that. I included
         | a screenshot and link to the chart with credit to you in a
         | comment to my post.
        
           | stared wrote:
           | I am happy you like it!
           | 
           | If you like darker color scheme, here it is:
           | 
           | https://app.charts.quesma.com/s/f07qji
           | 
           | And active vs total:
           | 
           | https://app.charts.quesma.com/s/4bsqjs
        
         | rain1 wrote:
         | I think that one thing that this chart makes visually very
         | clear is the point I about GPT-3 being such a huge leap, and
         | there being a long gap before anybody was able to match it.
        
       | ljoshua wrote:
       | Less a technical comment and more just a mind-blown comment, but
       | I still can't get over _just how much data_ is compressed into
       | and available in these downloadable models. Yesterday I was on a
       | plane with no WiFi, but had gemma3:12b downloaded through Ollama.
       | Was playing around with it and showing my kids, and we fired
       | history questions at it, questions about recent video games, and
       | some animal fact questions. It wasn't perfect, but holy cow the
       | breadth of information that is embedded in an 8.1 GB file is
       | incredible! Lossy, sure, but a pretty amazing way of compressing
       | all of human knowledge into something incredibly contained.
        
         | ljlolel wrote:
         | How big is Wikipedia text? Within 3X that size with 100%
         | accuracy
        
           | phkahler wrote:
           | Google AI response says this for compressed size of
           | wikipedia:
           | 
           | "The English Wikipedia, when compressed, currently occupies
           | approximately 24 GB of storage space without media files.
           | This compressed size represents the current revisions of all
           | articles, but excludes media files and previous revisions of
           | pages, according to Wikipedia and Quora."
           | 
           | So 3x is correct but LLMs are lossy compression.
        
         | rain1 wrote:
         | It's extremely interesting how powerful a language model is at
         | compression.
         | 
         | When you train it to be an assistant model, it's better at
         | compressing assistant transcripts than it is general text.
         | 
         | There is an eval which I have a lot of interested in and
         | respect for
         | https://huggingface.co/spaces/Jellyfish042/UncheatableEval
         | called UncheatableEval, which tests how good of a language
         | model an LLM is by applying it on a range of compression tasks.
         | 
         | This task is essentially impossible to 'cheat'. Compression is
         | a benchmark you cannot game!
        
           | MPSimmons wrote:
           | Agreed. It's basically lossy compression for everything it's
           | ever read. And the quantization impacts the lossiness, but
           | since a lot of text is super fluffy, we tend not to notice as
           | much as we would when we, say, listen to music that has been
           | compressed in a lossy way.
        
             | entropicdrifter wrote:
             | It's a bit like if you trained a virtual band to play any
             | song ever, then told it to do its own version of the songs.
             | Then prompted it to play whatever specific thing you
             | wanted. It won't be the same because it _kinda_ remembers
             | the right thing _sorta_ , but it's also winging it.
        
           | soulofmischief wrote:
           | Knowledge is learning relationships by decontextualizing
           | information into generalized components. Application of
           | knowledge is recontextualizing these components based on the
           | problem at hand.
           | 
           | This is essentially just compression and decompression. It's
           | just that with prior compression techniques, we never tried
           | leveraging the inherent relationships encoded in a compressed
           | data structure, because our compression schemes did not
           | leverage semantic information in a generalized way and thus
           | did not encode very meaningful relationships other than "this
           | data uses the letter 'e' quite a lot".
           | 
           | A lot of that comes from the sheer amount of data we throw at
           | these models, which provide enough substrate for semantic
           | compression. Compare that to common compression schemes in
           | the wild, where data is compressed in isolation without
           | contributing its information to some model of the world. It
           | turns out that because of this, we've been leaving _a lot_ on
           | the table with regards to compression. Another factor has
           | been the speed /efficiency tradeoff. GPUs have allowed us to
           | put a lot more into efficiency, and the expectations that
           | many language models only need to produce text as fast as it
           | can be read by a human means that we can even further
           | optimize for efficiency over speed.
           | 
           | Also, shout out to Fabrice Bellard's ts_zip, which leverages
           | LLMs to compress text files. https://bellard.org/ts_zip/
        
         | exe34 wrote:
         | Wikipedia is about 24GB, so if you're allowed to drop 1/3 of
         | the details and make up the missing parts by splicing in random
         | text, 8GB doesn't sound too bad.
         | 
         | To me the amazing thing is that you can tell the model to do
         | something, even follow simple instructions in plain English,
         | like make a list or write some python code to do $x, that's the
         | really amazing part.
        
           | bbarnett wrote:
           | Not to mention, _Language Modeling is Compression_
           | https://arxiv.org/pdf/2309.10668
           | 
           | So text wikipedia at 24G would easily hit 8G with many
           | standard forms of compression, I'd think. If not better. And
           | it would be 100% accurate, full text and data. Far more
           | usable.
           | 
           | It's so easy for people to not realise how _massive_ 8GB
           | really is, in terms of text. Especially if you use ascii
           | instead of UTF.
        
             | horsawlarway wrote:
             | The 24G is the compressed number.
             | 
             | They host a pretty decent article here:
             | https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia
             | 
             | The relevant bit:
             | 
             | > As of 16 October 2024, the size of the current version
             | including all articles compressed is about 24.05 GB without
             | media.
        
               | bbarnett wrote:
               | Nice link, thanks.
               | 
               | Well I'll fallback position, and say one is lossy, the
               | other not.
        
           | Nevermark wrote:
           | It blows my mind that I can ask for 50 synonyms, instantly
           | get a great list with great meaning summaries.
           | 
           | Then ask for the same list sorted and get that nearly
           | instantly,
           | 
           | These models have a short time context for now, but they
           | already have _a huge "working memory" relative to us_.
           | 
           | It is very cool. And indicative that vastly smarter models
           | are going to be achieved fairly easily, with new insight.
           | 
           | Our biology has had to ruthlessly work within our
           | biological/ecosystem energy envelope, and with the limited
           | value/effort returned by a pre-internet pre-vast economy.
           | 
           | So biology has never been able to scale. Just get marginally
           | more efficient and effective within tight limits.
           | 
           | Suddenly, (in historical, biological terms), energy
           | availability limits have been removed, and limits on the
           | value of work have compounded and continue to do so.
           | Unsurprising that those changes suddenly unlock easily
           | achieved vast untapped room for cognitive upscaling.
        
             | Wowfunhappy wrote:
             | > These models [...] have a huge "working memory" relative
             | to us. [This is] indicative that vastly smarter models are
             | going to be achieved fairly easily, with new insight.
             | 
             | I don't think your second sentence logically follows from
             | the first.
             | 
             | Relative to us, these models:
             | 
             | - Have a much larger working memory.
             | 
             | - Have much more limited logical reasoning skills.
             | 
             | To some extent, these models are able to use their superior
             | working memories to compensate for their limited reasoning
             | abilities. This can make them very useful tools! But there
             | may well be a ceiling to how far that can go.
             | 
             | When you ask a model to "think about the problem step by
             | step" to improve its reasoning, you are basically just
             | giving it more opportunities to draw on its huge memory
             | bank and try to put things together. But humans are able to
             | reason with orders of magnitude less training data. And by
             | the way, we are out of new training data to give the
             | models.
        
               | antonvs wrote:
               | > Have much more limited logical reasoning skills.
               | 
               | Relative to the best humans, perhaps, but I seriously
               | doubt this is true in general. Most people I work with
               | couldn't reason nearly as well through the questions I
               | use LLMs to answer.
               | 
               | It's also worth keeping in mind that having a different
               | approach to reasoning is not necessarily equivalent to a
               | worse approach. Watch out for cherry-picking the cons of
               | its approach and ignoring the pros.
        
               | exe34 wrote:
               | > Relative to the best humans,
               | 
               | For some reason, the bar for AI is always against the
               | best possible human, right now.
        
               | exe34 wrote:
               | > But humans are able to reason with orders of magnitude
               | less training data.
               | 
               | Common belief, but false. You start learning from inside
               | the womb. The data flow increases exponentially when you
               | open your eyes and then again when you start manipulating
               | things with your hands and mouth.
               | 
               | > When you ask a model to "think about the problem step
               | by step" to improve its reasoning, you are basically just
               | giving it more opportunities to draw on its huge memory
               | bank and try to put things together.
               | 
               | We do the same with children. At least I did it to my
               | classmates when they asked me for help. I'd give them a
               | hint, and ask them to work it out step by step from
               | there. It helped.
        
               | Wowfunhappy wrote:
               | > Common belief, but false. You start learning from
               | inside the womb. The data flow increases exponentially
               | when you open your eyes and then again when you start
               | manipulating things with your hands and mouth.
               | 
               | But you don't get data equal to _the entire internet_ as
               | a child!
               | 
               | > We do the same with children. At least I did it to my
               | classmates when they asked me for help. I'd give them a
               | hint, and ask them to work it out step by step from
               | there. It helped.
               | 
               | And I do it with my students. I still think there's a
               | difference in kind between when I listen to my students
               | (or other adults) reason through a problem, and when I
               | look at the output of an AI's reasoning, but I admittedly
               | couldn't tell you what that is, so point taken. I still
               | think the AI is relying far more heavily on its knowledge
               | base.
        
               | jacobr1 wrote:
               | There seems to be lots of mixed data points on this, but
               | to some extent there is knowledge encoded into the
               | evolutionary base state of the new human brain. Probably
               | not directly as knowledge, but "primed" to quickly to
               | establish relevant world models and certain types of
               | reasoning with a small number of examples.
        
               | oceanplexian wrote:
               | Your field of vision is equivalent to something like 500
               | Megapixels. And assume it's uncompressed because it's not
               | like your eyeballs are doing H.264.
               | 
               | Given vision and the other senses, I'd argue that your
               | average toddler has probably trained on more sensory
               | information than the largest LLMs ever built long before
               | they learn to talk.
        
               | jacobr1 wrote:
               | > And by the way, we are out of new training data to give
               | the models.
               | 
               | Only easily accessible text data. We haven't really
               | started using video at scale yet for example. It looks
               | like data for specific tasks goes really far too ... for
               | example agentic coding interactions aren't something that
               | has generally been captured on the internet. But
               | capturing interactions with coding agents, in combination
               | with the base-training of existing programming knowledge
               | already captured is resulting in significant performance
               | increases. The amount of specicialed data we might need
               | to gather or synthetically generate is perhaps orders of
               | magnitude less that presumed with pure supervised
               | learning systems. And for other applications like
               | industrial automation or robotics we've barely started
               | capturing all the sensor data that lives in those
               | systems.
        
               | Nevermark wrote:
               | My response completely acknowledged their current
               | reasoning limits.
               | 
               | But in evolutionary time frames, clearly those limits are
               | lifting extraordinarily quickly. By many orders of
               | magnitude.
               | 
               | And the point I made, that our limits were imposed by
               | harsh biological energy and reward limits, vs. todays
               | models (and their successors) which have access to
               | relatively unlimited energy, and via sharing value with
               | unlimited customers, unlimited rewards, stands.
               | 
               | It is a much simpler problem to improve digital cognition
               | in a global ecosystem of energy production, instant
               | communication and global application, than it was for
               | evolution to improve an individual animals cognition in
               | the limited resources of local habitats and their
               | inefficient communication of advances.
        
         | Workaccount2 wrote:
         | I don't like the term "compression" used with transformers
         | because it gives the wrong idea about how they function. Like
         | that they are a search tool glued onto a .zip file, your
         | prompts are just fancy search queries, and hallucinations are
         | just bugs in the recall algo.
         | 
         | Although strictly speaking they have lots of information in a
         | small package, they are F-tier compression algorithms because
         | the loss is bad, unpredictable, and undetectable (i.e. a human
         | has to check it). You would almost never use a transformer in
         | place of any other compression algorithm for typical data
         | compression uses.
        
           | Wowfunhappy wrote:
           | A .zip is lossless compression. But we also have plenty of
           | lossy compression algorithms. We've just never been able to
           | use lossy compression on text.
        
             | Workaccount2 wrote:
             | >We've just never been able to use lossy compression on
             | text.
             | 
             | ...and we still can't. If your lawyer sent you your case
             | files in the form of an LLM trained on those files, would
             | you be comfortable with that? Where is the situation you
             | would compress text with an LLM over a standard compression
             | algo? (Other than to make an LLM).
             | 
             | Other lossy compression targets known superfluous
             | information. MP3 removes sounds we can't really hear, and
             | JPEG works by grouping uniform color pixels into single
             | chunks of color.
             | 
             | LLM's kind of do their own thing, and the data you get back
             | out of them is correct, incorrect, or dangerously incorrect
             | (i.e. is plausible enough to be taken as correct), with no
             | algorithmic way to discern which is which.
             | 
             | So while yes, they do compress data and you can measure it,
             | the output of this "compression algorithm" puts in it the
             | same family as a "randomly delete words and thesaurus long
             | words into short words" compression algorithms. Which I
             | don't think anyone would consider to compress their
             | documents.
        
               | esafak wrote:
               | People summarize (compress) documents with LLMs all day.
               | With legalese the application would be to summarize it in
               | layman's terms, while retaining the original for legal
               | purposes.
        
               | Workaccount2 wrote:
               | Yes, and we all know (ask teachers) how reliable those
               | summaries are. They are _randomly_ lossy, which makes
               | them unsuitable for any serious work.
               | 
               | I'm not arguing that LLMs don't compress data, I am
               | arguing that they are technically compression tools, but
               | not colloquially compression tools, and the overlap they
               | have with colloquial compression tools is almost zero.
        
               | menaerus wrote:
               | At this moment LLMs are used for much of the serious work
               | across the globe so perhaps you will need to readjust
               | your line of thinking. There's nothing inherently better
               | or more trustworthy to have a person compile some
               | knowledge than, let's say, a computer algorithm in this
               | case. I place my bets on the latter to have better
               | output.
        
               | esafak wrote:
               | > They are randomly lossy, which makes them unsuitable
               | for any serious work.
               | 
               | Ask ten people and they'll give ten different summaries.
               | Are humans unsuitable too?
        
               | Workaccount2 wrote:
               | Yes, which is why we write things down, and when those
               | archives become too big we use lossless compression on
               | them, because we cannot tolerate a compression tool that
               | drops the street address of a customer or even worse,
               | hallucinates a slightly different one.
        
               | Wowfunhappy wrote:
               | But lossy compression algorithms for e.g. movies and
               | music are also non-deterministic.
               | 
               | I'm not making an argument about whether the compression
               | is good or useful, just like I don't find 144p bitrate
               | starved videos particularly useful. But it doesn't seem
               | so unlike other types of compression to me.
        
               | antonvs wrote:
               | > LLM's kind of do their own thing, and the data you get
               | back out of them is correct, incorrect, or dangerously
               | incorrect (i.e. is plausible enough to be taken as
               | correct), with no algorithmic way to discern which is
               | which.
               | 
               | Exactly like information from humans, then?
        
               | tshaddox wrote:
               | > If your lawyer sent you your case files in the form of
               | an LLM trained on those files, would you be comfortable
               | with that?
               | 
               | If the LLM-based compression method was well-understood
               | and demonstrated to be reliable, I wouldn't oppose it on
               | principle. If my lawyer didn't know what they were doing
               | and threw together some ChatGPT document transfer system,
               | of course I wouldn't trust it, but I also wouldn't trust
               | my lawyer if they developed their own DCT-based lossy
               | image compression algorithm.
        
           | angusturner wrote:
           | There is an excellent talk by Jack Rae called "compression
           | for AGI", where he shows (what I believe to be) a little
           | known connection between transformers and compression;
           | 
           | In one view, you can view LLMs as SOTA lossless compression
           | algorithms, where the number of weights don't count towards
           | the description length. Sounds crazy but it's true.
        
             | Workaccount2 wrote:
             | A transformer that doesn't hallucinate (or knows what is a
             | hallucination) would be the ultimate compression algorithm.
             | But right now that isn't a solved problem, and it leaves
             | the output of LLMs too untrustworthy to use over what are
             | colloquially known as compression algorithms.
        
               | Nevermark wrote:
               | It is still task related.
               | 
               | Compressing a comprehensive command line reference via
               | model might introduce errors and drop some options.
               | 
               | But for many people, especially new users, referencing
               | commands, and getting examples, via a model would
               | delivers many times the value.
               | 
               | Lossy vs. lossless are fundamentally different, but so
               | are use cases.
        
             | swyx wrote:
             | his talk here https://www.youtube.com/watch?v=dO4TPJkeaaU
             | 
             | and his last before departing for Meta Superintelligence
             | https://www.youtube.com/live/U-fMsbY-
             | kHY?si=_giVEZEF2NH3lgxI...
        
         | Wowfunhappy wrote:
         | How does this compare to, say, the compression ratio of a
         | lossless 8K video and a 240p Youtube stream of the same video?
        
         | agumonkey wrote:
         | Intelligence is compression some say
        
           | Nevermark wrote:
           | Very much so!
           | 
           | The more and faster a "mind" can infer, the less it needs to
           | store.
           | 
           | Think how much fewer facts a symbolic system that can perform
           | calculus needs to store, vs. an algebraic, or just arithmetic
           | system, to cover the same numerical problem solving space.
           | Many orders of magnitude less.
           | 
           | The same goes for higher orders of reasoning. General or
           | specific subject related.
           | 
           | And higher order reasoning vastly increases capabilities
           | extending into new novel problem spaces.
           | 
           | I think model sizes may temporarily drop significantly, after
           | every major architecture or training advance.
           | 
           | In the long run, "A circa 2025 maxed M3 Ultra Mac Studio is
           | all you need!" (/h? /s? Time will tell.)
        
             | agumonkey wrote:
             | I don't know who else took notes by diffing their own
             | assumptions with lectures / talks. There was a notion of
             | what's really new compared to previous conceptual state,
             | what adds new information.
        
           | goatlover wrote:
           | How well does that apply to robotics or animal intelligence?
           | Manipulating the real world is more fundamental to human
           | intelligence than compressing text.
        
             | ToValueFunfetti wrote:
             | Under the predictive coding model (and I'm sure some
             | others), animal intelligence is also compression. The idea
             | is that the early layers of the brain minimize how
             | surprising incoming sensory signals are, so the later
             | layers only have to work with truly entropic signal. But it
             | has non-compression-based intelligence within those more
             | abstract layers.
        
               | goatlover wrote:
               | I just wonder if neuroscientists use that kind of model.
        
           | penguin_booze wrote:
           | I don't know why, but I was reminded of Douglas Hofstadter's
           | talk: Analogy is cognition:
           | https://www.youtube.com/watch?v=n8m7lFQ3njk&t=964s.
        
           | tshaddox wrote:
           | Some say that. But what I value even more than compression is
           | the ability to create new ideas which do not in any way exist
           | in the set of all previously-conceived ideas.
        
             | benreesman wrote:
             | I'm toying with the phrase "precedented originality" as a
             | way to describe the optimal division of labor when I work
             | with Opus 4 running hot (which is the first one where I
             | consistently come out ahead by using it). That model at
             | full flog seems to be very close to the asymptote for the
             | LLM paradigm on coding: they've really pulled out all the
             | stops (the temperature is so high it makes trivial
             | typographical errors, it will discuss just about anything,
             | it will churn for 10, 20, 30 seconds to first token via
             | API).
             | 
             | Its good enough that it has changed my mind about the
             | fundamental utility of LLMs for coding in non-Javascript
             | complexity regimes.
             | 
             | But its still not an expert programmer, not by a million
             | miles, there is no way I could delegate my job to it (and
             | keep my job). So there's some interesting boundary that's
             | different than I used to think.
             | 
             | I think its in the vicinity of "how much precedent exists
             | for this thought or idea or approach". The things I bring
             | to the table in that setting have precedent too, but much
             | more tenuously connected to like one clear precedent on
             | e.g. GitHub, because if the thing I need was on GitHub I
             | would download it.
        
           | hamilyon2 wrote:
           | Crystallized intelligence is. I am not sure about fluid
           | intelligence.
        
             | antisthenes wrote:
             | Fluid intelligence is just how quickly you acquire
             | crystallized intelligence.
             | 
             | It's the first derivative.
        
               | agumonkey wrote:
               | Talking about that, people designed a memory game, dual n
               | back, which allegedly improve fluid intelligence.
        
         | dgrabla wrote:
         | Back in the '90s, we joked about putting "the internet" on a
         | floppy disk. It's kind of possible now.
        
           | Lu2025 wrote:
           | Yeah, those guys managed to steal the internet.
        
         | Nevermark wrote:
         | It is truly incredible.
         | 
         | One factor, is the huge redundancies pervasive in our
         | communication.
         | 
         | (1) There are so many ways to say the same thing, that (2) we
         | have to add even more words to be precise at all. Without a
         | verbal indexing system we (3) spend many words just setting up
         | context for what we really want to say. And finally, (4) we
         | pervasively add a great deal of intentionally non-informational
         | creative and novel variability, and mood inducing color, which
         | all require even more redundancy to maintain reliable
         | interpretation, in order to induce our minds to maintain
         | attention.
         | 
         | Our minds are active resistors of plain information!
         | 
         | All four factors add so much redundancy, it's probably fair to
         | say most of our communication (by bits, characters, words,
         | etc., may be 95%?, 98%? or more!) pure redundancy.
         | 
         | Another helpful compressor, is many facts are among a few
         | "reasonably expected" alternative answers. So it takes just a
         | little biasing information to encode the right option.
         | 
         | Finally, the way we reason seems to be highly common across
         | everything that matters to us. Even though we have yet to
         | identify and characterize this informal human logic. So once
         | that is modeled, that itself must compress a lot of relations
         | significantly.
         | 
         | Fuzzy Logic was a first approximation attempt at modeling human
         | "logic". But has not been very successful.
         | 
         | Models should eventually help us uncover that "human logic", by
         | analyzing how they model it. Doing so may let us create even
         | more efficient architectures. Perhaps significantly more
         | efficient, and even provide more direct non-gradient/data based
         | "thinking" design.
         | 
         | Nevertheless, the level of compression is astounding!
         | 
         | We are far less complicated cognitive machines that we imagine!
         | Scary, but inspiring too.
         | 
         | I personally believe that common PCs of today, maybe even high
         | end smart phones circa 2025, will be large enough to run future
         | super intelligence when we get it right, given internet access
         | to look up information.
         | 
         | We have just begun to compress artificial minds.
        
         | nico wrote:
         | For reference (according to Google):
         | 
         | > The English Wikipedia, as of June 26, 2025, contains over 7
         | million articles and 63 million pages. The text content alone
         | is approximately 156 GB, according to Wikipedia's statistics
         | page. When including all revisions, the total size of the
         | database is roughly 26 terabytes (26,455 GB)
        
           | sharkjacobs wrote:
           | better point of reference might be pages-articles-
           | multistream.xml.bz2 (current pages without edit/revision
           | history, no talk pages, no user pages) which is 20GB
           | 
           | https://en.wikipedia.org/wiki/Wikipedia:Database_download#Wh.
           | ..?
        
             | inopinatus wrote:
             | this is a much more deserving and reliable candidate for
             | any labels regarding the breadth of human knowledge.
        
           | mapt wrote:
           | What happens if you ask this 8gb model "Compose a realistic
           | Wikipedia-style page on the Pokemon named Charizard"?
           | 
           | How close does it come?
        
           | pcrh wrote:
           | Wikipedia itself describes its size as ~25GB without media
           | [0]. And it's probably more accurate and with broader
           | coverage in multiple languages compared to the LLM downloaded
           | by the GP.
           | 
           | https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia
        
             | pessimizer wrote:
             | Really? I'd assume that an LLM would deduplicate Wikipedia
             | into something much smaller than 25GB. That's its only job.
        
               | crazygringo wrote:
               | > _That 's its only job._
               | 
               | The vast, vast majority of LLM knowledge is not found in
               | Wikipedia. It is definitely not its only job.
        
         | tomkaos wrote:
         | Same thing with image model. 4 Go stable diffusion model can
         | draw and represent anything humanity know.
        
           | alternatex wrote:
           | How about a full glass of wine? Filled to the brim.
        
         | stronglikedan wrote:
         | I've been doing the AI course on Brilliant lately, and it's
         | mindblowing the techniques that they come up with to compress
         | the data.
        
         | thecosas wrote:
         | A neat project you (and others) might want to check out:
         | https://kiwix.org/
         | 
         | Lots of various sources that you can download locally to have
         | available offline. They're even providing some pre-loaded
         | devices in areas where there may not be reliable or any
         | internet access.
        
         | swyx wrote:
         | the study of language models from an information
         | theory/compression POV is a small field but increasingly impt
         | for efficiency/scaling - we did a discussion about this today
         | https://www.youtube.com/watch?v=SWIKyLSUBIc&t=2269s
        
         | divbzero wrote:
         | The _Encyclopaedia Britannica_ has about 40,000,000 words [1]
         | or about 0.25 GB if you assume 6 bytes per word. It's
         | impressive but not outlandish that an 8.1 GB file could encode
         | a large swath of human information.
         | 
         | [1]: https://en.wikipedia.org/wiki/Encyclopaedia_Britannica
        
         | holoduke wrote:
         | Yea. Same for a 8gb stable diffusion image generator. Sure not
         | the best quality. But there is so much information inside.
        
         | tasuki wrote:
         | 8.1 GB is a lot!
         | 
         | It is 64,800,000,000 bits.
         | 
         | I can imagine 100 bits sure. And 1,000 bits why not. 10,000 you
         | lose me. A million? That sounds like a lot. Now 64 million
         | would be a number I can't well imagine. And this is a thousand
         | times 64 million!
        
         | ysofunny wrote:
         | they're an upgraded version of self-executable zip files that
         | compresses knowledge like mp3 compresses music, without knowing
         | exactly wtf are either music nor knowledge
         | 
         | the self-execution is the interactive chat interface.
         | 
         | wikipedia gets "trained" (compiled+compressed+lossy) into an
         | executable you can chat with, you can pass this through another
         | pretrained A.I. than can talk out the text or transcribe it.
         | 
         | I think writing compilers is now an officially a defunct skill
         | of historical and conservation purposes more than anything
         | else; but I don't like saying "conservation", it's a bad
         | framing, I rather say "legacy connectivity" which is a form of
         | continuity or backwards compatibility
        
         | mr_toad wrote:
         | I will never tire of pointing out that machine learning models
         | are compression algorithms, not compressed data.
        
           | inopinatus wrote:
           | I kinda made an argument the other day that they are high-
           | dimensional lossy decompression algorithms, which might be
           | the same difference but looking the other way through the
           | lens.
        
       | simonw wrote:
       | > There were projects to try to match it, but generally they
       | operated by fine tuning things like small (70B) llama models on a
       | bunch of GPT-3 generated texts (synthetic data - which can result
       | in degeneration when AI outputs are fed back into AI training
       | inputs).
       | 
       | That parenthetical doesn't quite work for me.
       | 
       | If synthetic data always degraded performance, AI labs wouldn't
       | use synthetic data. They use it because it helps them train
       | _better_ models.
       | 
       | There's a paper that shows that if you very deliberately train a
       | model in its own output in a loop you can get worse performance.
       | That's not what AI labs using synthetic data actually do.
       | 
       | That paper gets a lot of attention because the schadenfreude of
       | models destroying themselves through eating their own tails is
       | irresistible.
        
         | rybosome wrote:
         | Agreed, especially when in this context of training a smaller
         | model on a larger model's outputs. Distillation is generally
         | accepted as an effective technique.
         | 
         | This is exactly what I did in a previous role, fine-tuning
         | Llama and Mistral models on a mix of human and GPT-4 data for a
         | domain-specific task. Adding (good) synthetic data definitely
         | increased the output quality for our tasks.
        
           | rain1 wrote:
           | Yes but just purely in terms of entropy, you can't make a
           | model better than GPT-4 by training it on GPT-4 outputs. The
           | limit you would converge towards is GPT-4.
        
             | simonw wrote:
             | A better way to think about synthetic data is to consider
             | code. With code you can have an LLM generate code with
             | tests, then confirm that the code compiles and the tests
             | pass. Now you have semi-verified new code you can add to
             | your training data, and training on that will help you get
             | better results for code even though it was generated by a
             | "less good" LLM.
        
       | 1vuio0pswjnm7 wrote:
       | 1. "raw text continuation engine"
       | 
       | https://gist.github.com/rain-1/cf0419958250d15893d8873682492...
       | 
       | 2. "superintelligence"
       | 
       | https://en.m.wikipedia.org/wiki/Superintelligence
       | 
       | "Meta is uniquely positioned to deliver superintelligence to the
       | world."
       | 
       | https://www.cnbc.com/2025/06/30/mark-zuckerberg-creating-met...
       | 
       | Is there any difference between 1 and 2
       | 
       | Yes. One is purely hypothetical
        
       | angusturner wrote:
       | I wish people would stop parroting the view that LLMs are lossy
       | compression.
       | 
       | There is kind of a vague sense in which this metaphor holds, but
       | there is a much more interesting and rigorous fact about LLMs
       | which is that they are also _lossless_ compression algorithms.
       | 
       | There are at least two senses in which this is true:
       | 
       | 1. You can use an LLM to losslessly compress any piece of text at
       | a cost that approaches the log-likelihood of that text under the
       | model, using arithmetic coding. A sender and receiver both need a
       | copy of the LLM weights.
       | 
       | 2. You can use an LLM plus SGD (I.e the training code) as an
       | lossless compression algorithm, where the communication cost is
       | area under the training curve (and the model weights don't count
       | towards description length!) see: Jack Rae "compression for AGI"
        
         | actionfromafar wrote:
         | Re 1 - classical compression is also extremely effective if
         | both sender and receiver have access to the same huge
         | dictionary.
        
       | kamranjon wrote:
       | This is somehow missing the Gemma and Gemini series of models
       | from Google. I also think that not mentioning the T5 series of
       | models is strange from a historical perspective because they sort
       | of pioneered many of the concepts in transfer learning and kinda
       | kicked off quite a bit of interest in this space.
        
         | rain1 wrote:
         | The Gemma models are too small to be included in this list.
         | 
         | You're right the T5 stuff is very important historically but
         | they're below 11B and I don't have much to say about them.
         | Definitely a very interesting and important set of models
         | though.
        
           | tantalor wrote:
           | > too small
           | 
           | Eh?
           | 
           | * Gemma 1 (2024): 2B, 7B
           | 
           | * Gemma 2 (2024): 2B, 9B, 27B
           | 
           | * Gemma 3 (2025): 1B, 4B, 12B, 27B
           | 
           | This is the same range as some Llama models which you do
           | mention.
           | 
           | > important historically
           | 
           | Aren't you trying to give a historical perspective? What's
           | the point of this?
        
           | kamranjon wrote:
           | Since you included GPT-2, everything from Google including T5
           | would qualify for the list I would think.
        
       | lukeschlather wrote:
       | This is a really nice writeup.
       | 
       | That said, there's an unstated assumption here that these truly
       | large language models are the most interesting thing. The big
       | players have been somewhat quiet but my impression from the
       | outside is that OpenAI let a little bit leak with their behavior.
       | They built an even larger model and it turned out to be
       | disappointing so they quietly discontinued it. The most powerful
       | frontier reasoning models may actually be smaller than the
       | largest publicly available models.
        
       | bobsmooth wrote:
       | There's got to be tons of books that remain undigitized that can
       | be mined for training data, hasn't there?
        
       ___________________________________________________________________
       (page generated 2025-07-02 23:00 UTC)