[HN Gopher] How large are large language models?
___________________________________________________________________
How large are large language models?
Author : rain1
Score : 218 points
Date : 2025-07-02 10:39 UTC (12 hours ago)
(HTM) web link (gist.github.com)
(TXT) w3m dump (gist.github.com)
| unwind wrote:
| Meta: The inclusion of the current year ("(2025)") in the title
| is strange, even though it's in the actual title of the linked-to
| post, repeating it here makes me look around for the time machine
| controls.
| dale_glass wrote:
| How big are those in terms of size on disk and VRAM size?
|
| Something like 1.61B just doesn't mean much to me since I don't
| know much about the guts of LLMs. But I'm curious about how that
| translates to computer hardware -- what specs would I need to run
| these? What could I run now, what would require spending some
| money, and what I might hope to be able to run in a decade?
| mjburgess wrote:
| At 1byte/param that's 1.6GB (f8), at 2 bytes (f16) that's 2.3GB
| -- but there's other space costs beyond loading the parameters
| for the GPU. So a rule of thumb is ~4x parameter count. So
| round up, 2B -> 2*4 = 8GB VRAM
| loudmax wrote:
| Most of these models have been trained using 16-bit weights. So
| a 1 billion parameter model takes up 2 gigabytes.
|
| In practice, models can be quantized to smaller weights for
| inference. Usually, the performance loss going from 16 bit
| weights to 8 bit weights is very minor, so a 1 billion
| parameter model can take 1 gigabyte. Thinking about these
| models in terms of 8-bit quantized weights has the added
| benefit of making the math really easy. A 20B model needs 20G
| of memory. Simple.
|
| Of course, models can be quantized down even further, at
| greater cost of inference quality. Depending on what you're
| doing, 5-bit weights or even lower might be perfectly
| acceptable. There's some indication that models that have been
| trained on lower bit weights might perform better than larger
| models that have been quantized down. For example, a model that
| was trained using 4-bit weights might perform better than a
| model that was trained at 16 bits, then quantized down to 4
| bits.
|
| When running models, a lot of the performance bottleneck is
| memory bandwidth. This is why LLM enthusiasts are looking for
| GPUs with the most possible VRAM. You computer might have 128G
| of RAM, but your GPU's access to that memory is so constrained
| by bandwidth that you might as well run the model on your CPU.
| Running a model on the CPU can be done, it's just much slower
| because the computation is so parallel.
|
| Today's higher end consumer grade GPUs have up to 24G of
| dedicated VRAM (an Nvidia RTX 5090 has 32G of VRAM and they're
| like $2k). The dedicated VRAM on a GPU has a memory bandwidth
| of about 1 Tb/s. Apple's M-series of ARM-based CPU's have 512
| Gb/s of bandwidth, and they're one of the most popular ways of
| being able to run larger LLMs on consumer hardware. AMD's new
| "Strix Halo" CPU+GPU chips have up to 128G of unified memory,
| with a memory bandwidth of about 256 Gb/s.
|
| Reddit's r/LocalLLaMA is a reasonable place to look to see what
| people are doing with consumer grade hardware. Of course, some
| of what they're doing is bonkers so don't take everything you
| see there as a guide.
|
| And as far as a decade from now, who knows. Currently, the top
| silicon fabs of TSMC, Samsung, and Intel are all working flat-
| out to meet the GPU demand from hyperscalers rolling out
| capacity (Microsoft Azure, AWS, Google, etc). Silicon chip
| manufacturing has traditionally followed a boom/bust cycle. But
| with geopolitical tensions, global trade barriers, AI-driven
| advances, and whatever other black swan events, what the next
| few years will look like is anyone's guess.
| OtherShrezzing wrote:
| >None of this document was not written by AI
|
| I think in these scenarios, articles should include the prompt
| and generating model.
| oc1 wrote:
| You are absolutely right! The AI slop is getting out of
| control.
| WesolyKubeczek wrote:
| I don't think the author knows that double negatives in English
| in a sentence like this cancel, not reinforce, each other.
| kylecazar wrote:
| I thought this was an accidental double negative by the author
| -- trying to declare they wrote it themselves.
|
| There are some signs it's written by possibly a non-native
| speaker.
| rain1 wrote:
| I have corrected that. It was supposed to say "None of this
| document was written by AI."
|
| Thank you for spotting the error.
| OtherShrezzing wrote:
| Understood, thanks for updating it!
| mjburgess wrote:
| Deepseek v1 is ~670Bn which is ~1.4TB physical.
|
| All digitized books ever written/encoded compress to a few TB.
| The public web is ~50TB. I think a usable zip of all english
| electronic text publicly available would be on O(100TB). So we're
| at about 1% of that in model size, and we're in a diminishing-
| returns area of training -- ie., going to >1% has not yielded
| improvements (cf. gpt4.5 vs 4o).
|
| This is why compute spend is moving to inference time with
| "reasoning" models. It's likely we're close to diminshing returns
| on inference-time compute now too, hence agents whereby (mostly,)
| deterministic tools are supplementing information /capability
| into the system.
|
| I think to get any more value out of this model class, we'll be
| looking at domain-specific specialisation beyond instruction
| fine-tuning.
|
| I'd guess targeting 1TB inference-time VRAM would be a reasonable
| medium-term target for high quality open source models -- that's
| within the reach of most SMEs today. That's about 250bn params.
| account-5 wrote:
| > All digitized books ever written/encoded compress to a few
| TB. The public web is ~50TB. I think a usable zip of all
| english electronic text publicly available would be on
| O(100TB).
|
| Where you getting these numbers from? Interested to see how
| that's calculated.
|
| I read somewhere, but cannot find the source anymore, that all
| written text prior to this century was approx 50MB. (Might be
| misquoted as don't have source anymore).
| WesolyKubeczek wrote:
| Maybe prior to the prior century, and even then I smell a lot
| of bullshit. I mean, just look at the Project Gutenberg. Even
| plaintext only, even compressed.
| bravesoul2 wrote:
| Even Shakespeare alone needs 4 floppy disks.
| kmm wrote:
| Perhaps that's meant to be 50GB (and that still seems like a
| serious underestimation)? Just the Bible is already 5MB.
| _Algernon_ wrote:
| English Wikipedia without media alone is ~24 GB
| _compressed_.
|
| https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia
| kmm wrote:
| I don't see how the size of Wikipedia has any bearing on
| the 50MB figure given for pre-20th century literature by
| the parent.
| mjburgess wrote:
| Anna's Archive full torrent is O(1PB), project gutenberg is
| O(1TB), many AI training torrents are reported in the O(50TB)
| range.
|
| Extract just the plain text from that (+social media, etc.),
| remove symbols outside of a 64 symbol alphabet (6 bits) and
| compress. "Feels" to me around a 100TB max for absolutely
| everything.
|
| Either way, full-fat LLMs are operating at 1-10% of this
| scale, depending how you want to estimate it.
|
| If you run a more aggressive filter on that 100TB, eg., for a
| more semantic dedup, there's a plausible argument for
| "information" in english texts available being ~10TB -- then
| we're running close to 20% of that in LLMs.
|
| If we take LLMs to just be that "semantic compression
| algorithm", and supposing the maximum useful size of an LLM
| is 2TB, then you could run the argument that everything
| "salient" ever written is <10TB.
|
| Taking LLMs to be running at close-to 50% "everything useful"
| rather than 1% would be a explanation of why training has
| capped out.
|
| I think the issue is at least as much to do with what we're
| using LLMs for -- ie., instruction fine-tuning requires some
| more general (proxy/quasi-) semantic structures in LLMs and I
| think you only need O(1%) of "everything ever written" to
| capture these. So it wouldnt really matter how much more we
| added, instruction-following LLMs don't really need it.
| TeMPOraL wrote:
| > _I read somewhere, but cannot find the source anymore, that
| all written text prior to this century was approx 50MB.
| (Might be misquoted as don 't have source anymore)._
|
| 50 MB feels too low, unless the quote meant text up until the
| _20th century_ , in which case it feels much more believable.
| In terms of text production and publishing, we're still
| riding an exponent, so a couple orders of magnitude increase
| between 1899 and 2025 is not surprising.
|
| (Talking about S-curves is all the hotness these days, but I
| feel it's usually a way to avoid understanding what
| exponential growth means - if one assumes we're past the
| inflection point, one can wave their hands and pretend the
| change is linear, and continue to not understand it.)
| ben_w wrote:
| Even by the start of the 20th century, 50 MB is definitely
| far too low.
|
| Any given English translation of Bible is by itself
| something like 3-5 megabytes of ASCII; the complete works
| of Shakespeare are about 5 megabytes; and I think (back of
| the envelope estimate) you'd get about the same again for
| what Arthur Conan Doyle wrote before 1900.
|
| I can just about believe there might have been only ten
| _thousand_ Bible-or-Shakespeare sized books (plus all the
| court documents, newspapers, etc. that add up to that)
| worldwide by 1900, but not _ten_.
|
| Edit: I forgot about encyclopaedias, by 1900 the
| Encyclopaedia Britannica was almost certainly more than 50
| MB all by itself.
| jerf wrote:
| 50MB feels like "all the 'ancient' text we have" maybe, as
| measured by the size of the original content and not
| counting copies. A quick check at Alice in Wonderland puts
| it at 163kB in plain text. About 300 of those gets us to
| 50MB. There's way more than 300 books of similar size from
| the 19th century. They may not all be digitized and freely
| available, but you can fill libraries with even existing
| 19th century texts, let alone what may be lost by now.
|
| Or it may just be someone bloviating and just being
| wrong... I think even ancient texts could exceed that
| number, though perhaps not by an order of magnitude.
| bravesoul2 wrote:
| I reckon a prolific writer could publish a million words in
| their career.
|
| Most people who blog could wrote 1k words a day. That's a
| million in 3 years. So not crazy numbers here.
|
| That's 5Mb. Maybe you meant 50Gb. I'd hazard 50Tb.
| smokel wrote:
| Simply add images and video, and these estimates start to sound
| like the "640 KB should be enough for everyone".
|
| After that, make the robots explore and interact with the world
| by themselves, to fetch even more data.
|
| In all seriousness, adding image and interaction data will
| probably be enormously useful, even for generating text.
| netcan wrote:
| Like both will be done. Idk what the roi is on adding video
| data to the text models, but it's presumably lower than text.
|
| There are just a lot of avenues to try at this point.
| llSourcell wrote:
| no its not lower than text, its higher ROI than text for
| understanding the physics of the world, which is exactly
| what videos are better at than text when it comes to
| training data
| AstroBen wrote:
| Does that transfer, though? I'm not sure we can expect
| its ability to approximate physics in video form would
| transfer to any other mode (text, code, problem solving
| etc)
| ricopags wrote:
| depends on the hyperparams but one of the biggest
| benefits of a latent space is transfer between modalities
| generalizations wrote:
| > has not yielded improvements (cf. gpt4.5 vs 4o).
|
| FWIW there is a huge difference between 4.5 and 4o.
| charcircuit wrote:
| >The public web is ~50TB
|
| Did you mean to type EB?
| gosub100 wrote:
| Only if you included all images and video
| andrepd wrote:
| > 50TB
|
| There's no way the entire Web fits in 400$ worth of hard
| drives.
| AlienRobot wrote:
| Text is small.
| flir wrote:
| Nah, Common Crawl puts on 250TB a month.
|
| Maybe text only, though...
| fouc wrote:
| Maybe you're thinking of Library of Congress when you say
| ~50TB? Internet is definitely larger..
| rain1 wrote:
| This is kind of related to the jack morris post
| https://blog.jxmo.io/p/there-are-no-new-ideas-in-ai-only he
| discusses how the big leaps in LLMs have mostly come - not so
| much from new training methods or arch. changes as such - but
| the ability of new archs. to ingest _more_ data.
| layer8 wrote:
| Just a nitpick, but please don't misuse big O notation like
| that. Any fixed storage amount is O(100TB).
| christianqchung wrote:
| This is a bad article. Some of the information is wrong, and it's
| missing lots of context.
|
| For example, it somehow merged Llama 4 Maverick's custom Arena
| chatbot version with Behemoth, falsely claiming that the former
| is stopping the latter from being released. It also claims 40B of
| internet text data is 10B tokens, which seems a little odd. Llama
| 405B was also trained on more than 15 trillion tokens[1], but the
| post claims only 3.67 trillion for some reason. It also doesn't
| mention Mistral large for some reason, even though it's the first
| good European 100B+ dense model.
|
| >The MoE arch. enabled larger models to be trained and used by
| more people - people without access to thousands of
| interconnected GPUs
|
| You still need thousands of GPUs to train a MoE model of any
| actual use. This is true for inference in the sense that it's
| faster I guess, but even that has caveats because MoE models are
| less powerful than dense models of the same size, though the
| trade-off has apparently been worth it in many cases. You also
| didn't need thousands of GPUs to do inference before, even for
| the largest models.
|
| The conclusion is all over the place, and has lots of just weird
| and incorrect implications. The title is about how big LLMs are,
| why is there such a focus on token training count? Also no
| mention of quantized size. This is a bad AI slop article (whoops,
| turns out the author accidentally said it was AI generated, so
| it's a bad human slop article).
|
| [1] https://ai.meta.com/blog/meta-llama-3-1/
| rain1 wrote:
| I can correct mistakes.
|
| > it somehow merged Llama 4 Maverick's custom Arena chatbot
| version with Behemoth
|
| I can clarify this part. I wrote 'There was a scandal as
| facebook decided to mislead people by gaming the lmarena
| benchmark site - they served one version of llama-4 there and
| released a different model' which is true.
|
| But it is inside the section about the llama 4 model behemoth.
| So I see how that could be confusing/misleading.
|
| I could restructure that section a little to improve it.
|
| > Llama 405B was also trained on more than 15 trillion
| tokens[1],
|
| You're talking about Llama 405B instruct, I'm talking about
| Llama 405B base. Of course the instruct model has been traiend
| on more tokens.
|
| > why is there such a focus on token training count?
|
| I tried to include the rough training token count for each
| model I wrote about - plus additional details about training
| data mixture if available. Training data is an important part
| of an LLM.
| fossa1 wrote:
| It's ironic: for years the open-source community was trying to
| match GPT-3 (175B dense) with 30B-70B models + RLHF + synthetic
| data--and the performance gap persisted.
|
| Turns out, size really did matter, at least at the base model
| level. Only with the release of truly massive dense (405B) or
| high-activation MoE models (DeepSeek V3, DBRX, etc) did we start
| seeing GPT-4-level reasoning emerge outside closed labs.
| stared wrote:
| If you want it visually, here's a chart of total parameters as a
| function of year: https://app.charts.quesma.com/s/rmyk38
| rain1 wrote:
| This is really awesome. Thank you for creating that. I included
| a screenshot and link to the chart with credit to you in a
| comment to my post.
| stared wrote:
| I am happy you like it!
|
| If you like darker color scheme, here it is:
|
| https://app.charts.quesma.com/s/f07qji
|
| And active vs total:
|
| https://app.charts.quesma.com/s/4bsqjs
| rain1 wrote:
| I think that one thing that this chart makes visually very
| clear is the point I about GPT-3 being such a huge leap, and
| there being a long gap before anybody was able to match it.
| ljoshua wrote:
| Less a technical comment and more just a mind-blown comment, but
| I still can't get over _just how much data_ is compressed into
| and available in these downloadable models. Yesterday I was on a
| plane with no WiFi, but had gemma3:12b downloaded through Ollama.
| Was playing around with it and showing my kids, and we fired
| history questions at it, questions about recent video games, and
| some animal fact questions. It wasn't perfect, but holy cow the
| breadth of information that is embedded in an 8.1 GB file is
| incredible! Lossy, sure, but a pretty amazing way of compressing
| all of human knowledge into something incredibly contained.
| ljlolel wrote:
| How big is Wikipedia text? Within 3X that size with 100%
| accuracy
| phkahler wrote:
| Google AI response says this for compressed size of
| wikipedia:
|
| "The English Wikipedia, when compressed, currently occupies
| approximately 24 GB of storage space without media files.
| This compressed size represents the current revisions of all
| articles, but excludes media files and previous revisions of
| pages, according to Wikipedia and Quora."
|
| So 3x is correct but LLMs are lossy compression.
| rain1 wrote:
| It's extremely interesting how powerful a language model is at
| compression.
|
| When you train it to be an assistant model, it's better at
| compressing assistant transcripts than it is general text.
|
| There is an eval which I have a lot of interested in and
| respect for
| https://huggingface.co/spaces/Jellyfish042/UncheatableEval
| called UncheatableEval, which tests how good of a language
| model an LLM is by applying it on a range of compression tasks.
|
| This task is essentially impossible to 'cheat'. Compression is
| a benchmark you cannot game!
| MPSimmons wrote:
| Agreed. It's basically lossy compression for everything it's
| ever read. And the quantization impacts the lossiness, but
| since a lot of text is super fluffy, we tend not to notice as
| much as we would when we, say, listen to music that has been
| compressed in a lossy way.
| entropicdrifter wrote:
| It's a bit like if you trained a virtual band to play any
| song ever, then told it to do its own version of the songs.
| Then prompted it to play whatever specific thing you
| wanted. It won't be the same because it _kinda_ remembers
| the right thing _sorta_ , but it's also winging it.
| soulofmischief wrote:
| Knowledge is learning relationships by decontextualizing
| information into generalized components. Application of
| knowledge is recontextualizing these components based on the
| problem at hand.
|
| This is essentially just compression and decompression. It's
| just that with prior compression techniques, we never tried
| leveraging the inherent relationships encoded in a compressed
| data structure, because our compression schemes did not
| leverage semantic information in a generalized way and thus
| did not encode very meaningful relationships other than "this
| data uses the letter 'e' quite a lot".
|
| A lot of that comes from the sheer amount of data we throw at
| these models, which provide enough substrate for semantic
| compression. Compare that to common compression schemes in
| the wild, where data is compressed in isolation without
| contributing its information to some model of the world. It
| turns out that because of this, we've been leaving _a lot_ on
| the table with regards to compression. Another factor has
| been the speed /efficiency tradeoff. GPUs have allowed us to
| put a lot more into efficiency, and the expectations that
| many language models only need to produce text as fast as it
| can be read by a human means that we can even further
| optimize for efficiency over speed.
|
| Also, shout out to Fabrice Bellard's ts_zip, which leverages
| LLMs to compress text files. https://bellard.org/ts_zip/
| exe34 wrote:
| Wikipedia is about 24GB, so if you're allowed to drop 1/3 of
| the details and make up the missing parts by splicing in random
| text, 8GB doesn't sound too bad.
|
| To me the amazing thing is that you can tell the model to do
| something, even follow simple instructions in plain English,
| like make a list or write some python code to do $x, that's the
| really amazing part.
| bbarnett wrote:
| Not to mention, _Language Modeling is Compression_
| https://arxiv.org/pdf/2309.10668
|
| So text wikipedia at 24G would easily hit 8G with many
| standard forms of compression, I'd think. If not better. And
| it would be 100% accurate, full text and data. Far more
| usable.
|
| It's so easy for people to not realise how _massive_ 8GB
| really is, in terms of text. Especially if you use ascii
| instead of UTF.
| horsawlarway wrote:
| The 24G is the compressed number.
|
| They host a pretty decent article here:
| https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia
|
| The relevant bit:
|
| > As of 16 October 2024, the size of the current version
| including all articles compressed is about 24.05 GB without
| media.
| bbarnett wrote:
| Nice link, thanks.
|
| Well I'll fallback position, and say one is lossy, the
| other not.
| Nevermark wrote:
| It blows my mind that I can ask for 50 synonyms, instantly
| get a great list with great meaning summaries.
|
| Then ask for the same list sorted and get that nearly
| instantly,
|
| These models have a short time context for now, but they
| already have _a huge "working memory" relative to us_.
|
| It is very cool. And indicative that vastly smarter models
| are going to be achieved fairly easily, with new insight.
|
| Our biology has had to ruthlessly work within our
| biological/ecosystem energy envelope, and with the limited
| value/effort returned by a pre-internet pre-vast economy.
|
| So biology has never been able to scale. Just get marginally
| more efficient and effective within tight limits.
|
| Suddenly, (in historical, biological terms), energy
| availability limits have been removed, and limits on the
| value of work have compounded and continue to do so.
| Unsurprising that those changes suddenly unlock easily
| achieved vast untapped room for cognitive upscaling.
| Wowfunhappy wrote:
| > These models [...] have a huge "working memory" relative
| to us. [This is] indicative that vastly smarter models are
| going to be achieved fairly easily, with new insight.
|
| I don't think your second sentence logically follows from
| the first.
|
| Relative to us, these models:
|
| - Have a much larger working memory.
|
| - Have much more limited logical reasoning skills.
|
| To some extent, these models are able to use their superior
| working memories to compensate for their limited reasoning
| abilities. This can make them very useful tools! But there
| may well be a ceiling to how far that can go.
|
| When you ask a model to "think about the problem step by
| step" to improve its reasoning, you are basically just
| giving it more opportunities to draw on its huge memory
| bank and try to put things together. But humans are able to
| reason with orders of magnitude less training data. And by
| the way, we are out of new training data to give the
| models.
| antonvs wrote:
| > Have much more limited logical reasoning skills.
|
| Relative to the best humans, perhaps, but I seriously
| doubt this is true in general. Most people I work with
| couldn't reason nearly as well through the questions I
| use LLMs to answer.
|
| It's also worth keeping in mind that having a different
| approach to reasoning is not necessarily equivalent to a
| worse approach. Watch out for cherry-picking the cons of
| its approach and ignoring the pros.
| exe34 wrote:
| > Relative to the best humans,
|
| For some reason, the bar for AI is always against the
| best possible human, right now.
| exe34 wrote:
| > But humans are able to reason with orders of magnitude
| less training data.
|
| Common belief, but false. You start learning from inside
| the womb. The data flow increases exponentially when you
| open your eyes and then again when you start manipulating
| things with your hands and mouth.
|
| > When you ask a model to "think about the problem step
| by step" to improve its reasoning, you are basically just
| giving it more opportunities to draw on its huge memory
| bank and try to put things together.
|
| We do the same with children. At least I did it to my
| classmates when they asked me for help. I'd give them a
| hint, and ask them to work it out step by step from
| there. It helped.
| Wowfunhappy wrote:
| > Common belief, but false. You start learning from
| inside the womb. The data flow increases exponentially
| when you open your eyes and then again when you start
| manipulating things with your hands and mouth.
|
| But you don't get data equal to _the entire internet_ as
| a child!
|
| > We do the same with children. At least I did it to my
| classmates when they asked me for help. I'd give them a
| hint, and ask them to work it out step by step from
| there. It helped.
|
| And I do it with my students. I still think there's a
| difference in kind between when I listen to my students
| (or other adults) reason through a problem, and when I
| look at the output of an AI's reasoning, but I admittedly
| couldn't tell you what that is, so point taken. I still
| think the AI is relying far more heavily on its knowledge
| base.
| jacobr1 wrote:
| There seems to be lots of mixed data points on this, but
| to some extent there is knowledge encoded into the
| evolutionary base state of the new human brain. Probably
| not directly as knowledge, but "primed" to quickly to
| establish relevant world models and certain types of
| reasoning with a small number of examples.
| oceanplexian wrote:
| Your field of vision is equivalent to something like 500
| Megapixels. And assume it's uncompressed because it's not
| like your eyeballs are doing H.264.
|
| Given vision and the other senses, I'd argue that your
| average toddler has probably trained on more sensory
| information than the largest LLMs ever built long before
| they learn to talk.
| jacobr1 wrote:
| > And by the way, we are out of new training data to give
| the models.
|
| Only easily accessible text data. We haven't really
| started using video at scale yet for example. It looks
| like data for specific tasks goes really far too ... for
| example agentic coding interactions aren't something that
| has generally been captured on the internet. But
| capturing interactions with coding agents, in combination
| with the base-training of existing programming knowledge
| already captured is resulting in significant performance
| increases. The amount of specicialed data we might need
| to gather or synthetically generate is perhaps orders of
| magnitude less that presumed with pure supervised
| learning systems. And for other applications like
| industrial automation or robotics we've barely started
| capturing all the sensor data that lives in those
| systems.
| Nevermark wrote:
| My response completely acknowledged their current
| reasoning limits.
|
| But in evolutionary time frames, clearly those limits are
| lifting extraordinarily quickly. By many orders of
| magnitude.
|
| And the point I made, that our limits were imposed by
| harsh biological energy and reward limits, vs. todays
| models (and their successors) which have access to
| relatively unlimited energy, and via sharing value with
| unlimited customers, unlimited rewards, stands.
|
| It is a much simpler problem to improve digital cognition
| in a global ecosystem of energy production, instant
| communication and global application, than it was for
| evolution to improve an individual animals cognition in
| the limited resources of local habitats and their
| inefficient communication of advances.
| Workaccount2 wrote:
| I don't like the term "compression" used with transformers
| because it gives the wrong idea about how they function. Like
| that they are a search tool glued onto a .zip file, your
| prompts are just fancy search queries, and hallucinations are
| just bugs in the recall algo.
|
| Although strictly speaking they have lots of information in a
| small package, they are F-tier compression algorithms because
| the loss is bad, unpredictable, and undetectable (i.e. a human
| has to check it). You would almost never use a transformer in
| place of any other compression algorithm for typical data
| compression uses.
| Wowfunhappy wrote:
| A .zip is lossless compression. But we also have plenty of
| lossy compression algorithms. We've just never been able to
| use lossy compression on text.
| Workaccount2 wrote:
| >We've just never been able to use lossy compression on
| text.
|
| ...and we still can't. If your lawyer sent you your case
| files in the form of an LLM trained on those files, would
| you be comfortable with that? Where is the situation you
| would compress text with an LLM over a standard compression
| algo? (Other than to make an LLM).
|
| Other lossy compression targets known superfluous
| information. MP3 removes sounds we can't really hear, and
| JPEG works by grouping uniform color pixels into single
| chunks of color.
|
| LLM's kind of do their own thing, and the data you get back
| out of them is correct, incorrect, or dangerously incorrect
| (i.e. is plausible enough to be taken as correct), with no
| algorithmic way to discern which is which.
|
| So while yes, they do compress data and you can measure it,
| the output of this "compression algorithm" puts in it the
| same family as a "randomly delete words and thesaurus long
| words into short words" compression algorithms. Which I
| don't think anyone would consider to compress their
| documents.
| esafak wrote:
| People summarize (compress) documents with LLMs all day.
| With legalese the application would be to summarize it in
| layman's terms, while retaining the original for legal
| purposes.
| Workaccount2 wrote:
| Yes, and we all know (ask teachers) how reliable those
| summaries are. They are _randomly_ lossy, which makes
| them unsuitable for any serious work.
|
| I'm not arguing that LLMs don't compress data, I am
| arguing that they are technically compression tools, but
| not colloquially compression tools, and the overlap they
| have with colloquial compression tools is almost zero.
| menaerus wrote:
| At this moment LLMs are used for much of the serious work
| across the globe so perhaps you will need to readjust
| your line of thinking. There's nothing inherently better
| or more trustworthy to have a person compile some
| knowledge than, let's say, a computer algorithm in this
| case. I place my bets on the latter to have better
| output.
| esafak wrote:
| > They are randomly lossy, which makes them unsuitable
| for any serious work.
|
| Ask ten people and they'll give ten different summaries.
| Are humans unsuitable too?
| Workaccount2 wrote:
| Yes, which is why we write things down, and when those
| archives become too big we use lossless compression on
| them, because we cannot tolerate a compression tool that
| drops the street address of a customer or even worse,
| hallucinates a slightly different one.
| Wowfunhappy wrote:
| But lossy compression algorithms for e.g. movies and
| music are also non-deterministic.
|
| I'm not making an argument about whether the compression
| is good or useful, just like I don't find 144p bitrate
| starved videos particularly useful. But it doesn't seem
| so unlike other types of compression to me.
| antonvs wrote:
| > LLM's kind of do their own thing, and the data you get
| back out of them is correct, incorrect, or dangerously
| incorrect (i.e. is plausible enough to be taken as
| correct), with no algorithmic way to discern which is
| which.
|
| Exactly like information from humans, then?
| tshaddox wrote:
| > If your lawyer sent you your case files in the form of
| an LLM trained on those files, would you be comfortable
| with that?
|
| If the LLM-based compression method was well-understood
| and demonstrated to be reliable, I wouldn't oppose it on
| principle. If my lawyer didn't know what they were doing
| and threw together some ChatGPT document transfer system,
| of course I wouldn't trust it, but I also wouldn't trust
| my lawyer if they developed their own DCT-based lossy
| image compression algorithm.
| angusturner wrote:
| There is an excellent talk by Jack Rae called "compression
| for AGI", where he shows (what I believe to be) a little
| known connection between transformers and compression;
|
| In one view, you can view LLMs as SOTA lossless compression
| algorithms, where the number of weights don't count towards
| the description length. Sounds crazy but it's true.
| Workaccount2 wrote:
| A transformer that doesn't hallucinate (or knows what is a
| hallucination) would be the ultimate compression algorithm.
| But right now that isn't a solved problem, and it leaves
| the output of LLMs too untrustworthy to use over what are
| colloquially known as compression algorithms.
| Nevermark wrote:
| It is still task related.
|
| Compressing a comprehensive command line reference via
| model might introduce errors and drop some options.
|
| But for many people, especially new users, referencing
| commands, and getting examples, via a model would
| delivers many times the value.
|
| Lossy vs. lossless are fundamentally different, but so
| are use cases.
| swyx wrote:
| his talk here https://www.youtube.com/watch?v=dO4TPJkeaaU
|
| and his last before departing for Meta Superintelligence
| https://www.youtube.com/live/U-fMsbY-
| kHY?si=_giVEZEF2NH3lgxI...
| Wowfunhappy wrote:
| How does this compare to, say, the compression ratio of a
| lossless 8K video and a 240p Youtube stream of the same video?
| agumonkey wrote:
| Intelligence is compression some say
| Nevermark wrote:
| Very much so!
|
| The more and faster a "mind" can infer, the less it needs to
| store.
|
| Think how much fewer facts a symbolic system that can perform
| calculus needs to store, vs. an algebraic, or just arithmetic
| system, to cover the same numerical problem solving space.
| Many orders of magnitude less.
|
| The same goes for higher orders of reasoning. General or
| specific subject related.
|
| And higher order reasoning vastly increases capabilities
| extending into new novel problem spaces.
|
| I think model sizes may temporarily drop significantly, after
| every major architecture or training advance.
|
| In the long run, "A circa 2025 maxed M3 Ultra Mac Studio is
| all you need!" (/h? /s? Time will tell.)
| agumonkey wrote:
| I don't know who else took notes by diffing their own
| assumptions with lectures / talks. There was a notion of
| what's really new compared to previous conceptual state,
| what adds new information.
| goatlover wrote:
| How well does that apply to robotics or animal intelligence?
| Manipulating the real world is more fundamental to human
| intelligence than compressing text.
| ToValueFunfetti wrote:
| Under the predictive coding model (and I'm sure some
| others), animal intelligence is also compression. The idea
| is that the early layers of the brain minimize how
| surprising incoming sensory signals are, so the later
| layers only have to work with truly entropic signal. But it
| has non-compression-based intelligence within those more
| abstract layers.
| goatlover wrote:
| I just wonder if neuroscientists use that kind of model.
| penguin_booze wrote:
| I don't know why, but I was reminded of Douglas Hofstadter's
| talk: Analogy is cognition:
| https://www.youtube.com/watch?v=n8m7lFQ3njk&t=964s.
| tshaddox wrote:
| Some say that. But what I value even more than compression is
| the ability to create new ideas which do not in any way exist
| in the set of all previously-conceived ideas.
| benreesman wrote:
| I'm toying with the phrase "precedented originality" as a
| way to describe the optimal division of labor when I work
| with Opus 4 running hot (which is the first one where I
| consistently come out ahead by using it). That model at
| full flog seems to be very close to the asymptote for the
| LLM paradigm on coding: they've really pulled out all the
| stops (the temperature is so high it makes trivial
| typographical errors, it will discuss just about anything,
| it will churn for 10, 20, 30 seconds to first token via
| API).
|
| Its good enough that it has changed my mind about the
| fundamental utility of LLMs for coding in non-Javascript
| complexity regimes.
|
| But its still not an expert programmer, not by a million
| miles, there is no way I could delegate my job to it (and
| keep my job). So there's some interesting boundary that's
| different than I used to think.
|
| I think its in the vicinity of "how much precedent exists
| for this thought or idea or approach". The things I bring
| to the table in that setting have precedent too, but much
| more tenuously connected to like one clear precedent on
| e.g. GitHub, because if the thing I need was on GitHub I
| would download it.
| hamilyon2 wrote:
| Crystallized intelligence is. I am not sure about fluid
| intelligence.
| antisthenes wrote:
| Fluid intelligence is just how quickly you acquire
| crystallized intelligence.
|
| It's the first derivative.
| agumonkey wrote:
| Talking about that, people designed a memory game, dual n
| back, which allegedly improve fluid intelligence.
| dgrabla wrote:
| Back in the '90s, we joked about putting "the internet" on a
| floppy disk. It's kind of possible now.
| Lu2025 wrote:
| Yeah, those guys managed to steal the internet.
| Nevermark wrote:
| It is truly incredible.
|
| One factor, is the huge redundancies pervasive in our
| communication.
|
| (1) There are so many ways to say the same thing, that (2) we
| have to add even more words to be precise at all. Without a
| verbal indexing system we (3) spend many words just setting up
| context for what we really want to say. And finally, (4) we
| pervasively add a great deal of intentionally non-informational
| creative and novel variability, and mood inducing color, which
| all require even more redundancy to maintain reliable
| interpretation, in order to induce our minds to maintain
| attention.
|
| Our minds are active resistors of plain information!
|
| All four factors add so much redundancy, it's probably fair to
| say most of our communication (by bits, characters, words,
| etc., may be 95%?, 98%? or more!) pure redundancy.
|
| Another helpful compressor, is many facts are among a few
| "reasonably expected" alternative answers. So it takes just a
| little biasing information to encode the right option.
|
| Finally, the way we reason seems to be highly common across
| everything that matters to us. Even though we have yet to
| identify and characterize this informal human logic. So once
| that is modeled, that itself must compress a lot of relations
| significantly.
|
| Fuzzy Logic was a first approximation attempt at modeling human
| "logic". But has not been very successful.
|
| Models should eventually help us uncover that "human logic", by
| analyzing how they model it. Doing so may let us create even
| more efficient architectures. Perhaps significantly more
| efficient, and even provide more direct non-gradient/data based
| "thinking" design.
|
| Nevertheless, the level of compression is astounding!
|
| We are far less complicated cognitive machines that we imagine!
| Scary, but inspiring too.
|
| I personally believe that common PCs of today, maybe even high
| end smart phones circa 2025, will be large enough to run future
| super intelligence when we get it right, given internet access
| to look up information.
|
| We have just begun to compress artificial minds.
| nico wrote:
| For reference (according to Google):
|
| > The English Wikipedia, as of June 26, 2025, contains over 7
| million articles and 63 million pages. The text content alone
| is approximately 156 GB, according to Wikipedia's statistics
| page. When including all revisions, the total size of the
| database is roughly 26 terabytes (26,455 GB)
| sharkjacobs wrote:
| better point of reference might be pages-articles-
| multistream.xml.bz2 (current pages without edit/revision
| history, no talk pages, no user pages) which is 20GB
|
| https://en.wikipedia.org/wiki/Wikipedia:Database_download#Wh.
| ..?
| inopinatus wrote:
| this is a much more deserving and reliable candidate for
| any labels regarding the breadth of human knowledge.
| mapt wrote:
| What happens if you ask this 8gb model "Compose a realistic
| Wikipedia-style page on the Pokemon named Charizard"?
|
| How close does it come?
| pcrh wrote:
| Wikipedia itself describes its size as ~25GB without media
| [0]. And it's probably more accurate and with broader
| coverage in multiple languages compared to the LLM downloaded
| by the GP.
|
| https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia
| pessimizer wrote:
| Really? I'd assume that an LLM would deduplicate Wikipedia
| into something much smaller than 25GB. That's its only job.
| crazygringo wrote:
| > _That 's its only job._
|
| The vast, vast majority of LLM knowledge is not found in
| Wikipedia. It is definitely not its only job.
| tomkaos wrote:
| Same thing with image model. 4 Go stable diffusion model can
| draw and represent anything humanity know.
| alternatex wrote:
| How about a full glass of wine? Filled to the brim.
| stronglikedan wrote:
| I've been doing the AI course on Brilliant lately, and it's
| mindblowing the techniques that they come up with to compress
| the data.
| thecosas wrote:
| A neat project you (and others) might want to check out:
| https://kiwix.org/
|
| Lots of various sources that you can download locally to have
| available offline. They're even providing some pre-loaded
| devices in areas where there may not be reliable or any
| internet access.
| swyx wrote:
| the study of language models from an information
| theory/compression POV is a small field but increasingly impt
| for efficiency/scaling - we did a discussion about this today
| https://www.youtube.com/watch?v=SWIKyLSUBIc&t=2269s
| divbzero wrote:
| The _Encyclopaedia Britannica_ has about 40,000,000 words [1]
| or about 0.25 GB if you assume 6 bytes per word. It's
| impressive but not outlandish that an 8.1 GB file could encode
| a large swath of human information.
|
| [1]: https://en.wikipedia.org/wiki/Encyclopaedia_Britannica
| holoduke wrote:
| Yea. Same for a 8gb stable diffusion image generator. Sure not
| the best quality. But there is so much information inside.
| tasuki wrote:
| 8.1 GB is a lot!
|
| It is 64,800,000,000 bits.
|
| I can imagine 100 bits sure. And 1,000 bits why not. 10,000 you
| lose me. A million? That sounds like a lot. Now 64 million
| would be a number I can't well imagine. And this is a thousand
| times 64 million!
| ysofunny wrote:
| they're an upgraded version of self-executable zip files that
| compresses knowledge like mp3 compresses music, without knowing
| exactly wtf are either music nor knowledge
|
| the self-execution is the interactive chat interface.
|
| wikipedia gets "trained" (compiled+compressed+lossy) into an
| executable you can chat with, you can pass this through another
| pretrained A.I. than can talk out the text or transcribe it.
|
| I think writing compilers is now an officially a defunct skill
| of historical and conservation purposes more than anything
| else; but I don't like saying "conservation", it's a bad
| framing, I rather say "legacy connectivity" which is a form of
| continuity or backwards compatibility
| mr_toad wrote:
| I will never tire of pointing out that machine learning models
| are compression algorithms, not compressed data.
| inopinatus wrote:
| I kinda made an argument the other day that they are high-
| dimensional lossy decompression algorithms, which might be
| the same difference but looking the other way through the
| lens.
| simonw wrote:
| > There were projects to try to match it, but generally they
| operated by fine tuning things like small (70B) llama models on a
| bunch of GPT-3 generated texts (synthetic data - which can result
| in degeneration when AI outputs are fed back into AI training
| inputs).
|
| That parenthetical doesn't quite work for me.
|
| If synthetic data always degraded performance, AI labs wouldn't
| use synthetic data. They use it because it helps them train
| _better_ models.
|
| There's a paper that shows that if you very deliberately train a
| model in its own output in a loop you can get worse performance.
| That's not what AI labs using synthetic data actually do.
|
| That paper gets a lot of attention because the schadenfreude of
| models destroying themselves through eating their own tails is
| irresistible.
| rybosome wrote:
| Agreed, especially when in this context of training a smaller
| model on a larger model's outputs. Distillation is generally
| accepted as an effective technique.
|
| This is exactly what I did in a previous role, fine-tuning
| Llama and Mistral models on a mix of human and GPT-4 data for a
| domain-specific task. Adding (good) synthetic data definitely
| increased the output quality for our tasks.
| rain1 wrote:
| Yes but just purely in terms of entropy, you can't make a
| model better than GPT-4 by training it on GPT-4 outputs. The
| limit you would converge towards is GPT-4.
| simonw wrote:
| A better way to think about synthetic data is to consider
| code. With code you can have an LLM generate code with
| tests, then confirm that the code compiles and the tests
| pass. Now you have semi-verified new code you can add to
| your training data, and training on that will help you get
| better results for code even though it was generated by a
| "less good" LLM.
| 1vuio0pswjnm7 wrote:
| 1. "raw text continuation engine"
|
| https://gist.github.com/rain-1/cf0419958250d15893d8873682492...
|
| 2. "superintelligence"
|
| https://en.m.wikipedia.org/wiki/Superintelligence
|
| "Meta is uniquely positioned to deliver superintelligence to the
| world."
|
| https://www.cnbc.com/2025/06/30/mark-zuckerberg-creating-met...
|
| Is there any difference between 1 and 2
|
| Yes. One is purely hypothetical
| angusturner wrote:
| I wish people would stop parroting the view that LLMs are lossy
| compression.
|
| There is kind of a vague sense in which this metaphor holds, but
| there is a much more interesting and rigorous fact about LLMs
| which is that they are also _lossless_ compression algorithms.
|
| There are at least two senses in which this is true:
|
| 1. You can use an LLM to losslessly compress any piece of text at
| a cost that approaches the log-likelihood of that text under the
| model, using arithmetic coding. A sender and receiver both need a
| copy of the LLM weights.
|
| 2. You can use an LLM plus SGD (I.e the training code) as an
| lossless compression algorithm, where the communication cost is
| area under the training curve (and the model weights don't count
| towards description length!) see: Jack Rae "compression for AGI"
| actionfromafar wrote:
| Re 1 - classical compression is also extremely effective if
| both sender and receiver have access to the same huge
| dictionary.
| kamranjon wrote:
| This is somehow missing the Gemma and Gemini series of models
| from Google. I also think that not mentioning the T5 series of
| models is strange from a historical perspective because they sort
| of pioneered many of the concepts in transfer learning and kinda
| kicked off quite a bit of interest in this space.
| rain1 wrote:
| The Gemma models are too small to be included in this list.
|
| You're right the T5 stuff is very important historically but
| they're below 11B and I don't have much to say about them.
| Definitely a very interesting and important set of models
| though.
| tantalor wrote:
| > too small
|
| Eh?
|
| * Gemma 1 (2024): 2B, 7B
|
| * Gemma 2 (2024): 2B, 9B, 27B
|
| * Gemma 3 (2025): 1B, 4B, 12B, 27B
|
| This is the same range as some Llama models which you do
| mention.
|
| > important historically
|
| Aren't you trying to give a historical perspective? What's
| the point of this?
| kamranjon wrote:
| Since you included GPT-2, everything from Google including T5
| would qualify for the list I would think.
| lukeschlather wrote:
| This is a really nice writeup.
|
| That said, there's an unstated assumption here that these truly
| large language models are the most interesting thing. The big
| players have been somewhat quiet but my impression from the
| outside is that OpenAI let a little bit leak with their behavior.
| They built an even larger model and it turned out to be
| disappointing so they quietly discontinued it. The most powerful
| frontier reasoning models may actually be smaller than the
| largest publicly available models.
| bobsmooth wrote:
| There's got to be tons of books that remain undigitized that can
| be mined for training data, hasn't there?
___________________________________________________________________
(page generated 2025-07-02 23:00 UTC)