[HN Gopher] Using a LLM to compress text
___________________________________________________________________
Using a LLM to compress text
Author : alexmolas
Score : 94 points
Date : 2024-05-03 07:51 UTC (2 days ago)
(HTM) web link (o565.com)
(TXT) w3m dump (o565.com)
| pronoiac wrote:
| I expected general-purpose compressors to do better, but I was
| wrong. For the whole document: 174355 pg11.txt
| 60907 pg11.txt.gz-9 58590 pg11.txt.zstd-9 54164
| pg11.txt.xz-9 25360 [from blog post]
| EgoIncarnate wrote:
| Which model did you use?
| Intralexical wrote:
| IDK why, but BZIP2 seems to do somewhat better that other
| compression algorithms for natural language text:
| $ curl https://www.gutenberg.org/cache/epub/11/pg11.txt | bzip2
| --best | wc 246 1183 48925
|
| Also, ZSTD goes all the way up to `--ultra -22` plus
| `--long=31` (4GB window-- Irrelevant here since the file fits
| in the default 8MB anyway).
| pastage wrote:
| You can use a preshared dictionary with bzip2 and zstd so you
| can get that down alot by using different dictionaries
| depending on certain rules. I dont know if it helps with
| literature but I had great success in sending databases with
| free text like that. In the end it was easier to just use one
| dictionary for everything and just skip the rules.
| immibis wrote:
| And here's the opposite - using gzip as a (merely adequate)
| "large" language model: https://aclanthology.org/2023.findings-
| acl.426.pdf
|
| By the way, if you want to see how well gzip actually models
| language, take any gzipped file, flip a few bits, and unzip it.
| If it gives you a checksum error, ignore that. You might have
| to unzip in a streaming way so that it can't tell the checksum
| is wrong until it's already printed the wrong data that you
| want to see.
| anonu wrote:
| Shouldn't you factor in the size of the compression tools
| needed?
| TeMPOraL wrote:
| Not if you're interested in compressing more than one file -
| compression tools are a fixed, one-time cost in size.
| sp332 wrote:
| For the prize, yes. But it's interesting to see that even a
| much larger decompressor might not necessarily buy you a
| smaller file.
| Legend2440 wrote:
| LLMs blow away traditional compressors because they are very
| good predictors, and prediction and compression are the same
| operation.
|
| You can convert any predictor into a lossless compressor by
| feeding the output probabilities into an entropy coding
| algorithm. LLMs can get compression ratios as high as 95% (0.4
| bits per character) on english text.
|
| https://arxiv.org/html/2404.09937v1
| Der_Einzige wrote:
| The 1-1 correspondence between prediction and compression is
| one of the most counterintuitive and fascinating things in
| all of AI to me.
| mjburgess wrote:
| Its only equivalent for a very narrow sense of 'prediction'
| , namely modelling conditional probability distributions
| over known data.
|
| There's no sense, for example, in which deriving a
| prediction about the nature of reality from a novel
| scientific theory is 'compression'
|
| eg., suppose we didn't know a planet existed, and we looked
| at orbital data. There's no sense in which compressing that
| data would indicate another planet existed.
|
| It's a great source of confusion that people think AI/ML
| systems are 'predicting' _novel distributions_ of
| observations (science), vs., novel observations of _the
| same_ distribution (statistics).
|
| It should be more obvious that the latter is just
| compression, since it's just taking a known distribution of
| data and replacing it with a derivative optimal value.
|
| Science predicts novel distributions based on theories,
| ie., it says the world is other than we previously
| supposed.
| palmtree3000 wrote:
| Sure it is! If we were trying to compress an archive of
| orbital data, one way to do it would be "initial
| positions + periodic error correction". If you have the
| new planet, your errors will be smaller and can be
| represented in less space at the same precision.
| mjburgess wrote:
| This is assuming the theory.
|
| A statistical model of orbits, without a theory of
| gravity, is less compressed when you assume more objects.
| Take all the apparent positions of objects in the sky,
| {(object, x1, x2, t),...}. Find a statistical model of
| each point at t+1, so y = (o, x1, x2, t+1). There is no
| sense in which you're deriving a new object in the sky
| from this statistical model -- it is only a compression
| of observable orbits.
|
| When you say, "if you have the new planet", you're
| changing the data generating process (theory) to produce
| a new distribution of points {(o' x1', x2', t'), ...} to
| include an unseen object. You're then comparing two _data
| generating models_ (two theories) for their simplicity.
| You 're not comparing the associative models.
|
| Call the prior theory 8-planets, so 8P generates x1,x2,t;
| and the new theory 9P which generates x1',x2',t'
|
| You're then making a conditional error distribution when
| comparing two rival theories. The 9P theory will minimize
| this error.
|
| But in no sense can the 9P theory be derived from the
| initial associative statistical distribution. You are,
| based on theory (, science, knowledge, etc.) choosing to
| add a planet, vs. eg., correcting for measurement error,
| modifiying newton's laws, changing the angle of the earth
| wrt the solar system... or one of _an infinite number_ of
| theories which all produce the same error minimization
|
| The sense of "prediction" that science uses (via Popper
| et al.) is deriving the existence of novel phenomena that
| do not follow from prior observable distributions.
| Legend2440 wrote:
| It doesn't matter how your predictor works, whether it's
| a scientific theory, a statistical model, or a magic
| oracle. You can always perfectly convert its predictions
| into compression using entropy coding. The conversion
| process is black-box.
| Intralexical wrote:
| The practical way to do this is to train a custom dictionary with
| a normal compression algorithm.
| clay_the_ripper wrote:
| It does seem like this is a possible method to test if an LLM has
| your data in it.
|
| People have found other ways to do that of course, but this is
| pretty clever.
| mvkel wrote:
| Not necessarily. This also uncovers the weakness of the NYT
| lawsuit.
|
| Imagine in your corpus of training data is the following:
|
| - bloga.com: "I read in the NYT that 'it rains cats and dogs
| twice per year'"
|
| - blogb.com: "according to the NYT, 'cats and dogs level rain
| occurs 2 times per year."
|
| - newssite.com: "cats and dogs rain events happen twice per
| year, according to the New York Times"
|
| Now, you chat with an LLM trained on this data, asking it "how
| many times per year does it rain cats and dogs?"
|
| "According to the New York Times, it rains cats and dogs twice
| per year."
|
| NYT content was never in the training data, however it -is-
| mentioned a lot on various sources throughout commoncrawl-
| approved sources, therefore gets a higher probability
| association with next token.
|
| Zoom that out to full articles quoted throughout the web, and
| you get false positives.
| refulgentis wrote:
| They were getting huge chunks, verbatim of NYT articles out.
| I remember being stunned. Then I remember finding out there
| was some sort of trick to it that made it seem sillier.
| KaoruAoiShiho wrote:
| Was it that NYT articles are routinely pirated on reddit
| comments and the like?
| viraptor wrote:
| Does it matter? What's the legal view on "I downloaded
| some data which turns out to be copied from a copyrighted
| source and it was probably trivial to figure it out, then
| trained the LLM on it"? I mean, they work on data
| processing - of course they would expect that if someone
| responds with 10 paragraphs in reporting style, under a
| link to NYT... that's just the article.
| KaoruAoiShiho wrote:
| I dunno, I'm not a lawyer, it might matter.
| evilduck wrote:
| I genuinely don't know the answer but I can see it being
| more complicated than "OpenAI purposefully acquired and
| trained on NYT articles".
|
| If Stack Overflow collects a bunch of questions and
| comments and expose them as a big dataset licensed as
| Creative Commons but it actually contains a quite bit of
| copyrighted content, whose responsibility is it to
| validate copyright violations in that data? If I use
| something licensed as CC in good faith and it turns out
| the provider or seller of that content had no right to
| relicense it, am I culpable? Is this just a new lawsuit
| where I can seek damages for the lawsuit I just lost?
| blackle wrote:
| There is a way to do this same compression by utilizing the raw
| probability distributions that the LLM produces as output.
| Fabrice Bellard has some experiments with doing this with
| transformers: https://bellard.org/nncp/
|
| The idea is that if you can produce an accurate probably
| distribution over the next bit/byte/token, then you can compress
| things with an entropy compressor (huffman encoding, range
| encoding, asymmetric numeral systems, etc). This comment is too
| small of a space to explain fully how they work, but it suffices
| to say that pretty much every good compression algorithm models
| probability distributions in some way.
| FooBarBizBazz wrote:
| Exactly this. The way to do this is to use the LLM as the
| statistical model inside an arithmetic coder. You use its next-
| token probabilities to encode the next token. The KL divergence
| between the LLM's probabilities, and the empirical
| probabilities in the corpus that you actually compress, is the
| compression inefficiency.
| ilaksh wrote:
| Did I miss it, or did he completely leave out the name of the LLM
| model he used?
| number6 wrote:
| > I'll use llama.cpp via its python bindings.
|
| llama.cpp also has a link to the model
| gliptic wrote:
| llama.cpp isn't a model though.
| mmoskal wrote:
| A problem with this is that the inference has to be 100%
| deterministic. For example of you change tailing order of matmul
| you may get slightly different results. It's easy to imagine
| people changing tailing in llama.cpp or other libraries. You are
| very unlikely to get bit exact results from different libraries.
|
| Interestingly (though maybe not relevant here) you can also get
| different results from multi-user inference systems depending on
| what other requests are in the batch. It's possible to avoid this
| but I'm pretty sure most systems don't.
|
| The "slightly different" bit of course makes it worse - it will
| work 99% of the time.
| david_draco wrote:
| Probably could set the seed and temperature and store these.
| Run with a few variations and keeping the best would probably
| work even better, at the expense of more computation.
| jxy wrote:
| There was this: Compression Represents Intelligence Linearly,
| https://arxiv.org/html/2404.09937v1
| KTibow wrote:
| On the topic of very strong compression:
| http://prize.hutter1.net/
| personjerry wrote:
| I mean, sure this is compression in the sense that I can send you
| a tiny "compressed text" and all you need is this multi-terabyte
| model to decompress it!
| markisus wrote:
| But we don't usually count the decompression program size when
| evaluating compression ratio. Eg 7-Zip is about 1 MB, but you
| don't think about that when evaluating particular 7z files.
| Filligree wrote:
| We would if it's a multi-gigabyte program the receiver
| doesn't have installed.
| viraptor wrote:
| Maybe not multi-gigabyte, but in a new system/phone in a
| year, you're basically guaranteed to find at least one tiny
| model. We may even get some "standard" model everyone can
| reliably use as a reference.
| Filligree wrote:
| At that point it would be useful. Although, I wonder if
| it wouldn't make more sense to train one specifically for
| the job. Current LLMs can predict HTML, sure, but they're
| large and slow for the task.
| markisus wrote:
| Yeah sometimes compression ratio is not the right question
| when there are other practical concerns like disk space or
| user experience.
|
| But I do want to point out that almost everyone installs at
| least one multigigabyte file to decompress other files, and
| that is the OS.
| vintermann wrote:
| We do when it's the Hutter prize, otherwise it's easy to
| cheat.
|
| But sure, it's a constant factor, so if you compress enough
| data you can always ignore it.
| Legend2440 wrote:
| There's an existing idea called shared dictionary compression.
| Everybody pre-agrees on some statistical priors about the data,
| and you use them to improve the compression ratio.
|
| This is just the gigascale version of that.
| anonu wrote:
| URL to the text is basically compression.
| recursive wrote:
| In some contexts, although it has an extra dependency, which is
| the web server.
| Intralexical wrote:
| You can already pre-train compression on text without using an
| LLM: $ curl
| https://www.gutenberg.org/cache/epub/11/pg11.txt > text.txt
| $ split -n 500 text.txt trainpart.
|
| Using a normal compression algorithm: $ zstd
| --train trainpart.* -o dictionary Save dictionary of size
| 112640 into file dictionary $ zstd -vD dictionary
| text.txt *** Zstandard CLI (64-bit) v1.5.5, by Yann
| Collet *** text.txt : 15.41% ( 170 KiB =>
| 26.2 KiB, text.txt.zst)
|
| For this example, ZSTD warns that the dictionary training set is
| 10X-100X too small to be efficient. Realistically, I guess you'd
| train it over E.G. the entire Gutenberg library. Then you can
| distribute specific books to people who already have the
| dictionary.
|
| Or: $ curl -L https://archive.org/download/comp
| leteworksofl1920carr/completeworksofl1920carr_hocr_searchtext.txt
| .gz | gzip -d | sed -E 's/\s+/ /g' >
| FullTextsSample.txt $ zstd -v -19 --patch-from
| FullTextsSample.txt text.txt text.txt :
| 16.50% ( 170 KiB => 28.1 KiB, text.txt.zst)
|
| Not sure how much performance would drop for realistic use. But
| there are also some knobs you can tune.
|
| Refer to:
|
| https://github.com/facebook/zstd/#dictionary-compression-how...
|
| https://github.com/facebook/zstd/wiki/Zstandard-as-a-patchin...
| $ man zstd
|
| - Dictionary occupies only kilobytes or megabytes of storage,
| instead of gigabytes or terabytes.
|
| - Dictionary can be re-trained for specific data at negligble
| cost.
|
| - Compression and decompression are deterministic by default.
|
| - Doesn't take large amount of GPU resources to
| compress/decompression.
|
| - This is actually designed to do this.
| Alifatisk wrote:
| I am so glad I read your comment, so faschinating!
| Terretta wrote:
| Interesting you're showing 15% - 16%, and the LLM technique
| showed 15%.*
|
| (To your point, one of those measures isn't including gigabytes
| of LLM in its size savings, as if it's part of the .exe size
| instead.)
|
| * EDIT to link to discussion further down:
| https://news.ycombinator.com/item?id=40245530
| Intralexical wrote:
| > Interesting you're showing 15% - 16%, and the LLM technique
| showed 15%.*
|
| Yeah. But I don't think it's hinting at any fundamental
| theoretical limit.
|
| Both the LLM and my examples were trained on data _including_
| the full text of Alice in Wonderland, which we 're
| "compressing". Probably many copies of it, for the LLM. In
| theory they should both be able to reach 0% (or very close).
|
| So both the blog post and my examples are a bit silly-- Like
| "losslessly compressing" an image by diffing it with a lossy
| JPEG, then claiming a higher compression ratio than PNG/JPXL
| because the compression program is a 1TB binary that bundles
| Sloot-style blurry copies of every known image.
|
| In fact, by just adding `--maxdict=1MB` to my first example,
| `zstd -D` gets down to 13.5%. Probably lower with further
| tweaking. And adding an explicit `cat text.txt >>
| FullTextsSample.txt` brings `zstd --patch-from` down to...
| Uh. 0.02%. 40 bytes total. ...And probably around a third of
| that is headers and checksum... So... Yeah. A bit silly.
|
| I think a better comparison should probably:
|
| - Have a clean separation between training data, and data to
| be compressed. Usually the compressed data should be similar
| to, but not included in, the training data.
|
| - Use the _same_ training data for both the LLM and
| conventional compressor.
|
| - Include the dictionary/model size. And compare methods at
| the same dictionary/model size.
|
| Also, as an aside, the method in the blog post could probably
| also get smaller by storing token probability ranks for most
| of its current explicit letters.
| tednoob wrote:
| Is this method used during training? Seems to me there could be a
| point to only backpropagate when the model is wrong?
| zwaps wrote:
| I mean this is implicit in back propagation, say, you need to
| store gradients anyway but if you get to a zero loss than you
| are just done.
| leodriesch wrote:
| The model is always wrong, since it predicts a propability
| distribution over all possible tokens, but the target has 100%
| possibility for one token and 0 for all others.
| Tiberium wrote:
| Am I missing something, or is there no repo to try it out?
| throwaway81523 wrote:
| A compression method something like this figured into Vernor
| Vinge's 1988-era short story "The Blabber".
| KaoruAoiShiho wrote:
| Can this be used to save on token cost in LLM inferencing?
___________________________________________________________________
(page generated 2024-05-05 23:01 UTC)