[HN Gopher] Using a LLM to compress text
       ___________________________________________________________________
        
       Using a LLM to compress text
        
       Author : alexmolas
       Score  : 94 points
       Date   : 2024-05-03 07:51 UTC (2 days ago)
        
 (HTM) web link (o565.com)
 (TXT) w3m dump (o565.com)
        
       | pronoiac wrote:
       | I expected general-purpose compressors to do better, but I was
       | wrong. For the whole document:                 174355 pg11.txt
       | 60907 pg11.txt.gz-9        58590 pg11.txt.zstd-9        54164
       | pg11.txt.xz-9        25360 [from blog post]
        
         | EgoIncarnate wrote:
         | Which model did you use?
        
         | Intralexical wrote:
         | IDK why, but BZIP2 seems to do somewhat better that other
         | compression algorithms for natural language text:
         | $ curl https://www.gutenberg.org/cache/epub/11/pg11.txt | bzip2
         | --best | wc             246    1183   48925
         | 
         | Also, ZSTD goes all the way up to `--ultra -22` plus
         | `--long=31` (4GB window-- Irrelevant here since the file fits
         | in the default 8MB anyway).
        
           | pastage wrote:
           | You can use a preshared dictionary with bzip2 and zstd so you
           | can get that down alot by using different dictionaries
           | depending on certain rules. I dont know if it helps with
           | literature but I had great success in sending databases with
           | free text like that. In the end it was easier to just use one
           | dictionary for everything and just skip the rules.
        
         | immibis wrote:
         | And here's the opposite - using gzip as a (merely adequate)
         | "large" language model: https://aclanthology.org/2023.findings-
         | acl.426.pdf
         | 
         | By the way, if you want to see how well gzip actually models
         | language, take any gzipped file, flip a few bits, and unzip it.
         | If it gives you a checksum error, ignore that. You might have
         | to unzip in a streaming way so that it can't tell the checksum
         | is wrong until it's already printed the wrong data that you
         | want to see.
        
         | anonu wrote:
         | Shouldn't you factor in the size of the compression tools
         | needed?
        
           | TeMPOraL wrote:
           | Not if you're interested in compressing more than one file -
           | compression tools are a fixed, one-time cost in size.
        
           | sp332 wrote:
           | For the prize, yes. But it's interesting to see that even a
           | much larger decompressor might not necessarily buy you a
           | smaller file.
        
         | Legend2440 wrote:
         | LLMs blow away traditional compressors because they are very
         | good predictors, and prediction and compression are the same
         | operation.
         | 
         | You can convert any predictor into a lossless compressor by
         | feeding the output probabilities into an entropy coding
         | algorithm. LLMs can get compression ratios as high as 95% (0.4
         | bits per character) on english text.
         | 
         | https://arxiv.org/html/2404.09937v1
        
           | Der_Einzige wrote:
           | The 1-1 correspondence between prediction and compression is
           | one of the most counterintuitive and fascinating things in
           | all of AI to me.
        
             | mjburgess wrote:
             | Its only equivalent for a very narrow sense of 'prediction'
             | , namely modelling conditional probability distributions
             | over known data.
             | 
             | There's no sense, for example, in which deriving a
             | prediction about the nature of reality from a novel
             | scientific theory is 'compression'
             | 
             | eg., suppose we didn't know a planet existed, and we looked
             | at orbital data. There's no sense in which compressing that
             | data would indicate another planet existed.
             | 
             | It's a great source of confusion that people think AI/ML
             | systems are 'predicting' _novel distributions_ of
             | observations (science), vs., novel observations of _the
             | same_ distribution (statistics).
             | 
             | It should be more obvious that the latter is just
             | compression, since it's just taking a known distribution of
             | data and replacing it with a derivative optimal value.
             | 
             | Science predicts novel distributions based on theories,
             | ie., it says the world is other than we previously
             | supposed.
        
               | palmtree3000 wrote:
               | Sure it is! If we were trying to compress an archive of
               | orbital data, one way to do it would be "initial
               | positions + periodic error correction". If you have the
               | new planet, your errors will be smaller and can be
               | represented in less space at the same precision.
        
               | mjburgess wrote:
               | This is assuming the theory.
               | 
               | A statistical model of orbits, without a theory of
               | gravity, is less compressed when you assume more objects.
               | Take all the apparent positions of objects in the sky,
               | {(object, x1, x2, t),...}. Find a statistical model of
               | each point at t+1, so y = (o, x1, x2, t+1). There is no
               | sense in which you're deriving a new object in the sky
               | from this statistical model -- it is only a compression
               | of observable orbits.
               | 
               | When you say, "if you have the new planet", you're
               | changing the data generating process (theory) to produce
               | a new distribution of points {(o' x1', x2', t'), ...} to
               | include an unseen object. You're then comparing two _data
               | generating models_ (two theories) for their simplicity.
               | You 're not comparing the associative models.
               | 
               | Call the prior theory 8-planets, so 8P generates x1,x2,t;
               | and the new theory 9P which generates x1',x2',t'
               | 
               | You're then making a conditional error distribution when
               | comparing two rival theories. The 9P theory will minimize
               | this error.
               | 
               | But in no sense can the 9P theory be derived from the
               | initial associative statistical distribution. You are,
               | based on theory (, science, knowledge, etc.) choosing to
               | add a planet, vs. eg., correcting for measurement error,
               | modifiying newton's laws, changing the angle of the earth
               | wrt the solar system... or one of _an infinite number_ of
               | theories which all produce the same error minimization
               | 
               | The sense of "prediction" that science uses (via Popper
               | et al.) is deriving the existence of novel phenomena that
               | do not follow from prior observable distributions.
        
               | Legend2440 wrote:
               | It doesn't matter how your predictor works, whether it's
               | a scientific theory, a statistical model, or a magic
               | oracle. You can always perfectly convert its predictions
               | into compression using entropy coding. The conversion
               | process is black-box.
        
       | Intralexical wrote:
       | The practical way to do this is to train a custom dictionary with
       | a normal compression algorithm.
        
       | clay_the_ripper wrote:
       | It does seem like this is a possible method to test if an LLM has
       | your data in it.
       | 
       | People have found other ways to do that of course, but this is
       | pretty clever.
        
         | mvkel wrote:
         | Not necessarily. This also uncovers the weakness of the NYT
         | lawsuit.
         | 
         | Imagine in your corpus of training data is the following:
         | 
         | - bloga.com: "I read in the NYT that 'it rains cats and dogs
         | twice per year'"
         | 
         | - blogb.com: "according to the NYT, 'cats and dogs level rain
         | occurs 2 times per year."
         | 
         | - newssite.com: "cats and dogs rain events happen twice per
         | year, according to the New York Times"
         | 
         | Now, you chat with an LLM trained on this data, asking it "how
         | many times per year does it rain cats and dogs?"
         | 
         | "According to the New York Times, it rains cats and dogs twice
         | per year."
         | 
         | NYT content was never in the training data, however it -is-
         | mentioned a lot on various sources throughout commoncrawl-
         | approved sources, therefore gets a higher probability
         | association with next token.
         | 
         | Zoom that out to full articles quoted throughout the web, and
         | you get false positives.
        
           | refulgentis wrote:
           | They were getting huge chunks, verbatim of NYT articles out.
           | I remember being stunned. Then I remember finding out there
           | was some sort of trick to it that made it seem sillier.
        
             | KaoruAoiShiho wrote:
             | Was it that NYT articles are routinely pirated on reddit
             | comments and the like?
        
               | viraptor wrote:
               | Does it matter? What's the legal view on "I downloaded
               | some data which turns out to be copied from a copyrighted
               | source and it was probably trivial to figure it out, then
               | trained the LLM on it"? I mean, they work on data
               | processing - of course they would expect that if someone
               | responds with 10 paragraphs in reporting style, under a
               | link to NYT... that's just the article.
        
               | KaoruAoiShiho wrote:
               | I dunno, I'm not a lawyer, it might matter.
        
               | evilduck wrote:
               | I genuinely don't know the answer but I can see it being
               | more complicated than "OpenAI purposefully acquired and
               | trained on NYT articles".
               | 
               | If Stack Overflow collects a bunch of questions and
               | comments and expose them as a big dataset licensed as
               | Creative Commons but it actually contains a quite bit of
               | copyrighted content, whose responsibility is it to
               | validate copyright violations in that data? If I use
               | something licensed as CC in good faith and it turns out
               | the provider or seller of that content had no right to
               | relicense it, am I culpable? Is this just a new lawsuit
               | where I can seek damages for the lawsuit I just lost?
        
       | blackle wrote:
       | There is a way to do this same compression by utilizing the raw
       | probability distributions that the LLM produces as output.
       | Fabrice Bellard has some experiments with doing this with
       | transformers: https://bellard.org/nncp/
       | 
       | The idea is that if you can produce an accurate probably
       | distribution over the next bit/byte/token, then you can compress
       | things with an entropy compressor (huffman encoding, range
       | encoding, asymmetric numeral systems, etc). This comment is too
       | small of a space to explain fully how they work, but it suffices
       | to say that pretty much every good compression algorithm models
       | probability distributions in some way.
        
         | FooBarBizBazz wrote:
         | Exactly this. The way to do this is to use the LLM as the
         | statistical model inside an arithmetic coder. You use its next-
         | token probabilities to encode the next token. The KL divergence
         | between the LLM's probabilities, and the empirical
         | probabilities in the corpus that you actually compress, is the
         | compression inefficiency.
        
       | ilaksh wrote:
       | Did I miss it, or did he completely leave out the name of the LLM
       | model he used?
        
         | number6 wrote:
         | > I'll use llama.cpp via its python bindings.
         | 
         | llama.cpp also has a link to the model
        
           | gliptic wrote:
           | llama.cpp isn't a model though.
        
       | mmoskal wrote:
       | A problem with this is that the inference has to be 100%
       | deterministic. For example of you change tailing order of matmul
       | you may get slightly different results. It's easy to imagine
       | people changing tailing in llama.cpp or other libraries. You are
       | very unlikely to get bit exact results from different libraries.
       | 
       | Interestingly (though maybe not relevant here) you can also get
       | different results from multi-user inference systems depending on
       | what other requests are in the batch. It's possible to avoid this
       | but I'm pretty sure most systems don't.
       | 
       | The "slightly different" bit of course makes it worse - it will
       | work 99% of the time.
        
         | david_draco wrote:
         | Probably could set the seed and temperature and store these.
         | Run with a few variations and keeping the best would probably
         | work even better, at the expense of more computation.
        
       | jxy wrote:
       | There was this: Compression Represents Intelligence Linearly,
       | https://arxiv.org/html/2404.09937v1
        
       | KTibow wrote:
       | On the topic of very strong compression:
       | http://prize.hutter1.net/
        
       | personjerry wrote:
       | I mean, sure this is compression in the sense that I can send you
       | a tiny "compressed text" and all you need is this multi-terabyte
       | model to decompress it!
        
         | markisus wrote:
         | But we don't usually count the decompression program size when
         | evaluating compression ratio. Eg 7-Zip is about 1 MB, but you
         | don't think about that when evaluating particular 7z files.
        
           | Filligree wrote:
           | We would if it's a multi-gigabyte program the receiver
           | doesn't have installed.
        
             | viraptor wrote:
             | Maybe not multi-gigabyte, but in a new system/phone in a
             | year, you're basically guaranteed to find at least one tiny
             | model. We may even get some "standard" model everyone can
             | reliably use as a reference.
        
               | Filligree wrote:
               | At that point it would be useful. Although, I wonder if
               | it wouldn't make more sense to train one specifically for
               | the job. Current LLMs can predict HTML, sure, but they're
               | large and slow for the task.
        
             | markisus wrote:
             | Yeah sometimes compression ratio is not the right question
             | when there are other practical concerns like disk space or
             | user experience.
             | 
             | But I do want to point out that almost everyone installs at
             | least one multigigabyte file to decompress other files, and
             | that is the OS.
        
           | vintermann wrote:
           | We do when it's the Hutter prize, otherwise it's easy to
           | cheat.
           | 
           | But sure, it's a constant factor, so if you compress enough
           | data you can always ignore it.
        
         | Legend2440 wrote:
         | There's an existing idea called shared dictionary compression.
         | Everybody pre-agrees on some statistical priors about the data,
         | and you use them to improve the compression ratio.
         | 
         | This is just the gigascale version of that.
        
       | anonu wrote:
       | URL to the text is basically compression.
        
         | recursive wrote:
         | In some contexts, although it has an extra dependency, which is
         | the web server.
        
       | Intralexical wrote:
       | You can already pre-train compression on text without using an
       | LLM:                   $ curl
       | https://www.gutenberg.org/cache/epub/11/pg11.txt > text.txt
       | $ split -n 500 text.txt trainpart.
       | 
       | Using a normal compression algorithm:                   $ zstd
       | --train trainpart.* -o dictionary         Save dictionary of size
       | 112640 into file dictionary              $ zstd -vD dictionary
       | text.txt          *** Zstandard CLI (64-bit) v1.5.5, by Yann
       | Collet ***         text.txt             : 15.41%   (   170 KiB =>
       | 26.2 KiB, text.txt.zst)
       | 
       | For this example, ZSTD warns that the dictionary training set is
       | 10X-100X too small to be efficient. Realistically, I guess you'd
       | train it over E.G. the entire Gutenberg library. Then you can
       | distribute specific books to people who already have the
       | dictionary.
       | 
       | Or:                   $ curl -L https://archive.org/download/comp
       | leteworksofl1920carr/completeworksofl1920carr_hocr_searchtext.txt
       | .gz |             gzip -d |             sed -E 's/\s+/ /g' >
       | FullTextsSample.txt              $ zstd -v -19 --patch-from
       | FullTextsSample.txt text.txt         text.txt             :
       | 16.50%   (   170 KiB =>   28.1 KiB, text.txt.zst)
       | 
       | Not sure how much performance would drop for realistic use. But
       | there are also some knobs you can tune.
       | 
       | Refer to:
       | 
       | https://github.com/facebook/zstd/#dictionary-compression-how...
       | 
       | https://github.com/facebook/zstd/wiki/Zstandard-as-a-patchin...
       | $ man zstd
       | 
       | - Dictionary occupies only kilobytes or megabytes of storage,
       | instead of gigabytes or terabytes.
       | 
       | - Dictionary can be re-trained for specific data at negligble
       | cost.
       | 
       | - Compression and decompression are deterministic by default.
       | 
       | - Doesn't take large amount of GPU resources to
       | compress/decompression.
       | 
       | - This is actually designed to do this.
        
         | Alifatisk wrote:
         | I am so glad I read your comment, so faschinating!
        
         | Terretta wrote:
         | Interesting you're showing 15% - 16%, and the LLM technique
         | showed 15%.*
         | 
         | (To your point, one of those measures isn't including gigabytes
         | of LLM in its size savings, as if it's part of the .exe size
         | instead.)
         | 
         | * EDIT to link to discussion further down:
         | https://news.ycombinator.com/item?id=40245530
        
           | Intralexical wrote:
           | > Interesting you're showing 15% - 16%, and the LLM technique
           | showed 15%.*
           | 
           | Yeah. But I don't think it's hinting at any fundamental
           | theoretical limit.
           | 
           | Both the LLM and my examples were trained on data _including_
           | the full text of Alice in Wonderland, which we 're
           | "compressing". Probably many copies of it, for the LLM. In
           | theory they should both be able to reach 0% (or very close).
           | 
           | So both the blog post and my examples are a bit silly-- Like
           | "losslessly compressing" an image by diffing it with a lossy
           | JPEG, then claiming a higher compression ratio than PNG/JPXL
           | because the compression program is a 1TB binary that bundles
           | Sloot-style blurry copies of every known image.
           | 
           | In fact, by just adding `--maxdict=1MB` to my first example,
           | `zstd -D` gets down to 13.5%. Probably lower with further
           | tweaking. And adding an explicit `cat text.txt >>
           | FullTextsSample.txt` brings `zstd --patch-from` down to...
           | Uh. 0.02%. 40 bytes total. ...And probably around a third of
           | that is headers and checksum... So... Yeah. A bit silly.
           | 
           | I think a better comparison should probably:
           | 
           | - Have a clean separation between training data, and data to
           | be compressed. Usually the compressed data should be similar
           | to, but not included in, the training data.
           | 
           | - Use the _same_ training data for both the LLM and
           | conventional compressor.
           | 
           | - Include the dictionary/model size. And compare methods at
           | the same dictionary/model size.
           | 
           | Also, as an aside, the method in the blog post could probably
           | also get smaller by storing token probability ranks for most
           | of its current explicit letters.
        
       | tednoob wrote:
       | Is this method used during training? Seems to me there could be a
       | point to only backpropagate when the model is wrong?
        
         | zwaps wrote:
         | I mean this is implicit in back propagation, say, you need to
         | store gradients anyway but if you get to a zero loss than you
         | are just done.
        
         | leodriesch wrote:
         | The model is always wrong, since it predicts a propability
         | distribution over all possible tokens, but the target has 100%
         | possibility for one token and 0 for all others.
        
       | Tiberium wrote:
       | Am I missing something, or is there no repo to try it out?
        
       | throwaway81523 wrote:
       | A compression method something like this figured into Vernor
       | Vinge's 1988-era short story "The Blabber".
        
       | KaoruAoiShiho wrote:
       | Can this be used to save on token cost in LLM inferencing?
        
       ___________________________________________________________________
       (page generated 2024-05-05 23:01 UTC)