[HN Gopher] Ts_zip: Text Compression Using Large Language Models
___________________________________________________________________
Ts_zip: Text Compression Using Large Language Models
Author : Deeg9rie9usi
Score : 110 points
Date : 2023-08-16 20:08 UTC (2 hours ago)
(HTM) web link (bellard.org)
(TXT) w3m dump (bellard.org)
| refulgentis wrote:
| Intriguing promise, but I can't find technical details or source,
| only binary exectuables. Anyone else have luck?
|
| The "(and hopefully decompress)" would usually make me think it
| was like when people would "compress" sequences into emoji using
| ChatGPT...but its bellard
| sterlind wrote:
| How does this work? I know that LLMs predict the log-likelihood
| of the next token in a sequence.. so I guess you initialize the
| LLM from scratch, say that each token of the input text is the
| Nth most likely possibility, and record what N is for that token?
| And then you do some combination of run-length encoding and
| Huffman trees or something to record the Ns?
|
| Since this is not open-source, but seems simple to do, perhaps
| I'll make my own.
| AceJohnny2 wrote:
| This is hilarious.
|
| In compression contest circles, there's long been this concept of
| "AI compression" (long before the current LLM wave), considered
| the upper limit of compression algorithms. This is based on the
| idea of the Dictionary... IIRC, some compression systems include
| the Dictionary as part of the decompressor, to save space in the
| payload, but then that Dictionary is fixed and may be suboptimal
| for the data. Others include it in the payload, so that it's
| optimal for the data, which most do.
|
| But you could achieve incredible compression of, say, The
| Complete Works Of Shakespeare if you knew your peer knew about
| The Complete Works Of Shakespeare (had a rich enough dictionary)
| and just send them that designator of the payload! Or, you could
| send them a particular high-resolution picture of a wheat field
| with cypresses under a blue sky with an appropriately specific
| "description".
|
| What Fabrice did is gone ahead and put this theoretical idea in
| practice. And that's hilarious.
| [deleted]
| sterlind wrote:
| Notably, I think this would fail on the Wikipedia compression
| contest because that counts the size of the decompression
| program (model in this case), so that it measures Kolmogorov
| complexity. And RWKV is surely too big. A neat thing to think
| about though, and maybe a winning idea if a more space-
| efficient model can be found.
| fsiefken wrote:
| Impressive! But an order of a magnitude slower then non-LLM
| compression techniques.
|
| I try to compare with the results from the top 9 of this enwik8
| compression test by Matt Mahoney:
| https://www.mattmahoney.net/dc/text.html
|
| The durilca compressor by Dmitry Shkarin (what happened to him?)
| has been the fastest compressor in the top since it's debut in
| 2006.
|
| The original size, enwik8, is 10^8 bytes. The compressed size is
| the compressend enwik8 plus the size of a zip archive containing
| the decompressor.
|
| The bpb value of rwkv_430M on enwik8 is 0.948 bpb, the 7B models
| will be lower, perhaps around 0.800. So if my calculations are
| correct, the LLM's can perform 50% better then the best
| performing conventional compressors excluding the zipped LLM
| decompressor code (I am unsure about the size).
|
| The bpb ratio for each existing program can be calculated as: bpb
| ratio=(Original size (enwik9)Total size (enwik9+prog) )x8
|
| nncp v3.1: Given compressed size is not provided, we cannot
| calculate bpb for nncp v3.1 using enwik8.
|
| cmix v19: bpb=(14,837,987+223,485108)x8bpb=(10814,837,987+223,485
| )x8 bpb[?]1.205
|
| tensorflow-compress v4:
| bpb=(15,905,037+55,283108)x8bpb=(10815,905,037+55,283 )x8
| bpb[?]1.272
|
| cmix-hp 10 Jun 2021: Given compressed size is not provided, we
| cannot calculate bpb for cmix-hp using enwik8.
|
| fast-cmix: Given compressed size is not provided, we cannot
| calculate bpb for fast-cmix using enwik8.
|
| starlit 31 May 2021: bpb=(15,215,107108)x8bpb=(10815,215,107 )x8
| bpb[?]1.217
|
| phda9 1.8: bpb=(15,010,414+42,944108)x8bpb=(10815,010,414+42,944
| )x8 bpb[?]1.205
|
| paq8px_v206fix1:
| bpb=(15,849,084+402,949108)x8bpb=(10815,849,084+402,949 )x8
| bpb[?]1.265
|
| durilca'kingsize:
| bpb=(16,209,167+407,477108)x8bpb=(10816,209,167+407,477 )x8
| bpb[?]1.317
| londons_explore wrote:
| > The bpb value of rwkv_430M on enwik8 is 0.948 bpb
|
| This is an unfair test, because the rwkv_430M model was almost
| certainly trained on wikipedia. Testing using training data is
| a big no-no in the ML world - you get unrealistically good
| results.
|
| To test this properly, it needs to be run on a bit of newly
| written text which is nowhere on the internet. Obviously
| writing 100 MB of human written text which is not already on
| the internet is rather a big task...
| etiam wrote:
| Newly declassified material maybe? Or just newly digitized?
| MaximilianEmel wrote:
| "The same exact GPU model ... must be used for compression and
| decompression."
|
| This seems like a massive issue for actual use. Are there really
| not some workarounds to get deterministic output?
| slimsag wrote:
| Unfortunately GPUs (and even CPUs' SIMD) floating point math is
| riddled with cross-platform determinism issues; hardware
| manufacturers intentionally trade that in order to get faster
| math operations in general, because although behavior of
| floating point is defined in IEEE 754, you get rounding errors
| for each operation.
|
| Compiler optimizations (and remember, GPU drivers each use
| their own compiler behind the scenes to translate to their
| actual hardware architecture) can alter rounding errors of each
| operation, and parallel execution - which differs from hardware
| to hardware - also affects it.
|
| Some APIs (Cuda?) let you disable all optimizations and there
| are ways to get cross-platform determinism, but in general it's
| much much slower if you want bit-for-bit equality across
| different hardware.
|
| SPIR-V/Vulkan for example[0] only define an error range based
| in ULP for some operations - not bit-for-bit equality.
|
| [0]
| https://registry.khronos.org/vulkan/specs/1.2-extensions/htm...
| johndough wrote:
| Reproducible results across GPUs are theoretically possible,
| but it would take some effort to implement and be slower. At
| least the primitive operations (addition, multiplication, etc.)
| are there: https://docs.nvidia.com/cuda/floating-
| point/index.html
| tysam_and wrote:
| The reason this works is because LLMs are trained to minimize the
| empirical risk over a large set of data
| (https://en.wikipedia.org/wiki/Empirical_risk_minimization), and
| it is where the log-likelihood-based measure comes from,
| oftentimes represented simply as the perplexity.
|
| This is precisely because LLMs learn a representation of the nth-
| order Markov Chain, which by its nature is extraordinarily
| sparse. This allows for excellent compression, and I would assume
| (and have for a while, I suppose) the nearly-linear nature of
| LLMs allows for excellent interpolation between these very sparse
| states.
|
| This lets us 'break' the hard-combinatorial problem into one that
| can be softly, albeit with the fuzzy error that comes with any
| estimator.
|
| Since computers cannot tractably compress markov chains past a
| certain point, this allows LLMs some opportunity to outpace
| traditional methods for compression, at least in theory.
|
| Also, this method technically cheats since you need the language
| model to decompress, which counts towards your dictionary size.
| In a truly unbounded case, I would be interested to see which
| method wins. I'm assuming it's the LLM.
|
| Apologies for any factual inaccuracies or such, I am still
| learning many of these things. Many thanks and much love! <3
| :))))
| blovescoffee wrote:
| If you want to learn something cool and related, checkout
| Autoencoders (or VAEs). They effectively compress information
| by forming representations of some data.
| tysam_and wrote:
| Indeed. They are extremely cool! <3 :) I think in a way,
| every neural network works on compression! Though AEs & VAEs
| are a bit more blatant in how they do it. It's just always
| turned around a little, here and there. I have a pet theory
| that no neural network can work without compression (even
| generative ones that have constantly increasing layer depths!
| :D :)))) )
| jandrese wrote:
| I guess they've worked out to always get exactly the same text
| back? Hopefully this won't be a repeat of the photocopiers that
| used a similar compression technique and occasionally substituted
| letters in the copy.
|
| https://www.bbc.com/news/technology-23588202
| cbhl wrote:
| The author, Fabrice Bellard, is well-known for his work on QEMU
| (among other things), but I'd treat the rest of the posts as
| "idea sharing" at an early stage. For example, while BPG
| (Better Portable Graphics) didn't take off, the core idea
| (image formats based on video codecs) can be found in WebP
| (VP8, VP9) and AVIF (AV1) and HEIF (HEVC/H.265).
|
| For this, I'd look at comparable work in the audio codec space,
| including Lyra (https://arxiv.org/abs/2102.09660) and
| SoundStream (https://arxiv.org/abs/2107.03312).
|
| For text compression, I think it would be novel if you could
| get a lossy but semantically equivalent decompression that was
| resilient to the exact hardware used for inference. I don't
| think that's what happened here, though, given the requirement
| for "exact same GPU and model".
| wood_spirit wrote:
| Fabrice is also well known for creating Ffmpeg and
| discovering an algorithm for computing digits of PI
| https://en.m.wikipedia.org/wiki/Fabrice_Bellard
|
| He has also held the 'world record' for data compression
| since 2019 with nncp http://mattmahoney.net/dc/text.html#1085
| OGWhales wrote:
| JBIG2 is what the NSO group exploited as part of their zero-
| click iMessage exploit. Google zero has a good write up if
| anyone hasn't heard about it yet--it's super cool:
|
| https://googleprojectzero.blogspot.com/2021/12/a-deep-dive-i...
| ks2048 wrote:
| If you can predict text very well, it can still be a good
| _lossless_ compressor - you just have to encode the _errors_
| you make. Fewer errors, less data to store
| wmf wrote:
| Yeah, using a probabilistic predictor to do lossless
| compression has been known for a few decades.
| albertzeyer wrote:
| > The same exact GPU model and program versions must be used for
| compression and decompression.
|
| This makes the whole approach really difficult to be useful in
| practice. Basically whenever you have a method that relies on
| 100% reproducible float numbers.
|
| What are ways around that? Why do the quantized variants actually
| have the same problem?
| londons_explore wrote:
| Bellard still does great work... But no longer releases anything
| opensource.
|
| In turn, that makes it far less interesting to me.
|
| Especially since the majority of what he releases is more of a
| cool tech demo than a product in it's own right. A closed source
| tech demo really isn't of much use to anyone who doesn't have the
| source code to extend it.
| [deleted]
| DSingularity wrote:
| Why did he stop?
| londons_explore wrote:
| I can't speak for him...
|
| But I stopped open sourcing some of my work when I got
| aggressive users demanding I support them for free. I
| realised I was pouring hours into triaging bugs, explaining
| why PR's weren't good enough, fixing other people's corner
| cases, and settling disputes, without really getting much
| back. Sure, my projects were getting a following and there
| were lots of happy users, but the actual benefit to me of
| that happening was negative.
|
| One middle ground I used for a while was releasing my
| projects as a tar file only, not a git repo. Someone else can
| then make a git repo and maintain the project. Looks like
| Bellard did the same for some projects.
| londons_explore wrote:
| Code is a tool. Often you build a tool to complete a task.
|
| Open Sourcing the tools lets others do _their_ tasks. But
| you run the risk of becoming an unpaid toolmaker and unpaid
| tool maintainer, and getting none of your own tasks done.
|
| In a way, github's biggest opensource committers are in a
| way the 'gig economy' workers of the tech world. Paid in
| just the occasional few bucks of donation for work which
| would otherwise be worth hundreds of thousands of dollars
| done with a salary.
| dakotasmith wrote:
| I recently came across Curio
| (https://github.com/dabeaz/curio) & really appreciated
| their FAQ section on contributing. I plan to use it in any
| future open source projects:
|
| Q: Can I contribute?
|
| A: Curio is not a community-based project seeking
| developers or maintainers. However, having it work reliably
| is important. If you've found a bug or have an idea for
| making it better, please file an issue.
| hn_throwaway_99 wrote:
| My hope is that one day I'll be reincarnated and the new me will
| produce 1/1000 of the output of Fabrice Bellard. Whenever I see
| new posts from him I just think " _How_ can he do so much? "
| dekhn wrote:
| The SI unit for productivity is measured in millibellards
| [deleted]
| optimalsolver wrote:
| Reminder that compression = intelligence:
|
| https://mattmahoney.net/dc/rationale.html
|
| ...which leads to the theory of mind known as "compressionism":
|
| https://ceur-ws.org/Vol-1419/paper0045.pdf
| burtonator wrote:
| There must be a relationship here between Shannon's estimation
| that english is 1 bit of entropy per character and is highly
| redundant and 'easy' to predict.
|
| A highly advanced AI could compress the text and predict the
| next sequences easily.
|
| This seems like a direct connection like electricity and
| magnetism.
|
| And maybe that's why English needs to be about 1 bit because
| we're not very intelligent.
| [deleted]
| YeGoblynQueenne wrote:
| >> Reminder that compression = intelligence:
|
| If I conclude from this that you mean that gzip is intelligent,
| will I be accused of bad faith?
| ks2048 wrote:
| I would say compression and intelligence are closely related,
| not equal. For example, zstd is a pretty good compressor, but
| does any one think it should be called a pretty good "AI"?
| willis936 wrote:
| Someone in sales with a product that uses zstd.
| dragonwriter wrote:
| Wasn't there a paper recently about a fairly simple wrapper
| around a common conpression algorithm beating LLMs on
| classification tasks?
| jmholla wrote:
| I thought that was about LLMs being trained on compressed
| data. But I might be thinking about a different paper.
| dragonwriter wrote:
| Gzip + kNN for text classification:
|
| https://aclanthology.org/2023.findings-acl.426.pdf
| junipertea wrote:
| Not LLM, just BERT, also did not actually outperform it.
|
| source: https://kenschutte.com/gzip-knn-paper/
| londons_explore wrote:
| Read the paper...
|
| Zstd, while it might be better than gzip or bzip, is still a
| very poor compressor compared to an ideal compressor (which
| hasn't yet been discovered).
|
| That is why zstd acts like a rather bad AI. Note that if you
| wanted to use zstd as an AI, you would patch out of the
| source code checksum checks, and you would then feed it a
| file to decompress (The cat sat on the mat), followed by a
| few bytes of random noise.
|
| A great compressor would output: The cat sat on the mat. It
| was comfortable, so he then lay down to sleep.
|
| A medium compressor would output: The cat sat on the mat. bat
| cat cat mat sat bat.
|
| A terrible compressor would output: The cat sat on the mat.
| D7s"/r %we
|
| See how each is using knowledge at different levels to
| generate a completion. Notice also how that few bytes
| generates different amounts of output depending on the
| compressors level of world understanding, and therefore
| compression ratio.
| astrange wrote:
| > A terrible compressor would output: The cat sat on the
| mat. Dsr %we3 9T23 }PS{D:rg!@ !jvPSdP$
|
| LLMs are sort of unable to do this because they use a fixed
| tokenizer instead of raw bytes. That means they won't
| output binary garbage even early on + saves a lot of
| memory, but it may hurt learning things like
| capitalization, rhyming, etc we think are obvious.
| londons_explore wrote:
| Even if you trained an LLM with a simplified tokenizer
| that simply had 256 tokens for each of 256 possible ascii
| characters, you would see the same result.
| bbstats wrote:
| I would say compression has some congruency with intelligence.
| tysam_and wrote:
| No. Compression == information, not intelligence. This is a
| common mistake that people make, and is a popular expression
| making the rounds right now, but it is incorrect.
|
| Intelligence _may arise_ from information. Information
| _necessarily arises_ from compression.
|
| It is very similar, to me, to the concept of humors
| (https://en.wikipedia.org/wiki/Humorism). Empirically, it was
| observed that these things had some correlation, and they
| occurred together, but they are not directly causative.
| Similarly with the theory of spontaneous generation
| (https://en.wikipedia.org/wiki/Spontaneous_generation), which
| was another theory spawned (heh, as it were), from casual
| causal correlation (again, as it were).
|
| This can be shown rather easily when you show that the time-
| dynamics of an exemplar system with high information diverge
| from the time dynamics of a system with high intelligence, i.e.
| there is a structural component to the information embedded in
| the system such that the temporal aspects of applied
| information are somehow necessary for intelligence.
|
| I hope to release some work at least tangentially related to
| this within the next few years (though of course it is a bit
| more high-flying and depends on some other in-flight work). If
| we are to attempt to move towards AGI, I really think we need
| to stick with the mathematical basics. Neural networks,
| although they've been made out to be really complicated, are
| actually quite simple in terms of the concepts powering them, I
| believe. That's something I'm currently working on putting
| together and trying to distill to its essence. That will also
| be at least 1-2 years out, and likely before any temporally-
| related work, all going well. In my experience, this all is
| just a personal belief from a great deal of time and thought
| with the minutia of neural networks. That said, I could perhaps
| be biased with some notion of simplicity, since I've worked
| with them long enough for some concepts to feel comfortably
| second-nature.
| derefr wrote:
| I think what the GP meant is not that _compressed data_ has
| intelligence, but rather that a _compressor_ is necessarily
| "intelligent" in correlation to the degree that it compresses
| well on arbitrary novel datasets it wasn't tuned for. At
| least insofar as we define "intelligence" with measures like
| IQ, that relate to "number of examples of a pattern required
| to notice the pattern."
|
| To compress data, you have to derive some interesting insight
| about that data that allows the data to be projected into a
| lower-dimensional, lower-entropy embedding. (In pop-
| neurological terms, you have to "make a neuron" that
| represents a concept, so that you can re-phrase your dataset
| into a star-schema expressed in its connections to that
| concept.)
|
| The more _different kinds of things_ a single compressor can
| do that for -- without the compressor itself being made
| larger -- the more "intelligent" the compressor is, no?
| bbstats wrote:
| OTOH true intelligence can still poorly compress things.
| tysam_and wrote:
| Any ERM-based method will have this flaw (if it is a flaw
| -- I'd possibly consider it a tradeoff) -- it is the nature
| of compression itself. I would make the argument, I think,
| that I believe that this disrupts the 'information' layer
| of the hierarchy, not the 'intelligence', though that is
| just my personal 2 cents of course.
| leumassuehtam wrote:
| Interesting application for the RWKV family of models.
| lwneal wrote:
| A man goes to prison, and the first night while he's laying in
| bed, he hears someone yell out, "44!", followed by laughter from
| the other prisoners.
|
| Puzzled, he laid back down, but then he heard someone else yell
| out, "72!", followed by even more laughter.
|
| "What's going on?" he asked his cellmate.
|
| "Well, we've all heard every joke so many times, we've given them
| each a number to make it easier."
|
| "Oh," he says, "can I try?"
|
| "Sure, go ahead."
|
| So, he yells out "102!" and the place goes nuts. People are
| whooping and laughing in a hysteria. He looks at his cellmate
| rolling on the ground laughing.
|
| "Wow, good joke, huh?"
|
| "Yeah! We ain't never heard that one before!"
| [deleted]
| ducktective wrote:
| Man fears AGI, AGI fears Fabrice Bellard.
| [deleted]
| optimalsolver wrote:
| Woman inherits the Earth.
| elil17 wrote:
| Idea: lossy text compression, where the LLM replaces unlikely
| words with likelier synonyms.
| GaggiX wrote:
| This is actually a very interesting idea, I hope someone would
| implement it and how. Would be fun to read the same text with
| stronger and stronger lossy compression.
| dekhn wrote:
| up-goer five: explained using only the ten hundred words
| people use the most often
| freedmand wrote:
| Was it run on any text that was not feasibly training data for
| the LLMs? It wouldn't be a fair comparison otherwise
| NobodyNada wrote:
| Every lossless "compression" algorithm makes some strings
| shorter and other strings longer, such that the _average_ space
| savings against all possible strings is neutral at best. This
| is a fundamental consequence of the way information works --
| there's no way to losslessly map (say) 8 bits of input into 7
| bits of output.
|
| The goal of compression, then, is to make common strings
| shorter while making uncommon strings longer. You can think of,
| say, UTF-8 as a simple compression algorithm: it would take 20
| bits per character to encode all Unicode code points with a
| fixed-width encoding, but UTF-8 takes advantage of the fact
| that the most commonly used characters have a lot of leading
| zeroes (at least in English). So the characters of the English
| alphabet require only 8 bits to encode, but uncommon characters
| require up to 32 bits.
|
| Thus, I would expect an LLM-based compression algorithm to do
| well on strings that were common in its training data, and make
| strings that were uncommon or absent slightly longer. If it did
| not do that, it would not be a lossless compression algorithm.
| freedmand wrote:
| I don't disagree with anything you said.
|
| I'm saying more that if the compression algorithm is
| benchmarking against "Alice in Wonderland" and has consumed
| the entirety of "Alice in Wonderland" in training the LLM
| (along with popular paragraphs and sentences quoted
| elsewhere), then it might do particularly well at reciting
| lines from that book and thus be able to compress it
| extremely well. I'd be more interested in seeing the
| compression algorithm's performance on new or unreleased
| works that would have no way of being training data.
|
| As an extreme hypothetical, I could make a compression
| algorithm that is a table mapping an ID to an entire book and
| fill it with all the popular works. "Alice in Wonderland"
| would be replaced with a single short identifier string and
| achieve a ~0.001% compression ratio. An unseen work would be
| replaced with an <unknown> ID followed by the entire work and
| be slightly bigger. Then, I benchmark only the popular works
| and show insanely impressive results!
|
| I have no doubt the LLM compressor would do really well on
| unseen works based on what you said above, but it's not a
| fair look at its performance to run it on works it may have
| been explicitly trained on.
| thomasahle wrote:
| Exactly. This is id always a pitfall when benchmarking LLM
| based techniques. The enwiki8 dataset they use, for example, is
| for sure in the training data.
|
| To know how the method performs on novel data, the authors have
| to come up with entirely new datasets, since anything already
| existing must be assumed probably contaminated.
| makapuf wrote:
| But doesn't the size in benchmarks include the size of the
| binary decoder ? So the embedded trained data is accounted
| for (preventing a plain copy of wikipedia to be included in
| the decoder)
| loeg wrote:
| I don't think the compressed size statistics on this
| webpage include the size of the LLM needed to decode. Some
| of these inputs are only a few 100 kB -- LLMs absolutely
| dwarf that.
| londons_explore wrote:
| There are plenty of usecases for very slow and very good text
| compression.
|
| Imagine you run a big email server for millions of users.
|
| Any email that hasn't been accessed for a month is probably a
| good candidate for compressing with this method.
|
| In the unlikely event the user wishes to read an old email again,
| I'm sure they won't mind waiting 0.1 seconds for 10 kilobytes of
| email (ie. a screenfull) to decompress.
| otabdeveloper4 wrote:
| Old email needs to be indexed for search, and those indexes
| might be as large as the original text.
| CharlesW wrote:
| > _Any email that hasn 't been accessed for a month is probably
| a good candidate for compressing with this method._
|
| You don't think "the same exact GPU model and program versions
| must be used for compression and decompression" rules it out as
| more than an interesting experiment, especially given that
| storage is cheap and continues to get cheaper?
| londons_explore wrote:
| I think they mean the exact same model data (for the GPU). I
| don't think it requires the exact same model of GPU.
| CharlesW wrote:
| I'm also reading it as "the same exact GPU [LLM] and
| program versions must be used for compression and
| decompression".
| eklitzke wrote:
| As soon as someone uses a POP or IMAP client to fetch all of
| their mail your server is going to catch on fire. Even if you
| had POP/IMAP disabled, the computation costs to decode data are
| enormous with this method, which is precisely why people don't
| use the most maximally compressing codecs for anything other
| than true archival applications (e.g. database backups which,
| unlike old email, really are expected to be accessed almost
| never).
| londons_explore wrote:
| Sounds like a good reason to just ship the compressed data to
| the customer and say "decompress it yourself - this bellard
| guy makes a decompressor!"
| etiam wrote:
| _That_ sounds like recipe for really poor relations with
| the sizeable fraction of customers who are more concerned
| with getting their data back than delighted to be
| introduced to a cutting-edge text compression method...
| londons_explore wrote:
| A full IMAP sync of a 20 gigabyte inbox from gmail takes ~12
| hours.
|
| Thats 400 kilobytes per second. And much of that is image
| attachments and stuff - perhaps only 10 kilobytes/second of
| text.
|
| Maybe Google already does this?
| ttoinou wrote:
| Aren't the LLM supposed to have "seen" those text corpus before ?
| I would expect that a very small prompt could generate the whole
| requested text
___________________________________________________________________
(page generated 2023-08-16 23:00 UTC)