[HN Gopher] Large Text Compression Benchmark
       ___________________________________________________________________
        
       Large Text Compression Benchmark
        
       Author : redeux
       Score  : 63 points
       Date   : 2024-09-18 11:36 UTC (2 days ago)
        
 (HTM) web link (www.mattmahoney.net)
 (TXT) w3m dump (www.mattmahoney.net)
        
       | hyperpape wrote:
       | It's worth noting that the benchmark has not been updated as
       | frequently for the past several years, and some versions of
       | compressors are quite far behind the current implementations
       | (http://www.mattmahoney.net/dc/text.html#history).
       | 
       | The one instance I double-checked (zstd) I don't recall it making
       | a massive difference, but it did make a difference (iirc, the
       | current version was slightly smaller than what was listed in the
       | benchmark).
        
         | pella wrote:
         | agree,
         | 
         | - in the benchmark "zstd 0.6.0" ( Apr 13, 2016 )
         | 
         | - vs. latest zstd v1.5.6 ( Mar 30, 2024
         | https://github.com/facebook/zstd/releases )
        
           | londons_explore wrote:
           | Only really worth updating if the results are meaningfully
           | different though.
           | 
           | Since zstd's format is fixed, I doubt there will be massive
           | changes in compression ratio.
        
             | namibj wrote:
             | Ehhh, it's the decompressor that's specified, and there's
             | considerable wiggle room a'la Trellis modulation in radio
             | line coding, due to the interplay of the dictionary coder
             | with the entropy coder. Basically, there are many moments
             | the dictionary coder has to make non-obvious decisions
             | about how to code a sequence, and what matters is that the
             | whole content overall has the least entropy, not the least
             | length when fed to the entropy coder.
             | 
             | For simple codings like the TLV bitstream in QR codes (the
             | L field bit length depends on the T and the QR code size
             | (i.e., the max bitstream length possible))[0], you can
             | afford to solve for all possibilities via e.g. dynamic
             | programming with some mild heuristics to terminate search
             | branches that can't possibly code shorter due to the TL
             | overhead.
             | 
             | But with an entropy coder like zstd's fst/tANS, that's not
             | remotely as trivial. Making a choice will change the
             | symbol/byte histogram for the entropy coder's input
             | sequence, which, if after quantization by tANS still
             | different, can change the length of the coded bitstream.
             | The problem is the non-local effect on a symbol's coding
             | size from making any individual residency coding decision.
             | 
             | BTW, friendly reminder that all-caps URLs will code shorter
             | into QR codes, so the code will have larger pixels or be
             | smaller.
             | 
             | [0]: there are 5 main modes (i.e., T values): ([0-9])+
             | coded by stuffing 3 digits into 10 bit and trailing 2/1
             | digits into 7/4 bits, respectively;
             | ([0-9]|[A-Z]|[:space:]|[\$\%*\\+\-\/\\.,\:])+ coded by
             | stuffing 2 characters into 11 bits; ISO-8859-1 coded by
             | stuffing one character into 8 bits; Kanji by stuffing each
             | into 13 bits; and a ECI mode for iirc generic Unicode
             | codepoints. I think the other 3 code points in T are used
             | for legacy barcode FNC1 and FNC2, as well as an end-of-
             | content marker.
        
             | adgjlsfhk1 wrote:
             | I think the compression ratio will be a bit better, but the
             | compression and decompression time will be a bunch better.
        
           | shaicoleman wrote:
           | - zstd v0.6.0 -22 --ultra: 215,674,670 bytes
           | 
           | - zstd v1.5.6 -22 --ultra: 213,893,168 bytes (~0.826%
           | smaller)
        
             | adgjlsfhk1 wrote:
             | how does the speed compare?
        
       | pmayrgundter wrote:
       | The very notable thing here is that the best method uses a
       | Transformer, and no other entry does
        
         | wmf wrote:
         | Text compression and text generation are both based on modeling
         | so the best approach to text modeling works for both.
        
           | sfink wrote:
           | Not if part of your definition of "best" includes
           | codebook/model size. The best text generator will use a
           | massive model. That won't help compression beyond a certain
           | point, since the cost of lookup indexes grows with the size
           | of the model. "Monkey #384714...872 typewriter output"
           | doesn't help when the monkey number is longer than the input.
        
         | londons_explore wrote:
         | Transformers run well on GPU's or other hardware accelerators.
         | This benchmark doesn't allow GPU's.
         | 
         | That makes it more of a "can I use unsuitable hardware to get
         | the job done fast and accurately enough" challenge, rather than
         | a pure math puzzle of how to encode data with fewer bytes.
         | 
         | I suspect that's why there is only 1 Transformer entry, and to
         | me raises the question whether the rules should be updated to
         | allow GPU's now they are fairly commonplace.
        
           | HybridCurve wrote:
           | I think it might be a moot point since the transformer run
           | times scale very poorly and the algorithm has a symmetric run
           | time.
        
       | pama wrote:
       | It would be nice to also have a competition of this type where
       | within ressonable limits the size of the compressor does not
       | matter and the material to be compressed is hidden and varied
       | over time. For example up to 10GB compressor size and the dataset
       | is a different random chunk of fineweb every week.
        
         | stephantul wrote:
         | That's an interesting idea. I wonder whether it would actually
         | lead to radically different solutions, bit with bigger
         | codebooks.
        
       | nick238 wrote:
       | Double your compression ratio for the low, low price of 100,000x
       | slower decompression (zstd: 215GB, 2.2 ns/byte vs. nncp: 106GB,
       | 230 us/byte)!
       | 
       | The neural network architectures are technically impressive, but
       | unless there's some standard compression dictionary that works
       | for everything (so the training/compression costs amortize down
       | to nil), and silicon architecture dramatically changes to
       | compute-on-memory, I don't know if this would ever take off.
       | Lossy compression would probably provide huge advantages, but
       | then you need to be domain specific, and can't slap it on
       | anything.
        
         | lxgr wrote:
         | One interesting application could be instant messaging over
         | extremely bandwidth constrained paths. I wouldn't be surprised
         | if Apple were doing something like this for their satellite-
         | based iMessage implementation.
         | 
         | Of course it could also just be a very large "classical" shared
         | dictionary (zstd and brotli can work in that mode, for
         | example).
        
           | londons_explore wrote:
           | Unfortunately, while you want to have encryption on your
           | messages, the required encryption signature eclipses the
           | message size.
           | 
           | Without signing the messages (and just using a stream
           | cipher), you could do it, but an adversary can send garbage
           | pretending to be you, which I don't think is acceptable in
           | the modern world.
           | 
           | Also, the message headers (message length, from, to) also
           | start to dominate and compressing them is hard (usually
           | depending on various hardware tricks like differing CDMA
           | keys).
        
             | lxgr wrote:
             | For such low-bandwidth applications, you'll usually want to
             | do the key exchange while you have a faster connection, as
             | well as build a dictionary of expected recipients etc.
             | 
             | Once you have a pre-exchanged symmetric key pair and IVs
             | etc., encryption can be done with zero overhead, and you
             | can choose your own trade-off between authentication
             | security and message size. Even 4 bytes go a long way (an
             | attacker would need to send 2^31 messages over a very
             | bandwidth-constrained channel to impersonate you), and 8
             | bytes make it safe enough for practically all applications.
             | 
             | That way, you can keep the authentication overhead very
             | small (on the order of a few bytes), similar to how it's
             | done for e.g. SRTP.
        
         | jeffbee wrote:
         | It does seem as though the training cost of these models ought
         | to be included in their times, because the compressors are
         | useless for any purpose other than compressing English
         | Wikipedia.
        
           | gwern wrote:
           | In these cases, the models are so small, and have to run on
           | very limited CPU, that the training time is going to be
           | fairly negligible. The top-listed program, nncp, is just
           | 0.06b parameters! https://bellard.org/nncp/nncp_v2.pdf#page=4
        
           | gliptic wrote:
           | The training cost is included because these NN compressors
           | are running the training while they are compressing. They are
           | general compressors, although the specific variants submitted
           | here may be tuned a little to Wikipedia (e.g. the pre-
           | processors).
        
         | Xcelerate wrote:
         | > unless there's some standard compression dictionary that
         | works for everything
         | 
         | There is, at least for anything worth compressing. It's called
         | the halting sequence. And while it exists, it's also
         | uncomputable unfortunately haha.
        
           | Vecr wrote:
           | Apparently there's ways of getting the first few binary
           | digits, but it's useless at that length.
        
             | Xcelerate wrote:
             | Not useless. That's most of mathematics right there.
        
         | abecedarius wrote:
         | The point of the contest was not compression tech for
         | communication and storage. It was a belief that minimizing log
         | loss on large corpora (i.e. message length) was key to AI
         | progress. That should remind you of some pretty hot more-recent
         | developments.
         | 
         | Here was someone else with a similar pov in the 2000s:
         | https://danburfoot.net/research.html
        
           | optimalsolver wrote:
           | The issue with this paradigm (see also: Hutter Prize) was its
           | insistence on lossless compression.
           | 
           | Intelligence is just as much about knowing what to throw away
           | as what to keep. Any nontrivial cognitive system operating in
           | a nontrivial physical environment will necessarily have a
           | lossy model of the world it's embedded in.
           | 
           | Also of interest, compressionism, a theory of mind based on
           | data compression:
           | 
           | https://ceur-ws.org/Vol-1419/paper0045.pdf
        
             | sfink wrote:
             | I tend to agree, but is that really a fatal flaw? A lossy
             | compression scheme that gets it 90% right only has to
             | encode the delta for the remaining 10%. Which is a big
             | cost, sure, but the alternative is evaluating how important
             | the thrown-away stuff is, and that evaluation is rife with
             | subjective value judgements. There are no right answers,
             | only differing flavors of wrong ones. It's the question of
             | what is important to generalize over, and nobody is ever
             | going to agree on that.
        
             | sdenton4 wrote:
             | You can turn any lossy compression scheme into a lossless
             | scheme by encoding the error... The lossy scheme should be
             | aiming for low error, making the error encoding
             | progressively smaller as the lossy scheme improves.
             | 
             | What's harder to deal with from a measurement perspective
             | is sematic equivalence. calling some kinds of errors zero-
             | cost, but not having a great way to categorize what the
             | loss of, exactly. But it's kinda what you want for really
             | extreme compression: the content is equivalent at a high
             | level, but may be a very different byte stream.
        
             | vintermann wrote:
             | > Intelligence is just as much about knowing what to throw
             | away as what to keep.
             | 
             | Sure, but you can see it as two steps. Decide what doesn't
             | matter _at all_ , and throw it away. Then compress the rest
             | losslessly (according to how surprising it is, basically
             | the only way)
             | 
             | I think a theory of mind based on data compression is
             | backwards, for this reason. When you have "data", you've
             | already decided what's important. Every time you combine
             | "A" and "B" into a set, you have already decided
             | 
             | 1. That they are similar in some way you care about
             | 
             | 2. That they're _distinct_ in some way you care about
             | (otherwise, it would be the same element!)
             | 
             | ... and you do this every time you even add two numbers, or
             | do anything remotely more interesting with  "data". There
             | is no "data" without making a-scientific statements about
             | what's important, what's interesting, what's meaningful,
             | what matters etc.
        
             | abecedarius wrote:
             | Your issue applies equally to GPT-3 (and afaik its
             | successors): the log likelihood loss which the GPT base
             | model was trained on is exactly the "lossless compression"
             | length of the training set (up to trivial arithmetic-coding
             | overhead).
             | 
             | The question I'd raise is why the route that got traction
             | used more ad-hoc regularization schemes instead of the
             | (decompressor + compressed) length from this contest and
             | the book I linked.
        
         | GaggiX wrote:
         | Read: https://opus-codec.org/demo/opus-1.5/
        
         | londons_explore wrote:
         | Imagine if we didn't have dedicated hardware for say caching -
         | every CPU instruction or bit of data has to be read from SSD
         | every use.
         | 
         | We'd see a 100,000x slowdown I expect.
         | 
         | But dedicated hardware (RAM, various cache levels) solved this.
         | 
         | We could do the same for neural net compression if we needed
         | to. But the question is do enough people care enough about a 2x
         | data size reduction?
        
           | wmf wrote:
           | Dedicated hardware isn't magic because there's an intrinsic
           | cost that cannot be reduced. In the case of neural nets you
           | have to store the weights somewhere and you have to perform
           | the multiplications.
        
         | geysersam wrote:
         | Depends on the application! For long term storage of rarely
         | accessed text it might make sense
        
         | ezekiel68 wrote:
         | This is a good point, and yet -- as high as position 24 (out of
         | 208) in the list, we have "bsc-m03 0.4.0" at only ~160ns per
         | byte compressed (compared with most above it being two or three
         | orders of magnitude slower). Below position 24, there's one in
         | the 20s of ns per byte. Looks like the cost was in the size of
         | the compression program. Not sure if there was any "cheating"
         | going on, hunting for benchmark scores with these two famous
         | test bodies of text.
         | 
         | By comparison, the top contender clocks in at ~240,000 ns per
         | byte (making your point).
        
       | sfink wrote:
       | I'd kind of like to see some sample LLM entries in the list (as
       | in, at the bottom of the list). Not because they'd ever be
       | competitive, but because it would underscore how they burn
       | massive amounts of space for the model with little concern for
       | how much of it is "necessary" for predicting text.
       | 
       | (I'm not sure how you would optimally use an LLM for a lossless
       | compression task. I guess you could convert the input to a stream
       | of single-bit "yes, it guessed right" tokens and multi-bit "no,
       | the actual token is X" tokens, and then compress that stream with
       | a more traditional entropy encoder? That way the LLM decompressor
       | would always be working off a valid prefix. But there are
       | probably more clever ways.)
        
         | Sesse__ wrote:
         | > I'm not sure how you would optimally use an LLM for a
         | lossless compression task.
         | 
         | You have the model output probabilities for each token, and
         | then use arithmetic encoding using that model. (This uses
         | optimally few bits per token, under the assumption that your
         | model is right.) This is how all the winning models work,
         | AFAIK, except they frequently work per-bit instead of per-
         | token.
        
           | sfink wrote:
           | Oh duh, that makes a lot more sense. Thanks!
        
       ___________________________________________________________________
       (page generated 2024-09-20 23:00 UTC)