[HN Gopher] Large Text Compression Benchmark
___________________________________________________________________
Large Text Compression Benchmark
Author : redeux
Score : 63 points
Date : 2024-09-18 11:36 UTC (2 days ago)
(HTM) web link (www.mattmahoney.net)
(TXT) w3m dump (www.mattmahoney.net)
| hyperpape wrote:
| It's worth noting that the benchmark has not been updated as
| frequently for the past several years, and some versions of
| compressors are quite far behind the current implementations
| (http://www.mattmahoney.net/dc/text.html#history).
|
| The one instance I double-checked (zstd) I don't recall it making
| a massive difference, but it did make a difference (iirc, the
| current version was slightly smaller than what was listed in the
| benchmark).
| pella wrote:
| agree,
|
| - in the benchmark "zstd 0.6.0" ( Apr 13, 2016 )
|
| - vs. latest zstd v1.5.6 ( Mar 30, 2024
| https://github.com/facebook/zstd/releases )
| londons_explore wrote:
| Only really worth updating if the results are meaningfully
| different though.
|
| Since zstd's format is fixed, I doubt there will be massive
| changes in compression ratio.
| namibj wrote:
| Ehhh, it's the decompressor that's specified, and there's
| considerable wiggle room a'la Trellis modulation in radio
| line coding, due to the interplay of the dictionary coder
| with the entropy coder. Basically, there are many moments
| the dictionary coder has to make non-obvious decisions
| about how to code a sequence, and what matters is that the
| whole content overall has the least entropy, not the least
| length when fed to the entropy coder.
|
| For simple codings like the TLV bitstream in QR codes (the
| L field bit length depends on the T and the QR code size
| (i.e., the max bitstream length possible))[0], you can
| afford to solve for all possibilities via e.g. dynamic
| programming with some mild heuristics to terminate search
| branches that can't possibly code shorter due to the TL
| overhead.
|
| But with an entropy coder like zstd's fst/tANS, that's not
| remotely as trivial. Making a choice will change the
| symbol/byte histogram for the entropy coder's input
| sequence, which, if after quantization by tANS still
| different, can change the length of the coded bitstream.
| The problem is the non-local effect on a symbol's coding
| size from making any individual residency coding decision.
|
| BTW, friendly reminder that all-caps URLs will code shorter
| into QR codes, so the code will have larger pixels or be
| smaller.
|
| [0]: there are 5 main modes (i.e., T values): ([0-9])+
| coded by stuffing 3 digits into 10 bit and trailing 2/1
| digits into 7/4 bits, respectively;
| ([0-9]|[A-Z]|[:space:]|[\$\%*\\+\-\/\\.,\:])+ coded by
| stuffing 2 characters into 11 bits; ISO-8859-1 coded by
| stuffing one character into 8 bits; Kanji by stuffing each
| into 13 bits; and a ECI mode for iirc generic Unicode
| codepoints. I think the other 3 code points in T are used
| for legacy barcode FNC1 and FNC2, as well as an end-of-
| content marker.
| adgjlsfhk1 wrote:
| I think the compression ratio will be a bit better, but the
| compression and decompression time will be a bunch better.
| shaicoleman wrote:
| - zstd v0.6.0 -22 --ultra: 215,674,670 bytes
|
| - zstd v1.5.6 -22 --ultra: 213,893,168 bytes (~0.826%
| smaller)
| adgjlsfhk1 wrote:
| how does the speed compare?
| pmayrgundter wrote:
| The very notable thing here is that the best method uses a
| Transformer, and no other entry does
| wmf wrote:
| Text compression and text generation are both based on modeling
| so the best approach to text modeling works for both.
| sfink wrote:
| Not if part of your definition of "best" includes
| codebook/model size. The best text generator will use a
| massive model. That won't help compression beyond a certain
| point, since the cost of lookup indexes grows with the size
| of the model. "Monkey #384714...872 typewriter output"
| doesn't help when the monkey number is longer than the input.
| londons_explore wrote:
| Transformers run well on GPU's or other hardware accelerators.
| This benchmark doesn't allow GPU's.
|
| That makes it more of a "can I use unsuitable hardware to get
| the job done fast and accurately enough" challenge, rather than
| a pure math puzzle of how to encode data with fewer bytes.
|
| I suspect that's why there is only 1 Transformer entry, and to
| me raises the question whether the rules should be updated to
| allow GPU's now they are fairly commonplace.
| HybridCurve wrote:
| I think it might be a moot point since the transformer run
| times scale very poorly and the algorithm has a symmetric run
| time.
| pama wrote:
| It would be nice to also have a competition of this type where
| within ressonable limits the size of the compressor does not
| matter and the material to be compressed is hidden and varied
| over time. For example up to 10GB compressor size and the dataset
| is a different random chunk of fineweb every week.
| stephantul wrote:
| That's an interesting idea. I wonder whether it would actually
| lead to radically different solutions, bit with bigger
| codebooks.
| nick238 wrote:
| Double your compression ratio for the low, low price of 100,000x
| slower decompression (zstd: 215GB, 2.2 ns/byte vs. nncp: 106GB,
| 230 us/byte)!
|
| The neural network architectures are technically impressive, but
| unless there's some standard compression dictionary that works
| for everything (so the training/compression costs amortize down
| to nil), and silicon architecture dramatically changes to
| compute-on-memory, I don't know if this would ever take off.
| Lossy compression would probably provide huge advantages, but
| then you need to be domain specific, and can't slap it on
| anything.
| lxgr wrote:
| One interesting application could be instant messaging over
| extremely bandwidth constrained paths. I wouldn't be surprised
| if Apple were doing something like this for their satellite-
| based iMessage implementation.
|
| Of course it could also just be a very large "classical" shared
| dictionary (zstd and brotli can work in that mode, for
| example).
| londons_explore wrote:
| Unfortunately, while you want to have encryption on your
| messages, the required encryption signature eclipses the
| message size.
|
| Without signing the messages (and just using a stream
| cipher), you could do it, but an adversary can send garbage
| pretending to be you, which I don't think is acceptable in
| the modern world.
|
| Also, the message headers (message length, from, to) also
| start to dominate and compressing them is hard (usually
| depending on various hardware tricks like differing CDMA
| keys).
| lxgr wrote:
| For such low-bandwidth applications, you'll usually want to
| do the key exchange while you have a faster connection, as
| well as build a dictionary of expected recipients etc.
|
| Once you have a pre-exchanged symmetric key pair and IVs
| etc., encryption can be done with zero overhead, and you
| can choose your own trade-off between authentication
| security and message size. Even 4 bytes go a long way (an
| attacker would need to send 2^31 messages over a very
| bandwidth-constrained channel to impersonate you), and 8
| bytes make it safe enough for practically all applications.
|
| That way, you can keep the authentication overhead very
| small (on the order of a few bytes), similar to how it's
| done for e.g. SRTP.
| jeffbee wrote:
| It does seem as though the training cost of these models ought
| to be included in their times, because the compressors are
| useless for any purpose other than compressing English
| Wikipedia.
| gwern wrote:
| In these cases, the models are so small, and have to run on
| very limited CPU, that the training time is going to be
| fairly negligible. The top-listed program, nncp, is just
| 0.06b parameters! https://bellard.org/nncp/nncp_v2.pdf#page=4
| gliptic wrote:
| The training cost is included because these NN compressors
| are running the training while they are compressing. They are
| general compressors, although the specific variants submitted
| here may be tuned a little to Wikipedia (e.g. the pre-
| processors).
| Xcelerate wrote:
| > unless there's some standard compression dictionary that
| works for everything
|
| There is, at least for anything worth compressing. It's called
| the halting sequence. And while it exists, it's also
| uncomputable unfortunately haha.
| Vecr wrote:
| Apparently there's ways of getting the first few binary
| digits, but it's useless at that length.
| Xcelerate wrote:
| Not useless. That's most of mathematics right there.
| abecedarius wrote:
| The point of the contest was not compression tech for
| communication and storage. It was a belief that minimizing log
| loss on large corpora (i.e. message length) was key to AI
| progress. That should remind you of some pretty hot more-recent
| developments.
|
| Here was someone else with a similar pov in the 2000s:
| https://danburfoot.net/research.html
| optimalsolver wrote:
| The issue with this paradigm (see also: Hutter Prize) was its
| insistence on lossless compression.
|
| Intelligence is just as much about knowing what to throw away
| as what to keep. Any nontrivial cognitive system operating in
| a nontrivial physical environment will necessarily have a
| lossy model of the world it's embedded in.
|
| Also of interest, compressionism, a theory of mind based on
| data compression:
|
| https://ceur-ws.org/Vol-1419/paper0045.pdf
| sfink wrote:
| I tend to agree, but is that really a fatal flaw? A lossy
| compression scheme that gets it 90% right only has to
| encode the delta for the remaining 10%. Which is a big
| cost, sure, but the alternative is evaluating how important
| the thrown-away stuff is, and that evaluation is rife with
| subjective value judgements. There are no right answers,
| only differing flavors of wrong ones. It's the question of
| what is important to generalize over, and nobody is ever
| going to agree on that.
| sdenton4 wrote:
| You can turn any lossy compression scheme into a lossless
| scheme by encoding the error... The lossy scheme should be
| aiming for low error, making the error encoding
| progressively smaller as the lossy scheme improves.
|
| What's harder to deal with from a measurement perspective
| is sematic equivalence. calling some kinds of errors zero-
| cost, but not having a great way to categorize what the
| loss of, exactly. But it's kinda what you want for really
| extreme compression: the content is equivalent at a high
| level, but may be a very different byte stream.
| vintermann wrote:
| > Intelligence is just as much about knowing what to throw
| away as what to keep.
|
| Sure, but you can see it as two steps. Decide what doesn't
| matter _at all_ , and throw it away. Then compress the rest
| losslessly (according to how surprising it is, basically
| the only way)
|
| I think a theory of mind based on data compression is
| backwards, for this reason. When you have "data", you've
| already decided what's important. Every time you combine
| "A" and "B" into a set, you have already decided
|
| 1. That they are similar in some way you care about
|
| 2. That they're _distinct_ in some way you care about
| (otherwise, it would be the same element!)
|
| ... and you do this every time you even add two numbers, or
| do anything remotely more interesting with "data". There
| is no "data" without making a-scientific statements about
| what's important, what's interesting, what's meaningful,
| what matters etc.
| abecedarius wrote:
| Your issue applies equally to GPT-3 (and afaik its
| successors): the log likelihood loss which the GPT base
| model was trained on is exactly the "lossless compression"
| length of the training set (up to trivial arithmetic-coding
| overhead).
|
| The question I'd raise is why the route that got traction
| used more ad-hoc regularization schemes instead of the
| (decompressor + compressed) length from this contest and
| the book I linked.
| GaggiX wrote:
| Read: https://opus-codec.org/demo/opus-1.5/
| londons_explore wrote:
| Imagine if we didn't have dedicated hardware for say caching -
| every CPU instruction or bit of data has to be read from SSD
| every use.
|
| We'd see a 100,000x slowdown I expect.
|
| But dedicated hardware (RAM, various cache levels) solved this.
|
| We could do the same for neural net compression if we needed
| to. But the question is do enough people care enough about a 2x
| data size reduction?
| wmf wrote:
| Dedicated hardware isn't magic because there's an intrinsic
| cost that cannot be reduced. In the case of neural nets you
| have to store the weights somewhere and you have to perform
| the multiplications.
| geysersam wrote:
| Depends on the application! For long term storage of rarely
| accessed text it might make sense
| ezekiel68 wrote:
| This is a good point, and yet -- as high as position 24 (out of
| 208) in the list, we have "bsc-m03 0.4.0" at only ~160ns per
| byte compressed (compared with most above it being two or three
| orders of magnitude slower). Below position 24, there's one in
| the 20s of ns per byte. Looks like the cost was in the size of
| the compression program. Not sure if there was any "cheating"
| going on, hunting for benchmark scores with these two famous
| test bodies of text.
|
| By comparison, the top contender clocks in at ~240,000 ns per
| byte (making your point).
| sfink wrote:
| I'd kind of like to see some sample LLM entries in the list (as
| in, at the bottom of the list). Not because they'd ever be
| competitive, but because it would underscore how they burn
| massive amounts of space for the model with little concern for
| how much of it is "necessary" for predicting text.
|
| (I'm not sure how you would optimally use an LLM for a lossless
| compression task. I guess you could convert the input to a stream
| of single-bit "yes, it guessed right" tokens and multi-bit "no,
| the actual token is X" tokens, and then compress that stream with
| a more traditional entropy encoder? That way the LLM decompressor
| would always be working off a valid prefix. But there are
| probably more clever ways.)
| Sesse__ wrote:
| > I'm not sure how you would optimally use an LLM for a
| lossless compression task.
|
| You have the model output probabilities for each token, and
| then use arithmetic encoding using that model. (This uses
| optimally few bits per token, under the assumption that your
| model is right.) This is how all the winning models work,
| AFAIK, except they frequently work per-bit instead of per-
| token.
| sfink wrote:
| Oh duh, that makes a lot more sense. Thanks!
___________________________________________________________________
(page generated 2024-09-20 23:00 UTC)