[HN Gopher] Dictionary Compression
___________________________________________________________________
Dictionary Compression
Author : todsacerdoti
Score : 30 points
Date : 2022-03-01 18:11 UTC (4 hours ago)
(HTM) web link (kevincox.ca)
(TXT) w3m dump (kevincox.ca)
| bob1029 wrote:
| I've been wondering about this per a unique business system I've
| been working on... Hypothetically, if you have a situation where
| you are _not allowed_ to delete any prior business data (i.e.
| append-only event log), what would stop you from directly using
| prior log entries as the basis for some sort of incremental
| /recursive compression dictionary?
|
| Put differently, is there any way to leverage scenarios where you
| have a practically-unlimited dictionary size and willingness to
| retain all data forever? Or, am I playing with some basic
| information theory equation the wrong way in my mind?
| [deleted]
| throwaway889900 wrote:
| What you're thinking of is a journaling archiver. Something
| like zpaq.
| lifthrasiir wrote:
| You can have an adaptive dictionary, favoring recently seen
| entries. The dictionary would have to be rebuilt periodically
| in practice, but it has an advantage of being able to cope with
| the ever-changing ("non-stationary") data distribution.
| Sometimes the dictionary doesn't even need to be stored and
| instead can be calculated as the data gets decompressed; this
| is so big deal that many compression algorithms indeed omit and
| adaptively build the dictionary instead.
| Scene_Cast2 wrote:
| I wonder what the lower bound on the entropy is.
| lifthrasiir wrote:
| cmix version 19 [1] compressed the wordle wordlist (using the
| golf.horse dataset [2] as the canonical version) into 9,714
| bytes in about three minutes and ~15 GB of memory. This
| obviously excludes the decompressor size; my golf.horse entry
| corresponds to the information entropy of at most 11.7 KB
| though.
|
| [1] https://github.com/byronknoll/cmix
|
| [2] http://golf.horse/wordle/list
| mpais wrote:
| paq8px v206 [1] with option "-8art" compresses it to 9,145
| bytes using ~2GB of memory in 82s, mostly due to a better
| pre-training stage.
|
| [1] https://github.com/hxim/paq8px
| _moof wrote:
| _> I would have expected at least some simple domain-specific
| compression._
|
| Why? The site is static, so it's cached and browsers get a 304
| when they come back. For new players the server compresses the
| whole thing in flight, not just the word list.
|
| Doing some fancy domain-specific compression on the word list
| would have added unnecessary complexity (read: places for bugs to
| creep in) and taken time/effort away from more important things.
| marcellus23 wrote:
| Yeah, this seems like a classic case of optimization for
| optimization's sake.
| gabagool wrote:
| This is really cool and I love that people are using Wordle to
| explore and explain computer science concepts (to people like
| me).
|
| _> The speed-up is significant, about 20x faster. This makes
| sense because instead of scanning half of all words on average,
| you only need to scan half of the words with the same first
| letter. I'll assume this isn't exactly 13x due to letter
| distribution._
|
| Extremely minor, but I think one would still expect a 26x
| speedup, right? You go from searching half of all words to half
| of 1/26th of all words. Obviously there are more words starting
| with S than X and all, but my question is just in theory.
| naniwaduni wrote:
| Right. The effect of letter distribution, however, is that
| you're more likely to be searching a bucket with more words in
| it. So getting somewhat less than 26x speedup makes sense.
| lifthrasiir wrote:
| For the reference, it seems that the most publicized attempts on
| compressing the Wordle dictionary is the Game Boy Wordle clone
| [1]. There are also several other independent attempts, including
| the golf.horse Kolmogorov complexity competition [2] (where my
| entry is incidentally at the first place ;-).
|
| [1] https://news.ycombinator.com/item?id=30402407
|
| [2] http://golf.horse/wordle/
| iib wrote:
| Isn't this best done with the trie-based dictionary (as in
| lexical dictionary) approach? Am I missing something here? Is
| there some additional compression opportunity due to the fact
| that all the words are the same length?
| lifthrasiir wrote:
| Tries are generally a good choice if you need to access the
| word list without first decompressing them. If you are
| looking for the best compression possible though, there may
| be more latent contexts that you may not realize their
| existence but can still leverage.
___________________________________________________________________
(page generated 2022-03-01 23:02 UTC)