[HN Gopher] Dictionary Compression
       ___________________________________________________________________
        
       Dictionary Compression
        
       Author : todsacerdoti
       Score  : 30 points
       Date   : 2022-03-01 18:11 UTC (4 hours ago)
        
 (HTM) web link (kevincox.ca)
 (TXT) w3m dump (kevincox.ca)
        
       | bob1029 wrote:
       | I've been wondering about this per a unique business system I've
       | been working on... Hypothetically, if you have a situation where
       | you are _not allowed_ to delete any prior business data (i.e.
       | append-only event log), what would stop you from directly using
       | prior log entries as the basis for some sort of incremental
       | /recursive compression dictionary?
       | 
       | Put differently, is there any way to leverage scenarios where you
       | have a practically-unlimited dictionary size and willingness to
       | retain all data forever? Or, am I playing with some basic
       | information theory equation the wrong way in my mind?
        
         | [deleted]
        
         | throwaway889900 wrote:
         | What you're thinking of is a journaling archiver. Something
         | like zpaq.
        
         | lifthrasiir wrote:
         | You can have an adaptive dictionary, favoring recently seen
         | entries. The dictionary would have to be rebuilt periodically
         | in practice, but it has an advantage of being able to cope with
         | the ever-changing ("non-stationary") data distribution.
         | Sometimes the dictionary doesn't even need to be stored and
         | instead can be calculated as the data gets decompressed; this
         | is so big deal that many compression algorithms indeed omit and
         | adaptively build the dictionary instead.
        
       | Scene_Cast2 wrote:
       | I wonder what the lower bound on the entropy is.
        
         | lifthrasiir wrote:
         | cmix version 19 [1] compressed the wordle wordlist (using the
         | golf.horse dataset [2] as the canonical version) into 9,714
         | bytes in about three minutes and ~15 GB of memory. This
         | obviously excludes the decompressor size; my golf.horse entry
         | corresponds to the information entropy of at most 11.7 KB
         | though.
         | 
         | [1] https://github.com/byronknoll/cmix
         | 
         | [2] http://golf.horse/wordle/list
        
           | mpais wrote:
           | paq8px v206 [1] with option "-8art" compresses it to 9,145
           | bytes using ~2GB of memory in 82s, mostly due to a better
           | pre-training stage.
           | 
           | [1] https://github.com/hxim/paq8px
        
       | _moof wrote:
       | _> I would have expected at least some simple domain-specific
       | compression._
       | 
       | Why? The site is static, so it's cached and browsers get a 304
       | when they come back. For new players the server compresses the
       | whole thing in flight, not just the word list.
       | 
       | Doing some fancy domain-specific compression on the word list
       | would have added unnecessary complexity (read: places for bugs to
       | creep in) and taken time/effort away from more important things.
        
         | marcellus23 wrote:
         | Yeah, this seems like a classic case of optimization for
         | optimization's sake.
        
       | gabagool wrote:
       | This is really cool and I love that people are using Wordle to
       | explore and explain computer science concepts (to people like
       | me).
       | 
       |  _> The speed-up is significant, about 20x faster. This makes
       | sense because instead of scanning half of all words on average,
       | you only need to scan half of the words with the same first
       | letter. I'll assume this isn't exactly 13x due to letter
       | distribution._
       | 
       | Extremely minor, but I think one would still expect a 26x
       | speedup, right? You go from searching half of all words to half
       | of 1/26th of all words. Obviously there are more words starting
       | with S than X and all, but my question is just in theory.
        
         | naniwaduni wrote:
         | Right. The effect of letter distribution, however, is that
         | you're more likely to be searching a bucket with more words in
         | it. So getting somewhat less than 26x speedup makes sense.
        
       | lifthrasiir wrote:
       | For the reference, it seems that the most publicized attempts on
       | compressing the Wordle dictionary is the Game Boy Wordle clone
       | [1]. There are also several other independent attempts, including
       | the golf.horse Kolmogorov complexity competition [2] (where my
       | entry is incidentally at the first place ;-).
       | 
       | [1] https://news.ycombinator.com/item?id=30402407
       | 
       | [2] http://golf.horse/wordle/
        
         | iib wrote:
         | Isn't this best done with the trie-based dictionary (as in
         | lexical dictionary) approach? Am I missing something here? Is
         | there some additional compression opportunity due to the fact
         | that all the words are the same length?
        
           | lifthrasiir wrote:
           | Tries are generally a good choice if you need to access the
           | word list without first decompressing them. If you are
           | looking for the best compression possible though, there may
           | be more latent contexts that you may not realize their
           | existence but can still leverage.
        
       ___________________________________________________________________
       (page generated 2022-03-01 23:02 UTC)