[HN Gopher] Shannon's Source Coding Theorem (2020)
       ___________________________________________________________________
        
       Shannon's Source Coding Theorem (2020)
        
       Author : ElFitz
       Score  : 90 points
       Date   : 2023-06-18 13:38 UTC (9 hours ago)
        
 (HTM) web link (mbernste.github.io)
 (TXT) w3m dump (mbernste.github.io)
        
       | ElFitz wrote:
       | I was curious as to wether or not there is a way to determine the
       | actual amount of information in a given file (say, an image), and
       | thus a theoretical limit to lossless compression, which is
       | apparently called "Kolmogorov Complexity" [1], and this came up.
       | 
       | [1]: https://en.m.wikipedia.org/wiki/Kolmogorov_complexity
        
         | contravariant wrote:
         | Entropy doesn't make much sense for a single instance. There
         | trivially exists a codec that will translate any particular
         | file into any other. To get a real idea how much information is
         | in a file you'll need some statistical model about how it may
         | have been generated (thankfully the model doesn't have to be
         | perfect).
         | 
         | You will also want to be a bit careful what assumptions you
         | allow this model to make. In a very real sense a hash is mostly
         | enough information to recover a file if you know it exists
         | somewhere, as torrents and magnet links prove.
        
         | supriyo-biswas wrote:
         | The formula for entropy and the source coding theorem discussed
         | in the article describe exactly that.
        
           | waveBidder wrote:
           | beyond, in fact, since they're lossy formats
        
         | canjobear wrote:
         | If you have a probability distribution over patterns you are
         | likely to find in a file, then the Shannon entropy gives you
         | the best losslessly compressed filesize you can achieve.
         | 
         | Kolmogorov complexity is the length of the shortest computer
         | program in some programming language that can generate your
         | file. It gives the same number across different programming
         | languages, up to an uncomputable and likely very large additive
         | constant. It's a more universal notion of compressibility than
         | Shannon entropy, because it doesn't depend on a probability
         | distribution, but it is uncomputable.
         | 
         | The concepts are related because for any programming language,
         | you can see it as corresponding to a probability distribution
         | such that p(x) \propto 2^-(length of the program that generates
         | x).
        
         | exhaze wrote:
         | A practical way I used to grok this is to try to zip up an mp4
         | or a jpeg. You'll notice the size doesn't really change. That's
         | because these formats already applied entropy coding
         | compression techniques and have compressed the information very
         | close to the Shannon entropy limit.
        
           | CorrectHorseBat wrote:
           | JPEG XL supports reversible JPEGs transcoding in ~20% less
           | space, which wouldn't be possible if JPEG was very close to
           | the Shannon entropy limit.
        
         | [deleted]
        
         | KRAKRISMOTT wrote:
         | Run your file through every compression algorithm (that doesn't
         | utilize too much prior knowledge i.e. not LLMs) that you can
         | get your hands on. After a certain point it won't be able to
         | get compressed further. It's a good approximation for the limit
         | I think.
        
           | jsenn wrote:
           | This might give a decent estimate of the entropy of the file,
           | but it is a poor approximation of the Kolmogorov complexity
           | in general.
           | 
           | All major compression libraries that I know of essentially
           | work by "factoring out" statistical regularities in the input
           | string--i.e. repeating patterns. But many strings have
           | regularity that can't be captured statistically. For example,
           | the sequence 1, 2, 3, 4, ... has an entropy rate of 1,
           | because all digits and all subsequences of digits occur with
           | equal frequency, so most modern compression libraries are
           | going to be unable to compress it to any significant degree
           | [1]. However, it obviously has a very compact representation
           | as a program in your favourite language.
           | 
           | If you want to go down a bit of a rabbit hole, this video
           | from a researcher in this field gives an overview, as well as
           | some proposed methods for approximating Kolmogorov Complexity
           | in a reasonable way (even though in a technical sense
           | approximating it with known, fixed bounds is I believe
           | impossible): https://www.youtube.com/watch?v=HXM3BUXsY4g.
           | This paper also discusses a theory that decomposes KC into a
           | statistical part (Shannon entropy) and a "computational
           | part": https://csc.ucdavis.edu/~cmg/compmech/pubs/CalcEmergTi
           | tlePag....
           | 
           | [1]: Note that if you encode it as ASCII or even N-bit
           | integers there _will_ be significant redundancy in the
           | encoding. To properly test it you 'd have to encode the
           | numbers in a packed binary encoding with no padding.
        
           | ElFitz wrote:
           | I wouldn't have thought of that.
           | 
           | I was more interested in learning about the possible theory
           | behind such a limit, to see if we had a way to actually know
           | how far such optimisations _can_ go. Basically, for any given
           | file, what would be the absolute best we could hope for
           | (without resorting to losing information).
        
             | oh_sigh wrote:
             | Not to be pedantic, but for any given file the answer is 0%
             | compression, due to the pigeonhole principle.
        
               | ElFitz wrote:
               | Not at all! Having basically no knowledge in that area,
               | aside from a vague understanding of the purpose and uses
               | of information theory, it is more than welcome!
               | 
               | For those interested also unfamiliar with it:
               | 
               | - https://eprint.iacr.org/2021/1333.pdf
               | 
               | - https://matt.might.net/articles/why-infinite-or-
               | guaranteed-f...
        
             | MauranKilom wrote:
             | That question makes pretty little sense without fixing some
             | constraints like "what other files exist" and "what
             | language are you allowed to reproduce them".
             | 
             | In a world where every single file is exactly "abcd", it
             | takes zero bits to represent the file - you don't need any
             | information to know that the file contains "abcd" because
             | all files do.
             | 
             | In theory, the language you use to output the file could
             | have a one-bit coding '0' for "produce the file 'there are
             | comments on HN'" (and all other commands starting with
             | '1...'). So in the world we live in, the smallest a file
             | can be compressed to is 1 bit - if you just use the right
             | language.
             | 
             | None of these are practical in general, but the point is
             | that Kolmogorov Complexity is not meaningful without
             | specifying more assumptions/priors.
        
               | ElFitz wrote:
               | Absolutely, that is part of what I have come to
               | understand while diving into the topic. That I first
               | needed to answer quite a few other questions for that one
               | to even make sense.
               | 
               | Thank you for the examples; they are much clearer and
               | straightforward than most of those I've come across.
        
               | guimplen wrote:
               | This is a misconception. The main result in the
               | Kolmogorov complexity theory that there is precisely the
               | "universal prior" that comes simply from the notion of
               | computability.
        
       | teleforce wrote:
       | Shannon is such an engineering marvel as Einstein is for science.
       | His perennial works, hobbies, papers and theses outlined the
       | fundamentals of modern engineering foundations including digital,
       | computer and communication systems including AI systems of early
       | forms of chess engine, LLM, etc [1]. Kleinrock once noted that he
       | has to pursue exotic field of queuing theory that led to packet
       | switching and then Internet, because most of popular problems in
       | his research field have been solved by Shannon (who was also one
       | of Kleinrock's doctoral research study committee members).
       | 
       | [1]How Claude Shannon Helped Kick-start Machine Learning:
       | 
       | https://www.spectrum.ieee.org/claude-shannon-information-the...
        
         | 082349872349872 wrote:
         | I'm fond of his "mind-reading machine":
         | https://literateprograms.org/mind_reading_machine__awk_.html
         | 
         | (compare Agassi/Becker for the value of single-bit predictions)
        
           | metadat wrote:
           | Wow, what a fascinating concept and read. It's humbling to
           | consider and marvel at what early technology pioneers like
           | Claude Shannon envisioned and accomplished with a tiny,
           | miniscule sliver of computational power compared to what we
           | are surrounded by today.
           | 
           | I couldn't resist submitting it, especially considering it
           | has not yet been discussed on HN*:
           | 
           | https://news.ycombinator.com/item?id=36385198
           | 
           | (*A different link covering the same subject was submitted 6
           | years ago but never caught on:
           | https://news.ycombinator.com/item?id=14450232)
        
       | nico wrote:
       | In the example in the article, if you want to communicate a
       | series of 5 dice rolls, you can send a binary symbol of the
       | number of each roll (1-6), 5 times. Using 3-bit coding for each
       | number, it would take 15bits (5 numbers in the sequence, 3 bits
       | each) to communicate a message
       | 
       | But there's an alternative way, instead of each number
       | individually, you can code the sequences of 5 numbers, which
       | there are 6 _5_ 4 _3_ 2 of in this case, 6!/1! = 720, which can
       | be encoded in binary with 8 bits
       | 
       | So instead of 15 bits, you can use 8
        
         | rododendro wrote:
         | All sequences of 5 numbers in the range [6] should be 6 * 6 *
         | ... * 6 = 6^5, which would give 12.92 bits, so 13. The only way
         | to actually do better than sending every symbol would be run
         | length encoding or other non fixed alphabet encodings, which
         | fortunately do not need to be longer than the entropy of the
         | distribution.
        
           | contravariant wrote:
           | You don't need variable length you just need to send the data
           | in bigger blocks. Count all 6^5n ways to throw 5 dice n
           | times, order them lexicographically and send the index in
           | 5nlog(6) bits, rounding up.
           | 
           | Your inefficiency is just the rounding error, which quickly
           | becomes trivial. In theory some block sizes should even
           | coincidentally be close to a whole number of bits.
        
           | nico wrote:
           | You are right, it's 6^5 to include sequences with repeating
           | symbols
           | 
           | In any case, you don't need to index the whole numbers space
           | to transmit all the information
           | 
           | Like you say, you can use 13 bits instead of 15
           | 
           | And using variable length encoding you can reduce it even
           | more
           | 
           | But what is the distribution in this case?
        
             | rododendro wrote:
             | Well in this case the distribution would simply be a
             | uniform distribution over [1:6], which has entropy 2.58 for
             | each symbol. Which is also the most entropic distribution
             | on the sets of size 6. Like other have said, you can't
             | actually know in a real world scenario your underlying
             | distribution in whatever file you are trying to compress,
             | but since you want to compress it you are telling the
             | program that the file has interesting information
             | (otherwise it would be just noise) and so grammar based
             | encoding would probably not be best, instead lz77 or lz78
             | would perform better (for example, if you know more about
             | the file you could even do better by transforming first
             | into some frequency domain and then compressing).
        
               | nico wrote:
               | Fascinating. Thank you.
               | 
               | Why do you need to know more about the file to transform
               | to frequency domain?
        
               | rododendro wrote:
               | Because no information is lost if you simply change
               | domain, but you may spot more patterns faster or easier.
               | For example (even if a bit dated)
               | https://ieeexplore.ieee.org/abstract/document/1614066
               | 
               | There is a lot of research in this field because of the
               | IoT fad/trend. If you want to know more i would suggest
               | to simply pic a paper from IEEE and deep dive into the
               | terms and terminology and simply skip the proofs of most
               | algorithms/theorems.
        
         | AnimalMuppet wrote:
         | > 720, which can be encoded in binary with 8 bits
         | 
         | Even if 720 were the right number, that takes 10 bits, not 8.
        
           | nico wrote:
           | Yes, embarrassing mistake
           | 
           | Still 10 is less than 15
           | 
           | Additionally, as someone else pointed out, it would be 6^5
           | sequences, which would need 13 bits
           | 
           | Still 13 is less than 15
           | 
           | But the interesting thing is, you can then index the most
           | commonly used sequences of sequences and compress by a few
           | more bits
           | 
           | You can keep the process going until you have a 2 bit index
           | for the most commonly used sequences (of sequences (of
           | sequences... ))
        
             | AnimalMuppet wrote:
             | Sorry to keep nitpicking, but: In this case, you can't
             | compress further. These are dice rolls; there _aren 't_
             | "most commonly used sequences" (if the dice are any good).
        
               | nico wrote:
               | Yes, however, the moment you choose a message size, there
               | is always only a finite set of ways to arrange the
               | symbols in those positions
        
       | activatedgeek wrote:
       | Another excellent book that I recently read was "Information
       | Theory: A Tutorial Introduction" by James V. Stone [1]. This book
       | is a fairly light read and builds up everything from scratch
       | which was very nice.
       | 
       | Shannon's original paper [2] is often touted to be very
       | parseable, but it is always enlightening to read the same basics
       | many times from multiple sources, especially something as
       | foundational as source coding (and broader information theory) in
       | a more modern language.
       | 
       | I also finally discovered the meaning of "differential" in
       | differential entropy [2], the continuous counterpart of discrete
       | probability masses, something that I always swept under the rug.
       | 
       | [1]: https://www.librarything.com/work/16165301/book/242295611
       | [2]: https://ieeexplore.ieee.org/document/6773024 [3]:
       | https://sanyamkapoor.com/kb/differential-entropy
        
       ___________________________________________________________________
       (page generated 2023-06-18 23:01 UTC)