[HN Gopher] The $5000 Compression Challenge (2001)
___________________________________________________________________
The $5000 Compression Challenge (2001)
Author : ekiauhce
Score : 102 points
Date : 2024-11-23 21:10 UTC (1 days ago)
(HTM) web link (www.patrickcraig.co.uk)
(TXT) w3m dump (www.patrickcraig.co.uk)
| andrei-akopian wrote:
| I though he would try to net profit through probability by
| requesting to regenerate the file and hope to get a compressible
| one in at least 5000/100=50 times. (Though I don't think that
| would work.)
| its-summertime wrote:
| > I think that I did not make a mistake except to believe that
| you truly intended to attempt to compress the data I was sending
| you.
|
| Weird phrase to hear from a digital carnival scam artist.
| Supermancho wrote:
| > It's not my fault that a file system uses up more space storing
| the same amount of data in two files rather than a single file.
|
| It's not interesting to see people play word games with each
| other in the HN comments, so why is it interesting to see this
| kind of disingenuous behavior? Trolls have always existed. I
| don't get what's interesting about it.
| recursive wrote:
| The act of offering a cash prize to prove you wrong kind of
| makes you look like a jerk. It's assumed that you will use
| every loop hole at your disposal. Seeing the situation reversed
| so elegantly makes me root for the challenger.
| ncruces wrote:
| Who's the bigger troll?
|
| Putting up the initial challenge, especially with the
| accompanying commentary, is just as much trolling as that.
| stavros wrote:
| As Patrick said, if he had gotten a 1000-byte file and sent a
| 999-byte file/decompressor file back, nobody would have counted
| the inode data as a part of the solution. Why count it now?
| Supermancho wrote:
| Because the solution specifically was for 1 file submitted, 2
| files back (compressed and decompressor). This topic ise the
| next level trolling that went on. The summary and ng faw,
| explained why these file metagames aren't useful. They
| obscure the lack of a compression strategy (eg if the inode
| equivalent was 2ft of tape) by nature of dispersing
| information into the fs.
| stavros wrote:
| > Because the solution specifically was for 1 file
| submitted, 2 files back (compressed and decompressor).
|
| Patrick:
|
| I meant can I send you a decompressor and several
| compressed files whose total file size is less than the
| original uncompressed file and from which I can regenerate
| the original uncompressed file
|
| Mike:
|
| Sure.
| Timwi wrote:
| I do agree that this last sentence (the one you quoted) is very
| unfortunate and doesn't match the wit of the rest. A better
| closing remark might have been something on the lines of, "I
| never claimed to have performed any compression. I only claimed
| to have met the terms of the challenge. I asked if I can send
| multiple files and the response was 'sure'. I provided the
| files; they meet the total file size criterion; and the script
| correctly reconstructs the original files."
| fxtentacle wrote:
| This guy clearly failed because he didn't actually do any
| compression, he just ab-used the filesystem to store parts of the
| data and then tried to argue that metadata was not data...
|
| But FYI someone else actually managed to compress that exact same
| data: https://jasp.net/tjnn/2018/1.xhtml
| recursive wrote:
| That argument was made, and addressed in the emails. How are
| you defining compression? Actual compassion doesn't seem to
| have been a requirement in the challenge.
| maxmcd wrote:
| Is there any more information about this solution?
| stavros wrote:
| If Mike didn't want multiple files, he should have said no to
| multiple files. The fact that he wanted $100 per try shows that
| he just got outhustled at his own hustle. He should have been
| more particular about the rules, and upheld the ones he himself
| set.
| PaulHoule wrote:
| Should have charged $100 a file.
| Timwi wrote:
| He clearly failed at doing any actual compression, but he did
| not fail at the challenge. He satisfied every single condition.
|
| He obviously knows that. He knows exactly what compression is
| and that his solution is not it. He showed (successfully) that
| meeting the challenge as it was worded did not require
| compression, which the setter of the challenge didn't realize.
| elahieh wrote:
| I'm rather skeptical of this claim that the data was compressed
| in 2018 because there is no further information, apart from a
| hash value given.
|
| If it's a true claim they must have identified some "non-
| random" aspect of the original data, and then they could have
| given more info.
| johnfn wrote:
| This is hilarious, but Mike's behavior boils my blood. To switch
| the rules when fairly beaten is just so scummy!
|
| Of course Patrick's solution occurred to me immediately once he
| started talking about splitting up the file. :)
|
| Then I got to wondering if it would be possible to compress a
| large enough file and save a few bytes, in the spirit of what
| Mike actually wanted. For instance, if you tried encrypting with
| a bunch of random keys, you'll find some such that the result
| ends with n 0s (say), so then your result could be something like
| encrypted + key + n. Typically you'd expect to find that the
| length of the encryption key would be roughly the length of the
| number of 0s, so you wouldn't actually net any savings. But I
| think if you searched long enough, you might "get lucky" and find
| a key that generated more 0s then the length of the key, so you
| could theoretically win the challenge. Of course you'd have to
| somehow embed a program which could decrypt and reassemble
| everything in a tiny amount of space, but it still doesn't seem
| impossible like Mike seems to suggest. Perhaps just requires an
| intractable amount of computing power. :)
|
| I'm not sure if any of that made sense. I think I'm using the
| wrong terms for things.
| crazygringo wrote:
| I remember reading from a previous thread how yes you can do
| something like that (or lots of other techniques) but they will
| only work on a small percentage of files.
|
| That sometimes the smallest decompressor (including the key, in
| your example) will result in overall less space, but usually it
| will result in more (because the smallest key that generates 10
| zeros happens to be 14 digits long, etc.).
|
| And so ultimately the more interesting question becomes about
| those probabilities, and therefore what is the right entry
| price for the bet to be even?
| permo-w wrote:
| this sounds like a Nigerian Prince scammer set-up
| ccleve wrote:
| I wonder if it would have been possible to win the challenge
| legitimately?
|
| If a randomly-generated file happened to contain some redundancy
| through sheer chance, you could hand-craft a compressor to take
| advantage of it. This compressor would not work in general for
| random data, but it could work for this one particular case.
|
| It's a bet worth taking, because the payoff, 50:1 ($5,000 to
| $100), is pretty good. Play the game 50 times and you might get a
| file you could compress.
|
| The challenge, then, would be for the person offering the bet to
| generate a really random file that contained no such redundancy.
| That might not be easy.
| dmurray wrote:
| I think you can make some argument about why this isn't
| possible at 50:1 odds.
|
| A plausible "decompressor" is at least, say, 30 or 100 bytes,
| so the random file needs to have 30 bytes less entropy than you
| expected, which happens with probability X where X << 1/50. Sum
| over the whole domain of reasonable decompressors, and you
| still don't get there.
|
| This argument could do with more rigor, but I think it's
| correct. Give me 100 million to 1 odds, though, and I'll take
| my chances trying to brute force a compressor.
| lambdaone wrote:
| This is actually an extremely interesting question. 'Weak'
| files that are more easily compressable than others certainly
| exist, but with low probability.
|
| For example, the all-zeros file is a member of the set of all
| random 3 megabyte files, and it would certainly be possible
| to compress that, if by great good fortune you were lucky
| enough to receive it - albeit something that is unlikely to
| ever happen in the possible lifetime of the universe, given
| current physical theories.
|
| Is it possible to quantify this idea of a 'weak' file more
| accurately?
| ccleve wrote:
| I know very little about this, but a little googling
| suggests that the measure you're looking for is entropy,
| which has a mathematical definition:
| https://en.wikipedia.org/wiki/Entropy_(information_theory)
| l33t7332273 wrote:
| One thing you can do, as the other commenter pointed out,
| is consider entropy of the file.
|
| However, this restriction is too much for the purposes of
| this challenge. We don't actually need a file with low
| entropy, in fact I claim that a weak file exists for files
| with entropy 8 (the maximum entropy value) - epsilon for
| each epsilon > 0.
|
| What we actually need is a sufficiently large chunk in a
| file to have low entropy. The largeness is in absolute
| terms, not relative terms.
|
| A very simple file would be taking a very large file with
| maximum entropy and adding 200 0's to the end. This would
| not decrease the entropy of the file much, but it gives way
| to a compression algorithm that should be able to save ~100
| bytes
| kevinventullo wrote:
| Note that if this large chunk occurs in the middle of the
| file, then you will need extra space to encode that
| position. For example, a random bit string of length 2^n
| is decently likely to have a run of n zeroes. But this
| doesn't help you because you need n bits just to encode
| _where_ that run happens.
| l33t7332273 wrote:
| But storing an index for a file of length 2^n takes only
| n bits
| kittoes wrote:
| What if we didn't even attempt to compress the file?
| Theoretically, there is a random number generator + seed
| combination that would output the same bytes as the source
| file.
| changoplatanero wrote:
| Neat idea but chances are the length of the seed is equal to
| the length of the original file.
| guepe wrote:
| There is a polynomial expression that will match the file.
| I wonder if expressing it would be compressible to a lower
| size than original file? [edit] I'm wrong - not all
| sequences can be expressed with a polynomial.
| l33t7332273 wrote:
| This reminds me of a data compression scheme I came up
| with once:
|
| Treat an n bit file as a polynomial over the finite field
| with characteristic 2. Now, there are some irreducible
| polynomials in this field, but many polynomials have
| factors of x and of (x+1). Factor the polynomial into
| P(x)x^n (x+1)^m. and just collect these terms, storing
| only P, n, and m.
| crazygringo wrote:
| That's not how seeds work. Seeds are tiny.
|
| Actually this would work perfectly _if_ you knew it was
| generated in a single pass by a known random number
| generator and you had tons of time to brute force it.
|
| If the file were generated by a natural source of entropy
| then forget it.
|
| Or even if modified in a trivial way like adding 1 to every
| byte.
| Retr0id wrote:
| Somewhere (discussed on HN) someone devised a "better-than-
| perfect" compressor. Most inputs get compressed (smaller than
| input), except for one input that does not. That one input is
| cryptographically impossible to find - or something along those
| lines.
|
| Unfortunately I can't find the article I'm describing here,
| maybe someone else can? It was a long time ago so I might be
| misrepresenting it slightly.
| spencerflem wrote:
| That's how all compressors work, in that likely files (eg.
| ASCII, obvious patterns, etc) become smaller and unlikely
| files become bigger.
| Dylan16807 wrote:
| > likely files (eg. ASCII, obvious patterns, etc) become
| smaller
|
| Likely files for a real human workload are like that, but
| if "most inputs" is talking about the set of all possible
| files, then that's a whole different ball game and "most
| inputs" will compress very badly.
|
| > unlikely files become bigger
|
| Yes, but when compressors can't find useful patterns they
| generally only increase size by a small fraction of a
| percent. There aren't files that get significantly bigger.
| PaulHoule wrote:
| In some cases it can be certain, the ascii encoded in the
| usual 8 bits has fat to trim even if it is random in that
| space.
| Dylan16807 wrote:
| You could design the opposite, a system where ultra-rare
| files get compressed a lot but most don't compress.
|
| But as stated, what you said is not possible. To compress by
| a single byte, only 1/256 files can be compressible, or you'd
| have multiple different inputs compressing down to the same
| output.
| hyperpape wrote:
| It's amazing that this commentary on the value of prediction
| markets was written in 2001.
| benchmarkist wrote:
| Take random bytes from /dev/urandom, encrypt it. There is no way
| any compressor can reduce the size of that file. I'm pretty sure
| there is no way the second player can win this game. Actually,
| the encryption part isn't even necessary. There is no compressor
| that can reduce the size of a random stream of bytes.
| cortesoft wrote:
| There is no general purpose compressor that can achieve
| compression on all random streams of bytes... but there is a
| possibility to compress a specific random stream of bytes, if
| you get lucky
| lambdaone wrote:
| Absolutely! Can this probability be quantified?
| benchmarkist wrote:
| You'd have to define what you mean by luck which would be
| equivalent to choosing some pattern that your compressor
| can recognize in a random stream of bytes and then
| computing the probability that pattern appeared in some
| random byte sequence. It's not a winnable game in any
| practical sense because whatever compressor you choose the
| probability of getting a sequence from your opponent that
| will conform to the patterns recognizable by your
| compressor is essentially 0.
| Retric wrote:
| But you don't need to compress every possible file to make
| playing such a game a good idea.
|
| Suppose if you shave 1 bit off of a sample bit stream you win
| and you lose if that fails. Your compressor looks at the first
| 2 bits and if it starts 11 it starts with 1 and otherwise it
| starts 0 then the bit stream. Now you have a 1 in 4 chance to
| compress a random string by 1 bit and. A 3 in 4 chance of
| making it longer by 1 bit.
|
| It's ultimately not about compression but what odds are worth
| it.
| quuxplusone wrote:
| The encryption part might actually be harmful: for (silly)
| example, if you encrypt using the Bacon cipher, which turns
| every letter into five letters all of which are either "a" or
| "b". :) You'd need to pick an encryption method that doesn't
| expand the text at all, not even via a header or footer block.
| Better to stick with bytes generated uniformly at random, in
| which case you are correct.
|
| [1] - https://en.wikipedia.org/wiki/Bacon%27s_cipher
| omoikane wrote:
| The original email thread was from 2001, and it gets posted to HN
| periodically:
|
| https://news.ycombinator.com/from?site=patrickcraig.co.uk
|
| For another compression challenge that is still ongoing, try
| "500000EUR Prize for Compressing Human Knowledge" (also known as
| "Hutter Prize"):
|
| http://prize.hutter1.net/
|
| https://news.ycombinator.com/item?id=37502329 - Hutter Prize for
| compressing human knowledge (2023-09-13, 215 comments)
| vlovich123 wrote:
| I have a fundamental problem with the Hutter prize stating that
| intelligence is related to compression & then sponsoring a
| prize for lossless compression. Intelligence is related to
| lossy compression. Lossless is mainly a mechanistic act.
| echoangle wrote:
| Isn't the intelligence shown by compressing lossless the
| scheme you use? Applying the algorithm is the easy part, the
| proof of intelligence is inventing the algorithm which
| compresses.
| AlotOfReading wrote:
| Hutter put up the prize as a way of getting people to
| approximate his AIXI algorithm, which he already considers
| to be the pinnacle of artificially intelligence. That's
| also why lossy compression isn't interesting. It'd be an
| approximation of an approximation and not particularly
| useful for hutter's purpose.
| chriswarbo wrote:
| > his AIXI algorithm, which he already considers to be
| the pinnacle of artificially intelligence
|
| This paper from Hutter in 2015
| https://arxiv.org/abs/1510.04931 considers AIXI to be
| "subjective", since the choice of Universal Turing
| Machine is left unspecified:
|
| > We show that Legg-Hutter intelligence and thus balanced
| Pareto optimality is entirely subjective, and that every
| policy is Pareto optimal in the class of all computable
| environments. This undermines all existing optimality
| properties for AIXI. While it may still serve as a gold
| standard for AI, our results imply that AIXI is a
| relative theory, dependent on the choice of the UTM.
| _hark wrote:
| Say we have some dataset composed of D bytes.
|
| Next, say I find some predictive model of the data M, where M
| is composed of N bytes. Furthermore, let us say that the
| entropy of the dataset under the model is H bytes.
|
| Then, if N + H < D, my model compresses the data.
|
| It doesn't matter if the model is deterministic or
| probabilistic: a probabilistic model can be used to
| (losslessly) compress a dataset with entropy coding.
|
| One more argument for compression being equivalent to
| intelligence: Across many fields of statistical machine
| learning, there are generalization bounds which have the
| form:
|
| test error <= train error + model complexity
|
| That is, we don't expect to do any worse on new (test) data,
| than the sum of the train error and the model complexity
| (smallest compressed size). Notice that if you interpret the
| train error as the entropy of the data (i.e. error under a
| _cross entropy loss_ ), then the models which satisfy the
| statistical generalization bound correspond to those which
| best compress the data. In other words: the model which
| produces the shortest description of the data is the one
| which is expected to generalize best.
| dooglius wrote:
| This seems to assume that there is a tractable way to
| encode H efficiently, but this seems very difficult given a
| model that is focused on understanding the content. Ex: I
| can easily write a program that can do basic arithmetic,
| but given say a bitmap scan of elementary school math
| materials, such a program gives me no easy way to compress
| that; rather something generic like PNG (that does not know
| or understand the material) will far outperform.
| _hark wrote:
| Great point. This points to the related issue: what do we
| want to compress? Do we want to compress "the answer",
| here the arithmetic expression's solution, or do we want
| to compress the image?
|
| You can formalize this with rate--distortion theory, by
| defining a distortion function that says what your
| objective is. That implies a well-defined complexity
| relative to that distortion function.
|
| Okay to be clear, I've written a paper on exactly this
| topic which will be announced in a week or so. So you
| won't find anything on the subject yet, haha. But I use
| _almost_ exactly this example.
| Jerrrry wrote:
| >Okay to be clear, I've written a paper on exactly this
| topic which will be announced in a week or so. So you
| won't find anything on the subject yet, haha. But I use
| almost exactly this example.
|
| I would use Floating Point arithmetic as the
| example/analogy: one trades off accuracy/precision for
| exactness/correct-ness when in the extremities. Answers
| near the more granular representations will be only be
| able to represented by their nearest value. If this is
| forsee-able, the floating point implementation can be
| adjusted to change where the floating point's
| "extremities'" are.
|
| I used the above analogy and the following when
| articulating the magnitude of near-lossless-ness that
| large LLM's have managed to attain, especially when all
| of the humanities corpus is compressed into a USB flash
| drive; the Kolmogorov complexity re-mitted/captured is
| similar to a master-piece like the Mona Lisa having to be
| described in X many brush-strokes to achieve Y amount of
| fidelity.
| vlovich123 wrote:
| > It doesn't matter if the model is deterministic or
| probabilistic: a probabilistic model can be used to
| (losslessly) compress a dataset with entropy coding.
|
| But if you can choose to lose information you can obviously
| achieve a higher compression score. That's literally what
| optical & auditory compression exploits. Indeed, we know
| people generally don't memorize the entire Wikipedia
| article. Rather they convert what they learn into some
| internally consistent story that they can then recite at
| any time & each time they recite it it's even differently
| worded (maybe memorizing some facts that help solidify the
| story).
|
| Again, I have no problem with compression and decompression
| being equated to intelligence provided _both_ are allowed
| to be lossy (or at least one facet of intelligence). That
| 's because you get to inject structure into the stored
| representation that may not otherwise exist in the original
| data and you get to choose how to hydrate that
| representation. That's why LZMA isn't "more intelligent"
| than ZIP - the algorithm itself is "smarter" at compression
| but you're not getting to AGI by working on a better LZMA.
|
| It's also why H264 and MP3 aren't intelligent either. While
| compression is lossy decompression is deterministic. That's
| why we can characterize LLMs as "more intelligent" than
| LZMA even though LZMA compresses losslessly better than
| LLMs.
| _hark wrote:
| I agree with you in spirit. I just thought you might be
| interested in some of the technical details regarding the
| relationship between compression and generalization!
|
| I'll have a paper out next week which makes your point
| precise, using the language of algorithmic rate--
| distortion theory (lossy compression applied to
| algorithmic information theory + neural nets).
|
| I think another way of understanding this is through the
| "Value Equivalence Principle", which points out that if
| we are learning a model of our environment, we don't want
| to model everything in full detail, we only want to model
| things which affect our value function, which determines
| how we will act. The value function, in this sense,
| implies a distortion function that we can define lossy
| compression relative to.
| misiek08 wrote:
| I read ,,decompressor and compressed file", not ,,fileS". There
| is only one person correct here
| stavros wrote:
| You should read the rest of the thread.
| moonchild wrote:
| suppose the data were generated by a csprng with 256 bits of
| state. then in fact there are just slightly more than 256 bits of
| entropy there (prng state plus the length of the generated file).
| only, you'd have a snail's chance in hell of actually finding
| that seed
| echoangle wrote:
| If he really wanted to make sure he kept the money, he could
| just use a hardware RNG. I don't know how common they were in
| 2001 but you could probably even generate some practically
| unpredictable random numbers using radio noise.
| stavros wrote:
| Sorry, if you're trying to hustle people by xstging $100 per try,
| don't catch the sleight of hand in the "multiple files" question,
| and accept, you were beaten at your own game, fair and square.
| l33t7332273 wrote:
| I feel like if the FAQ requires not using filename shenanigans
| then the slight of hand was illegal the whole way.
| stavros wrote:
| He didn't use filenames, he used files, and if that were
| illegal, Mike shouldn't have accepted it.
| PaulKeeble wrote:
| I would not have accepted multiple files nor scripting. Scripts
| offers up the entire contents of the linux operating system and
| this could allow you to use operating system files as partial
| data. To really test compression/decompression the rules need to
| be a lot tighter. The challenge was met and beat because the
| criteria was bad. It leaves open a number of dishonest
| approaches.
|
| The decompression program needs to be constrained to itself and
| the singular file to decompress and can not utilise anything
| other than basic operating system functions for the purpose of
| reading and writing the constrained files. Any bytes the program
| accesses otherwise can be used as part of the total count.
|
| It's very hard to constrain this without some form of chest being
| possiblemespecially if you have some years to set it up.
| echoangle wrote:
| > Scripts offers up the entire contents of the linux operating
| system and this could allow you to use operating system files
| as partial data.
|
| I also thought about this, but when thinking more about the
| argument by the challenge creator, I understood that the data
| in the OS doesn't help you.
|
| If you're given a file with 100 bits, there are 2^100
| possibilities. If you want to compress to 99 bits, there are
| only 2^99 possible compressed outputs. So when generating and
| trying to compress every 100 bit file, you're not going to be
| able to compress half of them. The external data doesn't help
| you unless you get to modify it after receiving the file to
| compress. If you can get patched merged into the Linux kernel,
| you could use the data, but if the data is fixed, you can't
| reference the data with less data than the original file (for
| some cases atleast).
|
| Edit: actually, when thinking about it: you could actually use
| the OS as another few bits of out of band information you get
| for free, which makes it possible again. If you have 8
| different OS to choose from, you could use their differences to
| compress any file. At least theoretically, the 3 bits you gain
| practically probably aren't enough to make a difference.
| cozzyd wrote:
| My decompressor:
|
| curl ftp://path/to/original.dat
| PaulKeeble wrote:
| One possibility I considered is that your script could use an
| "apt install" to provide out of band data, you could put the
| information you need into the operating system from the
| universe repositories/your own package server.
| echoangle wrote:
| The computer used for the test probably doesn't have
| network connectivity, otherwise you could just download the
| data with curl.
| echoangle wrote:
| I didn't quite get the method used to ,,compress" the data from
| the article, maybe this rephrasing helps someone:
|
| You basically split the file every time you encounter a specific
| character, and your compressor just combines all files it finds
| with the character you split by. If you split at every ,,X" Char
| which might occur 1000 times in the file, the compressor only
| needs a small script which joins all files and puts an ,,X"
| between them, which is less than 1000 bytes. The ,,trick" is
| storing the location of the Xs you removed in the file sizes of
| the individual files.
| avmich wrote:
| I particularly like this:
|
| > It's not my fault that a file system uses up more space storing
| the same amount of data in two files rather than a single file.
| teo_zero wrote:
| Mike supports $SPORT team the Compressors. He's so sure they are
| unbeatable that he accepts 50:1 bets against them. Patrick bets
| 100$ that they won't win this year's championship; Mike accepts.
| Later, the Compressors announce financial troubles and can't pay
| the fee to enter the championship, which is then won by another
| team. Patrick reclaims his 5000$. Mike refuses to pay saying that
| the Compressors have not been beaten, and the championship result
| does not disprove his claim that the Compressors are unbeatable,
| which was the spirit of the bet since the beginning. Mike offers
| to refund the 100$. Patrick holds that the terms of the bet were
| met and he deserves the 5000$.
| nojvek wrote:
| I'd take this bet.
|
| > With this offer, you can tune your algorithm to my data.
|
| One can generate a 1GB file or 10GB file. It is highly likely
| that there is some form of a repeatable pattern in there to shave
| off 50-100 bytes by sliding window search.
|
| Then the decompressor is essentially - at this index, expand this
| pattern. The compressed file excludes the range of pattern.
|
| One may not always get such a pattern, but on multiple tries it's
| very feasible. Payout just needs one win in 50 tries.
|
| You could generate a 100GB file. The bigger the file the higher
| the chance of repeating pattern.
|
| The challenge is won if compressed_file + decompressor is less
| one byte than original file.
|
| One could have a self executing decompressor to save some file
| overhead bytes.
| Jerrrry wrote:
| good compression = pattern eventually
|
| maximum compression = indelineable from random data
| crazygringo wrote:
| As a general principle, absolutely.
|
| In practice, I wonder what size of file we're talking about
| that would result in net compression on random data 50% of the
| time?
|
| I have no intuition whether it's 1 GB, 1 TB, 1 PB, or beyond.
| dang wrote:
| Related. Others?
|
| _The $5000 Compression Challenge_ -
| https://news.ycombinator.com/item?id=9163782 - March 2015 (168
| comments)
|
| _The $5000 Compression Challenge_ -
| https://news.ycombinator.com/item?id=5025211 - Jan 2013 (61
| comments)
|
| _The $5000 Compression Challenge_ -
| https://news.ycombinator.com/item?id=4616704 - Oct 2012 (119
| comments)
| trod1234 wrote:
| This gets reposted quite often.
|
| The crux of the ongoing debate over the post has to do with the
| ambiguity of the stated requirements.
|
| For there to be any rational discussion, you need to have proper
| unique definitions for what is being communicated. Without a
| definition (identity) no system of truth seeking will provide a
| definitive truth.
|
| Most notably, "compression" has many different meanings currently
| in practical use. For example many people use the word
| compression and simply mean that the uncompressed size is more
| than the compressed size; lossy vs lossless doesn't matter when
| its good enough, all that matters is its been reduced, and an
| example of that might be H264 video.
|
| Other more rigorous individuals will use it to mean entropy
| encoding, which has a strict definition, part of which is
| lossless, and in these cases they often view Shannon's theorems
| as gospel, and at that point they often stop thinking and may not
| try to sharpen their teeth on a seemingly impossible task.
|
| These theorems are often very nuanced, with the bounds very
| theoretical or still undiscovered, though still seemingly
| correct. When someone is told something is impossible, they don't
| fully apply themselves to try to find a solution. Its a worn path
| that leads nowhere, or so they imagine.
|
| This mindset of learned helplessness prevents a nuanced intuitive
| understanding of such problems and any advances stagnate. People
| take the impossibility at face value, and overgeneralize it
| (without that nuanced understanding), this is fallacy and it can
| be very difficult to recognize it when there is no simple way to
| refute it on its face.
|
| Here is something to consider. Shannon's source coding theorem
| relies on the average uncertainty of probabilities as a measure
| of entropy (information). Some exploration is needed to
| understand what makes this true, or where it fails. It has been
| my experience that only a very few exceptional people ever
| attempt this process.
|
| The main question that jumps out in my mind since compression has
| been a thing I've been interested in for years; is, can an
| average entropy in statistics tell us accurately whether this is
| true in systems with probabilities and distributions that are a
| mix of mathematical chaos but also contain regularity within
| their domains? Statistics can be unwieldy beasts, and act as
| regular sources of misunderstanding and falsehood without a very
| rigorous approach.
|
| For a similar problem, people may find this article particularly
| relevant with regards to the chaotic 3-body problem in
| astrophysics.
|
| https://www.aanda.org/articles/aa/full_html/2024/09/aa49862-...
| Xcelerate wrote:
| There's another interesting loophole to these general compression
| challenges. A universal Turing machine will necessarily compress
| some number of strings (despite the fact that almost all strings
| are incompressible). The set of compressible strings varies
| depending on the choice of UTM, and if your UTM if fixed, you're
| out of luck for random data. But if the UTM is unspecified, then
| there exist an infinite number of UTMs that will compress any
| specific string.
|
| To some extent, this resembles the approach of "hiding the data
| in the decompressor". But the key difference here is that you can
| make it less obvious by selecting a particular programming
| language capable of universal computation, and it is the _choice
| of language_ that encodes the missing data. For example, suppose
| we have ~17k programming languages to choose from--the language
| selection itself encodes about 14 bits (log2(17000)) of
| information.
|
| If there are m bits of truly random data to compress and n
| choices of programming languages capable of universal
| computation, then as n/m approaches infinity, the probability of
| at least one language being capable of compressing an arbitrary
| string approaches 1. This ratio is likely unrealistically large
| for any amount of random data more than a few bits in length.
|
| There's also the additional caveat that we're assuming the set of
| compressible strings is algorithmically independent for each
| choice of UTM. This certainly isn't the case. The invariance
| theorem states that [?]x|K_U(x) - K_V(x)| < c for UTMs U and V,
| where K is Kolmogorov complexity and c is a constant that depends
| only on U and V. So in our case, c is effectively the size of the
| largest minimal transpiler between any two programming languages
| that we have to choose from, and I'd imagine c is quite small.
|
| Oh, and this all requires computing the Kolmogorov complexity of
| a string of random data. Which we can't, because that's
| uncomputable.
|
| Nevertheless it's an interesting thought experiment. I'm curious
| what the smallest value of m is such that we could realistically
| compress a random string of length 2^m given the available
| programming languages out there. Unfortunately, I imagine it's
| probably like 6 bits, and no one is going to give you an award
| for compressing 6 bits of random data.
___________________________________________________________________
(page generated 2024-11-24 23:00 UTC)