hngopher.com

       [HN Gopher] The $5000 Compression Challenge (2001)
       ___________________________________________________________________
        
       The $5000 Compression Challenge (2001)
        
       Author : ekiauhce
       Score  : 102 points
       Date   : 2024-11-23 21:10 UTC (1 days ago)
        
 (HTM) web link (www.patrickcraig.co.uk)
 (TXT) w3m dump (www.patrickcraig.co.uk)
        
       | andrei-akopian wrote:
       | I though he would try to net profit through probability by
       | requesting to regenerate the file and hope to get a compressible
       | one in at least 5000/100=50 times. (Though I don't think that
       | would work.)
        
       | its-summertime wrote:
       | > I think that I did not make a mistake except to believe that
       | you truly intended to attempt to compress the data I was sending
       | you.
       | 
       | Weird phrase to hear from a digital carnival scam artist.
        
       | Supermancho wrote:
       | > It's not my fault that a file system uses up more space storing
       | the same amount of data in two files rather than a single file.
       | 
       | It's not interesting to see people play word games with each
       | other in the HN comments, so why is it interesting to see this
       | kind of disingenuous behavior? Trolls have always existed. I
       | don't get what's interesting about it.
        
         | recursive wrote:
         | The act of offering a cash prize to prove you wrong kind of
         | makes you look like a jerk. It's assumed that you will use
         | every loop hole at your disposal. Seeing the situation reversed
         | so elegantly makes me root for the challenger.
        
         | ncruces wrote:
         | Who's the bigger troll?
         | 
         | Putting up the initial challenge, especially with the
         | accompanying commentary, is just as much trolling as that.
        
         | stavros wrote:
         | As Patrick said, if he had gotten a 1000-byte file and sent a
         | 999-byte file/decompressor file back, nobody would have counted
         | the inode data as a part of the solution. Why count it now?
        
           | Supermancho wrote:
           | Because the solution specifically was for 1 file submitted, 2
           | files back (compressed and decompressor). This topic ise the
           | next level trolling that went on. The summary and ng faw,
           | explained why these file metagames aren't useful. They
           | obscure the lack of a compression strategy (eg if the inode
           | equivalent was 2ft of tape) by nature of dispersing
           | information into the fs.
        
             | stavros wrote:
             | > Because the solution specifically was for 1 file
             | submitted, 2 files back (compressed and decompressor).
             | 
             | Patrick:
             | 
             | I meant can I send you a decompressor and several
             | compressed files whose total file size is less than the
             | original uncompressed file and from which I can regenerate
             | the original uncompressed file
             | 
             | Mike:
             | 
             | Sure.
        
         | Timwi wrote:
         | I do agree that this last sentence (the one you quoted) is very
         | unfortunate and doesn't match the wit of the rest. A better
         | closing remark might have been something on the lines of, "I
         | never claimed to have performed any compression. I only claimed
         | to have met the terms of the challenge. I asked if I can send
         | multiple files and the response was 'sure'. I provided the
         | files; they meet the total file size criterion; and the script
         | correctly reconstructs the original files."
        
       | fxtentacle wrote:
       | This guy clearly failed because he didn't actually do any
       | compression, he just ab-used the filesystem to store parts of the
       | data and then tried to argue that metadata was not data...
       | 
       | But FYI someone else actually managed to compress that exact same
       | data: https://jasp.net/tjnn/2018/1.xhtml
        
         | recursive wrote:
         | That argument was made, and addressed in the emails. How are
         | you defining compression? Actual compassion doesn't seem to
         | have been a requirement in the challenge.
        
         | maxmcd wrote:
         | Is there any more information about this solution?
        
         | stavros wrote:
         | If Mike didn't want multiple files, he should have said no to
         | multiple files. The fact that he wanted $100 per try shows that
         | he just got outhustled at his own hustle. He should have been
         | more particular about the rules, and upheld the ones he himself
         | set.
        
           | PaulHoule wrote:
           | Should have charged $100 a file.
        
         | Timwi wrote:
         | He clearly failed at doing any actual compression, but he did
         | not fail at the challenge. He satisfied every single condition.
         | 
         | He obviously knows that. He knows exactly what compression is
         | and that his solution is not it. He showed (successfully) that
         | meeting the challenge as it was worded did not require
         | compression, which the setter of the challenge didn't realize.
        
         | elahieh wrote:
         | I'm rather skeptical of this claim that the data was compressed
         | in 2018 because there is no further information, apart from a
         | hash value given.
         | 
         | If it's a true claim they must have identified some "non-
         | random" aspect of the original data, and then they could have
         | given more info.
        
       | johnfn wrote:
       | This is hilarious, but Mike's behavior boils my blood. To switch
       | the rules when fairly beaten is just so scummy!
       | 
       | Of course Patrick's solution occurred to me immediately once he
       | started talking about splitting up the file. :)
       | 
       | Then I got to wondering if it would be possible to compress a
       | large enough file and save a few bytes, in the spirit of what
       | Mike actually wanted. For instance, if you tried encrypting with
       | a bunch of random keys, you'll find some such that the result
       | ends with n 0s (say), so then your result could be something like
       | encrypted + key + n. Typically you'd expect to find that the
       | length of the encryption key would be roughly the length of the
       | number of 0s, so you wouldn't actually net any savings. But I
       | think if you searched long enough, you might "get lucky" and find
       | a key that generated more 0s then the length of the key, so you
       | could theoretically win the challenge. Of course you'd have to
       | somehow embed a program which could decrypt and reassemble
       | everything in a tiny amount of space, but it still doesn't seem
       | impossible like Mike seems to suggest. Perhaps just requires an
       | intractable amount of computing power. :)
       | 
       | I'm not sure if any of that made sense. I think I'm using the
       | wrong terms for things.
        
         | crazygringo wrote:
         | I remember reading from a previous thread how yes you can do
         | something like that (or lots of other techniques) but they will
         | only work on a small percentage of files.
         | 
         | That sometimes the smallest decompressor (including the key, in
         | your example) will result in overall less space, but usually it
         | will result in more (because the smallest key that generates 10
         | zeros happens to be 14 digits long, etc.).
         | 
         | And so ultimately the more interesting question becomes about
         | those probabilities, and therefore what is the right entry
         | price for the bet to be even?
        
       | permo-w wrote:
       | this sounds like a Nigerian Prince scammer set-up
        
       | ccleve wrote:
       | I wonder if it would have been possible to win the challenge
       | legitimately?
       | 
       | If a randomly-generated file happened to contain some redundancy
       | through sheer chance, you could hand-craft a compressor to take
       | advantage of it. This compressor would not work in general for
       | random data, but it could work for this one particular case.
       | 
       | It's a bet worth taking, because the payoff, 50:1 ($5,000 to
       | $100), is pretty good. Play the game 50 times and you might get a
       | file you could compress.
       | 
       | The challenge, then, would be for the person offering the bet to
       | generate a really random file that contained no such redundancy.
       | That might not be easy.
        
         | dmurray wrote:
         | I think you can make some argument about why this isn't
         | possible at 50:1 odds.
         | 
         | A plausible "decompressor" is at least, say, 30 or 100 bytes,
         | so the random file needs to have 30 bytes less entropy than you
         | expected, which happens with probability X where X << 1/50. Sum
         | over the whole domain of reasonable decompressors, and you
         | still don't get there.
         | 
         | This argument could do with more rigor, but I think it's
         | correct. Give me 100 million to 1 odds, though, and I'll take
         | my chances trying to brute force a compressor.
        
           | lambdaone wrote:
           | This is actually an extremely interesting question. 'Weak'
           | files that are more easily compressable than others certainly
           | exist, but with low probability.
           | 
           | For example, the all-zeros file is a member of the set of all
           | random 3 megabyte files, and it would certainly be possible
           | to compress that, if by great good fortune you were lucky
           | enough to receive it - albeit something that is unlikely to
           | ever happen in the possible lifetime of the universe, given
           | current physical theories.
           | 
           | Is it possible to quantify this idea of a 'weak' file more
           | accurately?
        
             | ccleve wrote:
             | I know very little about this, but a little googling
             | suggests that the measure you're looking for is entropy,
             | which has a mathematical definition:
             | https://en.wikipedia.org/wiki/Entropy_(information_theory)
        
             | l33t7332273 wrote:
             | One thing you can do, as the other commenter pointed out,
             | is consider entropy of the file.
             | 
             | However, this restriction is too much for the purposes of
             | this challenge. We don't actually need a file with low
             | entropy, in fact I claim that a weak file exists for files
             | with entropy 8 (the maximum entropy value) - epsilon for
             | each epsilon > 0.
             | 
             | What we actually need is a sufficiently large chunk in a
             | file to have low entropy. The largeness is in absolute
             | terms, not relative terms.
             | 
             | A very simple file would be taking a very large file with
             | maximum entropy and adding 200 0's to the end. This would
             | not decrease the entropy of the file much, but it gives way
             | to a compression algorithm that should be able to save ~100
             | bytes
        
               | kevinventullo wrote:
               | Note that if this large chunk occurs in the middle of the
               | file, then you will need extra space to encode that
               | position. For example, a random bit string of length 2^n
               | is decently likely to have a run of n zeroes. But this
               | doesn't help you because you need n bits just to encode
               | _where_ that run happens.
        
               | l33t7332273 wrote:
               | But storing an index for a file of length 2^n takes only
               | n bits
        
         | kittoes wrote:
         | What if we didn't even attempt to compress the file?
         | Theoretically, there is a random number generator + seed
         | combination that would output the same bytes as the source
         | file.
        
           | changoplatanero wrote:
           | Neat idea but chances are the length of the seed is equal to
           | the length of the original file.
        
             | guepe wrote:
             | There is a polynomial expression that will match the file.
             | I wonder if expressing it would be compressible to a lower
             | size than original file? [edit] I'm wrong - not all
             | sequences can be expressed with a polynomial.
        
               | l33t7332273 wrote:
               | This reminds me of a data compression scheme I came up
               | with once:
               | 
               | Treat an n bit file as a polynomial over the finite field
               | with characteristic 2. Now, there are some irreducible
               | polynomials in this field, but many polynomials have
               | factors of x and of (x+1). Factor the polynomial into
               | P(x)x^n (x+1)^m. and just collect these terms, storing
               | only P, n, and m.
        
             | crazygringo wrote:
             | That's not how seeds work. Seeds are tiny.
             | 
             | Actually this would work perfectly _if_ you knew it was
             | generated in a single pass by a known random number
             | generator and you had tons of time to brute force it.
             | 
             | If the file were generated by a natural source of entropy
             | then forget it.
             | 
             | Or even if modified in a trivial way like adding 1 to every
             | byte.
        
         | Retr0id wrote:
         | Somewhere (discussed on HN) someone devised a "better-than-
         | perfect" compressor. Most inputs get compressed (smaller than
         | input), except for one input that does not. That one input is
         | cryptographically impossible to find - or something along those
         | lines.
         | 
         | Unfortunately I can't find the article I'm describing here,
         | maybe someone else can? It was a long time ago so I might be
         | misrepresenting it slightly.
        
           | spencerflem wrote:
           | That's how all compressors work, in that likely files (eg.
           | ASCII, obvious patterns, etc) become smaller and unlikely
           | files become bigger.
        
             | Dylan16807 wrote:
             | > likely files (eg. ASCII, obvious patterns, etc) become
             | smaller
             | 
             | Likely files for a real human workload are like that, but
             | if "most inputs" is talking about the set of all possible
             | files, then that's a whole different ball game and "most
             | inputs" will compress very badly.
             | 
             | > unlikely files become bigger
             | 
             | Yes, but when compressors can't find useful patterns they
             | generally only increase size by a small fraction of a
             | percent. There aren't files that get significantly bigger.
        
             | PaulHoule wrote:
             | In some cases it can be certain, the ascii encoded in the
             | usual 8 bits has fat to trim even if it is random in that
             | space.
        
           | Dylan16807 wrote:
           | You could design the opposite, a system where ultra-rare
           | files get compressed a lot but most don't compress.
           | 
           | But as stated, what you said is not possible. To compress by
           | a single byte, only 1/256 files can be compressible, or you'd
           | have multiple different inputs compressing down to the same
           | output.
        
       | hyperpape wrote:
       | It's amazing that this commentary on the value of prediction
       | markets was written in 2001.
        
       | benchmarkist wrote:
       | Take random bytes from /dev/urandom, encrypt it. There is no way
       | any compressor can reduce the size of that file. I'm pretty sure
       | there is no way the second player can win this game. Actually,
       | the encryption part isn't even necessary. There is no compressor
       | that can reduce the size of a random stream of bytes.
        
         | cortesoft wrote:
         | There is no general purpose compressor that can achieve
         | compression on all random streams of bytes... but there is a
         | possibility to compress a specific random stream of bytes, if
         | you get lucky
        
           | lambdaone wrote:
           | Absolutely! Can this probability be quantified?
        
             | benchmarkist wrote:
             | You'd have to define what you mean by luck which would be
             | equivalent to choosing some pattern that your compressor
             | can recognize in a random stream of bytes and then
             | computing the probability that pattern appeared in some
             | random byte sequence. It's not a winnable game in any
             | practical sense because whatever compressor you choose the
             | probability of getting a sequence from your opponent that
             | will conform to the patterns recognizable by your
             | compressor is essentially 0.
        
         | Retric wrote:
         | But you don't need to compress every possible file to make
         | playing such a game a good idea.
         | 
         | Suppose if you shave 1 bit off of a sample bit stream you win
         | and you lose if that fails. Your compressor looks at the first
         | 2 bits and if it starts 11 it starts with 1 and otherwise it
         | starts 0 then the bit stream. Now you have a 1 in 4 chance to
         | compress a random string by 1 bit and. A 3 in 4 chance of
         | making it longer by 1 bit.
         | 
         | It's ultimately not about compression but what odds are worth
         | it.
        
         | quuxplusone wrote:
         | The encryption part might actually be harmful: for (silly)
         | example, if you encrypt using the Bacon cipher, which turns
         | every letter into five letters all of which are either "a" or
         | "b". :) You'd need to pick an encryption method that doesn't
         | expand the text at all, not even via a header or footer block.
         | Better to stick with bytes generated uniformly at random, in
         | which case you are correct.
         | 
         | [1] - https://en.wikipedia.org/wiki/Bacon%27s_cipher
        
       | omoikane wrote:
       | The original email thread was from 2001, and it gets posted to HN
       | periodically:
       | 
       | https://news.ycombinator.com/from?site=patrickcraig.co.uk
       | 
       | For another compression challenge that is still ongoing, try
       | "500000EUR Prize for Compressing Human Knowledge" (also known as
       | "Hutter Prize"):
       | 
       | http://prize.hutter1.net/
       | 
       | https://news.ycombinator.com/item?id=37502329 - Hutter Prize for
       | compressing human knowledge (2023-09-13, 215 comments)
        
         | vlovich123 wrote:
         | I have a fundamental problem with the Hutter prize stating that
         | intelligence is related to compression & then sponsoring a
         | prize for lossless compression. Intelligence is related to
         | lossy compression. Lossless is mainly a mechanistic act.
        
           | echoangle wrote:
           | Isn't the intelligence shown by compressing lossless the
           | scheme you use? Applying the algorithm is the easy part, the
           | proof of intelligence is inventing the algorithm which
           | compresses.
        
             | AlotOfReading wrote:
             | Hutter put up the prize as a way of getting people to
             | approximate his AIXI algorithm, which he already considers
             | to be the pinnacle of artificially intelligence. That's
             | also why lossy compression isn't interesting. It'd be an
             | approximation of an approximation and not particularly
             | useful for hutter's purpose.
        
               | chriswarbo wrote:
               | > his AIXI algorithm, which he already considers to be
               | the pinnacle of artificially intelligence
               | 
               | This paper from Hutter in 2015
               | https://arxiv.org/abs/1510.04931 considers AIXI to be
               | "subjective", since the choice of Universal Turing
               | Machine is left unspecified:
               | 
               | > We show that Legg-Hutter intelligence and thus balanced
               | Pareto optimality is entirely subjective, and that every
               | policy is Pareto optimal in the class of all computable
               | environments. This undermines all existing optimality
               | properties for AIXI. While it may still serve as a gold
               | standard for AI, our results imply that AIXI is a
               | relative theory, dependent on the choice of the UTM.
        
           | _hark wrote:
           | Say we have some dataset composed of D bytes.
           | 
           | Next, say I find some predictive model of the data M, where M
           | is composed of N bytes. Furthermore, let us say that the
           | entropy of the dataset under the model is H bytes.
           | 
           | Then, if N + H < D, my model compresses the data.
           | 
           | It doesn't matter if the model is deterministic or
           | probabilistic: a probabilistic model can be used to
           | (losslessly) compress a dataset with entropy coding.
           | 
           | One more argument for compression being equivalent to
           | intelligence: Across many fields of statistical machine
           | learning, there are generalization bounds which have the
           | form:
           | 
           | test error <= train error + model complexity
           | 
           | That is, we don't expect to do any worse on new (test) data,
           | than the sum of the train error and the model complexity
           | (smallest compressed size). Notice that if you interpret the
           | train error as the entropy of the data (i.e. error under a
           | _cross entropy loss_ ), then the models which satisfy the
           | statistical generalization bound correspond to those which
           | best compress the data. In other words: the model which
           | produces the shortest description of the data is the one
           | which is expected to generalize best.
        
             | dooglius wrote:
             | This seems to assume that there is a tractable way to
             | encode H efficiently, but this seems very difficult given a
             | model that is focused on understanding the content. Ex: I
             | can easily write a program that can do basic arithmetic,
             | but given say a bitmap scan of elementary school math
             | materials, such a program gives me no easy way to compress
             | that; rather something generic like PNG (that does not know
             | or understand the material) will far outperform.
        
               | _hark wrote:
               | Great point. This points to the related issue: what do we
               | want to compress? Do we want to compress "the answer",
               | here the arithmetic expression's solution, or do we want
               | to compress the image?
               | 
               | You can formalize this with rate--distortion theory, by
               | defining a distortion function that says what your
               | objective is. That implies a well-defined complexity
               | relative to that distortion function.
               | 
               | Okay to be clear, I've written a paper on exactly this
               | topic which will be announced in a week or so. So you
               | won't find anything on the subject yet, haha. But I use
               | _almost_ exactly this example.
        
               | Jerrrry wrote:
               | >Okay to be clear, I've written a paper on exactly this
               | topic which will be announced in a week or so. So you
               | won't find anything on the subject yet, haha. But I use
               | almost exactly this example.
               | 
               | I would use Floating Point arithmetic as the
               | example/analogy: one trades off accuracy/precision for
               | exactness/correct-ness when in the extremities. Answers
               | near the more granular representations will be only be
               | able to represented by their nearest value. If this is
               | forsee-able, the floating point implementation can be
               | adjusted to change where the floating point's
               | "extremities'" are.
               | 
               | I used the above analogy and the following when
               | articulating the magnitude of near-lossless-ness that
               | large LLM's have managed to attain, especially when all
               | of the humanities corpus is compressed into a USB flash
               | drive; the Kolmogorov complexity re-mitted/captured is
               | similar to a master-piece like the Mona Lisa having to be
               | described in X many brush-strokes to achieve Y amount of
               | fidelity.
        
             | vlovich123 wrote:
             | > It doesn't matter if the model is deterministic or
             | probabilistic: a probabilistic model can be used to
             | (losslessly) compress a dataset with entropy coding.
             | 
             | But if you can choose to lose information you can obviously
             | achieve a higher compression score. That's literally what
             | optical & auditory compression exploits. Indeed, we know
             | people generally don't memorize the entire Wikipedia
             | article. Rather they convert what they learn into some
             | internally consistent story that they can then recite at
             | any time & each time they recite it it's even differently
             | worded (maybe memorizing some facts that help solidify the
             | story).
             | 
             | Again, I have no problem with compression and decompression
             | being equated to intelligence provided _both_ are allowed
             | to be lossy (or at least one facet of intelligence). That
             | 's because you get to inject structure into the stored
             | representation that may not otherwise exist in the original
             | data and you get to choose how to hydrate that
             | representation. That's why LZMA isn't "more intelligent"
             | than ZIP - the algorithm itself is "smarter" at compression
             | but you're not getting to AGI by working on a better LZMA.
             | 
             | It's also why H264 and MP3 aren't intelligent either. While
             | compression is lossy decompression is deterministic. That's
             | why we can characterize LLMs as "more intelligent" than
             | LZMA even though LZMA compresses losslessly better than
             | LLMs.
        
               | _hark wrote:
               | I agree with you in spirit. I just thought you might be
               | interested in some of the technical details regarding the
               | relationship between compression and generalization!
               | 
               | I'll have a paper out next week which makes your point
               | precise, using the language of algorithmic rate--
               | distortion theory (lossy compression applied to
               | algorithmic information theory + neural nets).
               | 
               | I think another way of understanding this is through the
               | "Value Equivalence Principle", which points out that if
               | we are learning a model of our environment, we don't want
               | to model everything in full detail, we only want to model
               | things which affect our value function, which determines
               | how we will act. The value function, in this sense,
               | implies a distortion function that we can define lossy
               | compression relative to.
        
       | misiek08 wrote:
       | I read ,,decompressor and compressed file", not ,,fileS". There
       | is only one person correct here
        
         | stavros wrote:
         | You should read the rest of the thread.
        
       | moonchild wrote:
       | suppose the data were generated by a csprng with 256 bits of
       | state. then in fact there are just slightly more than 256 bits of
       | entropy there (prng state plus the length of the generated file).
       | only, you'd have a snail's chance in hell of actually finding
       | that seed
        
         | echoangle wrote:
         | If he really wanted to make sure he kept the money, he could
         | just use a hardware RNG. I don't know how common they were in
         | 2001 but you could probably even generate some practically
         | unpredictable random numbers using radio noise.
        
       | stavros wrote:
       | Sorry, if you're trying to hustle people by xstging $100 per try,
       | don't catch the sleight of hand in the "multiple files" question,
       | and accept, you were beaten at your own game, fair and square.
        
         | l33t7332273 wrote:
         | I feel like if the FAQ requires not using filename shenanigans
         | then the slight of hand was illegal the whole way.
        
           | stavros wrote:
           | He didn't use filenames, he used files, and if that were
           | illegal, Mike shouldn't have accepted it.
        
       | PaulKeeble wrote:
       | I would not have accepted multiple files nor scripting. Scripts
       | offers up the entire contents of the linux operating system and
       | this could allow you to use operating system files as partial
       | data. To really test compression/decompression the rules need to
       | be a lot tighter. The challenge was met and beat because the
       | criteria was bad. It leaves open a number of dishonest
       | approaches.
       | 
       | The decompression program needs to be constrained to itself and
       | the singular file to decompress and can not utilise anything
       | other than basic operating system functions for the purpose of
       | reading and writing the constrained files. Any bytes the program
       | accesses otherwise can be used as part of the total count.
       | 
       | It's very hard to constrain this without some form of chest being
       | possiblemespecially if you have some years to set it up.
        
         | echoangle wrote:
         | > Scripts offers up the entire contents of the linux operating
         | system and this could allow you to use operating system files
         | as partial data.
         | 
         | I also thought about this, but when thinking more about the
         | argument by the challenge creator, I understood that the data
         | in the OS doesn't help you.
         | 
         | If you're given a file with 100 bits, there are 2^100
         | possibilities. If you want to compress to 99 bits, there are
         | only 2^99 possible compressed outputs. So when generating and
         | trying to compress every 100 bit file, you're not going to be
         | able to compress half of them. The external data doesn't help
         | you unless you get to modify it after receiving the file to
         | compress. If you can get patched merged into the Linux kernel,
         | you could use the data, but if the data is fixed, you can't
         | reference the data with less data than the original file (for
         | some cases atleast).
         | 
         | Edit: actually, when thinking about it: you could actually use
         | the OS as another few bits of out of band information you get
         | for free, which makes it possible again. If you have 8
         | different OS to choose from, you could use their differences to
         | compress any file. At least theoretically, the 3 bits you gain
         | practically probably aren't enough to make a difference.
        
           | cozzyd wrote:
           | My decompressor:
           | 
           | curl ftp://path/to/original.dat
        
           | PaulKeeble wrote:
           | One possibility I considered is that your script could use an
           | "apt install" to provide out of band data, you could put the
           | information you need into the operating system from the
           | universe repositories/your own package server.
        
             | echoangle wrote:
             | The computer used for the test probably doesn't have
             | network connectivity, otherwise you could just download the
             | data with curl.
        
       | echoangle wrote:
       | I didn't quite get the method used to ,,compress" the data from
       | the article, maybe this rephrasing helps someone:
       | 
       | You basically split the file every time you encounter a specific
       | character, and your compressor just combines all files it finds
       | with the character you split by. If you split at every ,,X" Char
       | which might occur 1000 times in the file, the compressor only
       | needs a small script which joins all files and puts an ,,X"
       | between them, which is less than 1000 bytes. The ,,trick" is
       | storing the location of the Xs you removed in the file sizes of
       | the individual files.
        
       | avmich wrote:
       | I particularly like this:
       | 
       | > It's not my fault that a file system uses up more space storing
       | the same amount of data in two files rather than a single file.
        
       | teo_zero wrote:
       | Mike supports $SPORT team the Compressors. He's so sure they are
       | unbeatable that he accepts 50:1 bets against them. Patrick bets
       | 100$ that they won't win this year's championship; Mike accepts.
       | Later, the Compressors announce financial troubles and can't pay
       | the fee to enter the championship, which is then won by another
       | team. Patrick reclaims his 5000$. Mike refuses to pay saying that
       | the Compressors have not been beaten, and the championship result
       | does not disprove his claim that the Compressors are unbeatable,
       | which was the spirit of the bet since the beginning. Mike offers
       | to refund the 100$. Patrick holds that the terms of the bet were
       | met and he deserves the 5000$.
        
       | nojvek wrote:
       | I'd take this bet.
       | 
       | > With this offer, you can tune your algorithm to my data.
       | 
       | One can generate a 1GB file or 10GB file. It is highly likely
       | that there is some form of a repeatable pattern in there to shave
       | off 50-100 bytes by sliding window search.
       | 
       | Then the decompressor is essentially - at this index, expand this
       | pattern. The compressed file excludes the range of pattern.
       | 
       | One may not always get such a pattern, but on multiple tries it's
       | very feasible. Payout just needs one win in 50 tries.
       | 
       | You could generate a 100GB file. The bigger the file the higher
       | the chance of repeating pattern.
       | 
       | The challenge is won if compressed_file + decompressor is less
       | one byte than original file.
       | 
       | One could have a self executing decompressor to save some file
       | overhead bytes.
        
         | Jerrrry wrote:
         | good compression = pattern eventually
         | 
         | maximum compression = indelineable from random data
        
         | crazygringo wrote:
         | As a general principle, absolutely.
         | 
         | In practice, I wonder what size of file we're talking about
         | that would result in net compression on random data 50% of the
         | time?
         | 
         | I have no intuition whether it's 1 GB, 1 TB, 1 PB, or beyond.
        
       | dang wrote:
       | Related. Others?
       | 
       |  _The $5000 Compression Challenge_ -
       | https://news.ycombinator.com/item?id=9163782 - March 2015 (168
       | comments)
       | 
       |  _The $5000 Compression Challenge_ -
       | https://news.ycombinator.com/item?id=5025211 - Jan 2013 (61
       | comments)
       | 
       |  _The $5000 Compression Challenge_ -
       | https://news.ycombinator.com/item?id=4616704 - Oct 2012 (119
       | comments)
        
       | trod1234 wrote:
       | This gets reposted quite often.
       | 
       | The crux of the ongoing debate over the post has to do with the
       | ambiguity of the stated requirements.
       | 
       | For there to be any rational discussion, you need to have proper
       | unique definitions for what is being communicated. Without a
       | definition (identity) no system of truth seeking will provide a
       | definitive truth.
       | 
       | Most notably, "compression" has many different meanings currently
       | in practical use. For example many people use the word
       | compression and simply mean that the uncompressed size is more
       | than the compressed size; lossy vs lossless doesn't matter when
       | its good enough, all that matters is its been reduced, and an
       | example of that might be H264 video.
       | 
       | Other more rigorous individuals will use it to mean entropy
       | encoding, which has a strict definition, part of which is
       | lossless, and in these cases they often view Shannon's theorems
       | as gospel, and at that point they often stop thinking and may not
       | try to sharpen their teeth on a seemingly impossible task.
       | 
       | These theorems are often very nuanced, with the bounds very
       | theoretical or still undiscovered, though still seemingly
       | correct. When someone is told something is impossible, they don't
       | fully apply themselves to try to find a solution. Its a worn path
       | that leads nowhere, or so they imagine.
       | 
       | This mindset of learned helplessness prevents a nuanced intuitive
       | understanding of such problems and any advances stagnate. People
       | take the impossibility at face value, and overgeneralize it
       | (without that nuanced understanding), this is fallacy and it can
       | be very difficult to recognize it when there is no simple way to
       | refute it on its face.
       | 
       | Here is something to consider. Shannon's source coding theorem
       | relies on the average uncertainty of probabilities as a measure
       | of entropy (information). Some exploration is needed to
       | understand what makes this true, or where it fails. It has been
       | my experience that only a very few exceptional people ever
       | attempt this process.
       | 
       | The main question that jumps out in my mind since compression has
       | been a thing I've been interested in for years; is, can an
       | average entropy in statistics tell us accurately whether this is
       | true in systems with probabilities and distributions that are a
       | mix of mathematical chaos but also contain regularity within
       | their domains? Statistics can be unwieldy beasts, and act as
       | regular sources of misunderstanding and falsehood without a very
       | rigorous approach.
       | 
       | For a similar problem, people may find this article particularly
       | relevant with regards to the chaotic 3-body problem in
       | astrophysics.
       | 
       | https://www.aanda.org/articles/aa/full_html/2024/09/aa49862-...
        
       | Xcelerate wrote:
       | There's another interesting loophole to these general compression
       | challenges. A universal Turing machine will necessarily compress
       | some number of strings (despite the fact that almost all strings
       | are incompressible). The set of compressible strings varies
       | depending on the choice of UTM, and if your UTM if fixed, you're
       | out of luck for random data. But if the UTM is unspecified, then
       | there exist an infinite number of UTMs that will compress any
       | specific string.
       | 
       | To some extent, this resembles the approach of "hiding the data
       | in the decompressor". But the key difference here is that you can
       | make it less obvious by selecting a particular programming
       | language capable of universal computation, and it is the _choice
       | of language_ that encodes the missing data. For example, suppose
       | we have ~17k programming languages to choose from--the language
       | selection itself encodes about 14 bits (log2(17000)) of
       | information.
       | 
       | If there are m bits of truly random data to compress and n
       | choices of programming languages capable of universal
       | computation, then as n/m approaches infinity, the probability of
       | at least one language being capable of compressing an arbitrary
       | string approaches 1. This ratio is likely unrealistically large
       | for any amount of random data more than a few bits in length.
       | 
       | There's also the additional caveat that we're assuming the set of
       | compressible strings is algorithmically independent for each
       | choice of UTM. This certainly isn't the case. The invariance
       | theorem states that [?]x|K_U(x) - K_V(x)| < c for UTMs U and V,
       | where K is Kolmogorov complexity and c is a constant that depends
       | only on U and V. So in our case, c is effectively the size of the
       | largest minimal transpiler between any two programming languages
       | that we have to choose from, and I'd imagine c is quite small.
       | 
       | Oh, and this all requires computing the Kolmogorov complexity of
       | a string of random data. Which we can't, because that's
       | uncomputable.
       | 
       | Nevertheless it's an interesting thought experiment. I'm curious
       | what the smallest value of m is such that we could realistically
       | compress a random string of length 2^m given the available
       | programming languages out there. Unfortunately, I imagine it's
       | probably like 6 bits, and no one is going to give you an award
       | for compressing 6 bits of random data.
        
       ___________________________________________________________________
       (page generated 2024-11-24 23:00 UTC)