[HN Gopher] Paradoxical Compression with VDF
       ___________________________________________________________________
        
       Paradoxical Compression with VDF
        
       Author : rdpintqogeogsaa
       Score  : 76 points
       Date   : 2021-10-03 13:16 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | Arkanosis wrote:
       | Another implementation of paradoxical compression, which, applied
       | recursively also allows for infinite compression (ie. down to 0
       | byte) of any file:
       | 
       | https://cs.fit.edu/~mmahoney/compression/barf.html
       | 
       | Of course, there's a catch.
        
         | messe wrote:
         | Given that barf is already using the filename to indicate the
         | compression, why not just move the final byte of the file into
         | the filename instead? Something like,                   head -c
         | -1 "$file" > "$file$(tail -c1 "$file")"
         | 
         | EDIT: Obviously this has the corner case of null bytes
        
         | chrisseaton wrote:
         | Is he just moving bytes from the file into the file name? He
         | should count the increased size of the file name in the size of
         | the compressed file otherwise he's not counting all of the
         | essential data you need for the decompression.
         | 
         | You can decompress a gzip file without the file name (think
         | about tar using a pipe.)
        
           | karmanyaahm wrote:
           | I'm pretty sure barf is a joke?
        
       | slaymaker1907 wrote:
       | Paradoxical compression really isn't that bad of a problem in
       | practice since you can just dedicate a single bit at the start of
       | the data to indicate whether a file is compressed or not. This
       | guarantees that a file is no more than 1 byte larger than it
       | would be "uncompressed" (since in this case we just add on a byte
       | to indicate we have not used our regular compression algorithm).
        
         | anderskaseorg wrote:
         | It sounds like you misunderstand what "paridoxical compression"
         | means. It does not refer to the problem where compression
         | sometimes increases the size. A (purported) paridoxical
         | compression algorithm is one that sometimes decreases and
         | _never_ increases the size.
        
           | IncRnd wrote:
           | This is not purely paradoxical compression, and cheats in
           | this area. You might want to read the gh page that was linked
           | as the article.
        
       | willvarfar wrote:
       | Off the top of my head, how about:
       | 
       | Compression: gzip the file and tacking a crypto hash on the end.
       | Discard if not shorter and store the original instead.
       | 
       | Decompression: if has gzip header and the crypto hash at the end
       | matches, unzip. Else return source as is.
       | 
       | Have I just paradoxically compressed something?
        
         | [deleted]
        
         | im3w1l wrote:
         | Well let's analyze it using the game from the paper. An
         | attacker could do the following I believe:
         | 
         | 1. Take something compressible like x0=AAAA...
         | 
         | 2. Compress to get y0.
         | 
         | 3. Take x1=y0. Compress to get y1. Now most likely y0=y1
         | meaning that decompressing y1 doesn't give x1, so you lost. But
         | if y1 != y0, then
         | 
         | 4. Take x{i+1} = y{i} and compress. Eventually there will be a
         | collision.
         | 
         | Preventing this is why the author's scheme is so complex.
        
         | svara wrote:
         | That doesn't work, since there's no way to distinguish a
         | compressed file from an uncompressed file that just happens to
         | have the same contents. The hash doesn't help with that since
         | that has collisions, it's the same pigeon hole issue.
         | 
         | Most files in the space of all possible files will be
         | essentially random and not compress, and adding a header will
         | make them larger.
        
           | willvarfar wrote:
           | A compressed file is distinguished by being a valid gzip file
           | with a valid hash. Such a file won't compress a second time,
           | so the length will remain unchanged. Then, decompressing
           | twice will also generate the same output.
        
             | comex wrote:
             | Then what happens if the user starts with a file that
             | happens to be a valid gzip file with a valid hash,
             | compresses it twice, and decompresses it twice?
        
           | nuccy wrote:
           | Ok, then what if we save some space for the single bit
           | indicator (compressed/non compressed) by following a simple
           | algorithm:
           | 
           | 1. Look for at least two as long as possible identical peaces
           | with arbitrary bit offset from the beginning.
           | 
           | 2. Record value and length (common for all peaces) and
           | offsets for each peace.
           | 
           | 3. Create and prepend a header with following fields:
           | compressed_flag, piece length, pieces count, piece value,
           | pieces offsets list, content
           | 
           | Obviously if the file is very small this approach will not
           | work. But assuming that we know the file size (which is the
           | loophole here), we can pre-calculate which is a threshold
           | file size to consider to be with a header, and what is too
           | short so uncompressed headerless content is being saved
           | there.
           | 
           | P.S. To remove the loophole possibility the streams of data
           | should be rather considered instead of files.
        
         | detaro wrote:
         | Fails if fed its own output again if that was compressed
         | before, and now goes in the uncompressed case.
        
           | willvarfar wrote:
           | Compress file A generated compressed file B.
           | 
           | Compress file B fails to make it shorter, so the generated
           | file C is actually identical to B.
           | 
           | We've just compressed twice. Uncompressing twice generates A
           | again.
        
             | jonas21 wrote:
             | The problem is if you start with B. Then you compress it to
             | get C.
             | 
             | Since B == C, when you uncompress C, you get A, which is
             | not what you started with.
        
             | detaro wrote:
             | I'm talking about compressing/decompressing that once. Then
             | input != output, thus it's not a working compression
             | formally.
        
       | emilfihlman wrote:
       | >Of course I am slightly cheating here, but the cheat is
       | conceptually interesting (and fun). The idea is that paradoxical
       | compression might have only a small density of problematic cases
       | (inputs where the algorithm misbehaves, e.g. by expanding the
       | length or not decompressing to the original input) and tools from
       | the field of cryptography can be used to "push away" these cases
       | so that they mathematically exist but nobody can find one with
       | nonnegligible probability. In essence, this is transforming the
       | question from "is it mathematically possible?" to "can I claim to
       | do it and not being detected as a fraud?". This change of model
       | makes paradoxical compression possible.
       | 
       | Isn't this trivially wrong?
       | 
       | If you have a bitstring s and you compress it to c, where the
       | length of s is larger than length of c, compressing c again
       | produces an issue.
       | 
       | For example, if you compress a bitstring to just 1, how do you
       | compress 1? To 0? And how do you compress 0 then? We can use this
       | principle to find uncompressable bitstrings of any length
       | anywhere for any algorithm.
       | 
       | An idea pops into my head regarding this, though: in signal
       | analysis there's a concept called noise shaping (which matches
       | OPs ideas to a great degree), the idea is to push any noise out
       | from some band and into another, it's super neat actually.
       | 
       | Could this idea be expanded such that there is some singular
       | monster string that expands to an absurd size, while others stay
       | the same or are smaller? How about multiple but still small in
       | density?
       | 
       | My intuition sadly says no because of what I started this comment
       | with. It's trivial to find strings that aren't compressible.
       | 
       | Please let me know if and where my ideas on this go wrong!
        
       | im3w1l wrote:
       | Very neat, but my preference would be for it to very rarely
       | expand the file, instead of very rarely rarely decompressing
       | wrong. Then you could actually use it in practice. It would be
       | pretty straightforward to change it as far as I can tell.
       | 
       | Edit: If c+1 would wrap around, you could fallback on encoding as
       | C0(x) || enc(0) || Mac(C0(x) || enc(0)).
        
         | labawi wrote:
         | If the MAC and construction is solid, then wrong decompression
         | is achievable, both theoretically, practically and by design,
         | in many different ways except the one you mention - the rare
         | collision.
         | 
         | There likely won't be enough computation, ever, to trigger an
         | accidental collision, even if was a galactic bitcoin
         | equivalent.
         | 
         | I'd only be worried about deliberate collisions, if the
         | construction isn't sufficiently hard.
        
         | assbuttbuttass wrote:
         | Isn't it just a normal compression algorithm then?
        
           | im3w1l wrote:
           | No, a normal compression algorithm will often (as opposed to
           | very rarely) expand the output when given a random input.
        
             | assbuttbuttass wrote:
             | I don't think you could expect this to do any better,
             | though. Apparently this algorithm uses DEFLATE under the
             | hood
        
           | goldenkey wrote:
           | Most compression algos have a simple flag to indicate an
           | embedded file for cases where compression is not workable.
           | All the same principles still apply, it's just that the
           | compression algo is now meta of sorts.
        
       | unnouinceput wrote:
       | Quote: "A lossless compression algorithm that can actually reduce
       | the size of at least one input file must also necessarily expand
       | the size of at least one other possible input file. This is a
       | direct application of the pigeonhole principle. A compression
       | algorithm that claims to never increase the size of any input,
       | while at the same time reducing the size of at least some inputs
       | (it's not the trivial do-nothing compression that, by definition,
       | preserves the length of the input), is called paradoxical
       | compression."
       | 
       | This paradoxical compression, as stated above is incomplete. You
       | can achieve all input files to be compressed without having to
       | expand any other file but you would expand time instead. Let me
       | explain it: - Any set of data can be expressed using mathematical
       | functions/algorithms that can be saved as one file, very small in
       | data compared with original data, but to do so you'll need to
       | invest a lot of time to find those functions/algorithms. One
       | practical example would be a mission to Alpha Centauri which
       | might want to have Earth's internet on board but no physical
       | space for all that data. Let's say it takes 10 years to prepare
       | the mission which in that case you'd also use this 10 years to
       | compress, for argument sake, Google's big index, (or NSA Utah
       | center if you like that one) and you'll have on board, on a
       | single thumb stick, Internet version from 10 years before launch.
       | 
       | Edit: Since some people have problems understanding this extreme
       | compression, I am trying to clarify this further. This method is
       | unique to that set of data. And it can be applied to only that
       | data, only one time and that's it. Not a general set of
       | algorithms so you can compress anything else. It will fail if not
       | applied to that data. It is not a classic compression, it is
       | more, if you like the term, a data generator. It will run a
       | program that at the end of its run will generate that Internet
       | version you started with 10 years ago. But to do that you'll need
       | to research how to do that data generating, hence the extreme
       | time it takes to achieve. In practice I'll envision a lot of
       | people getting their slice of data and do one function. And the
       | program will run those functions one after another, generating
       | the data.
        
         | anonymoushn wrote:
         | This is incorrect. If you would like to encode every string
         | that's 10 gigabytes long to a string that's at least 1 byte
         | shorter, then you don't have enough strings that are at least 1
         | byte shorter. There are fewer of them by a factor of about 256,
         | and each one of them can decode to only one 10-gigabyte-long
         | string. This is explained in the article.
        
         | beecafe wrote:
         | No, you can't. Consider applying your compression algo twice,
         | or N times. If it gets smaller each time, then eventually you
         | could compress anything to a single bit.
        
           | unnouinceput wrote:
           | You can't apply this method twice. Please understand this
           | particular compression type is unique to a single given fixed
           | set of data. That's it! It's not a general compression
           | method, is custom made for that data only. And it takes time
           | to research it to achieve that extreme compression.
        
             | anonymoushn wrote:
             | It doesn't work. If it did work, you could apply it
             | separately to every string containing 100 bits and end up
             | with 2^100 separate strings containing fewer than 100 bits,
             | each of which decodes to a unique 100-bit string.
        
               | unnouinceput wrote:
               | You're still missing the point. Please read the edit I
               | made to root comment.
        
               | anonymoushn wrote:
               | You're still wrong. A data generator is a program that
               | runs on some computer. Programs are strings with lengths.
               | 
               | If you'd like to use a purpose-built computer or decoder
               | for each output, then you need to include the complexity
               | of the computer or decoder in the compressed size.
               | Otherwise you could write a decoder such as "when given
               | any input, output the contents of wikipedia" and claim
               | that it compresses a certain text to 0 bytes.
        
             | beecafe wrote:
             | If your compression algorithm worked for just one fixed set
             | of data, then you could just store that fixed data in the
             | decompressor and achieve infinity compression of the data
             | to 0 bits.
        
         | gus_massa wrote:
         | You mean something like http://prize.hutter1.net/ ?
         | 
         | The problem is that once you fix the architecture (for example
         | a x64 compatible processor) and you try to compress every
         | possible 1GB files like in my link, for some of them you will
         | get a size of descompresor+data that is very small, specially
         | if you have a lot of time. This is essentially the challenge.
         | The articles in Wikipedia have a lot of internal structure, so
         | if you are smart and/or have lot time you can get a very small
         | size.
         | 
         | Anyway, once you fixed the architecture, it is possible that an
         | evil person looks at the details of the the architecture and
         | make a file that is so weird that the size of descompresor+data
         | is bigger, in spite you are very smart or you have a lot of
         | time to try the brute force approach.
         | 
         | The Pigeonhole Principle guarantees that the evil person can
         | always success, in spite the file with the problem may not be
         | very interesting and it may be very difficult to find it.
         | 
         | (Actually, any file with cryptography random number will
         | probably work. A file with pseudo random numbers is not enough,
         | because with enough patient you may find the algorithm, seed
         | and offset and use that as the compressed version. The digits
         | of pi skipping the first million is also a bad idea, because is
         | also a ba idea for the same reason.)
        
           | unnouinceput wrote:
           | Yes, something like that, except in my case you'd need a lot
           | more than 50 hours to generate back the data.
           | 
           | Also I agree with your points, this assumes every participant
           | to be honest when researching their slice, as per my edit.
        
             | gus_massa wrote:
             | With your method, you will be able to compress most
             | sensible files but not _all_ files. The main point of this
             | article is that with any method, you can not compress _all_
             | files.
             | 
             | In the case of the ship, they will be able to store a very
             | compressed version of Wikipedia, but they will not be able
             | to store a file with the same size that records the last
             | bit of the central pixel of a CCD carera that is pointed to
             | the sun filtered with the von Neumann trick https://en.wiki
             | pedia.org/wiki/Fair_coin#Fair_results_from_a_...
        
         | aaaaaaaaaaab wrote:
         | Your idea is information-theoretically unsound. You cannot
         | create information out of thin air.
        
       | madars wrote:
       | Very nice! A tl;dr with spoilers -- for a typical compression
       | algorithm C(x), applying it over and over again C(x), C(C(x)),
       | C(C(C(x))), ... we would quickly arrive at an input for which the
       | compressed length increases. Here the author proposes C(x) which
       | either equals x (no compression), or encodes a normal compression
       | (e.g. gzip) plus a MAC'd counter. Applying C(x) on the latter
       | case again would simply increment the counter, so an adversary
       | who only issues a bounded number of compression requests gets
       | "stuck" on a chain that contains no contradiction. (The attacker
       | could try to get to y = C(x) with a large counter value without
       | compressor's help but that requires predicting the MAC which is
       | infeasible.) The requirement for C and D to share a secret (here,
       | a MAC key) can be removed by replacing MAC with a VDF proof that
       | you spent O(counter) sequential time producing the compressed
       | output, so a computationally bounded attacker can't overflow the
       | counter (in practice, 128 bit counter requiring ~2^128 time) and
       | disprove the paradox.
        
       | DemocracyFTW wrote:
       | > Paradoxical compression is mathematically impossible. This
       | repository contains a description of how to nonetheless achieve
       | it, as well as a demo implementation.
       | 
       | That's _AWESOME_ show me!
       | 
       | > Of course I am slightly cheating here
       | 
       |  _mkay..._ :-(
        
       ___________________________________________________________________
       (page generated 2021-10-03 23:01 UTC)