[HN Gopher] Paradoxical Compression with VDF
___________________________________________________________________
Paradoxical Compression with VDF
Author : rdpintqogeogsaa
Score : 76 points
Date : 2021-10-03 13:16 UTC (9 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| Arkanosis wrote:
| Another implementation of paradoxical compression, which, applied
| recursively also allows for infinite compression (ie. down to 0
| byte) of any file:
|
| https://cs.fit.edu/~mmahoney/compression/barf.html
|
| Of course, there's a catch.
| messe wrote:
| Given that barf is already using the filename to indicate the
| compression, why not just move the final byte of the file into
| the filename instead? Something like, head -c
| -1 "$file" > "$file$(tail -c1 "$file")"
|
| EDIT: Obviously this has the corner case of null bytes
| chrisseaton wrote:
| Is he just moving bytes from the file into the file name? He
| should count the increased size of the file name in the size of
| the compressed file otherwise he's not counting all of the
| essential data you need for the decompression.
|
| You can decompress a gzip file without the file name (think
| about tar using a pipe.)
| karmanyaahm wrote:
| I'm pretty sure barf is a joke?
| slaymaker1907 wrote:
| Paradoxical compression really isn't that bad of a problem in
| practice since you can just dedicate a single bit at the start of
| the data to indicate whether a file is compressed or not. This
| guarantees that a file is no more than 1 byte larger than it
| would be "uncompressed" (since in this case we just add on a byte
| to indicate we have not used our regular compression algorithm).
| anderskaseorg wrote:
| It sounds like you misunderstand what "paridoxical compression"
| means. It does not refer to the problem where compression
| sometimes increases the size. A (purported) paridoxical
| compression algorithm is one that sometimes decreases and
| _never_ increases the size.
| IncRnd wrote:
| This is not purely paradoxical compression, and cheats in
| this area. You might want to read the gh page that was linked
| as the article.
| willvarfar wrote:
| Off the top of my head, how about:
|
| Compression: gzip the file and tacking a crypto hash on the end.
| Discard if not shorter and store the original instead.
|
| Decompression: if has gzip header and the crypto hash at the end
| matches, unzip. Else return source as is.
|
| Have I just paradoxically compressed something?
| [deleted]
| im3w1l wrote:
| Well let's analyze it using the game from the paper. An
| attacker could do the following I believe:
|
| 1. Take something compressible like x0=AAAA...
|
| 2. Compress to get y0.
|
| 3. Take x1=y0. Compress to get y1. Now most likely y0=y1
| meaning that decompressing y1 doesn't give x1, so you lost. But
| if y1 != y0, then
|
| 4. Take x{i+1} = y{i} and compress. Eventually there will be a
| collision.
|
| Preventing this is why the author's scheme is so complex.
| svara wrote:
| That doesn't work, since there's no way to distinguish a
| compressed file from an uncompressed file that just happens to
| have the same contents. The hash doesn't help with that since
| that has collisions, it's the same pigeon hole issue.
|
| Most files in the space of all possible files will be
| essentially random and not compress, and adding a header will
| make them larger.
| willvarfar wrote:
| A compressed file is distinguished by being a valid gzip file
| with a valid hash. Such a file won't compress a second time,
| so the length will remain unchanged. Then, decompressing
| twice will also generate the same output.
| comex wrote:
| Then what happens if the user starts with a file that
| happens to be a valid gzip file with a valid hash,
| compresses it twice, and decompresses it twice?
| nuccy wrote:
| Ok, then what if we save some space for the single bit
| indicator (compressed/non compressed) by following a simple
| algorithm:
|
| 1. Look for at least two as long as possible identical peaces
| with arbitrary bit offset from the beginning.
|
| 2. Record value and length (common for all peaces) and
| offsets for each peace.
|
| 3. Create and prepend a header with following fields:
| compressed_flag, piece length, pieces count, piece value,
| pieces offsets list, content
|
| Obviously if the file is very small this approach will not
| work. But assuming that we know the file size (which is the
| loophole here), we can pre-calculate which is a threshold
| file size to consider to be with a header, and what is too
| short so uncompressed headerless content is being saved
| there.
|
| P.S. To remove the loophole possibility the streams of data
| should be rather considered instead of files.
| detaro wrote:
| Fails if fed its own output again if that was compressed
| before, and now goes in the uncompressed case.
| willvarfar wrote:
| Compress file A generated compressed file B.
|
| Compress file B fails to make it shorter, so the generated
| file C is actually identical to B.
|
| We've just compressed twice. Uncompressing twice generates A
| again.
| jonas21 wrote:
| The problem is if you start with B. Then you compress it to
| get C.
|
| Since B == C, when you uncompress C, you get A, which is
| not what you started with.
| detaro wrote:
| I'm talking about compressing/decompressing that once. Then
| input != output, thus it's not a working compression
| formally.
| emilfihlman wrote:
| >Of course I am slightly cheating here, but the cheat is
| conceptually interesting (and fun). The idea is that paradoxical
| compression might have only a small density of problematic cases
| (inputs where the algorithm misbehaves, e.g. by expanding the
| length or not decompressing to the original input) and tools from
| the field of cryptography can be used to "push away" these cases
| so that they mathematically exist but nobody can find one with
| nonnegligible probability. In essence, this is transforming the
| question from "is it mathematically possible?" to "can I claim to
| do it and not being detected as a fraud?". This change of model
| makes paradoxical compression possible.
|
| Isn't this trivially wrong?
|
| If you have a bitstring s and you compress it to c, where the
| length of s is larger than length of c, compressing c again
| produces an issue.
|
| For example, if you compress a bitstring to just 1, how do you
| compress 1? To 0? And how do you compress 0 then? We can use this
| principle to find uncompressable bitstrings of any length
| anywhere for any algorithm.
|
| An idea pops into my head regarding this, though: in signal
| analysis there's a concept called noise shaping (which matches
| OPs ideas to a great degree), the idea is to push any noise out
| from some band and into another, it's super neat actually.
|
| Could this idea be expanded such that there is some singular
| monster string that expands to an absurd size, while others stay
| the same or are smaller? How about multiple but still small in
| density?
|
| My intuition sadly says no because of what I started this comment
| with. It's trivial to find strings that aren't compressible.
|
| Please let me know if and where my ideas on this go wrong!
| im3w1l wrote:
| Very neat, but my preference would be for it to very rarely
| expand the file, instead of very rarely rarely decompressing
| wrong. Then you could actually use it in practice. It would be
| pretty straightforward to change it as far as I can tell.
|
| Edit: If c+1 would wrap around, you could fallback on encoding as
| C0(x) || enc(0) || Mac(C0(x) || enc(0)).
| labawi wrote:
| If the MAC and construction is solid, then wrong decompression
| is achievable, both theoretically, practically and by design,
| in many different ways except the one you mention - the rare
| collision.
|
| There likely won't be enough computation, ever, to trigger an
| accidental collision, even if was a galactic bitcoin
| equivalent.
|
| I'd only be worried about deliberate collisions, if the
| construction isn't sufficiently hard.
| assbuttbuttass wrote:
| Isn't it just a normal compression algorithm then?
| im3w1l wrote:
| No, a normal compression algorithm will often (as opposed to
| very rarely) expand the output when given a random input.
| assbuttbuttass wrote:
| I don't think you could expect this to do any better,
| though. Apparently this algorithm uses DEFLATE under the
| hood
| goldenkey wrote:
| Most compression algos have a simple flag to indicate an
| embedded file for cases where compression is not workable.
| All the same principles still apply, it's just that the
| compression algo is now meta of sorts.
| unnouinceput wrote:
| Quote: "A lossless compression algorithm that can actually reduce
| the size of at least one input file must also necessarily expand
| the size of at least one other possible input file. This is a
| direct application of the pigeonhole principle. A compression
| algorithm that claims to never increase the size of any input,
| while at the same time reducing the size of at least some inputs
| (it's not the trivial do-nothing compression that, by definition,
| preserves the length of the input), is called paradoxical
| compression."
|
| This paradoxical compression, as stated above is incomplete. You
| can achieve all input files to be compressed without having to
| expand any other file but you would expand time instead. Let me
| explain it: - Any set of data can be expressed using mathematical
| functions/algorithms that can be saved as one file, very small in
| data compared with original data, but to do so you'll need to
| invest a lot of time to find those functions/algorithms. One
| practical example would be a mission to Alpha Centauri which
| might want to have Earth's internet on board but no physical
| space for all that data. Let's say it takes 10 years to prepare
| the mission which in that case you'd also use this 10 years to
| compress, for argument sake, Google's big index, (or NSA Utah
| center if you like that one) and you'll have on board, on a
| single thumb stick, Internet version from 10 years before launch.
|
| Edit: Since some people have problems understanding this extreme
| compression, I am trying to clarify this further. This method is
| unique to that set of data. And it can be applied to only that
| data, only one time and that's it. Not a general set of
| algorithms so you can compress anything else. It will fail if not
| applied to that data. It is not a classic compression, it is
| more, if you like the term, a data generator. It will run a
| program that at the end of its run will generate that Internet
| version you started with 10 years ago. But to do that you'll need
| to research how to do that data generating, hence the extreme
| time it takes to achieve. In practice I'll envision a lot of
| people getting their slice of data and do one function. And the
| program will run those functions one after another, generating
| the data.
| anonymoushn wrote:
| This is incorrect. If you would like to encode every string
| that's 10 gigabytes long to a string that's at least 1 byte
| shorter, then you don't have enough strings that are at least 1
| byte shorter. There are fewer of them by a factor of about 256,
| and each one of them can decode to only one 10-gigabyte-long
| string. This is explained in the article.
| beecafe wrote:
| No, you can't. Consider applying your compression algo twice,
| or N times. If it gets smaller each time, then eventually you
| could compress anything to a single bit.
| unnouinceput wrote:
| You can't apply this method twice. Please understand this
| particular compression type is unique to a single given fixed
| set of data. That's it! It's not a general compression
| method, is custom made for that data only. And it takes time
| to research it to achieve that extreme compression.
| anonymoushn wrote:
| It doesn't work. If it did work, you could apply it
| separately to every string containing 100 bits and end up
| with 2^100 separate strings containing fewer than 100 bits,
| each of which decodes to a unique 100-bit string.
| unnouinceput wrote:
| You're still missing the point. Please read the edit I
| made to root comment.
| anonymoushn wrote:
| You're still wrong. A data generator is a program that
| runs on some computer. Programs are strings with lengths.
|
| If you'd like to use a purpose-built computer or decoder
| for each output, then you need to include the complexity
| of the computer or decoder in the compressed size.
| Otherwise you could write a decoder such as "when given
| any input, output the contents of wikipedia" and claim
| that it compresses a certain text to 0 bytes.
| beecafe wrote:
| If your compression algorithm worked for just one fixed set
| of data, then you could just store that fixed data in the
| decompressor and achieve infinity compression of the data
| to 0 bits.
| gus_massa wrote:
| You mean something like http://prize.hutter1.net/ ?
|
| The problem is that once you fix the architecture (for example
| a x64 compatible processor) and you try to compress every
| possible 1GB files like in my link, for some of them you will
| get a size of descompresor+data that is very small, specially
| if you have a lot of time. This is essentially the challenge.
| The articles in Wikipedia have a lot of internal structure, so
| if you are smart and/or have lot time you can get a very small
| size.
|
| Anyway, once you fixed the architecture, it is possible that an
| evil person looks at the details of the the architecture and
| make a file that is so weird that the size of descompresor+data
| is bigger, in spite you are very smart or you have a lot of
| time to try the brute force approach.
|
| The Pigeonhole Principle guarantees that the evil person can
| always success, in spite the file with the problem may not be
| very interesting and it may be very difficult to find it.
|
| (Actually, any file with cryptography random number will
| probably work. A file with pseudo random numbers is not enough,
| because with enough patient you may find the algorithm, seed
| and offset and use that as the compressed version. The digits
| of pi skipping the first million is also a bad idea, because is
| also a ba idea for the same reason.)
| unnouinceput wrote:
| Yes, something like that, except in my case you'd need a lot
| more than 50 hours to generate back the data.
|
| Also I agree with your points, this assumes every participant
| to be honest when researching their slice, as per my edit.
| gus_massa wrote:
| With your method, you will be able to compress most
| sensible files but not _all_ files. The main point of this
| article is that with any method, you can not compress _all_
| files.
|
| In the case of the ship, they will be able to store a very
| compressed version of Wikipedia, but they will not be able
| to store a file with the same size that records the last
| bit of the central pixel of a CCD carera that is pointed to
| the sun filtered with the von Neumann trick https://en.wiki
| pedia.org/wiki/Fair_coin#Fair_results_from_a_...
| aaaaaaaaaaab wrote:
| Your idea is information-theoretically unsound. You cannot
| create information out of thin air.
| madars wrote:
| Very nice! A tl;dr with spoilers -- for a typical compression
| algorithm C(x), applying it over and over again C(x), C(C(x)),
| C(C(C(x))), ... we would quickly arrive at an input for which the
| compressed length increases. Here the author proposes C(x) which
| either equals x (no compression), or encodes a normal compression
| (e.g. gzip) plus a MAC'd counter. Applying C(x) on the latter
| case again would simply increment the counter, so an adversary
| who only issues a bounded number of compression requests gets
| "stuck" on a chain that contains no contradiction. (The attacker
| could try to get to y = C(x) with a large counter value without
| compressor's help but that requires predicting the MAC which is
| infeasible.) The requirement for C and D to share a secret (here,
| a MAC key) can be removed by replacing MAC with a VDF proof that
| you spent O(counter) sequential time producing the compressed
| output, so a computationally bounded attacker can't overflow the
| counter (in practice, 128 bit counter requiring ~2^128 time) and
| disprove the paradox.
| DemocracyFTW wrote:
| > Paradoxical compression is mathematically impossible. This
| repository contains a description of how to nonetheless achieve
| it, as well as a demo implementation.
|
| That's _AWESOME_ show me!
|
| > Of course I am slightly cheating here
|
| _mkay..._ :-(
___________________________________________________________________
(page generated 2021-10-03 23:01 UTC)