[HN Gopher] Sorted Integer Compression
       ___________________________________________________________________
        
       Sorted Integer Compression
        
       Author : GordonS
       Score  : 55 points
       Date   : 2021-06-18 10:40 UTC (12 hours ago)
        
 (HTM) web link (ayende.com)
 (TXT) w3m dump (ayende.com)
        
       | ot wrote:
       | > For my purpose, I want to get as high a compression rate as
       | possible, and I need to store just the list of integers
       | 
       | Then PFor(Delta) is not the best option, that class of algorithms
       | is intentionally a trade-off between compression rate and
       | decoding speed, as it's meant to store posting lists in inverted
       | indexes.
       | 
       | If you want higher compression, a better option would be Binary
       | Interpolative Coding [1], which has some of the highest
       | compression rate among the schemes that are relatively simple to
       | implement. Then you can go down the rabbit hole and start to use
       | entropy compression, complex modeling, etc... But there are
       | diminishing returns.
       | 
       | > The problem with PFor, however, is that if you have a single
       | large outlier, you'll waste a lot of bits unnecessarily.
       | 
       | This is not true, the author is describing binary packing of
       | deltas, and mistaking it for PFor. What PForDelta means is
       | 
       | - P -> Patched: Outliers are stored separately in a list of
       | "exceptions", so that does not affect the bit size of the rest of
       | the block
       | 
       | - For -> Frame of Reference: the min value of the block is
       | subtracted from the values in block, which are then encoded with
       | the minimum number of bits necessary for the adjusted maximum
       | (this is just binary packing)
       | 
       | - Delta -> Deltas are encoded instead of values, only used if
       | your sequence is monotonic. If it is _strictly_ monotonic, you
       | can encode v[i+1] - v[i] - 1 and save a little more.
       | 
       | > This encoder will [...] Then we check the top and bottom halves
       | of the batch, to see if we'll get a benefit from recording them
       | separately.
       | 
       | What this is describing is an algorithm for partitioning lists in
       | order to optimize for the overall binary packing compression
       | rate; but it is greedy, and optimal algorithms exist based on
       | dynamic programming [2]
       | 
       | EDIT: Forgot to mention that everything I wrote above is also
       | explained in the paper cited in the post [3], which is indeed a
       | very good survey of this stuff.
       | 
       | [1] https://link.springer.com/article/10.1023/A:1013002601898
       | 
       | [2] https://dl.acm.org/doi/10.1145/1871437.1871592
       | 
       | [3] https://arxiv.org/pdf/1209.2137.pdf
        
         | recursive wrote:
         | In a sorted list, all the outliers are grouped at the beginning
         | or the end. It's hard to understand how storing the single big
         | delta is using any bits unnecessarily.
        
           | ot wrote:
           | The outliers the article is talking about here are in the
           | *deltas*. If you're sorted list is
           | 
           | 0, 1, 2, 3, 1000, 1001, 1002, 1003, ...
           | 
           | your deltas (assuming strict monotonicity) are
           | 
           | 0, 0, 0, 0, 996, 0, 0, 0 ...
           | 
           | and that 996 breaks a run that you could otherwise encode in
           | 0 bits (plus block metadata).
        
             | danbruc wrote:
             | Only if you naively use the number of bits required for the
             | largest delta for all deltas. If your goal is good
             | compression, you will use a variable length encoding,
             | probably based on the frequency so that it is independent
             | of the magnitude of the deltas. Why should a hundred times
             | one compress any better than a hundred times a million?
             | Arithmetic coding would probably work equally well.
        
               | ot wrote:
               | "Naively"? Binary packing encoders are still competitive
               | with the state of the art :) Which notably hasn't
               | included variable-length encodings for quite a while.
               | 
               | Variable-length coding requires that each integer somehow
               | encodes its width. That can be very wasteful. Block-based
               | methods exploit the fact that consecutive deltas usually
               | require a similar number of bits, so it's better to waste
               | a bit on each integer, but amortize the block bitwidth,
               | than having to encode each width independently.
        
               | danbruc wrote:
               | So 1000, 1001, 1000, 1002, 1000, 1003, 1000, 1004
               | requiring 80 bits is good if you could get it to 16 bits
               | plus a Huffman tree which would admittedly eat up the 64
               | bits difference in this tiny example? Storing the delta
               | is essentially predicting that x[i + 1] = x[i] and
               | storing the difference to that prediction. One could do
               | better by having a better prediction, for example x[i +
               | 1] = x[i] + median_gap_size which would bring the
               | prediction error down to 0, 1, 0, 2, 0, 3, 0, 4 in my
               | example, which could be encoded in 24 fixed length
               | integers plus the median gap size. This could probably be
               | further improved by adaptively predicting the gap size.
               | Making the prediction errors small and independent of the
               | gap size makes the use of short fixed length integers in
               | the common case and switch to longer integers for
               | outliers by reserving one short code as an escape code a
               | much better choice.
               | 
               | In the end the result is probably heavily dependent on
               | the exact structure of the integer lists you want to
               | compress but I do not see how fixed length deltas can be
               | competitive on arbitrary inputs.
        
               | ot wrote:
               | I would suggest you actually read the whole thread before
               | commenting, as it started by explaining how outliers *do
               | not* make the rest of the list encode with 4 bits per
               | integer.
               | 
               | > I do not see how fixed length deltas can be competitive
               | on arbitrary inputs.
               | 
               | I also already explained this, most lists in practical
               | applications exhibit local properties that are exploited
               | by blocking.
               | 
               | Using Huffman/arithmetic coding generally doesn't work
               | well because the cost of storing the probability tables
               | does not break even with the savings. There was a recent
               | line of work [1] that attempted this with ANS, a faster
               | cousin of arithmetic coding, and the results were
               | honestly underwhelming considering the complexity
               | involved.
               | 
               | [1] https://dl.acm.org/doi/abs/10.1145/3159652.3159663
               | 
               | > In the end the result is probably heavily dependent on
               | the exact structure of the integer lists
               | 
               | Yes, I did not claim that this works with any list, for
               | any encoding scheme you can construct adversarial cases
               | that do not compress well, that's just information
               | theory. But that applies also to the schemes you're
               | floating.
        
               | danbruc wrote:
               | You are misunderstanding me, I am not primarily talking
               | about outliers. I am saying if the deltas are big in
               | magnitude but concentrated on only a few different
               | values, then encoding the deltas directly is wasteful.
               | Whether you then use entropy coding, which also deals
               | with outliers for free, or whether you use something more
               | sophisticated than deltas, for example double deltas or
               | subtracting the minimum delta, was not really my point.
               | Maybe it was a bit confusing that I first brought this up
               | in the context of outliers.
        
       | touisteur wrote:
       | I've been having loads of fun and success with roaring bitmaps
       | recently, to create this type of index for large data.
        
       | glebm wrote:
       | Elias-Fano would probably be a better option here
        
         | ot wrote:
         | EF doesn't compress that well if your data exhibits
         | regularities, it uses exactly a number of bits that only
         | depends on the sequence upper bound and length.
         | 
         | There are ways to make it compress better though, like
         | splitting the sequence into chunks, optimizing for the overall
         | compressed size.
        
         | js8 wrote:
         | Yeah, and somebody has even done that:
         | https://www.antoniomallia.it/sorted-integers-compression-wit...
        
       | 7373737373 wrote:
       | There is the Hutter Prize for English text compression
       | (http://prize.hutter1.net/), I wish there was an equivalent for
       | the https://oeis.org database:
       | https://github.com/void4/notes/issues/72
        
       | pklausler wrote:
       | The Burrows-Wheeler transform
       | (https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transf...)
       | might be handy for this problem case; encode the deltas with
       | something like UTF-8 first.
        
       | PaulHoule wrote:
       | This book has a nice treatment of that kind of compression:
       | 
       | https://www.amazon.com/Managing-Gigabytes-Compressing-Multim...
       | 
       | For instance you might be keep track of facts like
       | the word "the" is contained in document 1        the word "john"
       | is contained in document 1        the word "the" is contained in
       | document 2        ...        the word "john" is contained in
       | document 12
       | 
       | and you code the gaps; the word "the" appears in every document
       | and the gap is always 1, but the gap for "john" is 11. With a
       | variable-sized encoding you use fewer bits for smaller gaps --
       | with that kind of encoding you don't have to make "the" be a
       | stopword because you can afford to encode all the postings.
        
         | tgv wrote:
         | The Standard MIDI file spec had something similar back in 80s.
         | They code gaps between successive events with increasingly
         | longer byte sequences. Never mind that they waste a whole bit
         | to do that.
        
         | ot wrote:
         | That book is really dated (1994), but unfortunately there
         | aren't much better options.
         | 
         | It is unfortunate because the understanding of the field has
         | improved dramatically over the past decades, and explanations
         | from a more modern perspective would benefit even novices.
        
         | atombender wrote:
         | That book is great, and relies on a combination of Golomb-Rice,
         | delta coding, and arithmetic coding. I wonder how this compares
         | with the OP's compression.
        
           | PaulHoule wrote:
           | ... it doesn't just "rely on" that stuff, but it explains the
           | fundamentals of them really well!
        
       | NohatCoder wrote:
       | For anyone who wants really good compression of data with a
       | particular characteristic, a very useful approach is to first
       | shrink or reformat the data based on that characteristic, then
       | pass the result through a general purpose compressor.
       | 
       | For this case that might mean convert the data into deltas, then
       | write those deltas in a byte-oriented variable length integer
       | fashion. The general purpose compressor can then pick up most of
       | what you have left on the table by not being super careful with
       | the balancing of the chosen integer format, and it might find
       | some other useful patterns in the data without you having to
       | worry about those.
       | 
       | Theoretically a single compressor crafted purely for the exact
       | characteristics of the data at hand would be able to do a better
       | job. But in practice the skill and time required to beat the
       | hybrid approach tends to be a showstopper.
        
         | jhgb wrote:
         | > For this case that might mean convert the data into deltas,
         | then write those deltas in a byte-oriented variable length
         | integer fashion.
         | 
         | Calculating a trend line and storing deltas off of that trend
         | line might be mildly better for some series. Having an AR model
         | and storing errors of the model might work even better,
         | although the obvious (and common) problem is which AR model of
         | the many possible ones you go for.
        
           | NohatCoder wrote:
           | If you know the specifics of your data, maybe. But try to err
           | on the side of simpler models, every detail you add slows
           | your code down, and the difference after compression is often
           | small.
        
           | ballenf wrote:
           | Or simply a delta off the last delta (or mean of last n
           | deltas, I guess), so you can do the compression in a stream.
        
             | jhgb wrote:
             | That might also be equivalent to one of the ARMA models,
             | although I'd have to do the math to prove that (and to
             | derive which one corresponds to what you're describing).
             | 
             | Edit: It seems that if you detrend the data, the "delta off
             | the last delta", if I understand it correctly, is an MA(1)
             | model. I think. Maybe. My math is wobbly.
        
             | wallacoloo wrote:
             | If I understand you & GP, these are both just particular
             | instantiations of Linear Predictive Coding [1]. For both of
             | these approaches, you'd probably get similar or better
             | results just throwing your data into an off-the-shelf LPC,
             | and with less effort.
             | 
             | [1]
             | https://en.m.wikipedia.org/wiki/Linear_predictive_coding
        
       | raphlinus wrote:
       | It reminds me of the classic interview problem, attributed to
       | Google: sort 1 million 32 bit integers (provided as a stream)
       | using 2 megabytes of memory. Most people's first instinct is that
       | it can't be done, you need almost 4 megabytes to store the input,
       | but of course it can be, you keep the sorted list in compressed
       | form.
       | 
       | I haven't found a really good writeup, most of the hits on Google
       | (SO, earlier HN stories) fail to get the answer.
        
         | sltkr wrote:
         | I think there are two difficulties here.
         | 
         | 1. How to even represent 1,000,000,000 sorted integers in only
         | 2 megabytes, which means you can use (slightly more than) 2
         | bytes per integer. The central observation is that the sum of
         | the deltas is no more than 4294967296. 4294967296 / 1000000 =
         | 4295 which is about 13 bits, so it seems like it should be
         | possible, but it's not easy.
         | 
         | For example, if you use a standard variable length encoding
         | where the highest bit indicates continuation, you would have 7
         | data bits per byte, but you could have 743,665 times 128 and
         | 256,335 times 16384, which would require 743,665*2 + 256,335*3
         | = 2,256,335 bytes, which is slightly over 2 megabyte.
         | 
         | If you use the first two bits of the initial byte (so that K
         | bytes encode 8*K - 2 bits). You could have 740,749 times 64
         | 259,251 times 16384, for a total of 259251*3 + 740749*2 =
         | 2,259,251 bytes, slightly worse even.
         | 
         | With 1 continuation bit per nibble the math similarly doesn't
         | work out. So this is starting to look, if not impossible, at
         | least very hard.
         | 
         | 2. Imagine that you could represent 1,000,000 sorted integers
         | in 2 megabytes somehow. Then the next problem is to actually
         | create such a sorted sequence from the input.                 -
         | Insertion sort works, but O(N^2) time complexity is not great
         | with a million elements.       - Quick sort doesn't work since
         | you'd start with an unsorted array which you cannot represent
         | in memory.       - Heap sort doesn't work because it requires
         | random access to the array, which doesn't work with a variable-
         | length encoding.       - Merge sort works in the beginning, but
         | you need temporary space equal to the size of your input, so
         | towards the end you're in trouble.
         | 
         | I think you could make merge sort work if the requirement is
         | "output the elements in sorted order" rather than actually
         | constructing a sorted representation of the array. In that
         | case, you could create sorted sequences of the first 500,000,
         | 250,000, 125,000 etc. elements, and do a 20-way merge in the
         | end, which is O(N log N) overall.
         | 
         | This is still somewhat tricky because the average delta of e.g.
         | 500,000 elements can be twice as large as for the overall
         | array, so you might need slightly more space to encode it, so
         | we would need a really efficient compression scheme to make
         | this work.
         | 
         | All in all I'm gravitating towards: this problem isn't
         | generally solvable under the given constraints. With 3
         | megabytes I think the scheme outlined above works.
        
           | noctune wrote:
           | The entropy of 1e6 sorted 32 bit integers is
           | log2((2*32+1e6-1) choose 1e6) bits or about 1.61 MiB, so it
           | should be possible. Really difficult, obviously, but
           | possible.
        
           | raphlinus wrote:
           | It's doable. First, you use a Golomb coding for the deltas.
           | Tuning the parameters just right, you can get it to fit
           | inside your budget with a little room to spare.
           | 
           | Then, to get reasonably good time complexity, you use that
           | extra space as a scratch buffer - read in a chunk of input,
           | sort in place (heapsort is good for worst-case time
           | complexity), and merge with the compressed stream. It's not
           | simple to implement, but it does work.
        
             | GistNoesis wrote:
             | Can you explain how you merge a compressed sorted stream
             | with the sorted chunk, to obtain a bigger compressed stream
             | (without extra memory, or quadratic complexity) ?
             | 
             | I can see a solution with memory usage : size of(
             | compressed sorted stream ) + size of( compressed sorted
             | chunk ) + size of(compressed stream merge output)
             | 
             | This would work but it means you need to fit your
             | compressed sorted stream in less than a single MB instead
             | of 2MB (and I'm ignoring memory fragmentation problems)
        
               | rjmunro wrote:
               | One method is to move the compressed output from the top
               | to the bottom of the memory with each cycle:
               | 
               | A = Existing compressed sorted chunk; B = Compressed
               | merge output; C = New sorted chunk
               | 
               | Start with:
               | 
               | AAAAAAAAAAAAAAAAAAAAAAAAAAA-----CCCCCC
               | 
               | Merge into B, start at top of A and working your way
               | down:
               | 
               | AAAAAAAAAAAAAAAAAAAAAA-----BBBBBCCCC--
               | 
               | Finish this iteration like:
               | 
               | --------BBBBBBBBBBBBBBBBBBBBBBBB------
               | 
               | Then refill C and sort it:
               | 
               | --------BBBBBBBBBBBBBBBBBBBBBBBBCCCCCC
               | 
               | Merge B back into A, but forwards this time:
               | 
               | AAAAAAAAAA------BBBBBBBBBBBBBBBB---CCC
               | 
               | AAAAAAAAAAAAAAAAAAAAAAAAAAA-----------
        
               | GistNoesis wrote:
               | Thanks, I got it. The main trick is the standard
               | "interview" trick of writing from right to left.
               | 
               | I'm not sure your solution work though without memory
               | moving, because in your second step if the values of A
               | are larger than the values of C, the growing B will hit
               | the non decreasing set A.
               | 
               | To convince my self, I wrote for myself the following
               | sequence.
               | 
               | 00000224446666666666888888_11133333355555
               | 
               | BytesOfTheCompressedStream_NewSortedChunk________________
               | 
               | BytesOfTheCompressedStream_NewSortedChunk________weNsetyB
               | 
               | OfTheCompressedStream_SortedChunk________________weNsetyB
               | 
               | OfTheCompressedStream_SortedChunk________detroSfOweNsetyB
               | 
               | TheCompressedStream_Chunk________________detroSfOweNsetyB
               | 
               | TheCompressedStream_Chunk_______detroSehTdetroSfOweNsetyB
               | 
               | CompressedStream_Chunk__________detroSehTdetroSfOweNsetyB
               | 
               | Here notice that we hit the existing streams so we need
               | to memmove
               | 
               | CompressedStream_Chunk_esserpmoCdetroSehTdetroSfOweNsetyB
               | 
               | dStream_Chunk__________esserpmoCdetroSehTdetroSfOweNsetyB
               | 
               | dStream_Chunk____knuhCdesserpmoCdetroSehTdetroSfOweNsetyB
               | 
               | Stream_____maertSknuhCdesserpmoCdetroSehTdetroSfOweNsetyB
               | 
               | ___________maertSknuhCdesserpmoCdetroSehTdetroSfOweNsetyB
               | 
               | Then reverse it in place and memmove
        
               | raphlinus wrote:
               | There are a bunch of ways to do this. One of the more
               | straightforward is to use a temporary buffer big enough
               | to hold both the new values and the existing stream, and
               | start by shifting the existing stream to be right aligned
               | (using memmove or the equivalent). Then read bits from
               | the old stream while writing bits into the new stream.
               | Using Elias-Fano coding (which I believe is an instance
               | of Golomb coding), the write position will never overtake
               | the read position, so the merge can be done in-place.
        
               | GistNoesis wrote:
               | Thanks. I think I understood it now :
               | https://news.ycombinator.com/item?id=27553076
               | 
               | In my solution though when the extra temporary buffer is
               | very small,in the worst case, we need to make
               | sizeof(compressedStream)/sizeof(temporaryBuffer) memmove
               | which can potentially make the algorithm quadratic.
        
           | ot wrote:
           | Encoding-wise, Elias-Fano requires exactly 2 + ceil(log(2^32
           | / n)) bits, so it would provably fit in the memory budget.
           | 
           | Algorithm-wise, yeah I can only think of quadratic ones. But
           | if in the question the bottleneck is memory, maybe that's
           | fine? An optimization would be to do selection sort, but in
           | each round select the next group of integers that share the
           | top 32 - ceil(log(2^32 / n)) bits, as that is a cluster in
           | the EF representation, and then sort the lower bits, which
           | can be done in-place in O(k log k) where k is the group size.
           | So if your groups are large enough you get some speedup.
           | 
           | EDIT:
           | 
           | Ah I think that the trick is to always use half of the memory
           | left to read and sort.
           | 
           | - Read 256k integers (1MB), sort in-place, construct EF
           | representation (~500kB)
           | 
           | - Read 128k, sort, and produce another EF (~280kB)
           | 
           | - Keep halving
           | 
           | You do a logarithmic number of rounds, so it's O(n log n). At
           | the end you're left with log(n) EF sets, which fit in less
           | than 2MB (in fact there's some space left, and you could save
           | a few rounds) and you can merge them with a log(n)-way heap
           | into a stream.
        
         | tyingq wrote:
         | Guido van Rossum's answer, of course with Python:
         | 
         | http://neopythonic.blogspot.com/2008/10/sorting-million-32-b...
         | 
         | It sounds like "break up into smaller sorted sets, then do a
         | merge sort". Does the Google interview question allow for temp
         | file storage?
        
           | raphlinus wrote:
           | There are two versions of the question. One allows external
           | storage (Guido's answer and the top hit on SO), the other
           | does not.
        
         | tastyface wrote:
         | Er... is it just me, or is this a terrible interview question?
         | Assuming no external storage, I can imagine spending a few
         | hours working out this problem for fun, but to do it in 45
         | minutes on a whiteboard seems exceptionally difficult if you're
         | not familiar with it already. (Or is this one of those "let's
         | talk it through, you don't need to arrive at a perfect answer"
         | deals?)
         | 
         | I could be wrong. Maybe I should just sit down and try it, but
         | timed problems stress me out in general, even if nothing's at
         | stake.
        
         | ChrisLomont wrote:
         | When that problem came out I solved it quickly (it was 1
         | million 8 digit numbers in sorted in 1MB of RAM, which I think
         | is the original problem that gained notoriety) with basically
         | arithmetic compression of the delta stream, since the number of
         | possible states sits decently inside the required RAM.
         | 
         | I also saw I can sort it it using zero bytes of storage using a
         | lot of code :)
         | 
         | Think nested switches that eventually print a lot of numbers.
        
       ___________________________________________________________________
       (page generated 2021-06-18 23:02 UTC)