[HN Gopher] Sorted Integer Compression
___________________________________________________________________
Sorted Integer Compression
Author : GordonS
Score : 55 points
Date : 2021-06-18 10:40 UTC (12 hours ago)
(HTM) web link (ayende.com)
(TXT) w3m dump (ayende.com)
| ot wrote:
| > For my purpose, I want to get as high a compression rate as
| possible, and I need to store just the list of integers
|
| Then PFor(Delta) is not the best option, that class of algorithms
| is intentionally a trade-off between compression rate and
| decoding speed, as it's meant to store posting lists in inverted
| indexes.
|
| If you want higher compression, a better option would be Binary
| Interpolative Coding [1], which has some of the highest
| compression rate among the schemes that are relatively simple to
| implement. Then you can go down the rabbit hole and start to use
| entropy compression, complex modeling, etc... But there are
| diminishing returns.
|
| > The problem with PFor, however, is that if you have a single
| large outlier, you'll waste a lot of bits unnecessarily.
|
| This is not true, the author is describing binary packing of
| deltas, and mistaking it for PFor. What PForDelta means is
|
| - P -> Patched: Outliers are stored separately in a list of
| "exceptions", so that does not affect the bit size of the rest of
| the block
|
| - For -> Frame of Reference: the min value of the block is
| subtracted from the values in block, which are then encoded with
| the minimum number of bits necessary for the adjusted maximum
| (this is just binary packing)
|
| - Delta -> Deltas are encoded instead of values, only used if
| your sequence is monotonic. If it is _strictly_ monotonic, you
| can encode v[i+1] - v[i] - 1 and save a little more.
|
| > This encoder will [...] Then we check the top and bottom halves
| of the batch, to see if we'll get a benefit from recording them
| separately.
|
| What this is describing is an algorithm for partitioning lists in
| order to optimize for the overall binary packing compression
| rate; but it is greedy, and optimal algorithms exist based on
| dynamic programming [2]
|
| EDIT: Forgot to mention that everything I wrote above is also
| explained in the paper cited in the post [3], which is indeed a
| very good survey of this stuff.
|
| [1] https://link.springer.com/article/10.1023/A:1013002601898
|
| [2] https://dl.acm.org/doi/10.1145/1871437.1871592
|
| [3] https://arxiv.org/pdf/1209.2137.pdf
| recursive wrote:
| In a sorted list, all the outliers are grouped at the beginning
| or the end. It's hard to understand how storing the single big
| delta is using any bits unnecessarily.
| ot wrote:
| The outliers the article is talking about here are in the
| *deltas*. If you're sorted list is
|
| 0, 1, 2, 3, 1000, 1001, 1002, 1003, ...
|
| your deltas (assuming strict monotonicity) are
|
| 0, 0, 0, 0, 996, 0, 0, 0 ...
|
| and that 996 breaks a run that you could otherwise encode in
| 0 bits (plus block metadata).
| danbruc wrote:
| Only if you naively use the number of bits required for the
| largest delta for all deltas. If your goal is good
| compression, you will use a variable length encoding,
| probably based on the frequency so that it is independent
| of the magnitude of the deltas. Why should a hundred times
| one compress any better than a hundred times a million?
| Arithmetic coding would probably work equally well.
| ot wrote:
| "Naively"? Binary packing encoders are still competitive
| with the state of the art :) Which notably hasn't
| included variable-length encodings for quite a while.
|
| Variable-length coding requires that each integer somehow
| encodes its width. That can be very wasteful. Block-based
| methods exploit the fact that consecutive deltas usually
| require a similar number of bits, so it's better to waste
| a bit on each integer, but amortize the block bitwidth,
| than having to encode each width independently.
| danbruc wrote:
| So 1000, 1001, 1000, 1002, 1000, 1003, 1000, 1004
| requiring 80 bits is good if you could get it to 16 bits
| plus a Huffman tree which would admittedly eat up the 64
| bits difference in this tiny example? Storing the delta
| is essentially predicting that x[i + 1] = x[i] and
| storing the difference to that prediction. One could do
| better by having a better prediction, for example x[i +
| 1] = x[i] + median_gap_size which would bring the
| prediction error down to 0, 1, 0, 2, 0, 3, 0, 4 in my
| example, which could be encoded in 24 fixed length
| integers plus the median gap size. This could probably be
| further improved by adaptively predicting the gap size.
| Making the prediction errors small and independent of the
| gap size makes the use of short fixed length integers in
| the common case and switch to longer integers for
| outliers by reserving one short code as an escape code a
| much better choice.
|
| In the end the result is probably heavily dependent on
| the exact structure of the integer lists you want to
| compress but I do not see how fixed length deltas can be
| competitive on arbitrary inputs.
| ot wrote:
| I would suggest you actually read the whole thread before
| commenting, as it started by explaining how outliers *do
| not* make the rest of the list encode with 4 bits per
| integer.
|
| > I do not see how fixed length deltas can be competitive
| on arbitrary inputs.
|
| I also already explained this, most lists in practical
| applications exhibit local properties that are exploited
| by blocking.
|
| Using Huffman/arithmetic coding generally doesn't work
| well because the cost of storing the probability tables
| does not break even with the savings. There was a recent
| line of work [1] that attempted this with ANS, a faster
| cousin of arithmetic coding, and the results were
| honestly underwhelming considering the complexity
| involved.
|
| [1] https://dl.acm.org/doi/abs/10.1145/3159652.3159663
|
| > In the end the result is probably heavily dependent on
| the exact structure of the integer lists
|
| Yes, I did not claim that this works with any list, for
| any encoding scheme you can construct adversarial cases
| that do not compress well, that's just information
| theory. But that applies also to the schemes you're
| floating.
| danbruc wrote:
| You are misunderstanding me, I am not primarily talking
| about outliers. I am saying if the deltas are big in
| magnitude but concentrated on only a few different
| values, then encoding the deltas directly is wasteful.
| Whether you then use entropy coding, which also deals
| with outliers for free, or whether you use something more
| sophisticated than deltas, for example double deltas or
| subtracting the minimum delta, was not really my point.
| Maybe it was a bit confusing that I first brought this up
| in the context of outliers.
| touisteur wrote:
| I've been having loads of fun and success with roaring bitmaps
| recently, to create this type of index for large data.
| glebm wrote:
| Elias-Fano would probably be a better option here
| ot wrote:
| EF doesn't compress that well if your data exhibits
| regularities, it uses exactly a number of bits that only
| depends on the sequence upper bound and length.
|
| There are ways to make it compress better though, like
| splitting the sequence into chunks, optimizing for the overall
| compressed size.
| js8 wrote:
| Yeah, and somebody has even done that:
| https://www.antoniomallia.it/sorted-integers-compression-wit...
| 7373737373 wrote:
| There is the Hutter Prize for English text compression
| (http://prize.hutter1.net/), I wish there was an equivalent for
| the https://oeis.org database:
| https://github.com/void4/notes/issues/72
| pklausler wrote:
| The Burrows-Wheeler transform
| (https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transf...)
| might be handy for this problem case; encode the deltas with
| something like UTF-8 first.
| PaulHoule wrote:
| This book has a nice treatment of that kind of compression:
|
| https://www.amazon.com/Managing-Gigabytes-Compressing-Multim...
|
| For instance you might be keep track of facts like
| the word "the" is contained in document 1 the word "john"
| is contained in document 1 the word "the" is contained in
| document 2 ... the word "john" is contained in
| document 12
|
| and you code the gaps; the word "the" appears in every document
| and the gap is always 1, but the gap for "john" is 11. With a
| variable-sized encoding you use fewer bits for smaller gaps --
| with that kind of encoding you don't have to make "the" be a
| stopword because you can afford to encode all the postings.
| tgv wrote:
| The Standard MIDI file spec had something similar back in 80s.
| They code gaps between successive events with increasingly
| longer byte sequences. Never mind that they waste a whole bit
| to do that.
| ot wrote:
| That book is really dated (1994), but unfortunately there
| aren't much better options.
|
| It is unfortunate because the understanding of the field has
| improved dramatically over the past decades, and explanations
| from a more modern perspective would benefit even novices.
| atombender wrote:
| That book is great, and relies on a combination of Golomb-Rice,
| delta coding, and arithmetic coding. I wonder how this compares
| with the OP's compression.
| PaulHoule wrote:
| ... it doesn't just "rely on" that stuff, but it explains the
| fundamentals of them really well!
| NohatCoder wrote:
| For anyone who wants really good compression of data with a
| particular characteristic, a very useful approach is to first
| shrink or reformat the data based on that characteristic, then
| pass the result through a general purpose compressor.
|
| For this case that might mean convert the data into deltas, then
| write those deltas in a byte-oriented variable length integer
| fashion. The general purpose compressor can then pick up most of
| what you have left on the table by not being super careful with
| the balancing of the chosen integer format, and it might find
| some other useful patterns in the data without you having to
| worry about those.
|
| Theoretically a single compressor crafted purely for the exact
| characteristics of the data at hand would be able to do a better
| job. But in practice the skill and time required to beat the
| hybrid approach tends to be a showstopper.
| jhgb wrote:
| > For this case that might mean convert the data into deltas,
| then write those deltas in a byte-oriented variable length
| integer fashion.
|
| Calculating a trend line and storing deltas off of that trend
| line might be mildly better for some series. Having an AR model
| and storing errors of the model might work even better,
| although the obvious (and common) problem is which AR model of
| the many possible ones you go for.
| NohatCoder wrote:
| If you know the specifics of your data, maybe. But try to err
| on the side of simpler models, every detail you add slows
| your code down, and the difference after compression is often
| small.
| ballenf wrote:
| Or simply a delta off the last delta (or mean of last n
| deltas, I guess), so you can do the compression in a stream.
| jhgb wrote:
| That might also be equivalent to one of the ARMA models,
| although I'd have to do the math to prove that (and to
| derive which one corresponds to what you're describing).
|
| Edit: It seems that if you detrend the data, the "delta off
| the last delta", if I understand it correctly, is an MA(1)
| model. I think. Maybe. My math is wobbly.
| wallacoloo wrote:
| If I understand you & GP, these are both just particular
| instantiations of Linear Predictive Coding [1]. For both of
| these approaches, you'd probably get similar or better
| results just throwing your data into an off-the-shelf LPC,
| and with less effort.
|
| [1]
| https://en.m.wikipedia.org/wiki/Linear_predictive_coding
| raphlinus wrote:
| It reminds me of the classic interview problem, attributed to
| Google: sort 1 million 32 bit integers (provided as a stream)
| using 2 megabytes of memory. Most people's first instinct is that
| it can't be done, you need almost 4 megabytes to store the input,
| but of course it can be, you keep the sorted list in compressed
| form.
|
| I haven't found a really good writeup, most of the hits on Google
| (SO, earlier HN stories) fail to get the answer.
| sltkr wrote:
| I think there are two difficulties here.
|
| 1. How to even represent 1,000,000,000 sorted integers in only
| 2 megabytes, which means you can use (slightly more than) 2
| bytes per integer. The central observation is that the sum of
| the deltas is no more than 4294967296. 4294967296 / 1000000 =
| 4295 which is about 13 bits, so it seems like it should be
| possible, but it's not easy.
|
| For example, if you use a standard variable length encoding
| where the highest bit indicates continuation, you would have 7
| data bits per byte, but you could have 743,665 times 128 and
| 256,335 times 16384, which would require 743,665*2 + 256,335*3
| = 2,256,335 bytes, which is slightly over 2 megabyte.
|
| If you use the first two bits of the initial byte (so that K
| bytes encode 8*K - 2 bits). You could have 740,749 times 64
| 259,251 times 16384, for a total of 259251*3 + 740749*2 =
| 2,259,251 bytes, slightly worse even.
|
| With 1 continuation bit per nibble the math similarly doesn't
| work out. So this is starting to look, if not impossible, at
| least very hard.
|
| 2. Imagine that you could represent 1,000,000 sorted integers
| in 2 megabytes somehow. Then the next problem is to actually
| create such a sorted sequence from the input. -
| Insertion sort works, but O(N^2) time complexity is not great
| with a million elements. - Quick sort doesn't work since
| you'd start with an unsorted array which you cannot represent
| in memory. - Heap sort doesn't work because it requires
| random access to the array, which doesn't work with a variable-
| length encoding. - Merge sort works in the beginning, but
| you need temporary space equal to the size of your input, so
| towards the end you're in trouble.
|
| I think you could make merge sort work if the requirement is
| "output the elements in sorted order" rather than actually
| constructing a sorted representation of the array. In that
| case, you could create sorted sequences of the first 500,000,
| 250,000, 125,000 etc. elements, and do a 20-way merge in the
| end, which is O(N log N) overall.
|
| This is still somewhat tricky because the average delta of e.g.
| 500,000 elements can be twice as large as for the overall
| array, so you might need slightly more space to encode it, so
| we would need a really efficient compression scheme to make
| this work.
|
| All in all I'm gravitating towards: this problem isn't
| generally solvable under the given constraints. With 3
| megabytes I think the scheme outlined above works.
| noctune wrote:
| The entropy of 1e6 sorted 32 bit integers is
| log2((2*32+1e6-1) choose 1e6) bits or about 1.61 MiB, so it
| should be possible. Really difficult, obviously, but
| possible.
| raphlinus wrote:
| It's doable. First, you use a Golomb coding for the deltas.
| Tuning the parameters just right, you can get it to fit
| inside your budget with a little room to spare.
|
| Then, to get reasonably good time complexity, you use that
| extra space as a scratch buffer - read in a chunk of input,
| sort in place (heapsort is good for worst-case time
| complexity), and merge with the compressed stream. It's not
| simple to implement, but it does work.
| GistNoesis wrote:
| Can you explain how you merge a compressed sorted stream
| with the sorted chunk, to obtain a bigger compressed stream
| (without extra memory, or quadratic complexity) ?
|
| I can see a solution with memory usage : size of(
| compressed sorted stream ) + size of( compressed sorted
| chunk ) + size of(compressed stream merge output)
|
| This would work but it means you need to fit your
| compressed sorted stream in less than a single MB instead
| of 2MB (and I'm ignoring memory fragmentation problems)
| rjmunro wrote:
| One method is to move the compressed output from the top
| to the bottom of the memory with each cycle:
|
| A = Existing compressed sorted chunk; B = Compressed
| merge output; C = New sorted chunk
|
| Start with:
|
| AAAAAAAAAAAAAAAAAAAAAAAAAAA-----CCCCCC
|
| Merge into B, start at top of A and working your way
| down:
|
| AAAAAAAAAAAAAAAAAAAAAA-----BBBBBCCCC--
|
| Finish this iteration like:
|
| --------BBBBBBBBBBBBBBBBBBBBBBBB------
|
| Then refill C and sort it:
|
| --------BBBBBBBBBBBBBBBBBBBBBBBBCCCCCC
|
| Merge B back into A, but forwards this time:
|
| AAAAAAAAAA------BBBBBBBBBBBBBBBB---CCC
|
| AAAAAAAAAAAAAAAAAAAAAAAAAAA-----------
| GistNoesis wrote:
| Thanks, I got it. The main trick is the standard
| "interview" trick of writing from right to left.
|
| I'm not sure your solution work though without memory
| moving, because in your second step if the values of A
| are larger than the values of C, the growing B will hit
| the non decreasing set A.
|
| To convince my self, I wrote for myself the following
| sequence.
|
| 00000224446666666666888888_11133333355555
|
| BytesOfTheCompressedStream_NewSortedChunk________________
|
| BytesOfTheCompressedStream_NewSortedChunk________weNsetyB
|
| OfTheCompressedStream_SortedChunk________________weNsetyB
|
| OfTheCompressedStream_SortedChunk________detroSfOweNsetyB
|
| TheCompressedStream_Chunk________________detroSfOweNsetyB
|
| TheCompressedStream_Chunk_______detroSehTdetroSfOweNsetyB
|
| CompressedStream_Chunk__________detroSehTdetroSfOweNsetyB
|
| Here notice that we hit the existing streams so we need
| to memmove
|
| CompressedStream_Chunk_esserpmoCdetroSehTdetroSfOweNsetyB
|
| dStream_Chunk__________esserpmoCdetroSehTdetroSfOweNsetyB
|
| dStream_Chunk____knuhCdesserpmoCdetroSehTdetroSfOweNsetyB
|
| Stream_____maertSknuhCdesserpmoCdetroSehTdetroSfOweNsetyB
|
| ___________maertSknuhCdesserpmoCdetroSehTdetroSfOweNsetyB
|
| Then reverse it in place and memmove
| raphlinus wrote:
| There are a bunch of ways to do this. One of the more
| straightforward is to use a temporary buffer big enough
| to hold both the new values and the existing stream, and
| start by shifting the existing stream to be right aligned
| (using memmove or the equivalent). Then read bits from
| the old stream while writing bits into the new stream.
| Using Elias-Fano coding (which I believe is an instance
| of Golomb coding), the write position will never overtake
| the read position, so the merge can be done in-place.
| GistNoesis wrote:
| Thanks. I think I understood it now :
| https://news.ycombinator.com/item?id=27553076
|
| In my solution though when the extra temporary buffer is
| very small,in the worst case, we need to make
| sizeof(compressedStream)/sizeof(temporaryBuffer) memmove
| which can potentially make the algorithm quadratic.
| ot wrote:
| Encoding-wise, Elias-Fano requires exactly 2 + ceil(log(2^32
| / n)) bits, so it would provably fit in the memory budget.
|
| Algorithm-wise, yeah I can only think of quadratic ones. But
| if in the question the bottleneck is memory, maybe that's
| fine? An optimization would be to do selection sort, but in
| each round select the next group of integers that share the
| top 32 - ceil(log(2^32 / n)) bits, as that is a cluster in
| the EF representation, and then sort the lower bits, which
| can be done in-place in O(k log k) where k is the group size.
| So if your groups are large enough you get some speedup.
|
| EDIT:
|
| Ah I think that the trick is to always use half of the memory
| left to read and sort.
|
| - Read 256k integers (1MB), sort in-place, construct EF
| representation (~500kB)
|
| - Read 128k, sort, and produce another EF (~280kB)
|
| - Keep halving
|
| You do a logarithmic number of rounds, so it's O(n log n). At
| the end you're left with log(n) EF sets, which fit in less
| than 2MB (in fact there's some space left, and you could save
| a few rounds) and you can merge them with a log(n)-way heap
| into a stream.
| tyingq wrote:
| Guido van Rossum's answer, of course with Python:
|
| http://neopythonic.blogspot.com/2008/10/sorting-million-32-b...
|
| It sounds like "break up into smaller sorted sets, then do a
| merge sort". Does the Google interview question allow for temp
| file storage?
| raphlinus wrote:
| There are two versions of the question. One allows external
| storage (Guido's answer and the top hit on SO), the other
| does not.
| tastyface wrote:
| Er... is it just me, or is this a terrible interview question?
| Assuming no external storage, I can imagine spending a few
| hours working out this problem for fun, but to do it in 45
| minutes on a whiteboard seems exceptionally difficult if you're
| not familiar with it already. (Or is this one of those "let's
| talk it through, you don't need to arrive at a perfect answer"
| deals?)
|
| I could be wrong. Maybe I should just sit down and try it, but
| timed problems stress me out in general, even if nothing's at
| stake.
| ChrisLomont wrote:
| When that problem came out I solved it quickly (it was 1
| million 8 digit numbers in sorted in 1MB of RAM, which I think
| is the original problem that gained notoriety) with basically
| arithmetic compression of the delta stream, since the number of
| possible states sits decently inside the required RAM.
|
| I also saw I can sort it it using zero bytes of storage using a
| lot of code :)
|
| Think nested switches that eventually print a lot of numbers.
___________________________________________________________________
(page generated 2021-06-18 23:02 UTC)