[HN Gopher] An Empirical Evaluation of Columnar Storage Formats ...
___________________________________________________________________
An Empirical Evaluation of Columnar Storage Formats [pdf]
Author : eatonphil
Score : 40 points
Date : 2024-05-15 18:29 UTC (4 hours ago)
(HTM) web link (www.vldb.org)
(TXT) w3m dump (www.vldb.org)
| jauntywundrkind wrote:
| Nice to see methodology here. Ideally Lancedb lance v2 and nimble
| would also both be represented here. It feels like there's huge
| appetite to do better than Parquet; ideally work like this would
| help inform where we go next.
|
| https://blog.lancedb.com/lance-v2/
|
| https://github.com/facebookincubator/nimble
| apavlo wrote:
| Lance v2 looks interesting. I like their meta-data + container
| story. Lacking SOTA encoding schemes though.
|
| There is also Vortex (https://github.com/fulcrum-so/vortex).
| That has modern encoding schemes that we want to use.
|
| BtrBlocks (https://github.com/maxi-k/btrblocks) from the
| Germans is another Parquet alternative.
|
| Nimble (formerly Alpha) is a complicated story. We worked with
| the Velox team for over a year to open-source and extend it.
| But plans got stymied by legal. This was in collaboration with
| Meta + CWI + Nvidia + Voltron. We decided to go a separate path
| because Nimble code has no spec/docs. Too tightly coupled with
| Velox/Folly.
|
| Given that, we are working on a new file format. We hope to
| share our ideas/code later this year.
| jauntywundrkind wrote:
| Honored to have your reply, wonderful view of the scene,
| thanks Andy.
|
| 2c remark, zero horses in this race: I was surprised how few
| encodings were in Nimble at release. The skeleton
| superficially seemed fine I guess, I don't know, but not much
| meat on the bones. Without nice interesting optimized
| encodings, the container for them doesn't feel compelling.
| But also starting with some inarguable clear options makes
| some kind of sense too, is some kind of tactic.
|
| They claim they're trying to figure out a path to decoupling
| from Velox/Folly, so hopefully that can come about. I tend to
| believe so, godspeed.
|
| The "implementation not specification" does seem really scary
| though, isn't how we usually get breakout industry-changimg
| successes.
|
| I wish I had the savy to contrast lance (V2) vs nimble a
| little better. Both seem to be containerizing systems,
| allowing streams to define their own encodings. Your comment
| about meta-data + encodings makes me feel like there's
| dimensions to the puzzle I haven't identified yet (mostly
| after chugging VeloxCon talks).
|
| (Thanks for everything Andy, you're doing the good work
| (practicing and informing). Very very excited to see ya'll's
| alternative!!)
| 0cf8612b2e1e wrote:
| Third, faster and cheaper storage devices mean that it is better
| to use faster decoding schemes to reduce computation costs than
| to pursue more aggressive compression to save I/O bandwidth.
| Formats should not apply general-purpose block compression by
| default because the bandwidth savings do not justify the
| decompression overhead.
|
| Not sure I agree with that. Have a situation right now where I am
| bottlenecked by IO and not compute.
| RhodesianHunter wrote:
| Is this because you're using some sort of network backed
| storage like EBS?
| epistasis wrote:
| This is extremely common in genomics settings, and in the past
| I have spent far more time allocating disk iops, network
| bandwidth, and memory amounts for various pipeline stages than
| I have on CPUs in this space. Muck up and launch 30x as many
| processes as your compute node has, and it's fairy fixable, but
| muck up the RAM allocation and disk IO and you may not be able
| to fix it in any reasonable time. And if you misallocate your
| network storage, that can bring the entire cluster to a halt,
| not just a few nodes.
| jltsiren wrote:
| I think the idea is that you should design tools and
| pipelines to take advantage of current hardware. Individual
| nodes have more CPU cores, more RAM, and more and faster
| local storage than they used to. Instead of launching many
| small jobs that compete for shared resources, you should have
| large jobs that run the entire pipeline locally, using
| network and network storage only when it's unavoidable.
| epistasis wrote:
| That is exactly right, and optimizing for the current
| distribution of hardware is always the case; however most
| interesting problems still do not fit on a single node. For
| example, large LLMs that whose training data, or sometimes
| even model itself, do not fit on a single node. Lots of the
| same principles of allocation show up again.
| jltsiren wrote:
| You mentioned genomics, and that's a field where problems
| have not grown much over time. You may have more of them,
| but individual problems are about the same size as
| before. Most problems have a natural size that depends on
| the size of the genome. Genomics tools never really
| embraced distributed computing, because there was no need
| for the added complexity.
| apavlo wrote:
| > Have a situation right now where I am bottlenecked by IO and
| not compute.
|
| Can you describe your use-case? Are you reading from NVMe or
| S3?
| zX41ZdbW wrote:
| My point is near the opposite. Data formats should apply
| lightweight compression, such as lz4, by default because it
| could be beneficial even if the data is read from RAM.
|
| I have made a presentation about it:
| https://presentations.clickhouse.com/meetup53/optimizations/
|
| Actually, it depends on the ratio between memory speed, the
| number of memory channels, CPU speed, and the number of CPU
| cores.
|
| But there are cases when compression by default does not make
| sense. For example, it is pointless to apply lossless
| compression for embeddings.
| Galanwe wrote:
| Last I checked you can't get much better than 1.5GB/s per
| core with LZ4 (from RAM), up to a maximum ratio < 3:1, and
| multicore decompression is not really possible unless you
| manually tweak the compression.
|
| The benchmarks above that are usually misleading, because
| they assume no dependence between blocks, which is nuts. In
| real scenarios, blocks need to be parsed, depend on their
| previous blocks, and you need to carry around that context.
|
| My RAM can deliver close to 20GB/s, and my SSD 7GB/s, and
| that is all commodity hardware.
|
| Meaning unless you have quite slow disks, you're better off
| without compression.
| riku_iki wrote:
| > Last I checked you can't get much better than 1.5GB/s per
| core with LZ4
|
| you can partition your dataset and process each partition
| on separate core, which will produce some massive XX or
| even XXX GB/s?
|
| > up to a maximum ratio < 3:1
|
| this is obviously depends on your data pattern. If it is
| some low cardinality IDs, they can be compressed by ratio
| 100 easily.
| Galanwe wrote:
| > you can partition your dataset and process each
| partition on separate core, which will produce some
| massive XX or even XXX GB/s?
|
| Yes, but as I mentioned:
|
| > multicore decompression is not really possible unless
| you manually tweak the compression
|
| That is, there is no stable implementation out there that
| does it. You will have to do that manually and painfully.
| In which case, you're opening the doors for exotic/niche
| compression/decompression, and there are better
| alternatives than LZ4 if you're in the niche market.
|
| > this is obviously depends on your data pattern. If it
| is some low cardinality IDs, they can be compressed by
| ratio 100 easily.
|
| Everything is possible in theory. Yet we have to agree on
| what is a reasonable expectation. A compression factor of
| around 3:1 is, from my experience, what you would get
| from a reasonable compression speed on reasonably
| distributed data.
| riku_iki wrote:
| > Yes, but as I mentioned > multicore decompression is
| not really possible unless you manually tweak the
| compression
|
| I don't understand your point. Decompression will be
| applied on separate partitions using separate cores the
| same way as compression..
|
| > Yet we have to agree on what is a reasonable
| expectation. A compression factor of around 3:1 is, from
| my experience
|
| well, my prod database is compressed by ratio 7 (many
| hundreds billions IDs).
| miohtama wrote:
| Try blosch, faster than memcpy
|
| https://www.blosc.org/pages/blosc-in-depth/
| SPascareli13 wrote:
| Love this paper! I read it after watching Dr. Pavlo lessons on
| YouTube.
___________________________________________________________________
(page generated 2024-05-15 23:00 UTC)