[HN Gopher] An Empirical Evaluation of Columnar Storage Formats ...
       ___________________________________________________________________
        
       An Empirical Evaluation of Columnar Storage Formats [pdf]
        
       Author : eatonphil
       Score  : 40 points
       Date   : 2024-05-15 18:29 UTC (4 hours ago)
        
 (HTM) web link (www.vldb.org)
 (TXT) w3m dump (www.vldb.org)
        
       | jauntywundrkind wrote:
       | Nice to see methodology here. Ideally Lancedb lance v2 and nimble
       | would also both be represented here. It feels like there's huge
       | appetite to do better than Parquet; ideally work like this would
       | help inform where we go next.
       | 
       | https://blog.lancedb.com/lance-v2/
       | 
       | https://github.com/facebookincubator/nimble
        
         | apavlo wrote:
         | Lance v2 looks interesting. I like their meta-data + container
         | story. Lacking SOTA encoding schemes though.
         | 
         | There is also Vortex (https://github.com/fulcrum-so/vortex).
         | That has modern encoding schemes that we want to use.
         | 
         | BtrBlocks (https://github.com/maxi-k/btrblocks) from the
         | Germans is another Parquet alternative.
         | 
         | Nimble (formerly Alpha) is a complicated story. We worked with
         | the Velox team for over a year to open-source and extend it.
         | But plans got stymied by legal. This was in collaboration with
         | Meta + CWI + Nvidia + Voltron. We decided to go a separate path
         | because Nimble code has no spec/docs. Too tightly coupled with
         | Velox/Folly.
         | 
         | Given that, we are working on a new file format. We hope to
         | share our ideas/code later this year.
        
           | jauntywundrkind wrote:
           | Honored to have your reply, wonderful view of the scene,
           | thanks Andy.
           | 
           | 2c remark, zero horses in this race: I was surprised how few
           | encodings were in Nimble at release. The skeleton
           | superficially seemed fine I guess, I don't know, but not much
           | meat on the bones. Without nice interesting optimized
           | encodings, the container for them doesn't feel compelling.
           | But also starting with some inarguable clear options makes
           | some kind of sense too, is some kind of tactic.
           | 
           | They claim they're trying to figure out a path to decoupling
           | from Velox/Folly, so hopefully that can come about. I tend to
           | believe so, godspeed.
           | 
           | The "implementation not specification" does seem really scary
           | though, isn't how we usually get breakout industry-changimg
           | successes.
           | 
           | I wish I had the savy to contrast lance (V2) vs nimble a
           | little better. Both seem to be containerizing systems,
           | allowing streams to define their own encodings. Your comment
           | about meta-data + encodings makes me feel like there's
           | dimensions to the puzzle I haven't identified yet (mostly
           | after chugging VeloxCon talks).
           | 
           | (Thanks for everything Andy, you're doing the good work
           | (practicing and informing). Very very excited to see ya'll's
           | alternative!!)
        
       | 0cf8612b2e1e wrote:
       | Third, faster and cheaper storage devices mean that it is better
       | to use faster decoding schemes to reduce computation costs than
       | to pursue more aggressive compression to save I/O bandwidth.
       | Formats should not apply general-purpose block compression by
       | default because the bandwidth savings do not justify the
       | decompression overhead.
       | 
       | Not sure I agree with that. Have a situation right now where I am
       | bottlenecked by IO and not compute.
        
         | RhodesianHunter wrote:
         | Is this because you're using some sort of network backed
         | storage like EBS?
        
         | epistasis wrote:
         | This is extremely common in genomics settings, and in the past
         | I have spent far more time allocating disk iops, network
         | bandwidth, and memory amounts for various pipeline stages than
         | I have on CPUs in this space. Muck up and launch 30x as many
         | processes as your compute node has, and it's fairy fixable, but
         | muck up the RAM allocation and disk IO and you may not be able
         | to fix it in any reasonable time. And if you misallocate your
         | network storage, that can bring the entire cluster to a halt,
         | not just a few nodes.
        
           | jltsiren wrote:
           | I think the idea is that you should design tools and
           | pipelines to take advantage of current hardware. Individual
           | nodes have more CPU cores, more RAM, and more and faster
           | local storage than they used to. Instead of launching many
           | small jobs that compete for shared resources, you should have
           | large jobs that run the entire pipeline locally, using
           | network and network storage only when it's unavoidable.
        
             | epistasis wrote:
             | That is exactly right, and optimizing for the current
             | distribution of hardware is always the case; however most
             | interesting problems still do not fit on a single node. For
             | example, large LLMs that whose training data, or sometimes
             | even model itself, do not fit on a single node. Lots of the
             | same principles of allocation show up again.
        
               | jltsiren wrote:
               | You mentioned genomics, and that's a field where problems
               | have not grown much over time. You may have more of them,
               | but individual problems are about the same size as
               | before. Most problems have a natural size that depends on
               | the size of the genome. Genomics tools never really
               | embraced distributed computing, because there was no need
               | for the added complexity.
        
         | apavlo wrote:
         | > Have a situation right now where I am bottlenecked by IO and
         | not compute.
         | 
         | Can you describe your use-case? Are you reading from NVMe or
         | S3?
        
         | zX41ZdbW wrote:
         | My point is near the opposite. Data formats should apply
         | lightweight compression, such as lz4, by default because it
         | could be beneficial even if the data is read from RAM.
         | 
         | I have made a presentation about it:
         | https://presentations.clickhouse.com/meetup53/optimizations/
         | 
         | Actually, it depends on the ratio between memory speed, the
         | number of memory channels, CPU speed, and the number of CPU
         | cores.
         | 
         | But there are cases when compression by default does not make
         | sense. For example, it is pointless to apply lossless
         | compression for embeddings.
        
           | Galanwe wrote:
           | Last I checked you can't get much better than 1.5GB/s per
           | core with LZ4 (from RAM), up to a maximum ratio < 3:1, and
           | multicore decompression is not really possible unless you
           | manually tweak the compression.
           | 
           | The benchmarks above that are usually misleading, because
           | they assume no dependence between blocks, which is nuts. In
           | real scenarios, blocks need to be parsed, depend on their
           | previous blocks, and you need to carry around that context.
           | 
           | My RAM can deliver close to 20GB/s, and my SSD 7GB/s, and
           | that is all commodity hardware.
           | 
           | Meaning unless you have quite slow disks, you're better off
           | without compression.
        
             | riku_iki wrote:
             | > Last I checked you can't get much better than 1.5GB/s per
             | core with LZ4
             | 
             | you can partition your dataset and process each partition
             | on separate core, which will produce some massive XX or
             | even XXX GB/s?
             | 
             | > up to a maximum ratio < 3:1
             | 
             | this is obviously depends on your data pattern. If it is
             | some low cardinality IDs, they can be compressed by ratio
             | 100 easily.
        
               | Galanwe wrote:
               | > you can partition your dataset and process each
               | partition on separate core, which will produce some
               | massive XX or even XXX GB/s?
               | 
               | Yes, but as I mentioned:
               | 
               | > multicore decompression is not really possible unless
               | you manually tweak the compression
               | 
               | That is, there is no stable implementation out there that
               | does it. You will have to do that manually and painfully.
               | In which case, you're opening the doors for exotic/niche
               | compression/decompression, and there are better
               | alternatives than LZ4 if you're in the niche market.
               | 
               | > this is obviously depends on your data pattern. If it
               | is some low cardinality IDs, they can be compressed by
               | ratio 100 easily.
               | 
               | Everything is possible in theory. Yet we have to agree on
               | what is a reasonable expectation. A compression factor of
               | around 3:1 is, from my experience, what you would get
               | from a reasonable compression speed on reasonably
               | distributed data.
        
               | riku_iki wrote:
               | > Yes, but as I mentioned > multicore decompression is
               | not really possible unless you manually tweak the
               | compression
               | 
               | I don't understand your point. Decompression will be
               | applied on separate partitions using separate cores the
               | same way as compression..
               | 
               | > Yet we have to agree on what is a reasonable
               | expectation. A compression factor of around 3:1 is, from
               | my experience
               | 
               | well, my prod database is compressed by ratio 7 (many
               | hundreds billions IDs).
        
         | miohtama wrote:
         | Try blosch, faster than memcpy
         | 
         | https://www.blosc.org/pages/blosc-in-depth/
        
       | SPascareli13 wrote:
       | Love this paper! I read it after watching Dr. Pavlo lessons on
       | YouTube.
        
       ___________________________________________________________________
       (page generated 2024-05-15 23:00 UTC)