[HN Gopher] Nimble: A new columnar file format by Meta [video]
       ___________________________________________________________________
        
       Nimble: A new columnar file format by Meta [video]
        
       Author : aduffy
       Score  : 44 points
       Date   : 2024-04-10 20:00 UTC (3 hours ago)
        
 (HTM) web link (www.youtube.com)
 (TXT) w3m dump (www.youtube.com)
        
       | mempko wrote:
       | I would love to see support in Apache Arrow to read this format.
       | Parquet is already supported.
        
       | CharlesW wrote:
       | I learned that "Nimble" is the new name for "Alpha", discussed in
       | this 2023 report:
       | https://www.cidrdb.org/cidr2023/papers/p77-chattopadhyay.pdf
       | 
       | Here's an excerpt that may save some folks a click or three...
       | 
       | > _" While storing analytical and ML tables together in the data
       | lakehouse is beneficial from a management and integration
       | perspective, it also imposes some unique challenges. For example,
       | it is increasingly common for ML tables to outgrow analytical
       | tables by up to an order of magnitude. ML tables are also
       | typically much wider, and tend to have tens of thousands of
       | features usually stored as large maps._
       | 
       | > _" As we executed on our codec convergence strategy for ORC, it
       | gradually exposed significant weaknesses in the ORC format
       | itself, especially for ML use cases. The most pressing issue with
       | the DWRF format was metadata overhead; our ML use cases needed a
       | very large number of features (typically stored as giant maps),
       | and the DWRF map format, albeit optimized, had too much metadata
       | overhead. Apart from this, DWRF had several other limitations
       | related to encodings and stripe structure, which were very
       | difficult to fix in a backward-compatible way. Therefore, we
       | decided to build a new columnar file format that addresses the
       | needs of the next generation data stack; specifically, one that
       | is targeted from the onset towards ML use cases, but without
       | sacrificing any of the analytical needs._
       | 
       | > _" The result was a new format we call Alpha. Alpha has several
       | notable characteristics that make it particularly suitable for
       | mixed Analytical nd ML training use cases. It has a custom
       | serialization format for metadata that is significantly faster to
       | decode, especially for very wide tables and deep maps, in
       | addition to more modern compression algorithms. It also provides
       | a richer set of encodings and an adaptive encoding algorithm that
       | can smartly pick the best encoding based on historical data
       | patterns, through an encoding history loopback database. Alpha
       | requires fewer streams per column for many common data types,
       | making read coalescing much easier and saving I/Os, especially
       | for HDDs. Alpha was written in modern C++ from scratch in a way
       | that allows it to be extended easily in the future._
       | 
       | > _" Alpha is being deployed in production today for several
       | important ML training applications and showing 2-3x better
       | performance than ORC on decoding, with comparable encoding
       | performance and file size."_
        
         | 0cf8612b2e1e wrote:
         | Alpha has got to be one of the worst names I have ever heard
         | for a new product. Did they want to make it impossible to find?
        
           | __MatrixMan__ wrote:
           | How could a company called Meta be so shortsighted?
        
             | santoshalper wrote:
             | Well played.
        
           | isodev wrote:
           | Alpha was also the name of the virtual assistant owned by the
           | bad guy in Extrapolations.
           | 
           | https://www.imdb.com/title/tt13821126/
        
             | jeffcox wrote:
             | Yes, but before that he was helping Zordon and the Power
             | Rangers.
             | 
             | https://www.imdb.com/title/tt0106064/
        
       | khaledh wrote:
       | Fwiw, the name clashes with Nim's package manager nimble:
       | https://github.com/nim-lang/nimble
        
       | jauntywundrkind wrote:
       | There's already been some interesting column format optimization
       | work at Meta, as their Velox execution engine team worked with
       | Apache Arrow to align their columnar formats. This talk is
       | actually happening at VeloxCon, so there's got to be some
       | awareness! https://engineering.fb.com/2024/02/20/developer-
       | tools/velox-... https://news.ycombinator.com/item?id=39454763
       | 
       | I wonder how much if any overlap there is here, and whether it
       | was intentional or accidentally similar. Ah, "return efficient
       | Velox vectors" is on the list, but still seems likely to be some
       | overlap in encoding strategies etc.
       | 
       | The four main points seem to be: a) encoding metadata as part of
       | stream rather than fixed metadata, b) nls are just another
       | encoding, c) no stripe footer/only stream locations is in footer,
       | d) FlatBuffers! Shout out to FlatBuffers, wasn't expecting to see
       | them making a comeback!
       | 
       | I do wish there were a lot more diagrams/slides. There's four
       | bullet points, and Yoav Helfman talks to them, but there's not a
       | ton of showing what he's talking about.
        
       | gigatexal wrote:
       | Hmm another conte for in the open table format space. Nice.
        
       ___________________________________________________________________
       (page generated 2024-04-10 23:00 UTC)