[HN Gopher] Nimble: A new columnar file format by Meta [video]
___________________________________________________________________
Nimble: A new columnar file format by Meta [video]
Author : aduffy
Score : 44 points
Date : 2024-04-10 20:00 UTC (3 hours ago)
(HTM) web link (www.youtube.com)
(TXT) w3m dump (www.youtube.com)
| mempko wrote:
| I would love to see support in Apache Arrow to read this format.
| Parquet is already supported.
| CharlesW wrote:
| I learned that "Nimble" is the new name for "Alpha", discussed in
| this 2023 report:
| https://www.cidrdb.org/cidr2023/papers/p77-chattopadhyay.pdf
|
| Here's an excerpt that may save some folks a click or three...
|
| > _" While storing analytical and ML tables together in the data
| lakehouse is beneficial from a management and integration
| perspective, it also imposes some unique challenges. For example,
| it is increasingly common for ML tables to outgrow analytical
| tables by up to an order of magnitude. ML tables are also
| typically much wider, and tend to have tens of thousands of
| features usually stored as large maps._
|
| > _" As we executed on our codec convergence strategy for ORC, it
| gradually exposed significant weaknesses in the ORC format
| itself, especially for ML use cases. The most pressing issue with
| the DWRF format was metadata overhead; our ML use cases needed a
| very large number of features (typically stored as giant maps),
| and the DWRF map format, albeit optimized, had too much metadata
| overhead. Apart from this, DWRF had several other limitations
| related to encodings and stripe structure, which were very
| difficult to fix in a backward-compatible way. Therefore, we
| decided to build a new columnar file format that addresses the
| needs of the next generation data stack; specifically, one that
| is targeted from the onset towards ML use cases, but without
| sacrificing any of the analytical needs._
|
| > _" The result was a new format we call Alpha. Alpha has several
| notable characteristics that make it particularly suitable for
| mixed Analytical nd ML training use cases. It has a custom
| serialization format for metadata that is significantly faster to
| decode, especially for very wide tables and deep maps, in
| addition to more modern compression algorithms. It also provides
| a richer set of encodings and an adaptive encoding algorithm that
| can smartly pick the best encoding based on historical data
| patterns, through an encoding history loopback database. Alpha
| requires fewer streams per column for many common data types,
| making read coalescing much easier and saving I/Os, especially
| for HDDs. Alpha was written in modern C++ from scratch in a way
| that allows it to be extended easily in the future._
|
| > _" Alpha is being deployed in production today for several
| important ML training applications and showing 2-3x better
| performance than ORC on decoding, with comparable encoding
| performance and file size."_
| 0cf8612b2e1e wrote:
| Alpha has got to be one of the worst names I have ever heard
| for a new product. Did they want to make it impossible to find?
| __MatrixMan__ wrote:
| How could a company called Meta be so shortsighted?
| santoshalper wrote:
| Well played.
| isodev wrote:
| Alpha was also the name of the virtual assistant owned by the
| bad guy in Extrapolations.
|
| https://www.imdb.com/title/tt13821126/
| jeffcox wrote:
| Yes, but before that he was helping Zordon and the Power
| Rangers.
|
| https://www.imdb.com/title/tt0106064/
| khaledh wrote:
| Fwiw, the name clashes with Nim's package manager nimble:
| https://github.com/nim-lang/nimble
| jauntywundrkind wrote:
| There's already been some interesting column format optimization
| work at Meta, as their Velox execution engine team worked with
| Apache Arrow to align their columnar formats. This talk is
| actually happening at VeloxCon, so there's got to be some
| awareness! https://engineering.fb.com/2024/02/20/developer-
| tools/velox-... https://news.ycombinator.com/item?id=39454763
|
| I wonder how much if any overlap there is here, and whether it
| was intentional or accidentally similar. Ah, "return efficient
| Velox vectors" is on the list, but still seems likely to be some
| overlap in encoding strategies etc.
|
| The four main points seem to be: a) encoding metadata as part of
| stream rather than fixed metadata, b) nls are just another
| encoding, c) no stripe footer/only stream locations is in footer,
| d) FlatBuffers! Shout out to FlatBuffers, wasn't expecting to see
| them making a comeback!
|
| I do wish there were a lot more diagrams/slides. There's four
| bullet points, and Yoav Helfman talks to them, but there's not a
| ton of showing what he's talking about.
| gigatexal wrote:
| Hmm another conte for in the open table format space. Nice.
___________________________________________________________________
(page generated 2024-04-10 23:00 UTC)