[HN Gopher] Velox: Meta's Unified Execution Engine [pdf]
       ___________________________________________________________________
        
       Velox: Meta's Unified Execution Engine [pdf]
        
       Author : luu
       Score  : 59 points
       Date   : 2024-03-24 03:45 UTC (1 days ago)
        
 (HTM) web link (www.eecs.umich.edu)
 (TXT) w3m dump (www.eecs.umich.edu)
        
       | pvg wrote:
       | A thread from late 2022:
       | https://news.ycombinator.com/item?id=32673873
        
       | jauntywundrkind wrote:
       | Python's Substrait seems like the biggest/most-used competitor-
       | ish out there. I'd love some compare & contrast; my sense is that
       | Substrait has a smaller ambition, more wants to be a language for
       | talking about execution rather than a full on
       | optimization/execution engine. https://github.com/substrait-
       | io/substrait .
       | 
       | (Edit: ah, there's a recent talk discussing PyVelox trying to get
       | Substrait integration.
       | https://www.youtube.com/watch?v=l_kHxkGkNRg#t=18m22s . However
       | there's also discussion about the un-maintainedness of some of
       | the current Substrait work here; unclear status.
       | https://github.com/facebookincubator/velox/issues/8895)
       | 
       | We can also see from the Apache Arrow DataFusion discussion that
       | they too see themselves as a bit of a Velox competitor.
       | https://github.com/apache/arrow-datafusion/discussions/6441
       | 
       | It's cool to see this space mature. I like that even Velox sees
       | that Apache Arrow (underlying Apache Arrow DataFusion too) is
       | industry standard tech that they ought work with.
       | https://engineering.fb.com/2024/02/20/developer-tools/velox-...
       | 
       | Theres a solid Influx post talks to some of how they are
       | composing the assorted technologies to build they next gen 3.0,
       | which I find helpful for getting a sense of how all the pieces of
       | a modern high-performance data engine slot together.
       | https://www.influxdata.com/blog/flight-datafusion-arrow-parq...
        
         | kristjansson wrote:
         | I think you're right - Substrait wants to sit above something
         | like Velox. The closest comparison is probably Databricks
         | Photon[1], but that's proprietary.
         | 
         | [1]: https://www.databricks.com/product/photon
        
       | zX41ZdbW wrote:
       | Many ideas look like they were influenced by ClickHouse, and some
       | are direct copies. I'm surprised they didn't provide references
       | to ClickHouse, where the implementations are proven in production
       | in the first place.
        
         | gaogao wrote:
         | Could you be specific about which ideas you think were
         | influenced by ClickHouse specifically and not Presto or DuckDB
         | or Spark?
        
       | redskyluan wrote:
       | Velox could be competitor of datafusion. It is more focus on
       | execution engine and could be great to integrate to other high
       | performance databases.
       | 
       | Database will be split into pieces and rebuild!
        
         | sakras wrote:
         | Yes this has been an up-and-coming theme in the data science
         | world. Arrow for the data format, Ibis for the API,
         | Acero/Velox/DataFusion/DuckDB/Polars for execution, Substrait
         | for the query plan representation, etc.
        
       | sgt101 wrote:
       | I wonder how many of this sort of FAANG project really get used
       | where they are built. I went for an interview at a FAANG years
       | ago to work on a very big consumer product (when it was in
       | relative infancy) and expected to find a hyper tech data backend
       | to use... they told me that they were using mySQL.
       | 
       | I didn't get the job so maybe they were just joking around with
       | me - but the general despair that they evinced about their data
       | situation makes me wonder!
        
         | bezosdontpipme wrote:
         | I can neither confirm nor deny that S3's global bucket database
         | is actually just MySql (with a lil bit of special sauce)
        
           | sgt101 wrote:
           | tbh my general response to all data questions is "use
           | postgres". It does happen that someone comes back with a good
           | reason why that would be a bad idea, but it's not frequent!
           | 
           | mySQL == Oracle now... so bad on theological grounds.
        
             | astrange wrote:
             | You can use MariaDB.
        
           | nonrandomstring wrote:
           | And why ever not? It's a perfectly good solution, no?
           | 
           | What the GP alludes to is interesting though - mythologising
           | of organisations, brands and names.
           | 
           | Spend enough time with "famous" people, "big names", centres
           | of power and prominence and you quickly see everyone is just
           | ordinary dudes doing ordinary things with ordinary gear. But
           | for some reason there's fuck loads of money and attention,
           | and sometimes cloying paranoia and adulation floating around.
           | 
           | Sure, right out on the periphery are a noble few who play
           | with particle accelerators, spaceships and bunker
           | supercomputers. But then, that's just a day job too.
           | 
           | True genius/exceptionalism is rare and found in the
           | unexpected places. The rest is conjured out of thin air by
           | marketing and PR people, the press, and commentators. They
           | are the ones who need the big legend.
        
           | influx wrote:
           | Yeah, but I bet you the S3 Keymap isn't MySQL....
        
         | ipsum2 wrote:
         | Facebook/meta uses mySQL, but with a completely different
         | engine (myrocks) and sharding techniques.
         | 
         | YouTube uses mySQL but they've also rewritten major portions
         | for scalability. (Vitess)
         | 
         | Just because a company is using a technology you've heard of
         | doesn't mean it's what you expect.
        
           | riku_iki wrote:
           | > YouTube uses mySQL but they've also rewritten major
           | portions for scalability. (Vitess)
           | 
           | I imagine this is some very old info(like 10 yo) and could
           | change since then?
        
         | philjohn wrote:
         | At Meta they probably don't get built unless they're impactful,
         | and they're not impactful if they're not used in production to
         | solve a real pain point.
        
         | kgp7 wrote:
         | This is being actively used at Meta in Production across
         | several engines ; the paper makes explicit references to this.
        
       | sakras wrote:
       | My general take is that while the idea of composability is good,
       | the implementations of these things are just frankly not of high
       | quality. Velox/Acero in particular are all plagued by what I've
       | come to call "Java syndrome", where everything is written as
       | idiomatic Java but with C++ syntax. Virtual methods,
       | std::shared_ptr galore (in lieu of garbage collection), random
       | heap allocations, etc. As a result these systems tend to be
       | bloated and significantly slower than they need to be.
       | 
       | DuckDB is good though, and I predict its quality of
       | implementation will keep "monolithic databases" relevant for a
       | while longer.
        
       ___________________________________________________________________
       (page generated 2024-03-25 23:00 UTC)