[HN Gopher] Ballista: Distributed compute platform implemented i...
___________________________________________________________________
Ballista: Distributed compute platform implemented in Rust using
Apache Arrow
Author : Kinrany
Score : 185 points
Date : 2021-01-18 17:50 UTC (5 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| habitue wrote:
| It saddens me a little bit that a nascent protocol like arrow
| flight is using grpc + protobufs behind the scenes with some
| hacks on top to skip the cost of deserializing protobufs. It
| seems like a really common belief that protobufs have so much
| engineering time behind them and are cross-language that it's a
| no brainer to implement your new protocol on top of them.
|
| In reality, all the engineering and optimization time is behind
| the implementations for the google internal languages, and even
| the python protobuf implementation is pretty bad.
|
| Protobuf makes some _stunningly_ bad decisions like using
| varints, etc that you shouldn 't make the immediate assumption
| "google has tons of great engineers, google uses protobuf for
| everything internally, therefore, protobuf is a good foundation
| to build my new thing on top of"
|
| In reality, path dependence and the (amazing) internal tooling
| ecosystem at google both play a huge part of why they use
| protobuf so extensively.
|
| (Grpc is a little overly complicated to be a universal
| recommendation, but I could believe it's a good choice for Arrow
| Flight. But it seems like they didn't do grpc + arrow or grpc +
| flatbuffer + arrow in the hopes that "dumb" grpc + protobuf
| implementations would be able to still benefit. In my opinion,
| grpc implementations are so coupled, there's no reason to make
| this unnecessary concession to protobufs)
| sitkack wrote:
| What you describe is truly an
| https://github.com/TimelyDataflow/abomonation
| jfim wrote:
| I'm not sure I'd call varints a stunningly bad decision. It
| seems more like the kind of tradeoff one would make if storage
| and network transfer costs are considered to be more important
| than the serialization/deserialization speed.
|
| That being said, the fact that that particular tradeoff was
| considered to be good for Google doesn't mean it actually is,
| or that it's applicable to one's application.
| habitue wrote:
| You're right, it's probably fair to say it's a stunningly bad
| tradeoff for most applications most of the time, given we
| have fast compression like snappy & brotli available now and
| cloud costs are heavily weighted towards CPU costing more
| than storage & network transfer
| xiphias2 wrote:
| I guess it was good before compression. Nowdays working with
| protobufs is painful inside Google as well, but at least it's
| supported by everything.
|
| Most of what the CPUs at Google are doing is just copying
| fields from one protobuffer to another.
| cogman10 wrote:
| I think that's a fair description of pretty much every
| webapp :D
|
| Most of them are doing nothing more than copying data from
| the database into an http stream.
| ithkuil wrote:
| Well, all what CPUs do is copying memory from one memory
| location to another, perform some arithmetic and do some
| conditional jumps :-)
| cogman10 wrote:
| pop pop jump jump oh what a crunch it is.
|
| (sung to this tune
| https://www.youtube.com/watch?v=iENQXIQ8wH0 )
|
| Side note: It really is incredible what happened in the
| early days of computing when memory and computation were
| limited. How much care was taken in the precise layout of
| memory or even the timing of a calculation was insane.
| xiphias2 wrote:
| Most of the arithmetic is checking optional fields if
| they are empty or not before copying :)
| munk-a wrote:
| Don't forget the validation!
|
| Webapps are dumb middleware that pipes data from the
| database into an http stream - but it needs to determine
| which database calls to invoke and sanitize all the
| incoming junk.
| xiphias2 wrote:
| Sure, but when you are using C++, protobufs are not the
| best way to store data...but I guess it could be worse.
| philsnow wrote:
| At least when I was there, proto (de)serialization consumed
| the plurality of cpu cycles, but not the majority.
|
| It isn't really the majority these days, is it?
| ampdepolymerase wrote:
| What are some production ready alternatives to gRPC that have
| both a pleasant developer experience and great performance?
| Apache Avro? Apache Thrift?
| jeffnappi wrote:
| GraphQL is one.
|
| Here's an article making the argument for it
| https://blog.spaceuptech.com/posts/why-we-moved-from-grpc-
| to...
| e12e wrote:
| I think that if you honestly consider GraphQL a better fit
| than gRPC, you probably should never have considered gRPC
| to begin with...
|
| And much as we're considering GraphQL for some services as
| work... I'm not sure I buy it as an RPC framework. I
| suppose it has about the same appeal as SOAP for that
| purpose.
| habitue wrote:
| I think for this kind of high performance stuff, grpc is a
| reasonable choice. For ergonomics though, http + json is fine
| for many/most applications and there is a lot more widely
| available tooling for it than there is for grpc.
|
| It's very possible that will change over time
|
| (My implicit assumption here is that a project like Arrow
| Flight wants a cross-language, widely used foundation for
| their protocol, and there's not a ton of things that fit that
| bill. But depending on your application's needs, implementing
| a language-specific rpc system is perfectly acceptable, and
| may have even better ergonomics. Rust and Python both have a
| plethora of mono-lingual rpc frameworks)
| e12e wrote:
| Cap'n'proto?
|
| https://capnproto.org/
| hilbertseries wrote:
| gRPC has a pleasant developer experience? This is news to me.
| riku_iki wrote:
| Maybe it is relatively pleasant when compared to
| alternatives.
| ampdepolymerase wrote:
| It doesn't.
| [deleted]
| speps wrote:
| > using Apache Arrow MEMORY MODEL
|
| Probably got cut because of maximum title length but important
| nonetheless.
| superbcarrot wrote:
| It has Rust in the title, that will be enough.
| DSingularity wrote:
| lol
| davesque wrote:
| Not sure what that omitted portion would have clarified. Apache
| Arrow, at its core, is a memory model spec. Also, it appears
| that this project is using the official Rust library which is
| developed in the same repo as all the other language
| implementations. So the simple statement that they're using
| Apache Arrow seems adequate.
| eb0la wrote:
| Author is the same guy that wrote Arrow Rust library ;-)
| davesque wrote:
| So he is, hah. Well there you go :).
| vasi26ro wrote:
| I have less then an year as software tester and I have a startup
| in web. I am wining around 300 EUR a month but the level of
| knowledge that I won is amazing. Even a freelancer business is
| trying to sign contract with me to establish them self as
| authority in the startup world. Don't worry just try
| secondcoming wrote:
| Hopefully this is the beginning of the end for JVM use in data-
| centric applications like this. I'm not particularly bothered if
| it's Rust or C++
| georgewfraser wrote:
| This kind of data infrastructure is a great use case for Rust. A
| lot of data infrastructure is memory-bound, so saving the memory
| overhead of GC is a huge win.
|
| The use of Arrow to support multiple programming languages is
| also a great concept. Other distributed computing engines have
| ended up tied to the JVM (Spark, Presto, Kafka) as a way of
| avoiding serialization/deserialization costs when you go across a
| language boundary. Arrow is a really elegant solution, as long as
| you're willing to batch up operations.
| MrPowers wrote:
| Databricks recently rebuilt Spark in C++ in a project called
| "Delta Engine" to overcome the JVM limitations you're pointing
| out. You're right, Rust is a great way to sidestep the dreaded
| JVM GC pauses.
| sitkack wrote:
| At the same time the JVM is getting better memory tracking
| analysis and incremental pauseless collectors (C4, ZGZ,
| Shenandoah, G1 improvements)
|
| https://blogs.oracle.com/javamagazine/understanding-the-
| jdks...
| eb0la wrote:
| Blog post that started the project. Worth reading.
| https://andygrove.io/2019/07/announcing-ballista/
| andygrove wrote:
| There is also a more recent blog post which perhaps led to the
| project being posted here (I am guessing).
|
| https://andygrove.io/2021/01/ballista-2021/
| lumost wrote:
| I think it's telling on the state of rust in 2021 that this
| project can't compile fully for the latest rust versions.
| Maintaining these types of frameworks in the early days is an
| intensive and often thankless job, having your language leave you
| behind is a near guaranteed way to kill off your project, not too
| mention introduce the obvious "I tried using library X and hit
| compilation issue Y type issues".
|
| I'm curious to see how this evolves as there are a number of
| motivated folks working on similar efforts such as Vega. I for
| one would welcome a mature rust based distributed compute
| platform.
| jasonpbecker wrote:
| > With the release of Apache Arrow 3.0.0 there are many
| breaking changes in the Rust implementation and as a result it
| has been necessary to comment out much of the code in this
| repository and gradually get each Rust module working again
| with the 3.0.0 release.
|
| This appears to be an issue with Arrows implementation hitting
| a new major version and the Rust libraries not yet being
| compatible with the newest versions of Arrow. That's not
| something specific to the Rust ecosystem. It's not like a new
| version of Rust broke this project.
|
| But even if it had, maintenance is always hard and the health
| of a project is better measured by how long it takes to be
| working with new, major, stable versions after widespread
| community adoption of those new, major, stable versions.
|
| I don't know if Arrow 3.0 is the most commonly used
| implementation-- it may not have even reached that milestone.
| nevi-me wrote:
| Arrow 3.0 will be released likely in the next week. The Rust
| implementation has a lot of changes because we've had to make
| public-facing changes, mostly for performance benefits.
| jasonpbecker wrote:
| Thanks! This is helpful context, and supports my notion.
| andygrove wrote:
| The project uses stable Rust. Which version are you trying to
| compile with?
| frankmcsherry wrote:
| I think maybe they were confused by this text (which I agree
| has nothing to do with Rust itself breaking):
|
| > With the release of Apache Arrow 3.0.0 there are many
| breaking changes in the Rust implementation and as a result
| it has been necessary to comment out much of the code in this
| repository and gradually get each Rust module working again
| with the 3.0.0 release.
| andygrove wrote:
| Ah, yes, that makes sense. I can see how this could have
| been misread.
| alisaus6 wrote:
| Hottie hangout pics with nude babes - https://adultlove.life
___________________________________________________________________
(page generated 2021-01-18 23:00 UTC)