[HN Gopher] Apache Arrow 4.0
       ___________________________________________________________________
        
       Apache Arrow 4.0
        
       Author : kylebarron
       Score  : 127 points
       Date   : 2021-05-05 15:57 UTC (7 hours ago)
        
 (HTM) web link (arrow.apache.org)
 (TXT) w3m dump (arrow.apache.org)
        
       | phissenschaft wrote:
       | Great to see Ballista in arrow
       | https://github.com/apache/arrow/pull/9723
        
         | [deleted]
        
         | michael_j_ward wrote:
         | Just a heads up - ballista has been donated to apache arrow,
         | but they also broke the Rust arrow libraries into separate
         | repos [0][1] in a break from the mono-repo model. Read more at
         | the announcement[3] or the merge request[4].
         | 
         | [0] https://github.com/apache/arrow-datafusion
         | 
         | [1] https://github.com/apache/arrow-rs
         | 
         | [2] https://arrow.apache.org/blog/2021/05/04/rust-dev-workflow/
         | 
         | [3] https://github.com/apache/arrow/pull/10096
        
       | skrebbel wrote:
       | I don't understand much about this apache/java data streaming
       | ecosystem (ETL, Kafka, Cassandra, they're all buzzword bingo to
       | me and i don't know what it all means), but maybe someone here
       | can translate this to simpler application programmer terms?
       | 
       | I read the overview, and I'm not sure yet, but is this like an
       | in-memory database that runs inside your process? Like, sqlite
       | without disk persistence, or Erlang ETS, but then columnar?
       | 
       | I can't completely tell from the overview whether it's about the
       | data format or the querying capability. A columnar ETS
       | alternative would be splendid indeed!
        
         | tracyhenry wrote:
         | Here's a talk by Arrow's lead maintainer Wes Mckinney:
         | https://youtu.be/OtIU7HsHCE8?t=2731 which I think gave a good
         | overview on the high-level motivations behind Apache Arrow.
        
         | cookguyruffles wrote:
         | I'm similarly confused. It seems to be a family of table
         | encodings that sacrifice encoding simplicity for compactness.
         | PyArrow implements parquet files for example, but also I think
         | feather? PR for this project is a mess. Front page should be a
         | bullet point list of deliverables rather than aspirations
         | nobody understands
        
           | vletal wrote:
           | Come on, it's on the top of the front page
           | https://arrow.apache.org/
        
             | cookguyruffles wrote:
             | You don't understand, I'm already a user of PyArrow and it
             | doesn't match that page at all. It handles Parquet and
             | Feather, right? They don't appear on the home page.
             | Clicking the "specifications" link instead starts to talk
             | about Flatbuffers. What's going on?
        
               | papercrane wrote:
               | Arrow is the in-memory format, PyArrow supports loading
               | and saving that data as Parquet and Feather formatted
               | files.
        
               | nerdponx wrote:
               | Parquet and Feather are on-disk file formats. Arrow is an
               | in-memory format.
               | 
               | Parquet is not Arrow, but they work well together, in
               | that one can easily be (de)serialized to the other.
               | 
               | Feather uses the Arrow IPC format internally.
        
           | Tostino wrote:
           | Their FAQ page at least answered some of that for me. Have
           | the same feeling.
           | 
           | http://arrow.apache.org/faq
        
         | T-R wrote:
         | Data science does a lot of SQL-like and linear-algebra-like
         | transformations over a lot of data, and needs it to be
         | reasonably performant. This means you want to do things like
         | minimize overhead of indexing into data, and use things like
         | SIMD instructions/GPU or parallelize work. To do this, you
         | generally want your data in column-major format - organized as
         | objects of arrays, rather than arrays of objects. Dataframe
         | libraries like Pandas (which uses optimized linear algebra
         | libraries like BLAS/LAPACK under the hood, via numpy) and the
         | Spark Dataframe API are for working with columnar data and
         | getting performance via SIMD or parallelization, respectively.
         | 
         | Generally people start off by doing these computations in a
         | series of batch jobs (an "ETL pipeline", orchestrated with
         | something like Airflow), to transform data into whatever shape
         | they ultimately want it in; streaming technologies like Spark
         | Streaming and Kafka can help with incrementally adding new rows
         | to your data, rather than recomputing the whole thing every
         | batch-job run.
         | 
         | Whenever you want to involve multiple systems or multiple
         | libraries in your dataframe transformations, there's
         | potentially a lot of computational overhead in serializing the
         | dataframes or just converting them between memory
         | representations. Arrow is a standardized format, spearheaded by
         | the person who wrote Pandas, that attempts to match the in-
         | memory representation, so that whether you're passing the data
         | between libraries in-memory or writing a file for some other
         | system to read, no unnecessary transformations need to happen
         | to work on the data.
        
           | 6gvONxR4sf7o wrote:
           | > linear-algebra-like transformations
           | 
           | > To do this, you generally want your data in column-major
           | format
           | 
           | I'd argue that the basic element of linear algebra is matrix
           | vector multiplication, which I figured was best done row-
           | major. Column major is great in other data use cases, but
           | 'linear-algebra-like, therefore column major' doesn't feel
           | right.
        
             | BenoitP wrote:
             | I don't know about linear algebra, but column major lets
             | you compress thus:
             | 
             | * Dictionary encoding: US,US,US,US,FR ->
             | US:0,FR:1;0,0,0,0,1
             | 
             | * Run-length encoding: 0,0,0,0,1 -> 4x0,1x1
             | 
             | * Delta encoding: 0,1,2,3,4 -> 5x'+1'
             | 
             | * Storing the min and max for a chunk
             | 
             | Basically: exploit the data type to compress it.
             | 
             | Which enables very fast filtering and projections. (And now
             | that the IO bottleneck has been managed you can do your
             | gigantic logistic regression)
        
             | dmlorenzetti wrote:
             | It sounds like you're thinking about the mat-vec operation
             | in terms of "Grab one row of the matrix, take the dot-
             | product with the vector, and repeat for each row of the
             | matrix."
             | 
             | But it's also possible to think of it as "Grab one element
             | of the vector, use it to scale the corresponding col of the
             | matrix, and repeat, summing results." Both are efficient
             | means of finding the result, and both have block-level
             | versions that play nicely with the machine cache.
             | 
             | Meanwhile, linear algebra also often involves finding
             | vector norms, and scaling vectors, and so on, and the way
             | we usually set up tables means that the vectors of interest
             | are generally columns of the data tables.
        
               | T-R wrote:
               | This is what I was trying to get at - using column
               | vectors gives good cache locality and lets you use SIMD
               | for "multiply all of these by this scalar" for each
               | column, and then for "sum all of these" for the resulting
               | rows. I'd imagine it could also let you optimize
               | multiplications into things like bit-shifts with minimal
               | overhead as well, though I have no idea if that's done in
               | practice. Maybe only tangentially related, but I feel
               | like this talk on Halide[0] is really illustrative of the
               | general concepts.
               | 
               | As others have mentioned, for some operations it can also
               | save you from loading whole columns that aren't relevant
               | for your transformation. The compression point in the
               | sibling comment is definitely also relevant, especially
               | for serialization. A whole lot of reasons to use column
               | vectors.
               | 
               | Using "column-major" here might've been terminology
               | abuse; sorry for the confusion.
               | 
               | [0] https://www.youtube.com/watch?v=3uiEyEKji0M
        
             | andylei wrote:
             | "column" here refers to a type of data. let's say you have
             | a bunch of records of purchases. one column would be price,
             | another column would be quantity.
             | 
             | if you're doing a linear algebra like transformation, you
             | want to do it on all the prices or all the quantities, and
             | a linear algebra library expects a big array of numbers,
             | which is why you have to transform your records into an
             | array of prices and an array of quantities.
             | 
             | "column" here refers to properties of objects, and not rows
             | vs columns with in an array of number
        
         | vletal wrote:
         | Long story short: Apache Arrow defines a format for (tabular)
         | data to allow efficient computation and easier interop and
         | sharing data between different frameworks.
        
         | vmind wrote:
         | Arrow itself is a standardised in-memory columnar data
         | representation. The benefit of this is you can then send data
         | between processes without needing to be concerned with
         | serialise/deserialise. There's then a growing ecosystem around
         | this, e.g. Flight for making the sending of data easier,
         | DataFusion for querying, etc.
        
           | aynyc wrote:
           | Can you persist Arrow-format data to disk? I see a lot of
           | interests in it, but I can't figure out the use cases. For
           | example, let's say I have ton (xx TB) of well-structured
           | objects on S3. I want to run query via Spark/Presto on the
           | data. I still need to deserialize the data from ORC/PARQUET
           | into ARROW right? The advantage with ARROW here is if
           | Spark/Presto can use this format to pass the data between
           | worker nodes, the query would be faster because we don't need
           | to deserialize/serialize when passing data between nodes? If
           | yes, how do I utilize the format in Spark/Presto?
        
             | lmeyerov wrote:
             | You can: it is serializable and self-describing. However,
             | unless "disk" is super fast and thus more likely memory,
             | and your data is ephemeral, you probably shouldn't.
             | Instead, we've been happier as parquet/orc: tunable
             | compression, nicer multi-part / parallel readers, and a bit
             | more stable.
             | 
             | There _is_ feather for persistence, but you don 't need it:
             | just as how you can stream binary arrow buffers to
             | processes, you can write raw arrow to disk. In theory it
             | might give some teams in some setups parallel read/write
             | speedups, but we've been exploring other paths there, e.g.,
             | 90+GB/s per node via GDS https://pavilion.io/nvidia . I'm
             | not aware of feather efforts targeting that kind of perf
             | but would be curious!
             | 
             | To utilize w/ spark.. it already does underneath ;-) an
             | increasing flow is something like spark filter -> gpu
             | compute+ai, where the transfer is spark cpu rdd -> arrow
             | (spark-native) -> rapids/tensorflow
             | 
             | Edit: Arrow dev _does_ seem more active than parquet /orc
             | (and a lot of their dev is _by_ arrow devs!), so give it
             | another couple of years, and I can see arrow being stable
             | enough that you can persist data with less fear of having
             | to reprocess older files and having most of the compression
             | features you'd want!
        
               | aynyc wrote:
               | Got ya. We are sticking with Par/Orc for now, we are
               | running into the scenario where size of the data is going
               | up, query SLA is going down. At some point, we will need
               | to look at other technology to reduce cost without
               | sacrificing performance.
        
               | lmeyerov wrote:
               | Yep. I may have been unclear, they work well together:
               | we'll do a gpu parquet reader that returns an arrow
               | dataframe that our ETL pipeline then transforms into
               | visual depictions of the correlations+relationships in
               | people's datasets. Stuff on disk is nice stable formats,
               | stuff across our API boundaries & compute frameworks is
               | arrow.
        
               | aynyc wrote:
               | Interesting design! How big is your data per scan?
        
               | lmeyerov wrote:
               | it varies.. a lot of our users look at say 50kb files for
               | quick small and targeted visual sessions , but when doing
               | something like a log dump analysis, we are working on TB
               | files and 1-2 GB per streaming part is good. CPU arrow
               | people like to do say 10KB-1MB per record batch, but GPU
               | land is a lot faster by thinking in terms of bandwidth,
               | and so 500MB-10GB per contiguous part, depending on GPU
               | memory and working set size. likewise, depends on how
               | compressed it is, as you ultimately care how much it
               | uncompresses into for the downstream memory pressure.
               | hope that makes sense!
        
         | SatvikBeri wrote:
         | As one example, I use Arrow in production to share DataFrames
         | between Python and Julia processes. My previous method was
         | saving to disk and loading, this reduces the time from ~2
         | minutes to ~5 seconds.
        
           | nerdponx wrote:
           | You're actually sharing the memory between the processes? I'd
           | love to know how that works.
        
             | SatvikBeri wrote:
             | I actually don't know very much about how it works (which
             | is a testament to the usability of the libraries!)
             | 
             | I use PyCall to use Python from Julia, and PyJulia to do
             | the reverse. PyCall & PyJulia have functionality to easily
             | share arrays. PyCall is pretty seamless, PyJulia is a bit
             | more work but still solid.
             | 
             | To share a DataFrame, I convert it to Arrow, get a
             | bytearray representation of that, use PyCall/PyJulia to
             | access the array from a different process, and reinterpret
             | it as Arrow data within that process.
             | 
             | PyCall: https://github.com/JuliaPy/PyCall.jl
             | 
             | PyJulia: https://pyjulia.readthedocs.io/en/latest/
        
         | kthejoker2 wrote:
         | Can I just pushback on this misuse of the term "buzzword
         | bingo"? By all means, admit you're not familiar with them, but
         | don't cast that as some kind of culture jamming point of pride.
         | 
         | ETL is a design pattern.
         | 
         | Kafka and Cassandra are tools.
         | 
         | Arrow is a data format.
         | 
         | Sorry, will cop to being completely defensive here, but don't
         | call my baby ugly.
        
           | [deleted]
        
           | nxpnsv wrote:
           | This is actually a really a useful breakdown.
        
           | [deleted]
        
           | zarathustreal wrote:
           | >don't cast that as some kind of culture jamming point of
           | pride.
           | 
           | >Sorry, will cop to being completely defensive here, but
           | don't call my baby ugly.
           | 
           | Not trying to be antagonistic here for what it's worth - I
           | see your statement as far more guilty of casting your
           | understanding of this technology as a culture jamming point
           | of pride than the GP.
           | 
           | Furthermore the implication that "people who unironically
           | refer to this process as ETL" is the dominant culture (and
           | statements expressing the irony of giving the most common
           | form of computation a specific meaningless acronym are
           | necessarily "culture jamming") is not correct in my
           | experience but YMMV
        
             | kthejoker2 wrote:
             | I don't know your world, but this is like telling a
             | carpenter that hammers, jigsaws, and mahogany are "buzzword
             | bingo", it's condescending and disrespectful.
             | 
             | Dominant culture has nothing to do with it, it's about
             | respecting boundaries. You not knowing something isn't an
             | indictment on the something.
        
             | cwyers wrote:
             | ETL is a term that dates back to the 70s, and is familiar
             | to anyone who is doing any medium to large scale database
             | management. I'm not saying that everyone has to know about
             | relational databases, but to treat the _relational
             | database_ and its attendant terms as out of the mainstream
             | and "buzzwords" is just weird.
        
         | yazaddaruvala wrote:
         | As I understand it, Arrow is about the data format, to best
         | optimize query capabilities in any language.
         | 
         | Think, "CapnProto" or "Protobufs" but for querying data rather
         | than only data transfer.
         | 
         | In theory Arrow will enable users to "trivially" create high
         | performance SQLite type querying in any programming language.
         | Arrow doesn't help with any other features of SQLite tho.
         | 
         | Maybe one day Arrow will also target on-disc
         | analytics/persistence but for now it delegates to Parquet.
        
           | lmeyerov wrote:
           | Protobufs is a great comparison point and exactly what we
           | upgraded from. It's like protobuf but specialized for rich
           | data tables and, for those living outside google's
           | proprietary layers, modern sensibilities. This brings wins
           | like self-describing (no .proto, it optionally includes a
           | schema), streaming, columnar by default, etc. By
           | standardizing the notion of a table (and chunk within it, a
           | record batch), it plugs into compute/io frameworks that work
           | on tables: spark, pandas, etc. Most compute/db framework that
           | gives you lists/tables will likely have out of the box
           | support for it, and even when not explicit, via tools like
           | turbodc for SQL DBs.
           | 
           | Arrow has compute libs, but we don't use them (yet). More as
           | interop for compute engines that are fast (ex: rapids for
           | gpu) or feature rich (ex: pandas for chunks). Likewise, for
           | I/O, interop for good formats there: on-the-fly arrow ->
           | persistent parquet|orc.
           | 
           | It's enabling us to do cool stuff like fast interop between
           | our cpu<>gpu code, and in the latest initiative, crossing
           | even process & language runtime boundaries w/ zero-copy.
        
         | derriz wrote:
         | No it's not a database - it's much more low level than that.
         | 
         | It's a chunked columnar storage format for large data sets.
         | Think of storage layout for a table of data. With arrow, the
         | table is first split row-wize into chunks (sometimes/often just
         | one chunk), then each column of each chunk is stored as an
         | array. The underlying arrays layout supports vectorized
         | operations.
         | 
         | Querying is largely independent of arrow itself which is just
         | the memory format. But the format was designed to support
         | efficient querying. If used as a disk format, for example, you
         | can efficiently load a subset of columns without touching the
         | entire file. And if you're lucky and/or you've chunked the data
         | appropriately you may be able to skip entire chunks.
         | 
         | The format is also language agnostic - as long as the language
         | is python or C++ :) - and allows zero-copy passing of data
         | across the language barrier assuming a shared memory model.
         | This zero-copy feature is important when dealing with large in-
         | memory data-sets.
         | 
         | Unfortunately, until the entire Python data-science ecosystem
         | is re-written from scratch, the application for arrow will
         | largely for library writers and plumbing.
         | 
         | Yes, as a data-scientist, you can easily turn your arrow table
         | into a Pandas Dataframe or a numpy array but you run the risk
         | of expensive copies occurring (actually a copy is guaranteed to
         | happen if the table has more than one chunk) which sort-of
         | defeats the whole purpose. And since to do anything useful with
         | the data you're going to have to perform this conversion - as
         | most of the Python data-science ecosystem is built on numpy and
         | pandas - the format is not particularly interesting to data-
         | science users, I feel.
         | 
         | it their storage formats, I believe, will continue to dominate
         | for columnar storage in the Python world for the foreseeable
         | future.
        
           | amitport wrote:
           | Note that the pandas team pushed arrow forward and are
           | planning support for it.
           | 
           | https://pandas.pydata.org/docs/development/roadmap.html#
        
         | adrianmonk wrote:
         | ETL: That data over there is in that format, and that other
         | data over in that other place is in that other format, and I
         | want it all over here in this format.
         | 
         | So I will extract it (get it out of those other places),
         | transform it (put it in my format), and load it (store it
         | here).
         | 
         | The term has been around for decades and traditionally is used
         | by database people, but it can apply to any process that does
         | this.
         | 
         | A concrete example might be if you run a business where you
         | sell used books through Amazon and eBay. They each have their
         | own format for info about the status of a product you have
         | listed (whether it has sold, which shipping option the buyer
         | chose, whether payment was received, etc.), but you might want
         | to have that data from both sources in one place so you can see
         | a dashboard or analyze it.
        
       | yazaddaruvala wrote:
       | Is anyone familiar enough to know if Arrow is also targeting
       | usage by libraries like Lucene?
        
         | innagadadavida wrote:
         | Lucene among other things implements an inverted index -
         | basically tracks frequency of a word across different
         | documents. In my opinion, it is already in a columnar like
         | format and highly optimized for the use case and won't see any
         | benefit from changing on disk formats.
        
           | yazaddaruvala wrote:
           | > it is already in a columnar like format and highly
           | optimized for the use case
           | 
           | I mean it has more complex data structures (FST) than just
           | columnar, but yeah for doc values and such I agree, that is
           | exactly why I'm curious if Arrow is targeting that usecase,
           | and will be competitively "highly optimized".
           | 
           | > and won't see any benefit from changing on disk formats.
           | 
           | I'm less interested in Lucene actually migrating to Arrow
           | (although if it reduces tech-debt they should look into it),
           | I'm most interested in if Arrow will help future Lucene-like
           | libraries get implemented with competitive performance.
           | 
           | Also since Arrow version=X is cross-language compatible, it
           | would be amazing to be able to create "Lucene" indexes (or
           | segments) in Java (perhaps for legacy reasons), then use Rust
           | or Go to query the data.
        
       | poorman wrote:
       | Love the progress being made!
        
       | xbar wrote:
       | Here's a neat story about how an Apple M1 Macbook enjoyed 3x the
       | performance compared to an Apple Intel Macbook using a (hassle to
       | compile) Apache Arrow test instantiation.
       | 
       | https://uwekorn.com/2021/01/11/apache-arrow-on-the-apple-m1....
        
       | dmitrykoval wrote:
       | Good progress overall, especially on the Rust side, I wonder if
       | C++ and Rust would at some point follow the same roadmap when it
       | comes higher-level compute features or rather deviate and develop
       | at their own pace.
       | 
       | Special kudos to the Rust team for Parquet predicates pushdown
       | feature.
        
       | oregontechninja wrote:
       | It's a binary data format, supporting trees, tables, lists, and
       | even blobs. Never used it, I already have sqlite.
        
         | vlmutolo wrote:
         | Two major and significant differences from sqlite:
         | 
         | No persistent storage. Arrow is meant to be used for in-memory
         | queries.
         | 
         | Column-major storage. This enables more efficient data-science-
         | like queries, such as univariate statistics on columns.
        
       | mkoubaa wrote:
       | Really great to see how fast this thing took flight!
        
       | liminal wrote:
       | Is Arrow good for text data or does the columnar format lose its
       | benefits when dealing with lots of arbitrary length strings?
        
       | rubatuga wrote:
       | Finally has ARM builds for pyarrow!
        
       ___________________________________________________________________
       (page generated 2021-05-05 23:02 UTC)