[HN Gopher] Apache Arrow 4.0
___________________________________________________________________
Apache Arrow 4.0
Author : kylebarron
Score : 127 points
Date : 2021-05-05 15:57 UTC (7 hours ago)
(HTM) web link (arrow.apache.org)
(TXT) w3m dump (arrow.apache.org)
| phissenschaft wrote:
| Great to see Ballista in arrow
| https://github.com/apache/arrow/pull/9723
| [deleted]
| michael_j_ward wrote:
| Just a heads up - ballista has been donated to apache arrow,
| but they also broke the Rust arrow libraries into separate
| repos [0][1] in a break from the mono-repo model. Read more at
| the announcement[3] or the merge request[4].
|
| [0] https://github.com/apache/arrow-datafusion
|
| [1] https://github.com/apache/arrow-rs
|
| [2] https://arrow.apache.org/blog/2021/05/04/rust-dev-workflow/
|
| [3] https://github.com/apache/arrow/pull/10096
| skrebbel wrote:
| I don't understand much about this apache/java data streaming
| ecosystem (ETL, Kafka, Cassandra, they're all buzzword bingo to
| me and i don't know what it all means), but maybe someone here
| can translate this to simpler application programmer terms?
|
| I read the overview, and I'm not sure yet, but is this like an
| in-memory database that runs inside your process? Like, sqlite
| without disk persistence, or Erlang ETS, but then columnar?
|
| I can't completely tell from the overview whether it's about the
| data format or the querying capability. A columnar ETS
| alternative would be splendid indeed!
| tracyhenry wrote:
| Here's a talk by Arrow's lead maintainer Wes Mckinney:
| https://youtu.be/OtIU7HsHCE8?t=2731 which I think gave a good
| overview on the high-level motivations behind Apache Arrow.
| cookguyruffles wrote:
| I'm similarly confused. It seems to be a family of table
| encodings that sacrifice encoding simplicity for compactness.
| PyArrow implements parquet files for example, but also I think
| feather? PR for this project is a mess. Front page should be a
| bullet point list of deliverables rather than aspirations
| nobody understands
| vletal wrote:
| Come on, it's on the top of the front page
| https://arrow.apache.org/
| cookguyruffles wrote:
| You don't understand, I'm already a user of PyArrow and it
| doesn't match that page at all. It handles Parquet and
| Feather, right? They don't appear on the home page.
| Clicking the "specifications" link instead starts to talk
| about Flatbuffers. What's going on?
| papercrane wrote:
| Arrow is the in-memory format, PyArrow supports loading
| and saving that data as Parquet and Feather formatted
| files.
| nerdponx wrote:
| Parquet and Feather are on-disk file formats. Arrow is an
| in-memory format.
|
| Parquet is not Arrow, but they work well together, in
| that one can easily be (de)serialized to the other.
|
| Feather uses the Arrow IPC format internally.
| Tostino wrote:
| Their FAQ page at least answered some of that for me. Have
| the same feeling.
|
| http://arrow.apache.org/faq
| T-R wrote:
| Data science does a lot of SQL-like and linear-algebra-like
| transformations over a lot of data, and needs it to be
| reasonably performant. This means you want to do things like
| minimize overhead of indexing into data, and use things like
| SIMD instructions/GPU or parallelize work. To do this, you
| generally want your data in column-major format - organized as
| objects of arrays, rather than arrays of objects. Dataframe
| libraries like Pandas (which uses optimized linear algebra
| libraries like BLAS/LAPACK under the hood, via numpy) and the
| Spark Dataframe API are for working with columnar data and
| getting performance via SIMD or parallelization, respectively.
|
| Generally people start off by doing these computations in a
| series of batch jobs (an "ETL pipeline", orchestrated with
| something like Airflow), to transform data into whatever shape
| they ultimately want it in; streaming technologies like Spark
| Streaming and Kafka can help with incrementally adding new rows
| to your data, rather than recomputing the whole thing every
| batch-job run.
|
| Whenever you want to involve multiple systems or multiple
| libraries in your dataframe transformations, there's
| potentially a lot of computational overhead in serializing the
| dataframes or just converting them between memory
| representations. Arrow is a standardized format, spearheaded by
| the person who wrote Pandas, that attempts to match the in-
| memory representation, so that whether you're passing the data
| between libraries in-memory or writing a file for some other
| system to read, no unnecessary transformations need to happen
| to work on the data.
| 6gvONxR4sf7o wrote:
| > linear-algebra-like transformations
|
| > To do this, you generally want your data in column-major
| format
|
| I'd argue that the basic element of linear algebra is matrix
| vector multiplication, which I figured was best done row-
| major. Column major is great in other data use cases, but
| 'linear-algebra-like, therefore column major' doesn't feel
| right.
| BenoitP wrote:
| I don't know about linear algebra, but column major lets
| you compress thus:
|
| * Dictionary encoding: US,US,US,US,FR ->
| US:0,FR:1;0,0,0,0,1
|
| * Run-length encoding: 0,0,0,0,1 -> 4x0,1x1
|
| * Delta encoding: 0,1,2,3,4 -> 5x'+1'
|
| * Storing the min and max for a chunk
|
| Basically: exploit the data type to compress it.
|
| Which enables very fast filtering and projections. (And now
| that the IO bottleneck has been managed you can do your
| gigantic logistic regression)
| dmlorenzetti wrote:
| It sounds like you're thinking about the mat-vec operation
| in terms of "Grab one row of the matrix, take the dot-
| product with the vector, and repeat for each row of the
| matrix."
|
| But it's also possible to think of it as "Grab one element
| of the vector, use it to scale the corresponding col of the
| matrix, and repeat, summing results." Both are efficient
| means of finding the result, and both have block-level
| versions that play nicely with the machine cache.
|
| Meanwhile, linear algebra also often involves finding
| vector norms, and scaling vectors, and so on, and the way
| we usually set up tables means that the vectors of interest
| are generally columns of the data tables.
| T-R wrote:
| This is what I was trying to get at - using column
| vectors gives good cache locality and lets you use SIMD
| for "multiply all of these by this scalar" for each
| column, and then for "sum all of these" for the resulting
| rows. I'd imagine it could also let you optimize
| multiplications into things like bit-shifts with minimal
| overhead as well, though I have no idea if that's done in
| practice. Maybe only tangentially related, but I feel
| like this talk on Halide[0] is really illustrative of the
| general concepts.
|
| As others have mentioned, for some operations it can also
| save you from loading whole columns that aren't relevant
| for your transformation. The compression point in the
| sibling comment is definitely also relevant, especially
| for serialization. A whole lot of reasons to use column
| vectors.
|
| Using "column-major" here might've been terminology
| abuse; sorry for the confusion.
|
| [0] https://www.youtube.com/watch?v=3uiEyEKji0M
| andylei wrote:
| "column" here refers to a type of data. let's say you have
| a bunch of records of purchases. one column would be price,
| another column would be quantity.
|
| if you're doing a linear algebra like transformation, you
| want to do it on all the prices or all the quantities, and
| a linear algebra library expects a big array of numbers,
| which is why you have to transform your records into an
| array of prices and an array of quantities.
|
| "column" here refers to properties of objects, and not rows
| vs columns with in an array of number
| vletal wrote:
| Long story short: Apache Arrow defines a format for (tabular)
| data to allow efficient computation and easier interop and
| sharing data between different frameworks.
| vmind wrote:
| Arrow itself is a standardised in-memory columnar data
| representation. The benefit of this is you can then send data
| between processes without needing to be concerned with
| serialise/deserialise. There's then a growing ecosystem around
| this, e.g. Flight for making the sending of data easier,
| DataFusion for querying, etc.
| aynyc wrote:
| Can you persist Arrow-format data to disk? I see a lot of
| interests in it, but I can't figure out the use cases. For
| example, let's say I have ton (xx TB) of well-structured
| objects on S3. I want to run query via Spark/Presto on the
| data. I still need to deserialize the data from ORC/PARQUET
| into ARROW right? The advantage with ARROW here is if
| Spark/Presto can use this format to pass the data between
| worker nodes, the query would be faster because we don't need
| to deserialize/serialize when passing data between nodes? If
| yes, how do I utilize the format in Spark/Presto?
| lmeyerov wrote:
| You can: it is serializable and self-describing. However,
| unless "disk" is super fast and thus more likely memory,
| and your data is ephemeral, you probably shouldn't.
| Instead, we've been happier as parquet/orc: tunable
| compression, nicer multi-part / parallel readers, and a bit
| more stable.
|
| There _is_ feather for persistence, but you don 't need it:
| just as how you can stream binary arrow buffers to
| processes, you can write raw arrow to disk. In theory it
| might give some teams in some setups parallel read/write
| speedups, but we've been exploring other paths there, e.g.,
| 90+GB/s per node via GDS https://pavilion.io/nvidia . I'm
| not aware of feather efforts targeting that kind of perf
| but would be curious!
|
| To utilize w/ spark.. it already does underneath ;-) an
| increasing flow is something like spark filter -> gpu
| compute+ai, where the transfer is spark cpu rdd -> arrow
| (spark-native) -> rapids/tensorflow
|
| Edit: Arrow dev _does_ seem more active than parquet /orc
| (and a lot of their dev is _by_ arrow devs!), so give it
| another couple of years, and I can see arrow being stable
| enough that you can persist data with less fear of having
| to reprocess older files and having most of the compression
| features you'd want!
| aynyc wrote:
| Got ya. We are sticking with Par/Orc for now, we are
| running into the scenario where size of the data is going
| up, query SLA is going down. At some point, we will need
| to look at other technology to reduce cost without
| sacrificing performance.
| lmeyerov wrote:
| Yep. I may have been unclear, they work well together:
| we'll do a gpu parquet reader that returns an arrow
| dataframe that our ETL pipeline then transforms into
| visual depictions of the correlations+relationships in
| people's datasets. Stuff on disk is nice stable formats,
| stuff across our API boundaries & compute frameworks is
| arrow.
| aynyc wrote:
| Interesting design! How big is your data per scan?
| lmeyerov wrote:
| it varies.. a lot of our users look at say 50kb files for
| quick small and targeted visual sessions , but when doing
| something like a log dump analysis, we are working on TB
| files and 1-2 GB per streaming part is good. CPU arrow
| people like to do say 10KB-1MB per record batch, but GPU
| land is a lot faster by thinking in terms of bandwidth,
| and so 500MB-10GB per contiguous part, depending on GPU
| memory and working set size. likewise, depends on how
| compressed it is, as you ultimately care how much it
| uncompresses into for the downstream memory pressure.
| hope that makes sense!
| SatvikBeri wrote:
| As one example, I use Arrow in production to share DataFrames
| between Python and Julia processes. My previous method was
| saving to disk and loading, this reduces the time from ~2
| minutes to ~5 seconds.
| nerdponx wrote:
| You're actually sharing the memory between the processes? I'd
| love to know how that works.
| SatvikBeri wrote:
| I actually don't know very much about how it works (which
| is a testament to the usability of the libraries!)
|
| I use PyCall to use Python from Julia, and PyJulia to do
| the reverse. PyCall & PyJulia have functionality to easily
| share arrays. PyCall is pretty seamless, PyJulia is a bit
| more work but still solid.
|
| To share a DataFrame, I convert it to Arrow, get a
| bytearray representation of that, use PyCall/PyJulia to
| access the array from a different process, and reinterpret
| it as Arrow data within that process.
|
| PyCall: https://github.com/JuliaPy/PyCall.jl
|
| PyJulia: https://pyjulia.readthedocs.io/en/latest/
| kthejoker2 wrote:
| Can I just pushback on this misuse of the term "buzzword
| bingo"? By all means, admit you're not familiar with them, but
| don't cast that as some kind of culture jamming point of pride.
|
| ETL is a design pattern.
|
| Kafka and Cassandra are tools.
|
| Arrow is a data format.
|
| Sorry, will cop to being completely defensive here, but don't
| call my baby ugly.
| [deleted]
| nxpnsv wrote:
| This is actually a really a useful breakdown.
| [deleted]
| zarathustreal wrote:
| >don't cast that as some kind of culture jamming point of
| pride.
|
| >Sorry, will cop to being completely defensive here, but
| don't call my baby ugly.
|
| Not trying to be antagonistic here for what it's worth - I
| see your statement as far more guilty of casting your
| understanding of this technology as a culture jamming point
| of pride than the GP.
|
| Furthermore the implication that "people who unironically
| refer to this process as ETL" is the dominant culture (and
| statements expressing the irony of giving the most common
| form of computation a specific meaningless acronym are
| necessarily "culture jamming") is not correct in my
| experience but YMMV
| kthejoker2 wrote:
| I don't know your world, but this is like telling a
| carpenter that hammers, jigsaws, and mahogany are "buzzword
| bingo", it's condescending and disrespectful.
|
| Dominant culture has nothing to do with it, it's about
| respecting boundaries. You not knowing something isn't an
| indictment on the something.
| cwyers wrote:
| ETL is a term that dates back to the 70s, and is familiar
| to anyone who is doing any medium to large scale database
| management. I'm not saying that everyone has to know about
| relational databases, but to treat the _relational
| database_ and its attendant terms as out of the mainstream
| and "buzzwords" is just weird.
| yazaddaruvala wrote:
| As I understand it, Arrow is about the data format, to best
| optimize query capabilities in any language.
|
| Think, "CapnProto" or "Protobufs" but for querying data rather
| than only data transfer.
|
| In theory Arrow will enable users to "trivially" create high
| performance SQLite type querying in any programming language.
| Arrow doesn't help with any other features of SQLite tho.
|
| Maybe one day Arrow will also target on-disc
| analytics/persistence but for now it delegates to Parquet.
| lmeyerov wrote:
| Protobufs is a great comparison point and exactly what we
| upgraded from. It's like protobuf but specialized for rich
| data tables and, for those living outside google's
| proprietary layers, modern sensibilities. This brings wins
| like self-describing (no .proto, it optionally includes a
| schema), streaming, columnar by default, etc. By
| standardizing the notion of a table (and chunk within it, a
| record batch), it plugs into compute/io frameworks that work
| on tables: spark, pandas, etc. Most compute/db framework that
| gives you lists/tables will likely have out of the box
| support for it, and even when not explicit, via tools like
| turbodc for SQL DBs.
|
| Arrow has compute libs, but we don't use them (yet). More as
| interop for compute engines that are fast (ex: rapids for
| gpu) or feature rich (ex: pandas for chunks). Likewise, for
| I/O, interop for good formats there: on-the-fly arrow ->
| persistent parquet|orc.
|
| It's enabling us to do cool stuff like fast interop between
| our cpu<>gpu code, and in the latest initiative, crossing
| even process & language runtime boundaries w/ zero-copy.
| derriz wrote:
| No it's not a database - it's much more low level than that.
|
| It's a chunked columnar storage format for large data sets.
| Think of storage layout for a table of data. With arrow, the
| table is first split row-wize into chunks (sometimes/often just
| one chunk), then each column of each chunk is stored as an
| array. The underlying arrays layout supports vectorized
| operations.
|
| Querying is largely independent of arrow itself which is just
| the memory format. But the format was designed to support
| efficient querying. If used as a disk format, for example, you
| can efficiently load a subset of columns without touching the
| entire file. And if you're lucky and/or you've chunked the data
| appropriately you may be able to skip entire chunks.
|
| The format is also language agnostic - as long as the language
| is python or C++ :) - and allows zero-copy passing of data
| across the language barrier assuming a shared memory model.
| This zero-copy feature is important when dealing with large in-
| memory data-sets.
|
| Unfortunately, until the entire Python data-science ecosystem
| is re-written from scratch, the application for arrow will
| largely for library writers and plumbing.
|
| Yes, as a data-scientist, you can easily turn your arrow table
| into a Pandas Dataframe or a numpy array but you run the risk
| of expensive copies occurring (actually a copy is guaranteed to
| happen if the table has more than one chunk) which sort-of
| defeats the whole purpose. And since to do anything useful with
| the data you're going to have to perform this conversion - as
| most of the Python data-science ecosystem is built on numpy and
| pandas - the format is not particularly interesting to data-
| science users, I feel.
|
| it their storage formats, I believe, will continue to dominate
| for columnar storage in the Python world for the foreseeable
| future.
| amitport wrote:
| Note that the pandas team pushed arrow forward and are
| planning support for it.
|
| https://pandas.pydata.org/docs/development/roadmap.html#
| adrianmonk wrote:
| ETL: That data over there is in that format, and that other
| data over in that other place is in that other format, and I
| want it all over here in this format.
|
| So I will extract it (get it out of those other places),
| transform it (put it in my format), and load it (store it
| here).
|
| The term has been around for decades and traditionally is used
| by database people, but it can apply to any process that does
| this.
|
| A concrete example might be if you run a business where you
| sell used books through Amazon and eBay. They each have their
| own format for info about the status of a product you have
| listed (whether it has sold, which shipping option the buyer
| chose, whether payment was received, etc.), but you might want
| to have that data from both sources in one place so you can see
| a dashboard or analyze it.
| yazaddaruvala wrote:
| Is anyone familiar enough to know if Arrow is also targeting
| usage by libraries like Lucene?
| innagadadavida wrote:
| Lucene among other things implements an inverted index -
| basically tracks frequency of a word across different
| documents. In my opinion, it is already in a columnar like
| format and highly optimized for the use case and won't see any
| benefit from changing on disk formats.
| yazaddaruvala wrote:
| > it is already in a columnar like format and highly
| optimized for the use case
|
| I mean it has more complex data structures (FST) than just
| columnar, but yeah for doc values and such I agree, that is
| exactly why I'm curious if Arrow is targeting that usecase,
| and will be competitively "highly optimized".
|
| > and won't see any benefit from changing on disk formats.
|
| I'm less interested in Lucene actually migrating to Arrow
| (although if it reduces tech-debt they should look into it),
| I'm most interested in if Arrow will help future Lucene-like
| libraries get implemented with competitive performance.
|
| Also since Arrow version=X is cross-language compatible, it
| would be amazing to be able to create "Lucene" indexes (or
| segments) in Java (perhaps for legacy reasons), then use Rust
| or Go to query the data.
| poorman wrote:
| Love the progress being made!
| xbar wrote:
| Here's a neat story about how an Apple M1 Macbook enjoyed 3x the
| performance compared to an Apple Intel Macbook using a (hassle to
| compile) Apache Arrow test instantiation.
|
| https://uwekorn.com/2021/01/11/apache-arrow-on-the-apple-m1....
| dmitrykoval wrote:
| Good progress overall, especially on the Rust side, I wonder if
| C++ and Rust would at some point follow the same roadmap when it
| comes higher-level compute features or rather deviate and develop
| at their own pace.
|
| Special kudos to the Rust team for Parquet predicates pushdown
| feature.
| oregontechninja wrote:
| It's a binary data format, supporting trees, tables, lists, and
| even blobs. Never used it, I already have sqlite.
| vlmutolo wrote:
| Two major and significant differences from sqlite:
|
| No persistent storage. Arrow is meant to be used for in-memory
| queries.
|
| Column-major storage. This enables more efficient data-science-
| like queries, such as univariate statistics on columns.
| mkoubaa wrote:
| Really great to see how fast this thing took flight!
| liminal wrote:
| Is Arrow good for text data or does the columnar format lose its
| benefits when dealing with lots of arbitrary length strings?
| rubatuga wrote:
| Finally has ARM builds for pyarrow!
___________________________________________________________________
(page generated 2021-05-05 23:02 UTC)