[HN Gopher] Apache Arrow 3.0
___________________________________________________________________
Apache Arrow 3.0
Author : kylebarron
Score : 244 points
Date : 2021-02-03 19:43 UTC (3 hours ago)
(HTM) web link (arrow.apache.org)
(TXT) w3m dump (arrow.apache.org)
| mbyio wrote:
| I'm surprised they are still making breaking changes, and they
| plan to make more (they are already working on a 4.0).
| reilly3000 wrote:
| I didn't see any breaking changes in the release notes but I
| may have missed them. Maybe they don't use SemVer?
| lidavidm wrote:
| Arrow uses SemVer, but the library and the data format are
| versioned separately:
| https://arrow.apache.org/docs/format/Versioning.html
| offtop5 wrote:
| Does no one do load testing anymore, anyone got a working mirror
| Thaxll wrote:
| Last time I worked in ETL was with Hadoop, looks like a lot
| happened.
| macksd wrote:
| There's actually a lot of overlap between Hadoop and Arrow's
| origins - a lot of the projects that integrated early and the
| founding contributors had been in the larger Hadoop ecosystem.
| It's a very good sign IMO that you can hardly tell anymore -
| very diverse community and wide adoption!
| georgewfraser wrote:
| Arrow is _the most important_ thing happening in the data
| ecosystem right now. It 's going to allow you to run your choice
| of execution engine, on top of your choice of data store, as
| though they are designed to work together. It will mostly be
| invisible to users, the key thing that needs to happen is that
| all the producers and consumers of batch data need to adopt Arrow
| as the common interchange format.
|
| BigQuery recently implemented the storage API, which allows you
| to read BQ tables, in parallel, in Arrow format:
| https://cloud.google.com/bigquery/docs/reference/storage
|
| Snowflake has adopted Arrow as the in-memory format for their
| JDBC driver, though to my knowledge there is still no way to
| access data in _parallel_ from Snowflake, other than to export to
| S3.
|
| As Arrow spreads across the ecosystem, users are going to start
| discovering that they can store data in one system and query it
| in another, at full speed, and it's going to be amazing.
| wesm wrote:
| Microsoft is also on top of this with their Magpie project
|
| http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
|
| "A common, efficient serialized and wire format across data
| engines is a transformational development. Many previous
| systems and approaches (e.g., [26, 36, 38, 51]) have observed
| the prohibitive cost of data conversion and transfer,
| precluding optimizers from exploiting inter-DBMS performance
| advantages. By contrast, inmemory data transfer cost between a
| pair of Arrow-supporting systems is effectively zero. Many
| major, modern DBMSs (e.g., Spark, Kudu, AWS Data Wrangler,
| SciDB, TileDB) and data-processing frameworks (e.g., Pandas,
| NumPy, Dask) have or are in the process of incorporating
| support for Arrow and ArrowFlight. Exploiting this is key for
| Magpie, which is thereby free to combine data from different
| sources and cache intermediate data and results, without
| needing to consider data conversion overhead."
| data_ders wrote:
| way cool! Is magpie end-user facing anywhere yet? We were
| using the azureml-dataprep library for a while which seems
| similar but not all of magpie
| breck wrote:
| Arrow is definitely one of the top 10 new things I'm most
| excited about in the data science space, but not sure I'd call
| it _the_ most important thing. ;)
|
| It is pretty awesome, however, particularly for folks like me
| that are often hopping between Python/R/Javascript. I've
| definitely got in on the roadmap for all my data science
| libraries.
|
| Btw, Arquero from that UW lab looks really neat as well, and is
| supporting Arrow out of the gate
| (https://github.com/uwdata/arquero).
| tristanz wrote:
| Agreed! Thank you Arrow community for such a great project.
| It's a long road but it opens up tremendous potential for
| efficient data systems that talk to one another. The future
| looks bright with so many vendors backing Arrow independently
| and Wes McKinney founding Ursa Labs and now Ursa Computing.
| https://ursalabs.org/blog/ursa-computing/
| rubicon33 wrote:
| Wait so you're telling me I could store data as a PDF file, and
| access it easily / quickly as SQL?
| sethhochberg wrote:
| If you found/wrote a adapter to translate your structured PDF
| into Arrow's format, yes - the idea is that you can wire up
| anything that can produce Arrow data to anything that can
| consume Arrow data.
| staticassertion wrote:
| I'm kinda confused. Is that not the case for literally
| everything? "You can send me data of format X, all I ask is
| that you be able to produce format X" ?
|
| I'm assuming that I'm missing something fwiw, not trying to
| diminish the value.
| humbleMouse wrote:
| The difference is that arrow's mapping behind the scenes
| enables automatic translation to any implemented "plugin"
| that is on the user's implementation of arrow. You can
| extend arrows format to make it automatically map to
| whatever you want, basically.
| [deleted]
| shafiemukhre wrote:
| True. Arrow is awesome and Dremio is using it as well as their
| built in memory. I tried it and it is increadibily fast. The
| future of data ecosystem is gonna be amazing
| waynesonfire wrote:
| Uhh.. maybe. It's a serde that's trying to be cross-language /
| platform.
|
| I guess it also offers some APIs to process the data so you can
| minimize serde operations. But, I dunno. It's been hard to
| understand the benefit of the libabry and the posts here don't
| help.
| chrisweekly wrote:
| For those wondering what a SerDe is:
| https://docs.serde.rs/serde/
| waynesonfire wrote:
| I meant serialize / deserialize in the literal sense.
| nevi-me wrote:
| The term likely predates the Rust implementation. SerDe is
| Serializer & Deserializer, which could be any framework or
| tool that allows the serialization and deserialization of
| data.
|
| I first came across the concept in Apache Hive.
| TuringTest wrote:
| If it works as a universal intermediate exchange language, it
| could help standardize connections among disparate systems.
|
| When you have N systems, it takes N^2 translators to build
| direct connections to transfer data between them; but it only
| takes N translators if all them can talk the same exchange
| language.
| waynesonfire wrote:
| can you define what at translator is? I don't understand
| the complexity you're constructing. I have N systems and
| they talk protobuf. What's the problem?
| TuringTest wrote:
| By a translator, I mean a library that allows accessing
| data from different subsystems (either languages or OS
| processes).
|
| In this case, the advantages are that 1) Arrow is
| language agnostic, so it's likely that it can be used as
| a native library in your program and 2) it doesn't copy
| data to make it accessible to another process, so it
| saves a lot of marshalling / unmarshalling steps
| (assuming both sides use data in tabular format, which is
| typical of data analysis contexts).
| [deleted]
| wesm wrote:
| There's no serde by design (aside from inspecting a tiny
| piece of metadata indicating the location of each constituent
| block of memory). So data processing algorithms execute
| directly against the Arrow wire format without any
| deserialization.
| waynesonfire wrote:
| Of course there is. There is always deserialization. The
| data format is most definitely not native to the CPU.
| wesm wrote:
| I challenge you to have a closer look at the project.
|
| Deserialization by definition requires bytes or bits to
| be relocated from their position in the wire protocol to
| other data structures which are used for processing.
| Arrow does not require any bytes or bits to be relocated.
| So if a "C array of doubles" is not native to the CPU,
| then I don't know what is.
| throwaway894345 wrote:
| Perhaps "zero-copy" is a more precise or well-defined
| term?
| waynesonfire wrote:
| CPUs come in many flavors. One area where they differ is
| in the way that bytes of a word are represented in
| memory. Two common formats are Big Endian and Little
| Endian. This is an example where a "C array of doubles"
| would be incompatible and some form of deserilaziation
| would be needed.
|
| My understanding is that an apache arrow library provides
| an API to manipulate the format in a platform agnostic
| way. But to claim that it eliminates deserialization is
| false.
| mumblemumble wrote:
| It's not just a serde. One of its key use cases is
| eliminating serde.
| waynesonfire wrote:
| I just don't believe you. My CPU doesn't understand Apache
| Arrow 3.0.
| mumblemumble wrote:
| So, there are several components to Arrow. One of them
| transfers data using IPC, and naturally needs to
| serialize. The other uses shared memory, which eliminates
| the need for serde.
|
| Sadly, the latter isn't (yet) well supported anywhere but
| Python and C++. If you can/do use it, though, data are
| just kept as as arrays in memory. Which is exactly what
| the CPU wants to see.
| kristjansson wrote:
| Not GP post, but it might have been better stated as
| 'eliminating serde overhead'. Arrow's RPC serialization
| [1] is basically Protobuf, with a whole lot of hacks to
| eliminate copies on both ends of the wire. So it's still
| 'serde', but markedly more efficient for large blocks of
| tabular-ish data.
|
| [1]: https://arrow.apache.org/docs/format/Flight.html
| wesm wrote:
| > Arrow's serialization is Protobuf
|
| Incorrect. Only Arrow Flight embeds the Arrow wire format
| in a Protocol Buffer, but the Arrow protocol itself does
| not use Protobuf.
| kristjansson wrote:
| Apologies, off base there. Edited with a pointer to
| Flight :)
| jrevels wrote:
| Excited to see this release's official inclusion of the pure
| Julia Arrow implementation [1]!
|
| It's so cool to be able mmap Arrow memory and natively manipulate
| it from within Julia with virtually no performance overhead.
| Since the Julia compiler can specialize on the layout of Arrow-
| backed types at runtime (just as it can with any other type), the
| notion of needing to build/work with a separate "compiler for
| fast UDFs" is rendered obsolete.
|
| It feels pretty magical when two tools like this compose so well
| without either being designed with the other in mind - a
| testament to the thoughtful design of both :) mad props to Jacob
| Quinn for spearheading the effort to revive/restart Arrow.jl and
| get the package into this release.
|
| [1] https://github.com/JuliaData/Arrow.jl
| jayd16 wrote:
| Can someone dig into the pros and cons of the columnar aspect of
| Arrow? To some degree there are many other data transfer formats
| but this one seems to promote its columnar orientation.
|
| Things like eg. protobuffers support hierarchical data which
| seems like a superset of columns. Is there a benefit to a column
| based format? Is it an enforced simplification to ensure greater
| compatibility or is there some other reason?
| aldanor wrote:
| - Selective/lazy access (e.g. I have 100 columns but I want to
| quickly pull out/query just 2 of them).
|
| - Improved compression (e.g. a column of timestamps).
|
| - Flexible schemas being easy to manage (e.g. adding more
| columns, or optional columns).
|
| - Vectorization/SIMD-friendly.
| wging wrote:
| Redshift's explanation is pretty good. Among other things, if
| you only need a few columns there are entire blocks of data you
| don't need to touch at all.
| https://docs.aws.amazon.com/redshift/latest/dg/c_columnar_st...
|
| It's truly magical when you scope down a SELECT to the columns
| you need and see a query go blazing fast. Or maybe I'm easily
| impressed.
| mumblemumble wrote:
| A columnar format is almost always what you want for analytical
| workloads, because their access patterns tend to iterate ranges
| of rows but select only a few columns at random.
|
| About the only thing protocol buffers has in common is that
| it's a standardized binary format. The use case is largely non-
| overlapping, though. Protobuf is meant for transmitting
| monolithic datagrams, where the entire thing will be
| transmitted and then decoded as a monolithic blob. It's also,
| out of the box, not the best for efficiently transmitting
| highly repetitive data. Column-oriented formats cut down on
| some repetition of metadata, and also tend to be more
| compressible because similar data tends to get clumped
| together.
|
| Coincidentally, Arrow's format for transmitting data over a
| network, Arrow Flight, uses protocol buffers as its messaging
| format. Though the payload is still blocks of column-oriented
| data, for efficiency.
| tomnipotent wrote:
| This is intended for analytical workloads where you're often
| doing things that can benefit from vectorization (like SIMD).
| It's much faster to SUM(X) when all values of X are neatly laid
| out in-memory.
|
| It also has the added benefit of eliminating serialization and
| deserialization of data between processes - a Python process
| can now write to memory which is read by a C++ process that's
| doing windowed aggregations, which are then written over the
| network to another Arrow compatible service that just copies
| the data as-is from the network into local memory and resumes
| working.
| waynesonfire wrote:
| > It also has the added benefit of eliminating serialization
| and deserialization of data between processes
|
| Is that accurate? It still has to deserialize from apache
| arrow format to whatever the cpu understands.
| ianmcook wrote:
| The Arrow Feather format is an on-disk representation of
| Arrow memory. To read a Feather file, Arrow just copies it
| byte for byte from disk into memory. Or Arrow can memory-
| map a Feather file so you can operate on it without reading
| the whole file into memory.
| waynesonfire wrote:
| That's exactly how I read every data format.
|
| The advantage you describe is in the operations that can
| performed against the data. It would be nice to see what
| this API looks like and how it compares to flatbuffers /
| pq.
|
| To help me understand this benefit, can you talk through
| what it's like to add 1 to each record and write it back
| to disk?
| zten wrote:
| The important part to focus on is _between processes_.
|
| Consider Spark and PySpark. The Python bits of Spark are in
| a sidecar process to the JVM running Spark. If you ask
| PySpark to create a DataFrame from Parquet data, it'll
| instruct the Java process to load the data. Its in-memory
| form will be Arrow. Now, if you want to manipulate that
| data in PySpark using Python-only libraries, prior to the
| adoption of Arrow it used to serialize and deserialize the
| data between processes on the same host. With Arrow, this
| process is simplified -- however, I'm not sure if it's
| simplified by exchanging bytes that don't require
| serialization/deserialization between the processes or by
| literally sharing memory between the processes. The docs do
| mention zero-copied shared memory.
| [deleted]
| georgewfraser wrote:
| The main benefit of a columnar representation in _memory_ is it
| 's more cache friendly for a typical analytical workload. For
| example, if I have a dataframe: (A int, B int,
| C int, D int)
|
| And I write: A + B
|
| In a columnar representation, all the As are next to each
| other, and all the Bs are next to each other, so the process of
| (A and B in memory) => (A and B in CPU registers) => (addition)
| => (A + B result back to memory) will be a lot more efficient.
|
| In a row-oriented representation like protobuf, all your C and
| D values are going to get dragged into the CPU registers
| alongside the A and B values that you actually want.
|
| Column-oriented representation is also more friendly to SIMD
| CPU instructions. You can still use SIMD with a row-oriented
| representation, but you have to use gather-scatter operations
| which makes the whole thing less efficient.
| skratlo wrote:
| Yay, another ad-tech support engine from Apache, great
| [deleted]
| dang wrote:
| If curious see also
|
| 2020 https://news.ycombinator.com/item?id=23965209
|
| 2018 (a bit) https://news.ycombinator.com/item?id=17383881
|
| 2017 https://news.ycombinator.com/item?id=15335462
|
| 2017 https://news.ycombinator.com/item?id=15594542 rediscussed
| recently https://news.ycombinator.com/item?id=25258626
|
| 2016 https://news.ycombinator.com/item?id=11118274
|
| Also: related from a couple weeks ago
| https://news.ycombinator.com/item?id=25824399
|
| related from a few months ago
| https://news.ycombinator.com/item?id=24534274
|
| related from 2019 https://news.ycombinator.com/item?id=21826974
| humbleMouse wrote:
| I worked at a large company a few years ago on a team
| implementing this. It's super cool and works great. Definitely
| where the future is headed
| mushufasa wrote:
| Can someone ELI5 what problems are best solved by apache arrow?
| hbcondo714 wrote:
| The press release on the first version of Apache Arrow
| discussed here 5 years ago is actually a good introduction IMO:
|
| https://news.ycombinator.com/item?id=11118274
| Diederich wrote:
| Rather curious myself.
|
| https://en.wikipedia.org/wiki/Apache_Arrow was interesting, but
| I think many of us would benefit from a broader, problem
| focused description of Arrow from someone in the know.
| foobarian wrote:
| To really grok why this is useful, put yourself in the shoes of
| a data warehouse user or administrator. This is a farm of
| machines with disks that hold data too big to have in a typical
| RDBMS. Because the data is so big and stored spread out across
| machines, a whole parallel universe of data processing tools
| and practices exists that run various computations over shards
| of the data locally on each server, and merge the results etc.
| (Map-reduce originally from Google was the first famous
| example). Then there is a cottage industry of wrappers around
| this kind of system to let you use SQL to query the data, make
| it faster, let you build cron jobs and pipelines with
| dependencies, etc.
|
| Now, so far all these tools did not really have a common
| interchange format for data, so there was a lot of wheel
| reinvention and incompatibility. Got file system layer X on
| your 10PB cluster? Can't use it with SQL engine Y. And I guess
| this is where Arrow comes in, where if everyone uses it then
| interop will get a lot better and each individual tool that
| much more useful.
|
| Just my naive take.
| adgjlsfhk1 wrote:
| The big thing is that it is one of the first standardized,
| cross language binary data formats. CSV is an OK text format,
| but parsing it is really slow because of string escaping. The
| files it produces are also pretty big since it's text.
|
| Arrow is really fast to parse (up to 1000x faster than CSV),
| supports data compression, enough data-types to be useful, and
| deals with metadata well. The closest competitor is probably
| protobuf, but protobuf is a total pain to parse.
| MrPowers wrote:
| CSV is a file format and Arrow is an in memory data format.
|
| The CSV vs Parquet comparison makes more sense. Conflating
| Arrow / Parquet is a pet peeve of Wes:
| https://news.ycombinator.com/item?id=23970586
| nolta wrote:
| To be fair, arrow absorbed parquet-cpp, so a little
| confusion is to be expected.
| aldanor wrote:
| The closest competitor would be HDF5, not Protobuf.
| makapuf wrote:
| Seems nice. How does it compare to hdf5?
| BadInformatics wrote:
| HDF5 is pretty terrible as a wire format, so it's not a 1-1
| comparison to Arrow. Generally people are not going to be
| saving Arrow data to disk either (though you can with the
| IPC format), but serializing to a more compact
| representation like Parquet.
| neovintage wrote:
| The premise around arrow is that when you want share data with
| another system, or even on the same machine between processes,
| most of the compute time spent is in serializing and
| deserializing data. Arrow removes that step by defining a
| common columnar format that can be used in many different
| programming languages. Theres more to arrow than just the file
| format that makes working with data even easier like better
| over the wire transfers (arrow flight). How this would manifest
| for your customers using your applications? They'd like see
| speeds increase. Arrow makes a lot of sense when working with
| lots of data in analytical or data science use cases.
| RobinL wrote:
| Yes.
|
| A second important point is the recognition that data tooling
| often re-implements the same algorithms again and again,
| often in ways which are not particularly optimised, because
| the in-memory representation of data is different between
| tools. Arrow offers the potential to do this once, and do it
| well. That way, future data analysis libraries (e.g. a
| hypothetical pandas 2) can concentrate on good API design
| without having to re-invent the wheel.
|
| And a third is that Arrow allows data to be chunked and
| batched (within a particular tool), meaning that computations
| can be streamed through memory rather than the whole
| dataframe needing to be stored in memory. A little bit like
| how Spark partitions data and sends it to different nodes for
| computation, except all on the same machine. This also
| enables parallelisation by default. With the core count of
| CPUS this means Arrow is likely to be extremely fast.
| ianmcook wrote:
| Re this second point: Arrow opens up a great deal of
| language and framework flexibility for data engineering-
| type tasks. Pre-Arrow, common kinds of data warehouse ETL
| tasks like writing Parquet files with explicit control over
| column types, compression, etc. often meant you needed to
| use Python, probably with PySpark, or maybe one of the
| other Spark API languages. With Arrow now there are a bunch
| more languages where you can code up tasks like this, with
| consistent results. Less code switching, lower complexity,
| less cognitive overhead.
| BenoitP wrote:
| > most of the compute time spent is in serializing and
| deserializing data.
|
| This is to be viewed in light how hardware evolves now. CPU
| compute power is no longer growing as much (at least for
| individual cores).
|
| But one thing that's still doubling on a regular basis is
| memory capacity of all kinds (RAM, SSD, etc) and bandwidth of
| all kinds (PCIe lanes, networking, etc). This divide is
| getting large and will only continue to increase.
|
| Which brings me to my main point:
|
| You can't be serializing/deserializing data on the CPU. What
| you want is to have the CPU coordinate the SSD to copy chunks
| directly -and as is- to the NIC/app/etc.
|
| Short of having your RAM doing compute work*, you would be
| leaving performance on the table.
|
| ----
|
| * Which is starting to appear
| (https://www.upmem.com/technology/), but that's not quite
| there yet.
| jeffbee wrote:
| An interesting perspective on the future of computer
| architecture but it doesn't align well with my experience.
| CPUs are easier to build and although a lot of ink has been
| spilled about the end of Moore's Law, it remains the case
| that we are still on Moore's curve for number of
| transistors, and since about 15 years ago we are now also
| on the same slope for # of cores per CPU. We also still
| enjoy increasing single-thread performance, even if not at
| the rates of past innovation.
|
| DRAM, by contrast, is currently stuck. We need materials
| science breakthroughs to get beyond the capacitor aspect
| ratio challenge. RAM is still cheap but as a systems
| architect you should get used to the idea that the amount
| of DRAM per core will fall in the future, by amounts that
| might surprise you.
| jedbrown wrote:
| This is backward -- this sort of serialization is
| overwhelmingly bottlenecked on bandwidth (not CPU). (Multi-
| core) compute improvements have been outpacing bandwidth
| improvements for decades and have not stopped.
| Serialization is a bottleneck because compute is fast/cheap
| and bandwidth is precious. This is also reflected in the
| relative energy to move bytes being increasingly larger
| than the energy to do some arithmetic on those bytes.
| MrPowers wrote:
| Exactly. Some specific examples.
|
| Read a Parquet file into a Pandas DataFrame. Then read the
| Pandas DataFrame into a Spark DataFrame. Spark & Pandas are
| using the same Arrow memory format, so no serde is needed.
|
| See the "Standardization Saves" diagram here:
| https://arrow.apache.org/overview/
| pindab0ter wrote:
| I hope you don't mind me asking dumb questions, but how
| does this differ from the role that say Protocol Buffers
| fills? To my ears they both facilitate data exchange. Are
| they comparable in that sense?
| arjunnarayan wrote:
| protobufs still get encoded and decoded by each client
| when loaded into memory. arrow is a little bit more like
| "flatbuffers, but designed for common data-intensive
| columnar access patterns"
| hackcasual wrote:
| Arrow does actually use flatbuffers for metadata storage.
| poorman wrote:
| https://github.com/apache/arrow/tree/master/format
| hawk_ wrote:
| is that at the cost of the ability to do schema
| evolution?
| rcoveson wrote:
| Better to compare it to Cap'n Proto instead. Arrow data
| is already laid out in a usable way. For example, an
| Arrow column of int64s is an 8-byte aligned memory region
| of size 8*N bytes (plus a bit vector for nullity), ready
| for random access or vectorized operations.
|
| Protobuf, on the other hand, would encode those values as
| variable-width integers. This saves a lot of space, which
| might be better for transfer over a network, but means
| that writers have to take a usable in-memory array and
| serialize it, and readers have to do the reverse on their
| end.
|
| Think of Arrow as standardized shared memory using
| struct-of-arrays layout, Cap'n Proto as standardized
| shared memory using array-of-structs layout, and Protobuf
| as a lightweight purpose-built compression algorithm for
| structs.
| jeffbee wrote:
| Protobuf provides the fixed64 type and when combined with
| `packed` (the default in proto3, optional in proto2)
| gives you a linear layout of fixed-size values. You would
| not get natural alignment from protobuf's wire format if
| you read it from an arbitrary disk or net buffer; to get
| alignment you'd need to move or copy the vector.
| Protobuf's C++ generated code provides RepeatedField that
| behaves in most respects like std::vector, but in as much
| as protobuf is partly a wire format and partly a library,
| users are free to ignore the library and use whatever
| code is most convenient to their application.
|
| TL;DR variable-width numbers in protobuf are optional.
| cornstalks wrote:
| > _Think of Arrow as standardized shared memory using
| struct-of-arrays layout, Cap 'n Proto as standardized
| shared memory using array-of-structs layout_
|
| I just want to say thank you for this part of the
| sentence. I understand struct-of-arrays vs array-of-
| structs, and now I finally understand what the heck Arrow
| is.
| poorman wrote:
| Not only in between processes, but also in between languages
| in a single process. In this POC I spun up a Python
| interpreter in a Go process and pass the Arrow data buffer
| between processes in constant time.
| https://github.com/nickpoorman/go-py-arrow-bridge
| didip wrote:
| Question, doesn't Parquet already do that?
| ianmcook wrote:
| From https://arrow.apache.org/faq/: "Parquet files cannot
| be directly operated on but must be decoded in large
| chunks... Arrow is an in-memory format meant for direct and
| efficient use for computational purposes. Arrow data is...
| laid out in natural format for the CPU, so that data can be
| accessed at arbitrary places at full speed."
| snicker7 wrote:
| Yes. But parquet is now based on Apache Arrow.
| ianmcook wrote:
| Parquet is not based on Arrow. The Parquet libraries are
| built into Arrow, but the two projects are separate and
| Arrow is not a dependency of Parquet.
| CrazyPyroLinux wrote:
| This is not the greatest answer, but one anecdote: I am looking
| at it as an alternative to MS SSIS for moving batches of data
| around between databases.
| [deleted]
| RobinL wrote:
| I wrote a bit about this from the perspective of a data
| scientist here:
| https://www.robinlinacre.com/demystifying_arrow/
|
| I cover some of the use cases, but more importantly try and
| explain how it all fits together, justifying why - as another
| commenters has said - it's the most important thing happening
| in the data ecosystem right now.
|
| I wrote it because i'd heard a lot about Arrow, and even used
| it quite a lot, but realised I hadn't really understood what it
| was!
| juntan wrote:
| In Perspective (https://github.com/finos/perspective), we use
| Apache Arrow as a fast, cross-language/cross-network data
| encoding that is extremely useful for in-browser data
| visualization and analytics. Some benefits:
|
| - super fast read/write compared to CSV & JSON (Perspective and
| Arrow share an extremely similar column encoding scheme, so we
| can memcpy Arrow columns into Perspective wholesale instead of
| reading a dataset iteratively).
|
| - the ability to send Arrow binaries as an ArrayBuffer between
| a Python server and a WASM client, which guarantees
| compatibility and removes the overhead of JSON
| serialization/deserialization.
|
| - because Arrow columns are strictly typed, there's no need to
| infer data types - this helps with speed and correctness.
|
| - Compared to JSON/CSV, Arrow binaries have a super compact
| encoding that reduces network transport time.
|
| For us, building on top of Apache Arrow (and using it wherever
| we can) reduces the friction of passing around data between
| clients, servers, and runtimes in different languages, and
| allows larger datasets to be efficiently visualized and
| analyzed in the browser context.
| prepend wrote:
| I recently found it useful for the dumbest reason. A dataset
| was about 3GB as a CSV and 20MB as a parquet file created and
| consumed by arrow. The file also worked flawlessly across
| different environments and languages.
|
| So it's a good transport tool. It also happens to be fast to
| load and query, but I only used it because of the compact way
| it stores data without any hoops to jump through.
|
| Of course one might say that it's stupid to try to pass around
| a 2GB or 20MB file, but in my case I needed to do that.
| IshKebab wrote:
| When you want to process large amounts of in-memory tabular
| data from different languages.
|
| You can save it to disk too using Apache Parquet but I
| evaluated Parquet and it is very immature. Extremely incomplete
| documentation and lots of Arrow features are just not supported
| in Parquet unfortunately.
| Lucasoato wrote:
| Do you mean the Parquet format? I don't think Parquet is
| immature, it is used in so many enterprise environments, it's
| is one of the few columnar file format for batch analysis and
| processing. It preforms so well... But I'm curious to know
| your opinion on this, so feel free to add some context to
| your position!
| IshKebab wrote:
| Yeah I do. For example Apache Arrow supports in memory
| compression. But Parquet does not support that. I had to
| look through the code to find that out, and I found many
| instances of basically `throw "Not supported"`. And yeah as
| I said the documentation is just non-existent.
|
| If you are already using Arrow, or you absolutely must use
| a columnar file format then it's probably a good option.
| BadInformatics wrote:
| Is that a problem in the Parquet format or in PyArrow? My
| understanding is that Parquet is primarily meant for on-
| disk storage (hence the default on-disk compression), so
| you'd read into Arrow for in-memory compression or IPC.
| liminal wrote:
| Would really love to see first class support for
| Javascript/Typescript for data visualization purposes. The
| columnar format would naturally lend itself to an Entity-
| Component style architecture with TypedArrays.
| BadInformatics wrote:
| https://arrow.apache.org/docs/js/ has existed for a while and
| uses typed arrays under the hood. It's a bit of a chunky
| dependency, but if you're at the point where that level of
| throughput is required bundle size is probably not a big deal.
| liminal wrote:
| I was mostly looking at this:
| https://arrow.apache.org/docs/status.html
| BadInformatics wrote:
| Not sure I follow, that page indicates that JS support is
| pretty good for all but the more obscure features (e.g.
| decimals) and doesn't mention data visualization at all?
| Anyhow, I've successfully used
| https://github.com/vega/vega-loader-arrow for in-browser
| plots before, and Observable has a fair few notebooks
| showing how to use the JS API (e.g.
| https://observablehq.com/@theneuralbit/introduction-to-
| apach...)
| nevi-me wrote:
| Have you seen https://github.com/finos/perspective? Though they
| removed the JS library a few months ago in favour of a WASM
| build of the C++ library.
| anonyfox wrote:
| So if I understand this correctly from an application developers
| perspective:
|
| - for OLTP tasks, something row based like sqlite is great. Small
| to medium amounts of data mixed reading/writing with transactions
|
| - for OLAP tasks, arrow looks great. Big amounts of data, faster
| querying (datafusion) and more compact data files with parquet.
|
| Basically prevent the operational database from growing too
| large, offload older data to arrow/parquet. Did I get this
| correct?
|
| Additionally there seem to be further benefits like sharing
| arrow/parquet with other consumers.
|
| Sounds convincing, I just have two very specific questions:
|
| - if I load a ~2GB collection of items into arrow and query it
| with datafusion, how much slower will this perform in comparison
| to my current rust code that holds a large Vec in memory and
| ,,queries" via iter/filter?
|
| - if I want to move data from sqlite to a more permanent parquet
| ,,Archive" file, is there a better way than recreating the whole
| file or write additional files, like, appending?
|
| Really curious, could find no hints online so far to get an idea.
___________________________________________________________________
(page generated 2021-02-03 23:00 UTC)