[HN Gopher] Apache Arrow Flight SQL: Accelerating Database Access
___________________________________________________________________
Apache Arrow Flight SQL: Accelerating Database Access
Author : ionychal
Score : 82 points
Date : 2022-02-16 14:55 UTC (8 hours ago)
(HTM) web link (arrow.apache.org)
(TXT) w3m dump (arrow.apache.org)
| carlineng wrote:
| Mark Raasveldt and Hannes Muhleisen at CWI wrote about these
| challenges in their 2017 paper Don't Hold My Data Hostage [0].
| It's great to see progress on this front.
|
| [0] https://ir.cwi.nl/pub/26415
| eatonphil wrote:
| tldr; it's a new wire protocol that columnar databases can opt
| into that is better optimized for columnar data than ODBC/JDBC.
|
| Key quotes: While it can be directly used for
| database access, it is not a direct replacement for JDBC/ODBC.
| Instead, Flight SQL serves as a concrete wire protocol/driver
| implementation that can support a JDBC/ODBC driver and reduces
| implementation burden on databases. Flight SQL means
| database servers can implement a standard interface that is
| designed around Apache Arrow and columnar data from the start.
| Just like how Arrow provides a standard in-memory format, Flight
| SQL saves developers from having to design and implement an
| entirely new wire protocol. As mentioned, Flight already
| implements features like encryption on the wire and
| authentication of requests, which databases do not need to re-
| implement.
| gavinray wrote:
| This is true, but to clarify and expand a bit, you add support
| for FlightSQL to some datasource by adding Protobuf RPC
| handlers to it. This doesn't necessarily have to run in-
| process, the FlightSQL RPC service might run side-by-side.
|
| The datasource also doesn't necessarily have to be a database.
| It just needs to implement the RPC methods. So you could
| implement your own data-access layer over any datasource (flat
| file, an external API, a regular DBMS, etc) by providing an
| implementation for "get all schemas", "get all tables",
| "execute a query", etc.
|
| Some useful quotes: > The idea is that the
| driver is a "one-size-fits-all" driver -- a user or tool vendor
| only needs to supply a generic driver that can connect to an
| infinite number of databases. This is even future-proof -- if a
| new database comes out, it can work with existing tools as long
| as an Arrow Flight SQL endpoint is provided. In fact, by adding
| an Arrow Flight SQL endpoint they would automatically enable
| JDBC connectivity too. > Not only will Arrow Flight
| SQL reduce the technical burden on applications and users, but
| it leverages Arrow, which means it will provide better
| performance. ... having a single reference JDBC driver allows
| any data source that adds an Arrow Flight SQL endpoint to get
| JDBC "for free" as an onramp. So the selling point is add an
| Arrow Flight SQL endpoint to your data source and automatically
| get JDBC connectivity.
|
| I have been experimenting in my free time with building a
| platform that autogenerates GraphQL CRUD API's on top of
| arbitrary datasources.
|
| It lets you do federated/distributed queries and cross-
| databsource joins.
|
| So you can connect it to a Postgres DB, a MySQL DB, a MongoDB,
| and a CSV file and then run a query that joins and filters data
| across them in a single operation.
|
| I am using Apache Calcite for this, but am interested in
| potentially offloading the query plan execution using
| FlightSQL/Substrait[0] for better performance, since I am
| targeting OLTP workloads and it's latency-sensitive.
|
| https://github.com/GavinRay97/GraphQLCalcite
|
| [0] https://substrait.io/
| lmeyerov wrote:
| Exciting, remember whiteboard conversations about this years ago,
| and benchmarking early ideas like Turbodc :)
|
| I could not find: Is there a list of conformant DBs & their
| implementation statuses? Been curious both for Flight and Flight
| SQL....
| lidavidm wrote:
| There is not a list (though right now it is zero,
| unfortunately); Dremio has been doing some work around this
| (and contributed the spec and implementation) though I can't
| speak for their plans.
|
| Just a minor distinction - Flight is lower-level and might be
| used to build a wire protocol for a database, but a "database
| supporting Flight" still requires something like Flight SQL to
| specify how the database expects to serve requests over Flight.
| pauldix wrote:
| This is great and makes a ton of sense as a refinement to their
| work on Flight previously. InfluxDB IOx already supports Flight,
| but we'll almost certainly be updating to support Flight SQL
| before we launch. We've been thinking about adding Postgres wire
| protocol support, but this would be even better if enough
| downstream clients and projects end up adding SQL Flight support.
|
| Great work to the Arrow team!
| pauldix wrote:
| Oh, and for anyone interested in pitching in on the Rust
| implementation, there's an issue logged here along with some
| discussion: https://github.com/apache/arrow-rs/issues/1323
| cogman10 wrote:
| I'm not tuned into Arrow all that much. I've read some of the
| about and stuff, but the code examples (to my eye) look really
| complex and complicated. [1]
|
| Could someone point me to a more glossy "arrow flight sql for
| dummies" examples? What I'm gleaning from this (or am I wrong?)
| is you could use a JDBC driver + arrow jdbc client and write...
| SQL? Or is it something a lot different?
|
| Is this the sort of thing where you could just add a plugin to
| postgres and be arrowified or something?
|
| [1]
| https://github.com/apache/arrow/blob/release-7.0.0/java/flig...
| lidavidm wrote:
| There are several community members actively working on making
| things more accessible in general. (Flight SQL definitely needs
| some polish here.)
|
| For Arrow itself, you could start with the Cookbook [1] (and
| feel free to file issues if there are things that are unclear,
| or that you would like to learn more about [2])
|
| The example there is very long, but much of it is repetitive
| (that said, it could certainly be improved). But to your
| specific question, what you would eventually do is write a
| plugin or proxy for Postgres (or another database) implementing
| these RPC methods. Then you could submit SQL queries using the
| Flight SQL client (which calls those RPC methods internally) to
| get back columnar data. You could wrap this all in a JDBC
| driver as well (which is being actively worked on [3]), in
| which case you'd be mostly ignorant of what's going on
| underneath; the benefit there would (hopefully) be faster data
| access due to the lower layers being columnar (and zero-copy
| where possible), and enabling parallelism due to Flight
| (Postgres might not support this for instance, but other
| databases could, if they could expose internal parallelism or
| partitioning). Hope that helps.
|
| [1]: https://arrow.apache.org/cookbook/ [2]:
| https://github.com/apache/arrow-cookbook/ [3]:
| https://github.com/apache/arrow/pull/12254
| cogman10 wrote:
| Ok, I think I'm getting it. Then some of the remaining work
| (it sounds like) is getting that flight sql proxy/plugin
| setup for 3rd party datastores but once that is done you
| could hook up and run either flight directly through the
| flight client or via ODBC/JDBC wrappers that make it feel a
| little more sqly.
___________________________________________________________________
(page generated 2022-02-16 23:00 UTC)