[HN Gopher] Run SQL on CSV, Parquet, JSON, Arrow, Unix Pipes and...
___________________________________________________________________
Run SQL on CSV, Parquet, JSON, Arrow, Unix Pipes and Google Sheet
Author : houqp
Score : 161 points
Date : 2022-09-24 15:59 UTC (7 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| mmastrac wrote:
| The one thing everyone here is missing so far is that it's a Rust
| binary, distributed on PyPi. That's brilliant.
| jonahx wrote:
| Can you explain the advantages of this vs cargo?
| proto_lambda wrote:
| cargo is not a binary distribution.
| houqp wrote:
| Most users already have pip installed, so they won't need to
| install a rust toolchain.
| simonw wrote:
| I wrote a bit about this pattern here:
| https://simonwillison.net/2022/May/23/bundling-binary-tools-...
| einpoklum wrote:
| You can get a statically-linked binary release from GitHub
| which depends on nothing (I think).
| samwillis wrote:
| I'm all in on using PyPI for binary distribution. Couple that
| with Python Venv and you have a brilliant system for per
| project dependancies.
|
| I created this project for distributing Node via PyPI:
| https://pypi.org/project/nodejs-bin/
| henrydark wrote:
| It is pretty cool. py-spy has also been doing this for a few
| years
|
| https://github.com/benfred/py-spy
| playingalong wrote:
| Bye bye jq and your awful query syntax.
| gavinray wrote:
| 1) roapi is built with some wicked cool tech
|
| 2) the author once answered some questions I posted on
| Datafusion, so they're cool in my book
|
| Here are my anecdotes.
| tootie wrote:
| AWS Athena offers something similar. You can build tables off of
| structured text files (like log files) in S3 and run SQL queries.
| ramraj07 wrote:
| What's the performance like though?
| bachmeier wrote:
| As I commented on a recent similar discussion, these tools can't
| be used for update or insert. As useful as querying might be,
| it's terribly misleading to claim to "run SQL" if you can't
| change the data, since that's such a critical part of an SQL
| database.
| TAForObvReasons wrote:
| The title is an editorialization. The project is very careful
| to emphasize that it is for reading data:
|
| > Create full-fledged APIs for slowly moving datasets without
| writing a single line of code.
|
| Even the name of the project "ROAPI" has "read only" in the
| name.
| gavinray wrote:
| Question: I've built something that supports full CRUD, and
| queries that span multiple data sources with optimization and
| pushdown
|
| What kind of headline would make you want to read/try such a
| thing?
|
| (I'm planning on announcing it + releasing code on HN but have
| never done so before)
| porker wrote:
| Show HN: Read and update Arrow, Parquet and xxxx files using
| SQL
| gavinray wrote:
| It works on databases and arbitrary data sources too though
| tomrod wrote:
| 90% of SQL usage, or more, is select in slowly changing data
| contexts.
| andygrove wrote:
| I think it is worth pointing out that this tool does support
| querying Delta Lake (the author of ROAPI is also a major
| contributor the native Rust implementation of Delta Lake).
| Delta Lake certainly supports transactions, so ROAPI can query
| transactional data, although the writes would not go through
| ROAPI.
| mgradowski wrote:
| What you're really saying is that the database presented in OP
| is not useful because it only handles DQL.
|
| 1. SQL can be thought of as being composed of several smaller
| lanuages: DDL, DQL, DML, DCL.
|
| 2. columnq-cli is only a CLI to a query engine, not a database.
| As such, it only supports DQL by design.
|
| 3. I have the impression that outside of data engineering/DBA,
| people are rarely taught the distinction between OLTP and OLAP
| workloads [1]. The latter often utilizes immutable data
| structures (e.g. columnar storage with column compression), or
| provides limited DML support, see e.g. the limitations of the
| DELETE statement in ClickHouse [2], or the list of supported
| DML statements in Amazon Athena [3]. My point -- as much as
| this tool is useless for transactional workloads, it is
| perfectly capable of some analytical workloads.
|
| [1] Opinion, not a fact.
|
| [2] https://clickhouse.com/docs/en/sql-
| reference/statements/dele...
|
| [3] https://docs.aws.amazon.com/athena/latest/ug/functions-
| opera...
| ebfe1 wrote:
| This is cool...Totally reminded me about several tools pop up on
| HN every now and then in the past for similar task so i did a
| quick search:
|
| clickhouse-local - https://news.ycombinator.com/item?id=22457767
|
| q - https://news.ycombinator.com/item?id=27423276
|
| textql - https://news.ycombinator.com/item?id=16781294
|
| simpql- https://news.ycombinator.com/item?id=25791207
|
| We need a benchmark i think..;)
| tanin wrote:
| Shameless plug. A desktop app: https://superintendent.app
| skybrian wrote:
| Looks like it also supports SQLite for input, but not for output.
| That might be a nice addition.
| whimsicalism wrote:
| Trino can do this as well.
| cube2222 wrote:
| This looks really cool! Especially using datafusion underneath
| means that it probably is blazingly fast.
|
| If you like this, I recommend taking a look at OctoSQL[0], which
| I'm the author of.
|
| It's plenty fast and easier to add new data sources for as
| external plugins.
|
| It can also handle endless streams of data natively, so you can
| do running groupings on i.e. tailed JSON logs.
|
| Additionally, it's able to push down predicates to the database
| below, so if you're selecting 10 rows from a 1 billion row table,
| it'll just get those 10 rows instead of getting them all and
| filtering in memory.
|
| [0]: https://github.com/cube2222/octosql
___________________________________________________________________
(page generated 2022-09-24 23:00 UTC)