[HN Gopher] Dsq: Commandline tool for running SQL queries agains...
___________________________________________________________________
Dsq: Commandline tool for running SQL queries against JSON, CSV,
Parquet, etc.
Author : eatonphil
Score : 233 points
Date : 2022-01-11 15:20 UTC (7 hours ago)
(HTM) web link (datastation.multiprocess.io)
(TXT) w3m dump (datastation.multiprocess.io)
| houqp wrote:
| Neat! I have also built a similar project in Rust
| https://github.com/roapi/roapi/tree/main/columnq-cli :)
| eatonphil wrote:
| Nice project!
| [deleted]
| nonbirithm wrote:
| They really ought to be a standard, language-agnostic way for
| querying nested config file formats with identifiers by now. For
| all the flak XML gets, at least it has XPath. YAML has YAML Path,
| but it's only supported by a single Python lib.
| emmanueloga_ wrote:
| Augeas comes to mind [1], although it seems like a pretty niche
| project (since you mention a "standard" ...).
|
| 1: http://augeas.net
| newman314 wrote:
| dsq references a benchmark done by q
| (https://github.com/harelba/q/blob/master/test/BENCHMARK.md) that
| indicates that octosql is significantly slower.
|
| However, octosql's GH repo claims otherwise.
|
| Does anyone have any real world experience that they can share on
| these tools?
| eatonphil wrote:
| Yeah frankly the q benchmark isn't the best even though dsq
| compares not terribly in it. It isn't well documented and
| exercises a very limited amount of functionality and isn't very
| rigorous from what I can see. That said, the caching q does is
| likely very solid (and not something dsq does).
|
| The biggest risk (both in terms of SQL compliance and
| performance) I think with octosql (and cube2222 is here
| somewhere to disagree with me if I'm wrong) is that they have
| their own entire SQL engine whereas textql, q and dsq use
| SQLite. But q is also in Python whereas textql, octosql, and
| dsq are in Go.
|
| In the next few weeks I'll be posting some benchmarks that I
| hope are a little fairer (or at least well-documented and
| reproducible). Though of course it would be appropriate to have
| independent benchmarks too since I now have a dog in the fight.
|
| On a tangent, once the go-duckdb binding [0] matures I'd love
| to offer duckdb as an alternative engine flag within dsq (and
| DataStation). Would be neat to see.
|
| [0] https://github.com/marcboeker/go-duckdb
| em500 wrote:
| Another interesting comparison would be clickhouse-local:
|
| https://altinity.com/blog/2019/6/11/clickhouse-local-the-
| pow...
| eatonphil wrote:
| Awesome post, thanks for the link. I had no clue Clickhouse
| could do this. Shows off some more CLI tools in there too.
|
| The big issue with ClickHouse is the incredibly non-
| standard SQL dialect and the random bugs that remain. It's
| an amazing project for analytics but you definitely have to
| be willing to hack around its SQL language (I say this as a
| massive fan of ClickHouse).
|
| I wonder: does this mean I can embed ClickHouse in
| arbitrary software as a library? I'd be curious to provide
| that as an option in dsq.
| em500 wrote:
| > I wonder: does this mean I can embed ClickHouse in
| arbitrary software as a library? I'd be curious to
| provide that as an option in dsq.
|
| Not sure, but it's Apache licensed, so you likely can
| make it work if you want to. But realize that
| clickhouse(-local) is much heavier than sqlite / duckdb
| based solutions: the compiled binary is around 200mb
| iirc.
| eatonphil wrote:
| Ah yeah that's pretty large.
| cube2222 wrote:
| Hey there! OctoSQL author here.
|
| I've released OctoSQL v0.4.0 recently[0] which is a grand
| rewrite and is 2-3 orders of magnitude faster than OctoSQL
| was before. It's also much more robust overall with static
| typing and the plugin system for easy extension. The q
| benchmarks are over a year old and haven't been updated to
| reflect that yet.
|
| Take a look at the README[1] for more details.
|
| My benchmarks should be reproducible, you can find the script
| in the benchmarks/ repo directory.
|
| Btw if we're already talking @eatonphil I'd appreciate you
| updating your benchmarks to reflect these changes.
|
| As far as the custom query engine goes - yes, there are both
| pros and cons to that. In the case of OctoSQL I'm building
| something that's much more dynamic - a full-blown dataflow
| engine - and can subscribe to multiple datasources to
| dynamically update queries as source data changes. This also
| means it can support streaming datasources. That is not
| possible with the other systems. It also means I don't have
| to load _everything_ into a SQLite database before querying -
| I can optimize which columns I need to even read.
|
| OctoSQL also let's you work with actual databases in your
| queries - like PostgreSQL or MySQL - and pushes predicates
| down to them, so it doesn't have to dump your whole database
| tables. That's useful if you need to do cross-db joins, or
| JSON-file-with-database joins.
|
| As far as SQL compliance goes it gets hairy in the details -
| as usual. The overall dialect is based on MySQL as I'm using
| a parser based on vitess's one, but I don't support some
| syntax, and add original syntax too (type assertions,
| temporal extensions, object and type notation).
|
| Stay tuned for a lot more supported datasources, as the
| plugin system lets me work on that much quicker.
|
| [0]:https://github.com/cube2222/octosql/releases/tag/v0.4.0
|
| [1]:https://github.com/cube2222/octosql#readme
| eatonphil wrote:
| Neat! I'll give your script a shot.
| eatonphil wrote:
| I shared the first version of this in a Show HN last month [0].
| The big update since then is that it now supports loading
| multiple files and doing SQL joins on them. You can see examples
| of that in this post.
|
| The repo is here [1] and the README has a comparison against some
| of the other great tools in the space like q, octosql, and
| textql.
|
| [0] https://news.ycombinator.com/item?id=29643835
|
| [1] https://github.com/multiprocessio/dsq
| data_ders wrote:
| way cool Phil! I've been struggling to grok the value prop of
| data station for a while now, but dsq crystallizes it for me!
| eatonphil wrote:
| Interesting! and thanks! It has definitely not been
| straightforward to explain. Mostly my fault not working on
| tutorials, docs and videos. Thankfully DataStation
| performance, features and install-time are now in a pretty
| good spot so I am going to be focusing more on the education
| side.
| avmich wrote:
| LINQ went the other way - a language for queries against (among
| other things) relational databases, different than SQL.
|
| Wonder how, say, jq would look like working with tables in RDBMS.
| rahimiali wrote:
| If you're curious how they've adapted SQL to query JSON, here is
| the trick: The JSON "Must be an array of objects. Nested object
| fields are ignored".
| chrisweekly wrote:
| See also "lnav" -- https://lnav.org -- for an awesome CLI
| powertool with embeded SQLite.
| cschneid wrote:
| Thank you for this, it's a tool I didn't realize I was missing.
| da39a3ee wrote:
| I really like this idea. jq is great too, but the jq language
| just seems to be too different from anything else for me, so it
| only works if I use it every day, which I don't. But I guess an
| issue is that I'll still need to use jq in order to get my data
| into the suitable array-of-objects format for dsq?
| eatonphil wrote:
| For now yeah. Eventually I'd like to support pre transforms or
| something within dsq itself.
| wolfi1 wrote:
| just for clarification: one needs datastation in order for dsq to
| run? so dsq is not self-contained? (I'm not familiar with the Go
| eco system)
| eatonphil wrote:
| Good question! No you don't need to install DataStation to run
| dsq. dsq just imports Go libraries from DataStation during
| compile-time.
|
| dsq is a totally standalone binary.
| thriftwy wrote:
| I dream of a database which would use CSV (with metadata/indices
| in CSV comment lines) as its storage. You can even use commented-
| out padding to fit data to blocks, etc.
|
| Imagine when you don't have to dump/convert data because you can
| always open it with OpenOffice.
| bachmeier wrote:
| What I've done in the past is:
|
| Column 1 is the creation timestamp
|
| Column 2 is the modification timestamp
|
| On read, you update if the creation timestamp exists but the
| modification timestamp is later, otherwise you insert. Your
| app(s) can do a simple file append for writes. You even have
| the full version history at every point in time.
|
| I did a lot of looking but didn't find a command-line tool that
| automated this process. It works fine for small projects of
| e.g. 100,000 records. Wouldn't work well for things like a
| notes app, because you'd be storing every modification as a new
| entry.
| remcinerney wrote:
| That's interesting. I'm trying to imagine your workflow, and
| thinking about what serverless SQL platforms like Amazon Athena
| let you do now - i.e., you can more or less dump CSV files in
| blob storage and query them. Is that what you meant?
| LittlePeter wrote:
| In PostgreSQL you can use file_fdw extension:
|
| https://www.postgresql.org/docs/current/file-fdw.html
| bachmeier wrote:
| I think the parent comment was about reading and writing csv
| files. The documentation you linked says "Access to data
| files is currently read-only."
| LittlePeter wrote:
| I missed that. You are right.
| mr_toad wrote:
| PrestoDB. Or one of the many SaaS offerings like Athena,
| BigQuery or USQL.
| cerved wrote:
| I'm not sure I share that dream..
|
| CSV is so horribly non-standardized and horrible to parse. JSON
| appears a much more suitable candidate
| sundarurfriend wrote:
| recfiles [1] are a sort of poor man's plaintext database that
| you can edit with any text editor. I found manually entering
| the data mildly annoying and repetitive, but visidata [2]
| supports the format, so I've been meaning to learn to use that
| for easier data entry and for tabular visualization.
|
| [1]
| https://www.gnu.org/software/recutils/manual/recutils.html#I...
| [2] https://www.visidata.org/
|
| Bonus: A nice concise intro that someone wrote last week:
| https://chrismanbrown.gitlab.io/28.html
| laumars wrote:
| AWS Athena can do this. Dump your CSVs in S3 and query them in
| Athena.
|
| Personally I use sqlite for smaller datasets and wrap that
| around a CSV importer.
| thriftwy wrote:
| Why dump CSVs when you can outright store in them?
| laumars wrote:
| Sorry, I don't understand your question. But just in case
| this answers it:
|
| S3 is cloud storage on AWS. Athena can work directly off
| the CSVs stored on S3.
|
| Where I said "dump" it was just a colourful way of saying
| "transfer your files to...". I appreciate "dump" can also
| mean different things with databases so maybe that wasn't
| the best choice of word in my part. Sorry for any confusion
| there.
___________________________________________________________________
(page generated 2022-01-11 23:00 UTC)