[HN Gopher] DuckDB as the New jq
___________________________________________________________________
DuckDB as the New jq
Author : pgr0ss
Score : 93 points
Date : 2024-03-21 18:19 UTC (4 hours ago)
(HTM) web link (www.pgrs.net)
(TXT) w3m dump (www.pgrs.net)
| sshine wrote:
| Very cool!
|
| I am also a big fan of jq.
|
| And I think using DuckDB and SQL probably makes a lot of sense in
| a lot of cases.
|
| But I think the examples are very geared towards being better
| solved in SQL.
|
| The ideal jq examples are combinations of filter (select), map
| (map) and concat (.[]).
|
| For example, finding the right download link: $
| curl -s https://api.github.com/repos/go-
| gitea/gitea/releases/latest \ | jq -r '.assets[]
| | .browser_download_url |
| select(endswith("linux-amd64"))' https://github.com/go-
| gitea/gitea/releases/download/v1.15.7/gitea-1.15.7-linux-amd64
|
| Or extracting the KUBE_CONFIG of a DigitalOcean Kubernetes
| cluster from Terraform state: $ jq -r
| '.resources[] | select(.type ==
| "digitalocean_kubernetes_cluster") |
| .instances[].attributes.kube_config[].raw_config' \
| terraform.tfstate apiVersion: v1 kind: Config
| clusters: - cluster: certificate-authority-data:
| ... server: https://...k8s.ondigitalocean.com ...
| pgr0ss wrote:
| I think that's a fair point. Unnesting arrays in SQL can be
| annoying. Here is your first example with duckdb:
| duckdb -c \ "select * from ( \ select
| unnest(assets)->>'browser_download_url' as url \ from
| read_json('https://api.github.com/repos/go-
| gitea/gitea/releases/latest') \ ) \ where url
| like '%linux-amd64'"
| hprotagonist wrote:
| i've been using simonw's sqlite-utils (https://sqlite-
| utils.datasette.io/en/stable/) for this sort of thing; given
| structured json or jsonl, you can throw data at an in-memory
| sqlite database and query away: https://sqlite-
| utils.datasette.io/en/stable/cli.html#queryin...
| NortySpock wrote:
| In a similar vein, I have found Benthos to be an incredible
| swiss-army-knife for transforming data and shoving it either into
| (or out of) a message bus, webhook, or a database.
|
| https://www.benthos.dev/
| krembo wrote:
| How does this defer from filebeat?
| pletnes wrote:
| Worth noting that both jq and duckdb can be used from python and
| from the command line. Both are very useful data tools!
| haradion wrote:
| I've found Nushell (https://www.nushell.sh/) to be really handy
| for ad-hoc data manipulation (and a decent enough general-purpose
| shell).
| HellsMaddy wrote:
| Jq tip: Instead of `sort_by(.count) | reverse`, you can do
| `sort_by(-.count)`
| philsnow wrote:
| only if you're sure that .count is never null:
| $ echo '[{"a": {"count": null}}]' | jq -c 'sort_by(-.count)'
| jq: error (at <stdin>:1): null (null) cannot be negated $
| echo '[{"a": {"count": null}}]' | jq -c 'sort_by(.count) |
| reverse' [{"a":{"count":null}}]
| jeffbee wrote:
| I tried this and it just seems to add bondage and discipline that
| I don't need on top of what is, in practice, an extremely chaotic
| format.
|
| Example: trying to pick one field out of 20000 large JSON files
| that represent local property records.
|
| % duckdb -json -c "select apn.apnNumber from read_json('*')"
| Invalid Input Error: JSON transform error in file "052136400500",
| in record/value 1: Could not convert string 'fb1b1e68-89ee-11ea-
| bc55-0242ad1302303' to INT128
|
| Well, I didn't want that converted. I just want to ignore it.
| This has been my experience overall. DuckDB is great if there is
| a logical schema, not as good as jq when the corpus is just data
| soup.
| hu3 wrote:
| Related, clickhouse local cli command is a speed demon to parse
| and query JSON and other formats such as CSV:
|
| - "The world's fastest tool for querying JSON files"
| https://clickhouse.com/blog/worlds-fastest-json-querying-too...
|
| - "Show HN: ClickHouse-local - a small tool for serverless data
| analytics" https://news.ycombinator.com/item?id=34265206
| mightybyte wrote:
| I'll second this. Clickhouse is amazing. I was actually using
| it today to query some CSV files. I had to refresh my memory on
| the syntax so if anyone is interested:
|
| clickhouse local -q "SELECT foo, sum(bar) FROM
| file('foobar.csv', CSV) GROUP BY foo FORMAT Pretty"
|
| Way easier than opening in Excel and creating a pivot table
| which was my previous workflow.
|
| Here's a list of the different input and output formats that it
| supports.
|
| https://clickhouse.com/docs/en/interfaces/formats
| hermitcrab wrote:
| if you want a very visual way to transform JSON/XML/CSV/Excel etc
| in a pipeline it might also be worth looking at Easy Data
| Transform.
| mritchie712 wrote:
| You can also query (public) Google Sheets [0]
| SELECT * FROM
| read_csv_auto('https://docs.google.com/spreadsheets/export?
| format=csv&id=1GuEPkwjdICgJ31Ji3iUoarirZNDbPxQj_kf7fd4h4Ro',
| normalize_names=True);
|
| 0 - https://x.com/thisritchie/status/1767922982046015840?s=20
| dudus wrote:
| DuckDB parses JSON using yyjson internally .
|
| https://github.com/ibireme/yyjson
___________________________________________________________________
(page generated 2024-03-21 23:01 UTC)