[HN Gopher] DuckDB as the New jq
       ___________________________________________________________________
        
       DuckDB as the New jq
        
       Author : pgr0ss
       Score  : 93 points
       Date   : 2024-03-21 18:19 UTC (4 hours ago)
        
 (HTM) web link (www.pgrs.net)
 (TXT) w3m dump (www.pgrs.net)
        
       | sshine wrote:
       | Very cool!
       | 
       | I am also a big fan of jq.
       | 
       | And I think using DuckDB and SQL probably makes a lot of sense in
       | a lot of cases.
       | 
       | But I think the examples are very geared towards being better
       | solved in SQL.
       | 
       | The ideal jq examples are combinations of filter (select), map
       | (map) and concat (.[]).
       | 
       | For example, finding the right download link:                 $
       | curl -s https://api.github.com/repos/go-
       | gitea/gitea/releases/latest \         | jq -r '.assets[]
       | | .browser_download_url                  |
       | select(endswith("linux-amd64"))'       https://github.com/go-
       | gitea/gitea/releases/download/v1.15.7/gitea-1.15.7-linux-amd64
       | 
       | Or extracting the KUBE_CONFIG of a DigitalOcean Kubernetes
       | cluster from Terraform state:                 $ jq -r
       | '.resources[]               | select(.type ==
       | "digitalocean_kubernetes_cluster")               |
       | .instances[].attributes.kube_config[].raw_config' \
       | terraform.tfstate       apiVersion: v1       kind: Config
       | clusters:       - cluster:           certificate-authority-data:
       | ...           server: https://...k8s.ondigitalocean.com       ...
        
         | pgr0ss wrote:
         | I think that's a fair point. Unnesting arrays in SQL can be
         | annoying. Here is your first example with duckdb:
         | duckdb -c \         "select * from ( \           select
         | unnest(assets)->>'browser_download_url' as url \           from
         | read_json('https://api.github.com/repos/go-
         | gitea/gitea/releases/latest') \         ) \         where url
         | like '%linux-amd64'"
        
       | hprotagonist wrote:
       | i've been using simonw's sqlite-utils (https://sqlite-
       | utils.datasette.io/en/stable/) for this sort of thing; given
       | structured json or jsonl, you can throw data at an in-memory
       | sqlite database and query away: https://sqlite-
       | utils.datasette.io/en/stable/cli.html#queryin...
        
       | NortySpock wrote:
       | In a similar vein, I have found Benthos to be an incredible
       | swiss-army-knife for transforming data and shoving it either into
       | (or out of) a message bus, webhook, or a database.
       | 
       | https://www.benthos.dev/
        
         | krembo wrote:
         | How does this defer from filebeat?
        
       | pletnes wrote:
       | Worth noting that both jq and duckdb can be used from python and
       | from the command line. Both are very useful data tools!
        
       | haradion wrote:
       | I've found Nushell (https://www.nushell.sh/) to be really handy
       | for ad-hoc data manipulation (and a decent enough general-purpose
       | shell).
        
       | HellsMaddy wrote:
       | Jq tip: Instead of `sort_by(.count) | reverse`, you can do
       | `sort_by(-.count)`
        
         | philsnow wrote:
         | only if you're sure that .count is never null:
         | $ echo '[{"a": {"count": null}}]' | jq -c 'sort_by(-.count)'
         | jq: error (at <stdin>:1): null (null) cannot be negated       $
         | echo '[{"a": {"count": null}}]' | jq -c 'sort_by(.count) |
         | reverse'       [{"a":{"count":null}}]
        
       | jeffbee wrote:
       | I tried this and it just seems to add bondage and discipline that
       | I don't need on top of what is, in practice, an extremely chaotic
       | format.
       | 
       | Example: trying to pick one field out of 20000 large JSON files
       | that represent local property records.
       | 
       | % duckdb -json -c "select apn.apnNumber from read_json('*')"
       | Invalid Input Error: JSON transform error in file "052136400500",
       | in record/value 1: Could not convert string 'fb1b1e68-89ee-11ea-
       | bc55-0242ad1302303' to INT128
       | 
       | Well, I didn't want that converted. I just want to ignore it.
       | This has been my experience overall. DuckDB is great if there is
       | a logical schema, not as good as jq when the corpus is just data
       | soup.
        
       | hu3 wrote:
       | Related, clickhouse local cli command is a speed demon to parse
       | and query JSON and other formats such as CSV:
       | 
       | - "The world's fastest tool for querying JSON files"
       | https://clickhouse.com/blog/worlds-fastest-json-querying-too...
       | 
       | - "Show HN: ClickHouse-local - a small tool for serverless data
       | analytics" https://news.ycombinator.com/item?id=34265206
        
         | mightybyte wrote:
         | I'll second this. Clickhouse is amazing. I was actually using
         | it today to query some CSV files. I had to refresh my memory on
         | the syntax so if anyone is interested:
         | 
         | clickhouse local -q "SELECT foo, sum(bar) FROM
         | file('foobar.csv', CSV) GROUP BY foo FORMAT Pretty"
         | 
         | Way easier than opening in Excel and creating a pivot table
         | which was my previous workflow.
         | 
         | Here's a list of the different input and output formats that it
         | supports.
         | 
         | https://clickhouse.com/docs/en/interfaces/formats
        
       | hermitcrab wrote:
       | if you want a very visual way to transform JSON/XML/CSV/Excel etc
       | in a pipeline it might also be worth looking at Easy Data
       | Transform.
        
       | mritchie712 wrote:
       | You can also query (public) Google Sheets [0]
       | SELECT *          FROM
       | read_csv_auto('https://docs.google.com/spreadsheets/export?
       | format=csv&id=1GuEPkwjdICgJ31Ji3iUoarirZNDbPxQj_kf7fd4h4Ro',
       | normalize_names=True);
       | 
       | 0 - https://x.com/thisritchie/status/1767922982046015840?s=20
        
       | dudus wrote:
       | DuckDB parses JSON using yyjson internally .
       | 
       | https://github.com/ibireme/yyjson
        
       ___________________________________________________________________
       (page generated 2024-03-21 23:01 UTC)