https://www.pgrs.net/2024/03/21/duckdb-as-the-new-jq/

  * Skip to primary navigation
  * Skip to content
  * Skip to footer

Paul Gross's Blog

Toggle search Toggle menu

Paul Gross

Paul Gross

I'm a Principal Software Engineer in Seattle. I have previously
worked at Rvvup, Ripple, Braintree/PayPal, and ThoughtWorks.

Follow

  * Seattle
  * Twitter
  * GitHub
  * LinkedIn
  * Keybase
  * Mastodon
  * Email

DuckDB as the New jq

March 21, 2024 2 minute read

Recently, I've been interested in the DuckDB project (like a SQLite
geared towards data applications). And one of the amazing features is
that it has many data importers included without requiring extra
dependencies. This means it can natively read and parse JSON as a
database table, among many other formats.

I work extensively with JSON day to day, and I often reach for jq
when exploring documents. I love jq, but I find it hard to use. The
syntax is super powerful, but I have to study the docs anytime I want
to do anything beyond just selecting fields.

Once I learned DuckDB could read JSON files directly into memory, I
realized that I could use it for many of the things where I'm
currently using jq. In contrast to the complicated and custom jq
syntax, I'm very familiar with SQL and use it almost daily.

Here's an example:

First, we fetch some sample JSON to play around with. I used the
GitHub API to grab the repository information from the golang org:

% curl 'https://api.github.com/orgs/golang/repos' > repos.json

Now, as a sample question to answer, let's get some stats on the
types of open source licenses used.

The JSON structure looks like this:

[
  {
    "id": 1914329,
    "name": "gddo",
    "license": {
      "key": "bsd-3-clause",
      "name": "BSD 3-Clause \"New\" or \"Revised\" License",
      ...
    },
    ...
  },
  {
    "id": 11440704,
    "name": "glog",
    "license": {
      "key": "apache-2.0",
      "name": "Apache License 2.0",
      ...
    },
    ...
  },
  ...
]

This might not be the best way, but here is what I cobbled together
after searching and reading some docs for how to do this in jq:

 % cat repos.json | jq \
  'group_by(.license.key)
  | map({license: .[0].license.key, count: length})
  | sort_by(.count)
  | reverse'
[
  {
    "license": "bsd-3-clause",
    "count": 23
  },
  {
    "license": "apache-2.0",
    "count": 5
  },
  {
    "license": null,
    "count": 2
  }
]

And here is what it looks like in DuckDB using SQL:

% duckdb -c \
  "select license->>'key' as license, count(*) as count \
  from 'repos.json' \
  group by 1 \
  order by count desc"
+--------------+-------+
|   license    | count |
|   varchar    | int64 |
+--------------+-------+
| bsd-3-clause |    23 |
| apache-2.0   |     5 |
|              |     2 |
+--------------+-------+

For me, this SQL is much simpler and I was able to write it without
looking at any docs. The only tricky part is querying nested JSON
with the ->> operator. The syntax is the same as the PostgreSQL JSON
Functions, however, so I was familiar with it.

And if we do need the output in JSON, there's a DuckDB flag for that:

% duckdb -json -c \
  "select license->>'key' as license, count(*) as count \
  from 'repos.json' \
  group by 1 \
  order by count desc"
[{"license":"bsd-3-clause","count":23},
{"license":"apache-2.0","count":5},
{"license":null,"count":2}]

We can still even pretty print with jq at the end, after using DuckDB
to do the heavy lifting:

% duckdb -json -c \
  "select license->>'key' as license, count(*) as count \
  from 'repos.json' \
  group by 1 \
  order by count desc" \
  | jq
[
  {
    "license": "bsd-3-clause",
    "count": 23
  },
  {
    "license": "apache-2.0",
    "count": 5
  },
  {
    "license": null,
    "count": 2
  }
]

JSON is just one of the many ways of importing data into DuckDB. This
same approach would work for CSVs, parquet, Excel files, etc.

And I could choose to create tables and persist locally, but often
I'm just interrogating data and don't need the persistence.

Read more about DuckDB's great JSON support in this blog post:
Shredding Deeply Nested JSON, One Vector at a Time

Updated: March 21, 2024

Share on

Twitter Facebook LinkedIn Previous Next

You may also enjoy

Lessons Learned From Payments Startups

January 26, 2024 13 minute read

Over my career so far, I've worked in a number of payments companies,
including several startups. At the last startup, I was involved in
building out a payme...

Storing External Requests

December 14, 2023 4 minute read

I've worked in payment systems for a long time, and one common theme
is that they make a lot of 3rd party API calls. For example, there
are numerous payment ...

Two Exciting PostgreSQL Features To Improve NULL Handling

November 14, 2023 3 minute read

I recently learned about two PostgreSQL features that make handling
NULL values more sane.

ULID Identifiers and ULID Tools Website

January 10, 2023 2 minute read

Historically, when I wanted to store data in a database, I (or the
project/team I was on) used an incrementing integer to uniquely
identify each row (e.g. th...

Enter your search term... [                    ]
  * Follow:
  * Feed

(c) 2024 Paul Gross. Powered by Jekyll & Minimal Mistakes.