[HN Gopher] DuckDB-Wasm: Efficient analytical SQL in the browser
___________________________________________________________________
DuckDB-Wasm: Efficient analytical SQL in the browser
Author : ankoh
Score : 172 points
Date : 2021-10-29 14:58 UTC (8 hours ago)
(HTM) web link (duckdb.org)
(TXT) w3m dump (duckdb.org)
| typingmonkey wrote:
| Does DuckDB support multi tab usage? How big is the wasm file
| that must be loaded?
| ankoh wrote:
| The WebAssembly module is 1.6 - 1.8 MB brotli-compressed
| depending on the Wasm feature set. We're currently
| investigating ways to reduce this to around 1 MB. We further
| use streaming instantiation which means that the WebAssembly
| module will be compiled while downloading it. But still, it
| will hurt a bit more than a 40KB library.
|
| Regarding multi-tab usage: Not today. The available filesystem
| apis make it difficult to implement this right now. We're
| looking into ways to make DuckDB-Wasm persistent but we can
| only read in this release.
| domoritz wrote:
| On https://shell.duckdb.org/versus, we have a comparison with
| related libraries. The WASM bundles currently is 1.8 MB but it
| can be instantiated while it's streaming in.
|
| The size probably makes it prohibitive to use DuckDB when your
| dataset is small and download size matters but we hope that
| future improvements in WebAssembly can get the size down.
| [deleted]
| timwis wrote:
| Interesting.. Would this be effective at loading a remote CSV
| file with a million rows, then performing basic GROUP BY COUNTs
| on it so I can render bar charts?
|
| I've been thinking of using absurd-sql for it since I saw
| https://news.ycombinator.com/item?id=28156831 last week
| texodus wrote:
| I contribute to https://perspective.finos.org/ , supports all
| of this and quite a lot more.
|
| Here's 1,000,000 rows example I just threw together for you
|
| https://bl.ocks.org/texodus/3802a8671fa77399c7842fd0deffe925
|
| and a CSV example, you try yours right now
|
| https://bl.ocks.org/texodus/02d8fd10aef21b19d6165cf92e43e668
| ankoh wrote:
| It depends.
|
| Querying CSV files is particularly painful over the network
| since we still have to read everything for a full scan.
|
| With Parquet, you would at least only have to read the columns
| of group by keys and aggregate arguments.
|
| Try it out and share your experiences with us!
| elmolino89 wrote:
| Not really DuckDB-Wasm question but DuckDB:
|
| I got a data sets probably not suitable for loading into a memory
| table (close to 1000M rows CSV). I did split it into 20M rows
| chunks, read one by one into a DuckDB temporary table and
| exported as parquet.
|
| SELECT using glob prefix.*.parquet where mycolumb=foobar does
| work but can be a bit faster. Apart from sorting the input to
| parquet CSVs, what can he done? The CSV chunks were already
| sorted.
| fnord77 wrote:
| this looks so cool. is there pre-loaded a demo page loaded with
| tables so people can try out queries right away?
| 1egg0myegg0 wrote:
| If you head over to the shell demo, you can run a query like
| the one below!
|
| https://shell.duckdb.org/
|
| select * from 'https://shell.duckdb.org/data/tpch/0_01/parquet/
| orders.parqu...' limit 10;
| aynyc wrote:
| I wish it supports ORC.
| crimsoneer wrote:
| I'm still not sure I "get" the use case for DuckDB. From what I
| understand, it's like a nifty, in-memory SQL, but why is that
| better than just running PostGRES or Microsoft SQL server
| locally, where your data structures and tables and stuff have a
| lot more permanence?
|
| Like, my workflow is either I query an exiting remote corporate
| DB and do my initial data munging there, or get givne a data dump
| that I either work on directly in Pandas, or add to a local DB
| and do a little more cleaning there. Not at all clear how Duck DB
| would hel
| nicoburns wrote:
| DuckDB is columnar, so in theory a lot faster than Postgres or
| SQL server for Analytical workloads.
|
| DuckDB is to Clickhouse, TimescaleDB, Redshift, etc as SQLite
| is to Postgres, MySQL, SQL Server.
| deadliftpro wrote:
| From where do you get that sql server does not support
| columnar? That is a wrong claim.
| wenc wrote:
| I don't think the OP said that SQL Server doesn't support
| columnar, only that by analogy SQL Server is primarily a
| row store (which for most of its history was true).
|
| Columnar technology in SQL Server only became usable in SQL
| 2016 (it existed in 2012 but was too restrictive -- I know
| because I tried to use it).
|
| In 2016 and above, you can either create a columnar index
| (non clustered column store) or convert an entire table
| into a columnar table (clustered column store). The
| technology is actually pretty impressive and I've used it
| in production where I have row stores coexisting with
| column stores within the same database.
| Dayshine wrote:
| The simple answer noone else seems to have mentioned: SQLLite
| has quite a low limit on the number of columns it supports,
| which is a problem for data analytics which often prefers wide
| over long.
| 1egg0myegg0 wrote:
| Check out this post for some comparisons with Pandas.
|
| https://duckdb.org/2021/05/14/sql-on-pandas.html
|
| DuckDB is often faster than Pandas, and it can handle larger
| than memory data. Plus, if you already know SQL, you don't have
| to become a Pandas expert to be productive in Python data
| munging. Pandas is still good, but now you can mix and match
| with SQL!
| catawbasam wrote:
| Not just in-memory. It's pretty convenient if you have a set of
| Parquet files with common schema. Fairly snappy and doesn't
| have to fit in memory.
| jamesrr39 wrote:
| I'm using duckdb for querying parquet files as well. It's an
| awesome tool, so nice to just "look into" parquet files with
| SQL.
| deshpand wrote:
| Many enterprises are coming up with patterns where they
| replicate the data from the database (say Redshift) into
| parquet files (data lake?) and directing more traffic
| including analytical workloads onto the parquet files.
|
| duckdb will be very useful here, instead of having to use
| Redshift Spectrum or whatever.
| deshpand wrote:
| I work heavily with pandas and dask (when you want to use
| multiple cores), using parquet files for storage. We see a lot
| of benefits in selectively bringing in duckdb into the mix. For
| instance, the joins are extremely slow with both pandas and
| dask and require a lot of memory. That's a situation where
| using duckdb reduces the memory needs and speeds things up a
| lot.
|
| And we may not want to upload the data into postgres or another
| database. We can just work with parquet files and run in-
| process queries.
| mytherin wrote:
| DuckDB developer here. DuckDB is a regular RDBMS that has
| persistent ACID storage, but is tuned towards analytical
| workloads, i.e. read-heavy workloads with aggregates that
| require full scans of the data. Any data you write to tables is
| stored persistently on disk, and not all your data needs to fit
| in memory either.
|
| Our tagline is "SQLite for analytics", as DuckDB is an in-
| process database system similar to SQLite that is geared
| towards these types of workloads.
|
| DuckDB has a flexible query engine, and also has support for
| directly running SQL queries (in parallel!) on top of Pandas
| [1] and Parquet [2] without requiring the data to be imported
| into the system.
|
| [1] https://duckdb.org/2021/05/14/sql-on-pandas.html
|
| [2] https://duckdb.org/2021/06/25/querying-parquet.html
| brandmeyer wrote:
| Maybe this is a silly question: Why is the A/B choice between
| a row-major database and a column-major database, instead of
| between row-major tables and column-major tables within a
| flexible database?
|
| What's stopping the other leading brands from implementing
| columnar storage, queries, and such with a COLUMN MAJOR table
| attribute?
| dgudkov wrote:
| SQL calculations on columnar data are quite different from
| row-based databases, so its effectively a different
| database engine. You can take multiple advantages of
| columnar data store, because it usually employs a form of
| vocabulary compression. For instance, obtaining distinct
| values of a field in a columnar DB is much faster because
| it's typically just the vocabulary of the field, so it
| doesn't even require a full table scan. Many other columnar
| computations such as filtering or aggregation can be done
| on compressed data without decompression.
| mytherin wrote:
| Some databases do offer both, but it is much more involved
| than just changing the storage model. The entire query
| execution model needs to adapt to columnar execution. You
| can simulate a column store model in a row-store database
| by splitting a table into a series of single-column tables,
| but the performance benefits you will capture are much
| smaller than a system that is designed and optimized for
| column store execution.
| loxias wrote:
| To me, the use case is really obvious: when you reached for
| SQLite but now want something with Moar Powah.
|
| Now I've reduced it to a previously answered question: what's
| the use case for SQLite? ;)
| https://www.sqlite.org/whentouse.html
|
| That being said, I don't see the point, and shudder at the idea
| of a web page's javascript doing anything which _needs_
| noticeable amounts of the CPU, but I 'm a non-standard user...
| kf6nux wrote:
| I don't fully get the use case either, but it's in a different
| category than Postgres or Microsoft SQL because it runs in the
| browser and can be made part of your web app.
| ankoh wrote:
| DuckDB-wasm is targeting the browser so it's not directly
| competing with Pandas (that's the job of native DuckDB).
|
| It's targeting use cases where you want to push analytical
| computation away from servers into the client (browser).
|
| Lets me sketch 2 examples:
|
| A) You have your data sitting in S3 and the user-specific
| data is in a browser-manageable area.
|
| (E.g. this paper from Tableau research actually states
| dataset sizes that should fall into that category: https://re
| search.tableau.com/sites/default/files/get_real.pd...)
|
| In that scenario, you could eliminate your central server if
| your clients are smart enough.
|
| B) You are talking about larger dataset sizes (GB) and want
| to explore them ad-hoc in your browser.
|
| Uploading them is unrealistic and installing additional
| software is no longer ad-hoc enough.
| sedatesteak wrote:
| I got excited about duckdb recently too. Used it yesterday for a
| new project at work and immediately ran into a not implemented
| exception for my (awful) column naming structure and discovered
| there is no pivot function.
|
| Otherwise, it's great, but obviously still a wip.
|
| For those wondering, I have a helper function for soql queries
| for salesforce that follows the structure object.field
|
| Referring to a tablealias.[column.name] or quotes instead of
| brackets was a no go.
| sgarrity wrote:
| How does this compare/relate to https://jlongster.com/future-sql-
| web (if at all)?
| ankoh wrote:
| The author outlines many problems that you'll run into when
| implementing a persistent storage backend using the current
| browser APIs.
|
| We faced many of them ourselves but paused any further work on
| an IndexedDB-backend due to the lack of synchronous IndexedDB
| apis (e.g. check the warning here
| https://developer.mozilla.org/en-
| US/docs/Web/API/IDBDatabase...). He bypasses this issue using
| SharedArrayBuffers which would lock DuckDB-Wasm to cross-
| origin-isolated sites. (See the "Multithreading" section in our
| blog post)
|
| We might be able to lift this limitation in the future but this
| has some far-reaching implications affecting the query
| execution of DuckDB itself.
|
| To the best of my knowledge, there's just no way to do
| synchronous persistency efficiently right now that wont lock
| you to a browser or cross-origin-isolation. But this will be
| part of our ongoing research.
| munro wrote:
| Cool! This is the first time hearing about DuckDB, exciting as I
| heavily use SQLite. And these benchmarks are showing it's 6-15
| times faster than sql.js (SQLite) [1], along with another small
| benchmark I found [2]. I usually just slather on indexes in
| SQLite tho, so indexed queries may not stand up as well; and
| might not be as fast when it's on storage, as this is comparing
| in memory performance (I think?), but I'll give it a spin!
|
| Gonna throw this out there: main thing I'm looking for from an
| embedded DB is better on disk compression; I've been toying with
| RocksDB, but it's hard to tune optimally & it's really too low
| level for my needs.
|
| [1] > ipython import numpy as np duckdb
| = [0.855, 0.179, 0.151, 0.197, 0.086, 0.319, 0.236, 0.351, 0.276,
| 0.194, 0.086, 0.137, 0.377] sqlite = [8.441, 1.758,
| 0.384, 1.965, 1.294, 2.677, 4.126, 1.238, 1.080, 5.887, 1.194,
| 0.453, 1.272] print((np.quantile(sqlite, q=[0.1, 0.9]) /
| np.quantile(duckdb, q=[0.1, 0.9])).round())
|
| [2] https://uwekorn.com/2019/10/19/taking-duckdb-for-a-spin.html
| sitkack wrote:
| Do you need compression to get more bandwidth or are you trying
| to save money on storage costs?
|
| DuckDB is a column based db, so you are going to see a
| throughput increase for queries that only use a handful of
| columns.
| [deleted]
| obeliskora wrote:
| There was neat post https://news.ycombinator.com/item?id=27016630
| a while ago about about using sqlite on static pages with large
| datasets that wouldn't have to be loaded entirely. Does duckdb do
| something similar with arrow/parquet files or its own format?
| ankoh wrote:
| Yes we do! DuckDB-Wasm can read files using HTTP range requests
| very similar to the sql.js-httpvfs from phiresky.
|
| The blog post contains a few examples how this can be used, for
| example, to partially query Parquet files over the network.
|
| E.g. just visit shell.duckdb.org and enter:
|
| select * from 'https://shell.duckdb.org/data/tpch/0_01/parquet/
| orders.parqu...' limit 10;
| 1egg0myegg0 wrote:
| NIIIICE! Data twitter was pretty excited about that cool
| SQLite trick - now you can turn it up a notch!
| tomrod wrote:
| Is data twitter == #datatwitter, like Econ Twitter is
| #econtwitter?
|
| If so, I have another cool community to follow!
| obeliskora wrote:
| It would be really cool to load duckdb files too. sql.js-
| httpvfs seems convenient because it works on everything in
| database so you don't have to create indexes, or setup keys
| and constraints in the client.
| ankoh wrote:
| I agree! DuckDB-Wasm can already open DuckDB database files
| in the browser the very same way.
| obeliskora wrote:
| That's really neat! Can you control the cache too?
| ankoh wrote:
| DuckDB-Wasm uses a traditional buffer manager and evicts
| pages using a combination of FIFO + LRU (to distinguish
| sequential scans from hot pages like the Parquet metadata).
| joos2010kj wrote:
| Awesome!
| pantsforbirds wrote:
| DuckDB is one of my favorite projects ive stumbled on recently.
| I've had multiple use cases pop up where i wanted to do some
| pandas type work, but sqlite was a better fit so its really come
| in handy for me.
| xnx wrote:
| Similar(?): https://sql.js.org/ (SQLite in wasm)
| domoritz wrote:
| Yes but DuckDB is optimized for analytics (columnar data and
| vectorized computation). Take a look at the comparison in
| https://shell.duckdb.org/versus.
| [deleted]
| pantsforbirds wrote:
| Anyone have a good benchmark comparing DuckDb to Parquet/Avro/ORC
| etc.? Super curious to see how some of those workflows might
| compare. Obviously at scale its going to be different, but using
| a single parquet file/dataset as a db replacement isn't an
| uncommon thing in DS/ML work.
| mytherin wrote:
| Why compare DuckDB to Parquet when you can use DuckDB and
| Parquet [1] :)
|
| [1] https://duckdb.org/2021/06/25/querying-parquet.html
| tomnipotent wrote:
| Does DuckDB also use a PAX-like format like Parquet? Without
| going into code, the best I could find with a little googlefu
| is the HyPer/Data Blocks paper - is this a relevant read?
| mytherin wrote:
| DuckDB's storage format has similar advantages as the
| Parquet storage format (e.g. individual columns can be
| read, partitions can be skipped, etc) but it is different
| because DuckDB's format is designed to do more than Parquet
| files.
|
| Parquet files are intended to store data from a single
| table and they are intended to be written-once, where you
| write the file and then never change it again. If you want
| to change anything in a Parquet file you re-write the file.
|
| DuckDB's storage format is intended to store an entire
| database (multiple tables, views, sequences, etc), and is
| intended to support ACID operations on those structures,
| such as insertions, updates, deletes, and even altering
| tables in the form of adding/removing columns or altering
| types of columns without rewriting the entire table or the
| entire file.
|
| Tables are partitioned into row groups much like Parquet,
| but unlike Parquet the individual columns of those row
| groups are divided into fixed-size blocks so that
| individual columns can be fetched from disk. The fixed-size
| blocks ensure that the file will not suffer from
| fragmentation as the database is modified.
|
| The storage is still a work in progress, and we are
| currently actively working on adding more support for
| compression and other goodies, as well as stabilizing the
| storage format so that we can maintain backwards
| compatibility between versions.
___________________________________________________________________
(page generated 2021-10-29 23:00 UTC)