[HN Gopher] DuckDB - An in-process SQL OLAP database management ...
___________________________________________________________________
DuckDB - An in-process SQL OLAP database management system
Author : freilanzer
Score : 98 points
Date : 2023-02-10 15:48 UTC (7 hours ago)
(HTM) web link (duckdb.org)
(TXT) w3m dump (duckdb.org)
| samwillis wrote:
| DuckDB is awesome!
|
| Last year I was working on something using SQLite, users could
| perform analytical queries that would scan the entire 3gb db and
| generate aggregates. It would take at least 45 seconds to do the
| queries.
|
| I did a dump of the db and imported to DuckDB. The same queries
| now only take 1.5 seconds with exactly the same SQL.
|
| Obviously there is a trade off, inserts are slower on DuckDB. But
| for a low write, analytical read app it's perfect.
|
| I tried to use the SQLite connector but struggled to get it
| working. Need to circle back and have another go.
|
| With their SQLite and Postgres connectors, as a Django dev I
| would love an app that lets you run specific queries on your DB
| via DuckDB almost transparently. Would be awesome for analytical
| dashboards.
| mcdonje wrote:
| Yeah, I see it as the sqlite of the reporting world. Especially
| with columnar stores and parquet support.
| dunefox wrote:
| I'd need a MySQL/MariaDB connector for it to be really useful
| for me. But still, it looks great.
| everybodyknows wrote:
| Anyone tried using the Go driver? It has a few open bugs:
|
| https://github.com/marcboeker/go-duckdb/issues
|
| And a third-party effort, only:
|
| https://duckdb.org/docs/api/overview
| kagitac wrote:
| I've found the CGO boundary to be quite slow for large result
| sets and have taken to just running commands that do SELECT and
| COPY to files on the system and then read those.
| out_of_step wrote:
| DuckDB has an awesome integration with Apache Arrow and very
| useful implementations in R and Python. Great for easy larger-
| than-memory analyses
|
| https://duckdb.org/2021/12/03/duck-arrow.html
| vgt wrote:
| DuckDB is such a breath of fresh air!
|
| We at MotherDuck at working very closely with the DuckDB folks to
| build a DuckDB-based cloud service. I'm talking to various folks
| in the industry about the details in 1:1. Feel free to reach out
| to tino at motherduck.com.
|
| (co-founder and head of Produck at MotherDuck)
| lopkeny12ko wrote:
| I appreciate the clarity on the explicitly unsupported use cases
| in "When to not use DuckDB."
|
| There are so many infrastructure products, especially database
| products for some reason, where the marketing team takes control
| of the messaging away from engineers, and push outlandish claims
| on how their new DB is faster than all the competition, can
| support any workload, can scale infinitely, etc.
| SnowflakeOnIce wrote:
| DuckDB is terrific. I'm bullish on its potential for simplifying
| many big data pipelines. Particularly, it's plausible that DuckDB
| + Parquet could be used on a large SMP machine (32+ cores and
| 128GB+ memory) to deal with data munging for 100s of gigabytes to
| several terabytes, all from SQL, without dealing with Hadoop,
| Spark, Ray, etc.
|
| I have successfully used DuckDB like above for preparing an ML
| dataset from about 100GB of input.
|
| DuckDB is undergoing rapid development these days. There have
| been format-breaking changes and bugs that could lose data. I
| would not yet trust DuckDB for long-term storage or archival
| purposes. Parquet is a better choice for that.
| nerdponx wrote:
| As far as I can tell, DuckDB is an alternative to "data frame"
| libraries like Data.table, Polars, Pandas, etc. Is that the
| case? What makes DuckDB a better choice than, say, Polars?
| corford wrote:
| This blog post offers a nice summary:
| https://motherduck.com/blog/six-reasons-duckdb-slaps/
| nerdponx wrote:
| The blog post doesn't really make a comparison between
| DuckDB and data frame libraries. It mentions that the
| DuckDB Python bindings can interoperate with Pandas, but it
| doesn't really explain why you would use DuckDB _instead
| of_ Pandas, or Polars (which is both faster and more
| portable than Pandas).
| simonw wrote:
| Don't Polars and Pandas both require your entire data to
| fit in memory?
| syntaxfree wrote:
| It's a drop in alternative to SQLite that's column-
| oriented/OLAP. I've been profiling entire projects in
| production switching between SQLite and duckdb (no clear
| conclusions yet)
| nerdponx wrote:
| I suppose that leads to a broader question: when should you
| use an in-memory database, and when should you use a data
| frame library? The distinction between the two seems to be
| getting blurry (which maybe is a good thing).
| IanCal wrote:
| Very blurry. The answer now is just "whichever is easier
| for the small part of the task right now". Since duckdb
| happily talks arrow, you can use pandas for part of it,
| quickly do some SQL where that is easier (with no data
| copying) then switch back to pandas for something. You
| don't really have to choose which one to use any more.
| benjaminwootton wrote:
| SQL is easier and more natural to work with than Pandas.
| RobinL wrote:
| I'd love to hear any real world experiences of anyone who's
| tried to run jobs that would usually require a spark cluster on
| a single machine with loads of cores and memory.
|
| How big can you go, and how does speed compare to Spark? (I'm
| guessing significantly faster from my experience using Duckdb
| on smaller machines)
| wenc wrote:
| I have a single machine EC2 instance with 32 cores and 240GB
| memory and about 200 GB of partitioned Parquet files. I use
| DuckDB and Python with complex SQL (window functions,
| inequality joins, quantile functions etc) to extract data
| from this data.
|
| Because it's a single machine (no distributed cluster) DuckDB
| can heavily parallelize and vectorize. I don't know if I can
| give you perf numbers but complex analytic queries over the
| entire dataset regularly finish in 1-2 mins (not scientific
| since I'm not telling what kinds of queries I'm running).
|
| I've used Spark SQL and DuckDB overall is just more
| ergonomic, less boilerplate and is much faster since it is so
| lightweight.
|
| Granted DuckDB can only process data on one machine (whereas
| Spark can scale up indefinitely by adding machines) but most
| data sets I work with fit on a single beefy machine.
|
| Distributed computing -- most of the time, you ain't gonna
| need it.
|
| It's like StackOverflow: it serves 2B requests a month but
| only runs on a few on-prem servers. Most people think this is
| impossible but you can actually do a lot with very few
| machines if you're smart about it. Same with data. Big data
| is overrated.
| RobinL wrote:
| Thanks - very interesting
| jakewins wrote:
| I used it as glue for a monthly job, last week, processing
| ~1B event records, juggling them a bit and pushing them as 2M
| parquet segment files to S3.
|
| Mixed experience, would definitely not put it in a system
| that isn't ok crashing frequently.
|
| It segfaults on Alpine (argh, C++!), and force exits the
| whole NodeJS process when it gets unexpected HTTP responses
| from S3.
|
| In an archive run of 2M pushes it'll crash the process 4-5
| times.
|
| Overall still really, really like it, but learned to not
| trust it
| tracker1 wrote:
| Node + AWS for data ingress has been pretty painful in my
| experience (mostly dynamo feeds from large csv (whois
| database)). In the end, rewrote in C# (core 2) and it was
| able to complete more reliably. I'm guessing that go and
| rust would also be better. I like node, really like JS, but
| I just think that maybe the AWS libraries aren't that great
| in the space. It would run for 3-5 hours, then just blow up
| unexpectedly, even with plenty of memory overhead, and not
| really bandwidth limited, with appropriate retries and
| slowdown for dynamo rejections.
|
| If I never have to write ETL pipelines again, I won't be
| upset about it.
| Master_Odin wrote:
| There was a bug that was recently fixed in node 16.17+
| that was causing hard node processes crashes when doing
| stuff with S3 and I think had to do with receiving
| multiple packets at once or something.
| benjaminwootton wrote:
| DuckDB is a relational OLAP store. If you want to do
| transformations on relational data using SQL then I think
| nowadays you would look at the modern data stack and do it with
| DBT.
|
| If you have genuinely big and unstructured data then of course
| you need a cluster and would reach for Spark.
|
| If you have smallish data then maybe DuckDB has a role because
| working with SQL is nicer than Pandas. But a lot of time you
| actually need the complexity of Pandas to do the transformation
| you need.
|
| DuckDB is neat but I still can't quite convince myself of a
| killer use case.
| 89vision wrote:
| > working with SQL is nicer than Pandas
|
| Really? I prefer working with dataframe apis. You get a nice
| sql-like paradigm plus all the control structures of the
| runtime.
| neeleshs wrote:
| I am not sure I understand the first comment very well. Are
| you saying that instead of duckdb use modern data stack?
| Because DBT and DuckDB don't seem to contradict, but can work
| together. FWIW, I think the only important breakthrough in
| the "modern data stack" is really dbt. The rest, nothing
| modern about it
| krimpenrik wrote:
| Recently tried the GUI tool for ducks, forgot what's it called,
| something like 'Tab' and was quite disappointed. I feel duckdb
| needs a good tool like sqliteviewer to really take off.
| 0cf8612b2e1e wrote:
| DBeaver supports duckdb.
| bwanab wrote:
| There's a nice intro to DuckDB for julia developers by Bogumil
| Kaminski, the creator of DataFrames.jl here:
| https://juliazoid.com/welcome-to-duckdb-3c4e75f50b97.
|
| Interesting since in some ways, as he points out, it's in direct
| competition with DataFrames for use cases, but he gives it a very
| positive treatment and shows how they can work together using
| advantages of standard SQL along with processing power of
| DataFrames.
| nickpeterson wrote:
| When someone gives fair opinions on something that directly
| competes were their own work, you should take their opinion
| very seriously. It's an excellent quality in a person, and
| shows they're more focused on the problem than their ego.
| phillc73 wrote:
| Quite a while ago, when duckdb was just a duckling, I wrote
| an R package that supported direct manipulation of R
| dataframes using SQL.[1] duckdb was the engine for this.
|
| The approach was never as fast as data.table but did approach
| the speed of dplyr for more complex queries.
|
| Life had other things in store for me and I haven't touched
| this library for a while now.
|
| At the time there was no Julia connector for duckdb, but now
| that there is, I'd like to try this approach in that
| language.
|
| [1] https://github.com/phillc73/duckdf
| dang wrote:
| Related--others?:
|
| _Motherduck Raises $47.5M for DuckDB_ -
| https://news.ycombinator.com/item?id=33610218 - Nov 2022 (2
| comments)
|
| _Modern Data Stack in a Box with DuckDB_ -
| https://news.ycombinator.com/item?id=33191938 - Oct 2022 (5
| comments)
|
| _Querying Postgres Tables Directly from DuckDB_ -
| https://news.ycombinator.com/item?id=33035803 - Sept 2022 (38
| comments)
|
| _Notes on the SQLite DuckDB Paper_ -
| https://news.ycombinator.com/item?id=32684424 - Sept 2022 (28
| comments)
|
| _Show HN: CSVFiddle - Query CSV files with DuckDB in the
| browser_ - https://news.ycombinator.com/item?id=31946039 - July
| 2022 (13 comments)
|
| _Show HN: Easily Convert WARC (Web Archive) into Parquet, Then
| Query with DuckDB_ -
| https://news.ycombinator.com/item?id=31867179 - June 2022 (15
| comments)
|
| _Range joins in DuckDB_ -
| https://news.ycombinator.com/item?id=31530639 - May 2022 (24
| comments)
|
| _Friendlier SQL with DuckDB_ -
| https://news.ycombinator.com/item?id=31355050 - May 2022 (133
| comments)
|
| _Fast analysis with DuckDB and Pyarrow_ -
| https://news.ycombinator.com/item?id=31217782 - April 2022 (53
| comments)
|
| _Directly running DuckDB queries on data stored in SQLite files_
| - https://news.ycombinator.com/item?id=30801575 - March 2022 (23
| comments)
|
| _Parallel Grouped Aggregation in DuckDB_ -
| https://news.ycombinator.com/item?id=30589250 - March 2022 (10
| comments)
|
| _DuckDB quacks Arrow: A zero-copy data integration between Arrow
| and DuckDB_ - https://news.ycombinator.com/item?id=29433941 - Dec
| 2021 (13 comments)
|
| _DuckDB-Wasm: Efficient analytical SQL in the browser_ -
| https://news.ycombinator.com/item?id=29039235 - Oct 2021 (58
| comments)
|
| _Comparing SQLite, DuckDB and Arrow with UN trade data_ -
| https://news.ycombinator.com/item?id=29010103 - Oct 2021 (79
| comments)
|
| _DuckDB is the better SQLite with APIs for Java /Python/R and
| it's got potential_ -
| https://news.ycombinator.com/item?id=28692997 - Sept 2021 (2
| comments)
|
| _Fastest table sort in the West - Redesigning DuckDB 's sort_ -
| https://news.ycombinator.com/item?id=28328657 - Aug 2021 (27
| comments)
|
| _Querying Parquet with Precision Using DuckDB_ -
| https://news.ycombinator.com/item?id=27634840 - June 2021 (32
| comments)
|
| _DuckDB now has a Node.js API_ -
| https://news.ycombinator.com/item?id=25289574 - Dec 2020 (4
| comments)
|
| _DuckDB - An embeddable SQL database like SQLite, but supports
| Postgres features_ -
| https://news.ycombinator.com/item?id=24531085 - Sept 2020 (160
| comments)
|
| _DuckDB: SQLite for Analytics_ -
| https://news.ycombinator.com/item?id=23287278 - May 2020 (67
| comments)
| bluehat974 wrote:
| DuckCon 2023 - State of the Duck - Mark Raasveldt and Hannes
| Muhleisen
|
| https://youtu.be/rdnPkLSkoyU
| d_watt wrote:
| I'm duck-curious.
|
| Looking at how it's deployed, as an in process database, how do
| people actually use this in production? Trying to figure out
| where I might actually want to think about replacing current
| databases or analyses with DuckDB.
|
| EG if you deployed new code
|
| 1. Do you have a stateful machine you're doing an old school
| "Kill the old process, start the new process" deploy, and there's
| some duckdb file on disk that is maintained?
|
| 2. Or do you back that duckdb file in some sort of shared disk
| (Eg EBS), and have a rolling deploy where multiple applications
| access the same DB at the same time?
|
| 3. Or is DuckDB is treated as ephemeral, and you're using it to
| process data on the fly, so persisted state isn't an issue?
| benrutter wrote:
| There's a great podcast/interview with the creator of duckdb.
| He's pretty clear of thinking of the use case as more or less
| equivalent to mysql but for aggregated queries. I think trying
| to use it in place of something like a fully fledged postgres
| server might get weird, less because of any issues with duckdb
| and more because that isn't what it's designed for.
|
| Podcast/interview link for anyone interested:
| https://www.dataengineeringpodcast.com/duckdb-in-process-ola...
| bigdubs wrote:
| We use DuckDB extensively where I work (https://watershed.com),
| the primary way we're using it is to query Parquet formatted
| files stored in GCS, and we have some machinery to make that
| doable on demand for reporting and analysis "online" queries.
| dstroot wrote:
| Storing data in Parquet files and querying via DuckDB is fast
| and kind of magical.
| bigdubs wrote:
| Shockingly fast and nice and having the intermediate files
| be immutable is super nice.
| Tostino wrote:
| Haven't used it yet, but this aspect seems very
| appealing.
| mritchie712 wrote:
| Do you load the Parquet files in duckdb or just query them
| directly?
| bigdubs wrote:
| We query them directly in most cases with the httpfs
| plugin, but for hot paths we fetch them and cache them on
| disk locally.
| dunefox wrote:
| Good questions, I'm thinking about evaluating it along with
| pola.rs as data science tools. Would be interesting to hear
| from someone using it already.
| romski wrote:
| Yes using it in production - Stateless and ephemeral. For sure
| there's a learning curve.
| d_watt wrote:
| I see. Would it be fair to say you treat it almost like
| Pandas, except that it has a lower memory footprint since
| data is written to disk instead of memory. IE you use it for
| on the fly analysis of large frames of data, not like more
| traditional database/datawarehouse?
| chrisjc wrote:
| I'm not sure they're saying that.
|
| BTW, your questions are exactly those that I've been ask
| over the last few months, but also with a lot of focus over
| the last few days. Still learning as much as I can so the
| following might not be true.
|
| For what it's worth, there's a difference between using
| duckdb to query a set of files vs loading a bunch of files
| in to a table. But once the data has been loaded into a
| table it can be backed up as a duckdb db file.
|
| Therefore it might be more performant to preprocess duckdb
| db files (perhaps a process that works in conjunction with
| whatever manages your external tables) and load these db
| files into duckdb as needed (on the fly analysis) instead
| of loading datafiles into duckdb, transforming and CTAS
| every time.
|
| https://duckdb.org/docs/sql/statements/attach
|
| Of course all of this might be introducing more latency esp
| if you're trying to do NRT analytics.
|
| I assume you could partition your data into multiple db
| files similar to how you would probably do it with your
| data files (managing external tables).
| d_watt wrote:
| Ah, almost like a little bit of a lower level trino,
| where DuckDB is able to push out queries to different
| data storage endpoints?
| chrisjc wrote:
| Unfortunately not. At least not without a little
| intervention. See this blog post for more details about
| what I mean. They inspect the iceberg table's catalogue
| to list the related parquet files and then load them into
| duckdb.
|
| https://www.arecadata.com/getting-started-with-iceberg-
| using...
|
| You would still need to interact with some kind of
| catalogue to understand which .db files you need to
| fetch.
|
| And honestly I don't really know or understand the
| performance implications of the attach command.
|
| I'm excited to see if the duckdb team will be able to
| integrate with external tables directly one day. (not
| that data files would be .db files)
|
| Imagine this:
|
| 1) you have an external managed external table (iceberg,
| delta, etc... managed by Glue, databricks, etc)
|
| 2) register this table in duckdb CREATE
| OR REPLACE EXTERNAL TABLE my_table ... TYPE = 'ICEBERG'
| CATALOG = 's3://...' CREDENTIALS = '...' etc
|
| 3) simply interact with table in duckdb as you would any
| other table
| argella wrote:
| Isn't SQLite adding features for analytic queries that should
| take the wind out of the sails of duckdb?
| tanin wrote:
| I doubt sqlite will catch up soon in terms of analytics due to
| a few below reasons.
|
| The SQL dialect is so much lacking that it seems intentional.
| They are meant to be a transactional database, not an analytics
| one.
|
| Sqlite also takes pride on stability (deployed on a billion
| android devices). Adding 100+ analytics capabilities e.g.
| functions is not gonna be easy in terms of maintaining
| stability.
|
| I want to be wrong though because my paid app
| (superintendent.app) uses Sqlite. Not supporting analytics well
| is the number one complaint.
| simonw wrote:
| Not really. There was an academic project around this but I've
| not seen signs that it will be integrated with SQLite core:
| https://simonwillison.net/2022/Sep/1/sqlite-duckdb-paper/
|
| The key thing to consider here is trade-offs.
|
| Analytical databases tend to be optimized for analytical
| queries at the expense of fast atomic read-write transactions.
|
| SQLite is mainly used in situations where fast atomic read-
| write transactions are key - that's why it's used in so many
| mobile phone applications, for example.
|
| It's not going to grow analytical-query-at-scale capabilities
| if that means negatively impacting the stuff it's really good
| at already.
| kiratp wrote:
| Its disappointing that C++ was chosen to build something that is
| going to live in-process. Rust would have been so much safer. All
| the segfaults you get when running DucDB supports this statement.
| mcdonje wrote:
| I agree about Rust's memory safety advantage over C++, but I
| disagree that it's disappointing from a project perspective.
| Some DB experts made a good DB using a performant language
| they're comfortable with.
|
| You can't make project choices in a vacuum, and you can't
| assume others can either. People have limited time. The choice
| they were facing was probably not C++ vs Rust, but C++ vs
| nothing because they didn't have time to learn a new language
| before starting their project.
|
| Also, their first release was in 2019, so they probably heard
| of it, but that's around the beginning of its recent spike in
| popularity. It's starting to be viewed as a good long term
| option, but back then a lot of people were still wondering if
| it was a fad.
|
| I'm learning Rust, and I'm a big fan, but this is a bad take.
| intrasight wrote:
| I see no MDX query support. How is this an OLAP database? Or do I
| misunderstand what it is?
| slt2021 wrote:
| MDX is microsoft proptech, OLAP is more generic term to define
| analytical processing (in contrast to transaction processing as
| in CRUD).
|
| OLAP cube is microsoft's take on OLAP using 90s technologies
| for tech stacks from 1990s (Windows Server + SQL Server +
| SSAS).
|
| DuckDB is a modern take on OLAP
| epberry wrote:
| We just switched to DuckDB from Postgres to analyze AWS billing
| data and wrote up our experience,
| https://www.vantage.sh/blog/querying-aws-cost-data-duckdb
|
| Arguably Postgres was never the right tool to use for this
| analysis but nonetheless I was surprised at how much faster
| DuckDB was.
| thewisenerd wrote:
| ha! i'm using duckdb for a similar use-case.. plotting 18-month
| aggregates take less than 2 seconds.. vendor only provides data
| for the last 3 months (through the web portal).
|
| (i'm "only" working with 30 million rows though)
| lichtenberger wrote:
| I wonder if we'll in the near future also see JVM based in-memory
| or hybrid OLAP database systems, which will make use of SIMD
| instructions and columnar storage layouts with the incubating
| Vector API.
|
| It would be also interesting to see how we can process semi-
| structured data in a similar way.
| jengleton wrote:
| Great to see this posted here! DuckDB is an integral part of an
| in-browser data analytics tool that I've been working on. It
| compiles to WASM and runs in a web worker. Queries against WASM
| DuckDB regularly run 10x faster than the original JavaScript
| implementation!
| thenaturalist wrote:
| In case you're interested in or willing to share more details
| about the tool or your architecture I'd be an interested
| reader.
| hummus_bae wrote:
| [dead]
| wasd wrote:
| Our database is heroku postgresql database. What's the best way
| to get this working with DuckDB? I see there's a postgresql
| connector but I'm not totally following how to deploy it. Would I
| just spin up a dyno with the docker image / custom build pack and
| connect it to the DB?
| chrisjc wrote:
| Doesn't postgres have a columnar option? If so, you could prob
| get better performance for your analytical interactions if you
| switched some tables to columnar.
|
| Otherwise check out postgres scanner.
| https://github.com/duckdblabs/postgres_scanner
|
| They have a blog entry about it too:
| https://motherduck.com/blog/duckdb-ecosystem-newsletter-two/
| ctc24 wrote:
| We're big fans of DuckDb at https://prequel.co! We use it as part
| of our own dataframe implementation in Go. The speed is
| unbeatable and the tool is top notch. There are a few rough edges
| (it's not quite 1.0 level of stability yet), but the team is
| super reactive and has fixed bugs we've reported in < 48hrs
| pretty much every time.
| __mharrison__ wrote:
| Not trying to troll, but assuming proficiency in python, when
| would someone prefer this to say Pandas (or Polars)?
|
| I've written a lot of OLAP queries (wrote a materialization layer
| for MonetDb and Postgres years ago). I find Pandas so much easier
| to work with for semi complicated work.
| cdavid wrote:
| The obvious one is speed/data size: you can handle much larger
| data with duckdb compared to pandas. Depending on the data,
| maybe 10x larger, sometimes even more.
|
| From an ergonomics perspective, I find pandas much harder to
| use casually than SQL. When I was an IC and was using it a lot,
| I was proefficient in it. But now that I code maybe 5 hours /
| month at work, I can't really do anything non trivial besides
| basic stuff/pivots. OTOH, I never really forget SQL.
| brunoqc wrote:
| Anyone both tried duckdb and clickhouse-local?
___________________________________________________________________
(page generated 2023-02-10 23:00 UTC)