[HN Gopher] Querying Postgres Tables Directly from DuckDB
___________________________________________________________________
Querying Postgres Tables Directly from DuckDB
Author : hfmuehleisen
Score : 99 points
Date : 2022-09-30 14:59 UTC (8 hours ago)
(HTM) web link (duckdb.org)
(TXT) w3m dump (duckdb.org)
| samwillis wrote:
| DuckDB is brilliant. Something I'm working I uses SQLite, there
| is an analytical search users can perform, it scans most of the
| DB with multiple joins. It would take about 45 seconds to run the
| query. I exported the dataset and imported it into DuckDB. The
| literal same SQL query now takes only 1.5 seconds.
|
| It really shows the difference in how a column store is so much
| better for curtain queries.
| galaxyLogic wrote:
| I wasn't aware of DuckDB but now that I read about I wonder
| should I replace my SQLite drivers with it, since it seems more
| versatile?
| astrospective wrote:
| It's pretty cool, but I wouldn't switch over unless you know
| you need a column store approach.
| mhuffman wrote:
| I would say, maybe not switch over if you need something that
| has been rock-solid for years and years.
|
| If you want a shit-ton of built-in features DuckDB is pretty
| damn fast even with non-column store optimized queries.
| Surprisingly so! I have done side-by-side tests and you
| really have to have a lot of rows for it to start to make a
| difference. And their built-in functions are awesome. And
| they (company/group) are awesome and very responsive!
|
| Having said that DuckDB is still young enough to make me
| nervous putting in critical application stacks and I feel
| about Sqlite the same way I feel about PostgreSQL ... it is
| solid as hell and will just work.
|
| I will say that I have not had any stability issues in any of
| my personal projects that I use DuckDB in, it is just that I
| know for a fact that Sqlite is solid.
| tkhattra wrote:
| one thing to be aware of is that duckdb doesn't currently allow
| concurrent readers and writers (unlike sqlite). depending on
| your use case, that might be a showstopper.
| revision17 wrote:
| DuckDB is awesome! I find it the easiest way to ingest data from
| various sources then query it into a form I can do analytics on.
|
| The datasets I work on are a bit too big for pandas, but spark is
| way overkill for them. DuckDB lets me efficiently work on them
| using only a single computer.
| jxi wrote:
| xnx wrote:
| SQLite and DuckDB are excellent demonstrations that the computers
| we have on our desks (and even in our pockets) are ridiculously
| fast and capable for almost everything but the largest workloads.
| They're a stark contrast to framework bloat that has stolen our
| CPU cycles and given us perceptible lag when typing and
| scrolling.
| somenameforme wrote:
| It makes one wonder what the software world would like if the
| whole notion of 'premature optimization is the root of all
| evil' never existed. Because that advice, well intentioned and
| perfectly reasonable when applied thoughtfully, gradually
| morphed into 'optimization is the root of all evil'
| contributing to this ongoing race between computers speeding up
| and software slowing down.
|
| And that's a scenario which a cynic would observe was almost
| certainly anything but undesired by the top players on either
| side. It's much easier to require what would be considered a
| supercomputer not long ago, to run a word processor, than it is
| to create scenarios where such power is meaningfully applied in
| a mass market product.
| rgblambda wrote:
| I don't think the idea morphed into "any optimization" is
| evil. But it is the unfortunate consequence of leaving
| optimization until after functional requirements are met.
| Same with any kind of tech debt. A mentality of "Let's just
| get this out the door now any fix it later" results in later
| meaning never.
| [deleted]
| ClassicOrgin wrote:
| I was looking into exporting a database from DuckDB to Postgres
| for use with Google Data Studio, but whenever I was exporting
| DuckDB was attaching decimals to integers. This was throwing off
| the entire import into Postgres. Think I'll just try this out.
| isoprophlex wrote:
| Nice job, this will be useful to many, I think. The duckdb sure
| seems to have a knack for figuring out what their audience needs.
| eatonphil wrote:
| The way they do this read is interesting, and especially why they
| don't just read the files on disk directly.
|
| > The Postgres Scanner uses the standard libpq library, which it
| statically links in. Ironically, this makes the Postgres Scanner
| easier to install than the other Postgres clients. However,
| Postgres' normal client-server protocol is quite slow, so we
| spent quite some time optimizing this. As a note, DuckDB's SQLite
| Scanner does not face this issue, as SQLite is also an in-process
| database.
|
| > We actually implemented a prototype direct reader for Postgres'
| database files, but while performance was great, there is the
| issue that committed but not yet checkpointed data would not be
| stored in the heap files yet. In addition, if a checkpoint was
| currently running, our reader would frequently overtake the
| checkpointer, causing additional inconsistencies. We abandoned
| that approach since we want to be able to query an actively used
| Postgres database and believe that consistency is important.
| Another architectural option would have been to implement a
| DuckDB Foreign Data Wrapper (FDW) for Postgres similar to
| duckdb_fdw but while this could improve the protocol situation,
| deployment of a postgres extension is quite risky on production
| servers so we expect few people will be able to do so.
|
| > Instead, we use the rarely-used binary transfer mode of the
| Postgres client-server protocol. This format is quite similar to
| the on-disk representation of Postgres data files and avoids some
| of the otherwise expensive to-string and from-string conversions.
| For example, to read a normal int32 from the protocol message,
| all we need to do is to swap byte order (ntohl).
| latenightcoding wrote:
| Wonder if there is anything like this for ClickHouse.
|
| edit: to query pg from clickhouse not clickhouse from duckdb
| zX41ZdbW wrote:
| Yes, ClickHouse can query Postgres for maybe two years already.
|
| https://clickhouse.com/docs/en/integrations/postgresql/postg...
|
| You can connect to Postgres using a table function:
|
| SELECT ... FROM postgresql(...)
|
| You can create a table with ENGINE = PostgreSQL and use it like
| a normal table.
|
| You can create a database with ENGINE = PostgreSQL and it will
| represent all the tables from the PostgreSQL database.
|
| And finally, you can replicate data from PostgreSQL in realtime
| with CDC into ClickHouse tables.
| zX41ZdbW wrote:
| Similar features available for MySQL, MongoDB and SQLite.
| zX41ZdbW wrote:
| What is surprising: if you query MySQL from ClickHouse, the
| queries work faster despite the fact that the data is
| simply read from MySQL.
|
| And two more use-cases: - you can create a table in
| ClickHouse pointing to multiple MySQL servers for sharding
| and failover, so ClickHouse can be used to query sharded
| MySQL; - you can create a key-value dictionary on top of
| MySQL tables and use it in JOINs.
| klysm wrote:
| Does the inverse of this also exist?
| 1egg0myegg0 wrote:
| You can use the DuckDB FDW (linked in the blog post) for this!
| https://github.com/alitrack/duckdb_fdw
| nlnn wrote:
| Unofficially maybe? There's a 3rd party duckdb_fdw plugin for
| postgres that allows querying DuckDB from Postgres:
| https://github.com/alitrack/duckdb_fdw
| m_ke wrote:
| This might end up being the best way to etl postgres tables to
| parquet. From everything else that I tried, doing a copy to CSV
| and then converting to parquet was the fastest but can be a pain
| when dealing with type conversions.
| orthecreedence wrote:
| I didn't know DuckDB could read parquet natively[0]. Thanks for
| mentioning this. I recently had the displeasure of trying to
| deal with parquet files in nodejs and it was a disaster. I
| might try going through DuckDB instead.
|
| [0]: https://duckdb.org/docs/data/parquet
| zX41ZdbW wrote:
| It is very easy with ClickHouse. All you need is:
|
| SELECT ... FROM postgresql(...) FORMAT Parquet
|
| And you can run this query without installing ClickHouse, using
| the clickhouse-local command-line tool.
|
| It can be downloaded simply as:
|
| curl https://clickhouse.com/ | sh
| exyi wrote:
| Exactly, I almost started writing a tool for that, because
| nothing else I found supports arrays and composite types. Now
| let's hope DuckDB does support that :)
|
| If you don't use arrays and composites, Spark should be able to
| do it, right?
| 1egg0myegg0 wrote:
| Yes, DuckDB supports writing lists and structs to Parquet!
| https://github.com/duckdb/duckdb/pull/2832
|
| Does that help or do you have any other questions?
___________________________________________________________________
(page generated 2022-09-30 23:01 UTC)