hngopher.com

       [HN Gopher] Querying Postgres Tables Directly from DuckDB
       ___________________________________________________________________
        
       Querying Postgres Tables Directly from DuckDB
        
       Author : hfmuehleisen
       Score  : 99 points
       Date   : 2022-09-30 14:59 UTC (8 hours ago)
        
 (HTM) web link (duckdb.org)
 (TXT) w3m dump (duckdb.org)
        
       | samwillis wrote:
       | DuckDB is brilliant. Something I'm working I uses SQLite, there
       | is an analytical search users can perform, it scans most of the
       | DB with multiple joins. It would take about 45 seconds to run the
       | query. I exported the dataset and imported it into DuckDB. The
       | literal same SQL query now takes only 1.5 seconds.
       | 
       | It really shows the difference in how a column store is so much
       | better for curtain queries.
        
       | galaxyLogic wrote:
       | I wasn't aware of DuckDB but now that I read about I wonder
       | should I replace my SQLite drivers with it, since it seems more
       | versatile?
        
         | astrospective wrote:
         | It's pretty cool, but I wouldn't switch over unless you know
         | you need a column store approach.
        
           | mhuffman wrote:
           | I would say, maybe not switch over if you need something that
           | has been rock-solid for years and years.
           | 
           | If you want a shit-ton of built-in features DuckDB is pretty
           | damn fast even with non-column store optimized queries.
           | Surprisingly so! I have done side-by-side tests and you
           | really have to have a lot of rows for it to start to make a
           | difference. And their built-in functions are awesome. And
           | they (company/group) are awesome and very responsive!
           | 
           | Having said that DuckDB is still young enough to make me
           | nervous putting in critical application stacks and I feel
           | about Sqlite the same way I feel about PostgreSQL ... it is
           | solid as hell and will just work.
           | 
           | I will say that I have not had any stability issues in any of
           | my personal projects that I use DuckDB in, it is just that I
           | know for a fact that Sqlite is solid.
        
         | tkhattra wrote:
         | one thing to be aware of is that duckdb doesn't currently allow
         | concurrent readers and writers (unlike sqlite). depending on
         | your use case, that might be a showstopper.
        
       | revision17 wrote:
       | DuckDB is awesome! I find it the easiest way to ingest data from
       | various sources then query it into a form I can do analytics on.
       | 
       | The datasets I work on are a bit too big for pandas, but spark is
       | way overkill for them. DuckDB lets me efficiently work on them
       | using only a single computer.
        
         | jxi wrote:
        
       | xnx wrote:
       | SQLite and DuckDB are excellent demonstrations that the computers
       | we have on our desks (and even in our pockets) are ridiculously
       | fast and capable for almost everything but the largest workloads.
       | They're a stark contrast to framework bloat that has stolen our
       | CPU cycles and given us perceptible lag when typing and
       | scrolling.
        
         | somenameforme wrote:
         | It makes one wonder what the software world would like if the
         | whole notion of 'premature optimization is the root of all
         | evil' never existed. Because that advice, well intentioned and
         | perfectly reasonable when applied thoughtfully, gradually
         | morphed into 'optimization is the root of all evil'
         | contributing to this ongoing race between computers speeding up
         | and software slowing down.
         | 
         | And that's a scenario which a cynic would observe was almost
         | certainly anything but undesired by the top players on either
         | side. It's much easier to require what would be considered a
         | supercomputer not long ago, to run a word processor, than it is
         | to create scenarios where such power is meaningfully applied in
         | a mass market product.
        
           | rgblambda wrote:
           | I don't think the idea morphed into "any optimization" is
           | evil. But it is the unfortunate consequence of leaving
           | optimization until after functional requirements are met.
           | Same with any kind of tech debt. A mentality of "Let's just
           | get this out the door now any fix it later" results in later
           | meaning never.
        
       | [deleted]
        
       | ClassicOrgin wrote:
       | I was looking into exporting a database from DuckDB to Postgres
       | for use with Google Data Studio, but whenever I was exporting
       | DuckDB was attaching decimals to integers. This was throwing off
       | the entire import into Postgres. Think I'll just try this out.
        
       | isoprophlex wrote:
       | Nice job, this will be useful to many, I think. The duckdb sure
       | seems to have a knack for figuring out what their audience needs.
        
       | eatonphil wrote:
       | The way they do this read is interesting, and especially why they
       | don't just read the files on disk directly.
       | 
       | > The Postgres Scanner uses the standard libpq library, which it
       | statically links in. Ironically, this makes the Postgres Scanner
       | easier to install than the other Postgres clients. However,
       | Postgres' normal client-server protocol is quite slow, so we
       | spent quite some time optimizing this. As a note, DuckDB's SQLite
       | Scanner does not face this issue, as SQLite is also an in-process
       | database.
       | 
       | > We actually implemented a prototype direct reader for Postgres'
       | database files, but while performance was great, there is the
       | issue that committed but not yet checkpointed data would not be
       | stored in the heap files yet. In addition, if a checkpoint was
       | currently running, our reader would frequently overtake the
       | checkpointer, causing additional inconsistencies. We abandoned
       | that approach since we want to be able to query an actively used
       | Postgres database and believe that consistency is important.
       | Another architectural option would have been to implement a
       | DuckDB Foreign Data Wrapper (FDW) for Postgres similar to
       | duckdb_fdw but while this could improve the protocol situation,
       | deployment of a postgres extension is quite risky on production
       | servers so we expect few people will be able to do so.
       | 
       | > Instead, we use the rarely-used binary transfer mode of the
       | Postgres client-server protocol. This format is quite similar to
       | the on-disk representation of Postgres data files and avoids some
       | of the otherwise expensive to-string and from-string conversions.
       | For example, to read a normal int32 from the protocol message,
       | all we need to do is to swap byte order (ntohl).
        
       | latenightcoding wrote:
       | Wonder if there is anything like this for ClickHouse.
       | 
       | edit: to query pg from clickhouse not clickhouse from duckdb
        
         | zX41ZdbW wrote:
         | Yes, ClickHouse can query Postgres for maybe two years already.
         | 
         | https://clickhouse.com/docs/en/integrations/postgresql/postg...
         | 
         | You can connect to Postgres using a table function:
         | 
         | SELECT ... FROM postgresql(...)
         | 
         | You can create a table with ENGINE = PostgreSQL and use it like
         | a normal table.
         | 
         | You can create a database with ENGINE = PostgreSQL and it will
         | represent all the tables from the PostgreSQL database.
         | 
         | And finally, you can replicate data from PostgreSQL in realtime
         | with CDC into ClickHouse tables.
        
           | zX41ZdbW wrote:
           | Similar features available for MySQL, MongoDB and SQLite.
        
             | zX41ZdbW wrote:
             | What is surprising: if you query MySQL from ClickHouse, the
             | queries work faster despite the fact that the data is
             | simply read from MySQL.
             | 
             | And two more use-cases: - you can create a table in
             | ClickHouse pointing to multiple MySQL servers for sharding
             | and failover, so ClickHouse can be used to query sharded
             | MySQL; - you can create a key-value dictionary on top of
             | MySQL tables and use it in JOINs.
        
       | klysm wrote:
       | Does the inverse of this also exist?
        
         | 1egg0myegg0 wrote:
         | You can use the DuckDB FDW (linked in the blog post) for this!
         | https://github.com/alitrack/duckdb_fdw
        
         | nlnn wrote:
         | Unofficially maybe? There's a 3rd party duckdb_fdw plugin for
         | postgres that allows querying DuckDB from Postgres:
         | https://github.com/alitrack/duckdb_fdw
        
       | m_ke wrote:
       | This might end up being the best way to etl postgres tables to
       | parquet. From everything else that I tried, doing a copy to CSV
       | and then converting to parquet was the fastest but can be a pain
       | when dealing with type conversions.
        
         | orthecreedence wrote:
         | I didn't know DuckDB could read parquet natively[0]. Thanks for
         | mentioning this. I recently had the displeasure of trying to
         | deal with parquet files in nodejs and it was a disaster. I
         | might try going through DuckDB instead.
         | 
         | [0]: https://duckdb.org/docs/data/parquet
        
         | zX41ZdbW wrote:
         | It is very easy with ClickHouse. All you need is:
         | 
         | SELECT ... FROM postgresql(...) FORMAT Parquet
         | 
         | And you can run this query without installing ClickHouse, using
         | the clickhouse-local command-line tool.
         | 
         | It can be downloaded simply as:
         | 
         | curl https://clickhouse.com/ | sh
        
         | exyi wrote:
         | Exactly, I almost started writing a tool for that, because
         | nothing else I found supports arrays and composite types. Now
         | let's hope DuckDB does support that :)
         | 
         | If you don't use arrays and composites, Spark should be able to
         | do it, right?
        
           | 1egg0myegg0 wrote:
           | Yes, DuckDB supports writing lists and structs to Parquet!
           | https://github.com/duckdb/duckdb/pull/2832
           | 
           | Does that help or do you have any other questions?
        
       ___________________________________________________________________
       (page generated 2022-09-30 23:01 UTC)