[HN Gopher] Faster PostgresSQL to BigQuery Transfers
       ___________________________________________________________________
        
       Faster PostgresSQL to BigQuery Transfers
        
       Author : fhk
       Score  : 106 points
       Date   : 2023-01-10 11:33 UTC (11 hours ago)
        
 (HTM) web link (tech.marksblogg.com)
 (TXT) w3m dump (tech.marksblogg.com)
        
       | legedemon wrote:
       | For someone whose interaction with spatial data is very limited,
       | I found the article to be a treasure trove of information.
       | 
       | Also, thanks for sharing S2! It'll be nice to look at.
        
       | fhk wrote:
       | Great to share the combination of tech here and very interested
       | to see how others are ingesting spatial data at scale
        
       | michaericalribo wrote:
       | Is there a reason not to used federated queries to hit postgres
       | directly?
        
         | vgt wrote:
         | Federation is slow
         | 
         | (i was the pm for bigquery federation with gcs and cloudsql)
        
       | boundlessdreamz wrote:
       | Does anyone know what tools can be used to stream the result of a
       | mongodb query into bigquery?
        
         | jd_mongodb wrote:
         | See https://cloud.google.com/blog/products/data-
         | analytics/mongod...
        
           | boundlessdreamz wrote:
           | Thank you!
        
           | jd_mongodb wrote:
           | See also https://www.youtube.com/watch?v=9MFv8J7Smws
        
       | pantsforbirds wrote:
       | I really love working with parquet and the general arrow
       | ecosystem. The performance and cost ratios you can get out of it
       | are really insane. AWS S3 + parquet + athena is one of the best
       | and cheapest databases I've ever used.
        
         | contol-m wrote:
         | [I work on BigQuery]
         | 
         | BigQuery also supports in-place querying of datasets on GCS (or
         | S3/Azure using Omni) via external/BigLake tables.
         | https://cloud.google.com/bigquery/docs/query-cloud-storage-u...
        
       | yolo3000 wrote:
       | Nice to read this, I had a similar type of assignment 15 years
       | ago, visualize the rollout of the fiber optics network across the
       | city. But we had a lot less data to deal with.
        
       | vidarh wrote:
       | Back when I was working with shapefiles, it was the type of
       | things that tended to be far more convenient to process in-
       | process using something like GDAL [1] (which can operate directly
       | on an in memory copy, gzip files, sqlite databases and far more)
       | and query it with GDAL's SQL support, especially when build with
       | Spatialite [2] rather than loading it into a separate database.
       | It'd have been interesting if the author had talked about what's
       | stopping him from that approach given he's clearly aware of GDAL
       | and given that 130M records and a few tens of GB isn't a
       | particularly big GIS dataset.
       | 
       | [1] https://gdal.org
       | 
       | [2] https://www.gaia-gis.it/fossil/libspatialite/index
        
         | marklit wrote:
         | I'll take a look into it.
         | 
         | Last year I found ClickHouse did H3 enrichment much faster than
         | PG or BQ so it has become a go-between for larger workloads
         | with simple enrichment requirements.
         | 
         | The dataset in the post is an example anyone can download. I
         | handle 30B+ records at work.
        
           | qoega wrote:
           | >It would also be good to see GEOS, GDAL and PROJ integrated
           | into ClickHouse.
           | 
           | You can just add feature requests like this to ClickHouse
           | issue tracker.
           | https://github.com/ClickHouse/ClickHouse/issues/45129
        
             | marklit wrote:
             | Done :)
             | 
             | https://github.com/ClickHouse/ClickHouse/issues/45130
        
       | api wrote:
       | Anything to go the other way? I'd like to use BQ to warehouse and
       | be able to examine but PG to do heavy analytics due to the cost
       | once you really start doing many repeated queries.
       | 
       | I guess I could just dump directly to CSVs and download but BQ is
       | a nice convenient bottomless data bucket.
        
         | buremba wrote:
         | You can utilize BI engine for this exact purpose:
         | https://cloud.google.com/bigquery/docs/bi-engine-query
        
         | mritchie712 wrote:
         | do you already have an ETL pipeline to BQ? You could use
         | Airbyte[0] for pg to bq. It's pretty heavy if this is your only
         | source, but should work well if you need to query the pg data
         | in bq frequently.
         | 
         | 0 - https://airbyte.com/connections/PostgreSQL-to-BigQuery
        
           | rockostrich wrote:
           | They said "the other way" so BQ to Postgres.
        
             | muspimerol wrote:
             | BigQuery and Postgres are both sources and destinations in
             | Airbyte, the data can flow either way.
        
           | api wrote:
           | Airbyte looks nice! thanks.
        
       | ctc24 wrote:
       | Very cool to see a walkthrough with actual benchmarks. Not
       | entirely surprised that Parquet shines here. Another big
       | advantage of Parquet over CSV is that you don't have to worry
       | about data integrity. Perhaps less relevant for GIS data, but not
       | having to think about things like string escaping is rather nice.
       | 
       |  _" It would be great to see data vendors deliver data straight
       | into the Cloud Databases of their customers. It would save a lot
       | of client time that's spent converting and uploading files."_
       | 
       | Hear hear! Shameless plug: this is exactly what we enable at
       | prequel.co. If there are any data vendors reading this, or anyone
       | who wants easier access to data from their vendor, we're here to
       | help.
       | 
       | edit: quote fmt
        
       ___________________________________________________________________
       (page generated 2023-01-10 23:01 UTC)