[HN Gopher] Faster PostgresSQL to BigQuery Transfers
___________________________________________________________________
Faster PostgresSQL to BigQuery Transfers
Author : fhk
Score : 106 points
Date : 2023-01-10 11:33 UTC (11 hours ago)
(HTM) web link (tech.marksblogg.com)
(TXT) w3m dump (tech.marksblogg.com)
| legedemon wrote:
| For someone whose interaction with spatial data is very limited,
| I found the article to be a treasure trove of information.
|
| Also, thanks for sharing S2! It'll be nice to look at.
| fhk wrote:
| Great to share the combination of tech here and very interested
| to see how others are ingesting spatial data at scale
| michaericalribo wrote:
| Is there a reason not to used federated queries to hit postgres
| directly?
| vgt wrote:
| Federation is slow
|
| (i was the pm for bigquery federation with gcs and cloudsql)
| boundlessdreamz wrote:
| Does anyone know what tools can be used to stream the result of a
| mongodb query into bigquery?
| jd_mongodb wrote:
| See https://cloud.google.com/blog/products/data-
| analytics/mongod...
| boundlessdreamz wrote:
| Thank you!
| jd_mongodb wrote:
| See also https://www.youtube.com/watch?v=9MFv8J7Smws
| pantsforbirds wrote:
| I really love working with parquet and the general arrow
| ecosystem. The performance and cost ratios you can get out of it
| are really insane. AWS S3 + parquet + athena is one of the best
| and cheapest databases I've ever used.
| contol-m wrote:
| [I work on BigQuery]
|
| BigQuery also supports in-place querying of datasets on GCS (or
| S3/Azure using Omni) via external/BigLake tables.
| https://cloud.google.com/bigquery/docs/query-cloud-storage-u...
| yolo3000 wrote:
| Nice to read this, I had a similar type of assignment 15 years
| ago, visualize the rollout of the fiber optics network across the
| city. But we had a lot less data to deal with.
| vidarh wrote:
| Back when I was working with shapefiles, it was the type of
| things that tended to be far more convenient to process in-
| process using something like GDAL [1] (which can operate directly
| on an in memory copy, gzip files, sqlite databases and far more)
| and query it with GDAL's SQL support, especially when build with
| Spatialite [2] rather than loading it into a separate database.
| It'd have been interesting if the author had talked about what's
| stopping him from that approach given he's clearly aware of GDAL
| and given that 130M records and a few tens of GB isn't a
| particularly big GIS dataset.
|
| [1] https://gdal.org
|
| [2] https://www.gaia-gis.it/fossil/libspatialite/index
| marklit wrote:
| I'll take a look into it.
|
| Last year I found ClickHouse did H3 enrichment much faster than
| PG or BQ so it has become a go-between for larger workloads
| with simple enrichment requirements.
|
| The dataset in the post is an example anyone can download. I
| handle 30B+ records at work.
| qoega wrote:
| >It would also be good to see GEOS, GDAL and PROJ integrated
| into ClickHouse.
|
| You can just add feature requests like this to ClickHouse
| issue tracker.
| https://github.com/ClickHouse/ClickHouse/issues/45129
| marklit wrote:
| Done :)
|
| https://github.com/ClickHouse/ClickHouse/issues/45130
| api wrote:
| Anything to go the other way? I'd like to use BQ to warehouse and
| be able to examine but PG to do heavy analytics due to the cost
| once you really start doing many repeated queries.
|
| I guess I could just dump directly to CSVs and download but BQ is
| a nice convenient bottomless data bucket.
| buremba wrote:
| You can utilize BI engine for this exact purpose:
| https://cloud.google.com/bigquery/docs/bi-engine-query
| mritchie712 wrote:
| do you already have an ETL pipeline to BQ? You could use
| Airbyte[0] for pg to bq. It's pretty heavy if this is your only
| source, but should work well if you need to query the pg data
| in bq frequently.
|
| 0 - https://airbyte.com/connections/PostgreSQL-to-BigQuery
| rockostrich wrote:
| They said "the other way" so BQ to Postgres.
| muspimerol wrote:
| BigQuery and Postgres are both sources and destinations in
| Airbyte, the data can flow either way.
| api wrote:
| Airbyte looks nice! thanks.
| ctc24 wrote:
| Very cool to see a walkthrough with actual benchmarks. Not
| entirely surprised that Parquet shines here. Another big
| advantage of Parquet over CSV is that you don't have to worry
| about data integrity. Perhaps less relevant for GIS data, but not
| having to think about things like string escaping is rather nice.
|
| _" It would be great to see data vendors deliver data straight
| into the Cloud Databases of their customers. It would save a lot
| of client time that's spent converting and uploading files."_
|
| Hear hear! Shameless plug: this is exactly what we enable at
| prequel.co. If there are any data vendors reading this, or anyone
| who wants easier access to data from their vendor, we're here to
| help.
|
| edit: quote fmt
___________________________________________________________________
(page generated 2023-01-10 23:01 UTC)