[HN Gopher] Datastream for BigQuery Preview
___________________________________________________________________
Datastream for BigQuery Preview
Author : rntn
Score : 87 points
Date : 2022-09-16 16:04 UTC (6 hours ago)
(HTM) web link (cloud.google.com)
(TXT) w3m dump (cloud.google.com)
| mousetree wrote:
| Finally! We use both Cloud SQL (Postgres) and BigQuery. Given
| both are managed GCP products you'd expect a simple way to
| replicate Postgres into BigQuery. Instead we've had to pay for
| Stirch/Fivetran/Airbyte or some janky "BigQuery external
| connections" to do this. Very happy they're finally supporting an
| obvious use case. Surprise it's taken so long
| mywittyname wrote:
| We've been using the external queries feature of BQ against
| Cloud SQL, then redirecting the output to a new table. It's not
| perfect, but it's functional, so long as your tables are very
| simple.
| jimmyechan wrote:
| On a related note, Salesforce and Snowflake announced a
| partnership for real-time data sharing. Not exactly the same, but
| it's interesting to see data warehouse companies building
| upstream. I wonder if there's a trend going on here where data
| warehouse companies start natively integrating with more
| databases, apps, and other data sources in (near) real-time
|
| Edit. Link to announcement:
| https://techcrunch.com/2022/09/15/salesforce-snowflake-partn...
| mywittyname wrote:
| Yes, I think this will be the case. Google makes it very easy
| to integration with customers' BigQuery instances. Have the
| customer supply a service account token and a target dataset,
| then run a query with the output table configured in that
| target dataset. Done.
| gigatexal wrote:
| Managed debezium?
| ta3411 wrote:
| This seems like an ideal use case for us. I have a naive thinking
| of my workflow: can someone please comment if I am off track.
|
| I am building an e-commerce product on AWS PostgresSQL. Everyday,
| I want to be able to do analytics on order volume, new customers,
| etc. - For us to track internally: we fire client and backend
| events into Amplitude - For sellers to track: we directly query
| PostgressQL to export
|
| Now with this, I am thinking of constantly streaming our SQL
| table to BigQuery. And any analysis can be done on top of this
| BigQuery instance across both internal tracking and external
| export.
|
| Is RedShift the AWS equivalent of this?
| te_chris wrote:
| BQ is saas proper vs redshift where you have to pick instance
| sizes etc. It's amazing, true superpower stuff in how little
| you have to think about it to get loads out of it.
| mchaynes wrote:
| basically yes. kinesis -> firehose -> s3
| faizshah wrote:
| As a heavy BQ user on my side projects, there isn't really an
| alternative to BQ in AWS. I find that RedShift does not provide
| a lot of the functionality and ease of use that BQ provides.
|
| That said the closest thing is Amazon Athena.
|
| The architecture would basically be Kinesis -> S3 <- Athena
| where S3 is your data lake or you can do it like AWS DMS -> S3
| <- Athena.
|
| To accomplish this or the redshift solution you need to
| implement change data capture from your relational DB, for that
| you can sue AWS Database Migration Service like this for
| redshift: https://aws.amazon.com/blogs/apn/change-data-capture-
| from-on...
|
| Like this for kinesis: https://aws.amazon.com/blogs/big-
| data/stream-change-data-to-...
|
| The reason you may want to use Kinesis is because you can use
| Flink in Kinesis Data Analytics just like you can use DataFlow
| in GCP to aggregate some metrics before dumping them into your
| data lake/warehouse.
| pramsey wrote:
| No support for geometry or ranges, alas! "Datastream doesn't
| support columns of the Geometric data types or Range data types.
| The values in these columns will be replaced with NULL values."
| neeleshs wrote:
| Shameless plug : Syncari supports this as well, without most
| limits around backfills and schema changes
| tehalex wrote:
| Hopefully it 'just works' and is easier than AWS's DMS.
|
| I use AWS DMS to replica data from RDS => Redshift and it's a
| never-ending source of pain.
| johndfsgdgdfg wrote:
| Doesn't matter. Goggle will shut this service down. Moreover,
| Google is an ad company. It can never compete to customer focus
| of Amazon.
| mritchie712 wrote:
| doesnt amazon have an ecomm business or something? Thought I
| heard something about that.
| brad0 wrote:
| I'm curious about customer use cases for this. Can anyone
| elaborate?
| skadamat wrote:
| Few come to mind!
|
| - Startup that's running analytics off their production
| Postgres DB can use this as a gateway drug to try out BigQuery.
| Bonus for not having to ETL the data themselves or pay FiveTran
| / host Airbyte to do it
|
| - Replace existing internally maintained sync job / ETL
| pipelines with this, if you're somewhat all-in on BigQuery /
| GCP for analytics (BQ is quite popular in analytics because of
| the pricing structure)
| dtauzell wrote:
| If you use some sort of ETL to copy data from your OLTP to data
| warehouse you might use this to stream data from your OLTP to
| Big Query so you can use Big Query as a data warehouse.
| jerglingu wrote:
| The lifting-and-dropping process from transactional stores into
| analytics stores is usually the first step in analytics data
| pipelines. This appears to fulfill that initial ingestion step
| specifically for Google BigQuery. A lot of times these days
| this ingestion step is ran through some third party service:
| Fivetran, Stitch, Matillion, etc. Similar to AWS's offerings
| for Redshift, this looks like Google's "native" ingestion tool
| for BigQuery.
|
| I'd be curious to know how it handles the wrinkles Fivetran
| effectively solves (schema drift, logging, general performance)
| and if it does so at a much cheaper price.
| gfarah wrote:
| We have been using RDS -> DMS -> Kinesis -> BigQuery for 3 years
| now. Very little maintenance and has worked without issues. I
| doubt this service will be cost competitive over a setup like
| ours. Perhaps between Cloud SQL -> BigQuery will make sense?
| stingraycharles wrote:
| Kinesis -> BigQuery? How does that work? Or did you mean Kafka?
| faizshah wrote:
| Is this using Anthos? If not aren't the bandwidth costs high
| for this solution from AWS -> GCP?
| polskibus wrote:
| I wonder how well does it handle schema changes. Adding new
| nullable columns is easy to implement, but more complex changes
| can be a pain.
| kppullin wrote:
| We currently use the kafka connect bigquery connector, along with
| debezium, to "stream" both the "changelog"/"transaction log" and
| to "mirror" our rdbms instances into bigquery. While this works,
| it's been a fair amount of effort to iron out issues over time.
| We also have had to work around bigquery limits including issues
| exceeding concurrent queries (switched to batch mode, which has
| it's own issues) and frequency of writes (we've had to throttle
| to flushing every minute, which is good enough, but did have a
| use case for faster updates). Also have issues related to
| partitioning and clustering, and more...
|
| So seeing this to potentially replace the kafka connect bigquery
| connector looked appealing. However, according to the docs and
| listed limitations
| (https://cloud.google.com/datastream/docs/sources-postgresql) it
| does not handle schema changes well nor postgres array types. Not
| that any of these tools handle this well, but given the open
| source bigquery connector, we've been able to work around this
| with customizations to the code. Hopefully they'll continue to
| iterate on the product and I'll be keeping an eye out.
| atwebb wrote:
| Yeah the Debezium connectors have some issues that really get
| in the way. I'm less familiar with BQ but some other DBs, the
| data typing is really, really basic
|
| (case when string then varchar(4000/max))
|
| and similar. It looks like a relatively easy thing to
| incrementally improve.
| gunnarmorling wrote:
| Hey, Gunnar here from the Debezium team. I would love to
| learn more about those issues you encountered with Debezium.
| Perhaps it's something we could improve. So if you could
| share your feedback either here, on our mailing list
| (https://groups.google.com/g/debezium), that would be
| awesome. Thanks!
| kppullin wrote:
| For us, debezium has been working rather well in recent
| memory (there were a couple PRs I submitted that I'm proud
| of, even if tiny!). Most of the issues are on the kafka
| connector side, whether it's the bigquery sink, jdbc sink,
| and s3 sink.
|
| A couple things that do pop to mind as it relates to
| debezium include better capabilities around backup +
| restores and disaster recovery, and any further hints
| around schema changes. Admittedly I haven't looked at these
| areas for 6+ months so they may be improved.
| yumraj wrote:
| What do they mean by _near real-time_?
|
| Is it microsecond, millisecond, seconds, minutes, hours ... ?
| rochoa wrote:
| From the documentation
| (https://cloud.google.com/datastream/docs/behavior-
| overview#p...):
|
| > When using the default stream settings, the latency from
| reading the data in the source to streaming it into the
| destination is between 10 and 120 seconds.
| lmkg wrote:
| This is a common thing for BigQuery to claim. It's quite fast
| for an analytics database, but it is not a transactional
| database. Even for static data, the simplest queries on the
| tiniest data still take 1/10th of a second or so, and realistic
| queries run in the 1-5s range.
|
| Basically you could use BigQuery to build tools supporting
| human-in-the-loop processes (BI reports, exec dashboards), and
| you could call those "real-time" to the humans involved. But it
| will not support transactional workloads, and it does not
| provide SLAs to support that. I don't know about this
| particular feature, I'm guessing seconds but maybe minutes.
| yolo3000 wrote:
| They are also introducing a sort of search/indexing soon,
| which will probably speed up certain types of queries.
| carlineng wrote:
| I wonder if the underlying technology is from their acquisition
| of Alooma [1] several years back. High volume change data capture
| from an OLTP database to a data warehouse is a problem that seems
| simple on its face, but has a lot of nasty edge cases. Alooma did
| a decent job of handling the problem at moderate scale, so I'm
| cautiously optimistic about this announcement.
|
| [1]: https://techcrunch.com/2019/02/19/google-acquires-cloud-
| migr...
| base3 wrote:
| Definitely alooma. It even uses alooma's semantics for naming
| variables and staging tables.
| dominotw wrote:
| > problem that seems simple on its face, but has a lot of nasty
| edge cases.
|
| I highly recommend teams not rolling out their own solutions
| because of all these edge cases. My boss refused to pay a
| vendor solution because it was a 'easy problem' according to
| him. Denied whole team promotion because we spent lot of time
| solving 'easy problem' , outright asked us 'whats so hard about
| it that it deserves promotion' . Worst person i've ever worked
| for.
| anonymousDan wrote:
| Well what are the hard parts then?
| mywittyname wrote:
| Keeping track of new record without hammering the database.
|
| Keeping track of updates without a dedicated column.
|
| Ditto for deletes.
|
| Schema drift.
|
| Managing datatype differences. I.e., Postgres supports
| complex datatypes that other DBs do not.
|
| Stored procedures.
|
| Basically any features of a DB that are beyond the bare
| minimum necessary to call something a relation database is
| an edge case.
|
| I rolled a solution for Postgres -> BQ by hand and it was
| much more involved than I expected. I had to set up a read-
| only replica and completely copying over the databases each
| run, which is only possible because they are so tiny.
| carlineng wrote:
| Just to name a few off the top of my head --
|
| - Data type mismatches between systems
|
| - Differences in handling ambiguous or bad data (e.g., null
| characters)
|
| - Handling backfills
|
| - Handling table schema changes
|
| - Writing merge queries to handle deletes/updates in a
| cost-effective way
|
| - Scrubbing the binlog of PII or other information that
| shouldn't make its way into the data warehouse
|
| - Determining which tables to replicate, and which to leave
| behind
|
| - Ability to replay the log from a point-in-time in case of
| an outage or other incident
|
| And I'm sure there are a lot more I'm not thinking of. None
| of these are terribly difficult in isolation, but there's a
| long tail of issues like these that need to be solved.
___________________________________________________________________
(page generated 2022-09-16 23:00 UTC)