[HN Gopher] Show HN: Peerdb Streams - Simple, native Postgres ch...
___________________________________________________________________
Show HN: Peerdb Streams - Simple, native Postgres change data
capture
Hello HN, I am Sai Srirampur, one of the Co-founders of PeerDB.
(https://github.com/PeerDB-io/peerdb). We spent the past 7 months
building a solid experience to replicate data from Postgres to data
warehouses. Now we're expanding to queues. PeerDB Streams provides
a simple and native way to replicate changes as they happen in
Postgres to Queues (Kafka, Redpanda, Google PubSub, etc). We use
Postgres logical decoding to enable Change Data Capture (CDC).
Blog post here: https://blog.peerdb.io/peerdb-streams-simple-
native-postgres.... 10-min quickstart here:
https://docs.peerdb.io/quickstart/streams-quickstart. We chose
queues as many users found that existing tools are complex.
Debezium is the most used tool for this use-case. It has large
production usage. However, a common pain point among our users is
that it has a significant learning curve taking months to
productionize. A few issues are: a) Interacting through a command
line interface, understanding the various settings, and learning
best practices for running it in production is not trivial.
Debezium UI, released to address usability concerns [1], is still
in an incubating state [2]. Additionally, reading Debezium
resources to get started can be overwhelming [3]. b) Supporting
data formats and transformations isn't easy. It needs a Java
project, building JAR packages and setting up a runtime path on the
kafka connect plugin. c)Debezium is not as native as Kafka for
other queues and doesn't offer the same level of configurability.
For example, with Event Hubs, it is difficult to stream to topics
spread across namespaces and subscriptions. TL;DR Debezium aims to
provide a comprehensive experience for engineers to implement CDC
rather than making it dead simple for them. So you can do a lot
with Debezium but need to know a lot about it. At PeerDB, we are
building a simple yet comprehensive experience for Postgres CDC.
The goal is to enable engineers to build prod-grade Postgres CDC
with a minimal learning curve, within a few days. PeerDB's
feature-set isn't at Debezium's level yet, and as we evolve, we
might face similar challenges. However, we're putting usability at
the forefront and we believe that we can achieve the above goal.
First, PeerDB offers a simple UI to set up Postgres and Kafka by
creating PEERs and initiating CDC by creating a MIRROR. Through the
UI, users can monitor the progress of CDC, including throughput and
latency; set up alerts to Slack/Email based on replication slot
growth; investigate Postgres-native metrics, including slot size,
etc. Here is a demo showing of PeerDB UI in action:
https://www.loom.com/share/ebcfb7646a1e48738835853b760e5d04
Second, for users who prefer a CLI, we provide a Postgres-
compatible SQL layer to manage CDC. This offers the same level of
features as the UI and is more intuitive compared to bash scripts.
Third, users can perform row-level transformations using Lua
scripts executed at runtime. This enables features such as
encrypting/masking PII data, supporting various data formats (JSON,
MsgPack, Protobuf, etc.), and more. We offer a script editor along
with a bunch of useful templates [5]. Fourth, we provide native
connectors to non-Kafka targets. We also provide native
configurability options tailored to these platforms. For example,
with Event Hubs, users can perform CDC to topics distributed across
different namespaces and subscriptions [4]. Finally, We are laser
focused on Postgres, enabling specific optimizations like native
metrics for replication, wait-events, and # of connections.
Features like faster initial loads through parallel snapshotting
and decoding transactions in-flight are in private beta. Our hope
is to provide the best data-movement experience for Postgres.
PeerDB Streams is another step in that direction. We would love to
get your feedback on product experience, our thesis and anything
else that comes to your mind. It would be super useful for us.
Thank you! References: [1]
https://debezium.io/blog/2020/10/22/towards-debezium-ui/ [2]
https://debezium.io/documentation/reference/stable/operation... [3]
https://medium.com/@cooper.wolfe/i-hated-debezium-so-much-i-... [4]
https://blog.peerdb.io/enterprise-grade-replication-from-pos... [5]
https://github.com/PeerDB-io/examples [5] https://app.peerdb.cloud
[6] https://github.com/PeerDB-io/PeerDB
Author : saisrirampur
Score : 100 points
Date : 2024-05-06 17:00 UTC (5 hours ago)
| adontz wrote:
| SQS and EventBridge targets not even on the roadmap? Why? Any
| specific reason?
| saisrirampur wrote:
| So far, we have prioritized connectors that most of our users
| requested, including Redpanda, PubSub, and Event Hubs. We will
| add SQS and EventBridge to our backlog. It shouldn't be too
| hard to add these connectors; for example, adding the PubSub
| connector took a few weeks. Thanks for your feedback! :)
| __s wrote:
| you're underselling, I did pubsub over an afternoon. But it
| was an easy stretch since peerdb already had gcp auth logic
| for bigquery & pubsub mostly matches kafka so code only
| needed a couple tweaks
| flockonus wrote:
| What is Change Data Capture (CDC) ?
|
| Peerdb doesn't seem to inform on the core of the problem it
| solves, here's a reference from Debezium (mentioned in the text)
|
| > set up and configure Debezium to monitor your databases, and
| then your applications consume events for each row-level change
| made to the database. Only committed changes are visible, so your
| application doesn't have to worry about transactions or changes
| that are rolled back.
|
| It's good to know! This model seems to turn row changes into
| effectively a side-effect invocation for a queue.
| treyfitty wrote:
| To my knowledge, CDC provides a solution to the following 2
| pain points. Would love to hear more:
|
| 1. Replication: Imagine you have 5 different analytical
| environments. Do you backup & restore to each destination,
| essentially re-writing 10TB of data, when only 2MB worth of
| data changed? You'd be surprised how many organizations still
| do this.
|
| 2. System Triggering: How can you inform a downstream system an
| event occurred? There are many mechanisms to do so, and they
| vary in complexity, but given the requirements are simple (if a
| row in the DB changes, let's tag this change as an event and
| just pass it to all the downstream systems that need to know
| this)
| assaxor wrote:
| Additionally, it can help with systems where 2-phase commits
| can lead to weird and complex situations. This is where
| applying SAGA and using Transactional Outbox can help: it
| should allow for abstracting away the event stream away from
| the main application logic, the CDC mirror can then handle
| generating the stream. This can also allow for plug-and-play
| of other downstream services/sinks from the CDC stream.
| saisrirampur wrote:
| Change Data Capture (and its use cases) is a well known topic
| in the data-engineering and streaming communities. Hence we
| didn't mention more about it. But we will add it to our future
| posts. Thanks for the feedback here!
|
| Here goes some commentary on CDC use-cases: CDC provides a
| reliable stream of database changes, enabling various use cases
| such as real-time alerting (e.g., fraud detection in banking),
| replicating OLTP data to other types of stores (OLAP, search,
| etc.) for real-time analytics and search, and implementing the
| transactional outbox pattern which exposes database changes to
| other microservices/apps (e.g., a bank exposing a raw feed of
| transactions to its rewards microservice, etc.).
| take-five wrote:
| How do you handle Postgres cluster failover? Does PeerDB
| automatically restore logical replication slot on a new primary?
| saisrirampur wrote:
| Great question! We have retry logic in place to handle Postgres
| restarts. If the failure occurs in-place, you should be fine as
| the slot will persist. If Postgres fails over to the standby,
| PeerDB will wait until the slot is created. We did consider
| automatically creating the slot if it doesn't exist on retries,
| but ensuring data reliability (creating the slot right after
| failover without data being ingested) is tricky. So, as of now,
| we leave it to the user.
| arsalanb wrote:
| Noob question: What is the advantage of replicating data into a
| warehouse vs. just querying it in place on a postgres database?
| CuriouslyC wrote:
| If the postgres database is recording business transactions,
| you don't want to cause your business to stop being able to
| take credit cards because you generated a report.
| arsalanb wrote:
| Assuming you use a connection pool, why would it stop? Either
| the query returns a result or it doesnt? Am I missing
| something?
| CuriouslyC wrote:
| Reporting queries can put a significant load on the db, to
| the point that it interrupts service.
| arsalanb wrote:
| I see, thanks!
| necubi wrote:
| Futhermore, Postgres is an OLTP (transactional) database,
| designed to efficiently perform updates and deletes on
| individual rows. OLAP (analytical) databases/query
| engines like Clickhouse, Presto, Druid, etc. are designed
| for efficient processing across many rows of mostly
| unchanging data.
|
| Analytical queries (like "find the average sales price
| across all orders over the past year, grouped by store
| and region") can be 100-1000x faster in an OLAP database
| compared to Postgres.
| matthieucan wrote:
| Additionally, unless your data model is designed as append-only
| (which is unusual and requires logic downstream), you won't be
| able to track updates and deletions, which are valuable for
| reporting
| ralfhn wrote:
| Data warehouses are structured to handle large volumes of data
| and complex queries more efficiently than a typical
| transactional database like PostgreSQL.
| gniting wrote:
| Nice to see more product development and offerings in this area.
| Well done.
|
| [Full disclosure, I work for Prisma and we have a similar product
| called Pulse (https://prisma.io/pulse)]
|
| Another use case for CDC is compliance. I reckon that in the near
| future, to ensure with data compliance regulations, CDC will
| become the better option for devs vs traditional
| seek/update/delete functions.
___________________________________________________________________
(page generated 2024-05-06 23:00 UTC)