https://github.com/supabase/pg_replicate Skip to content Navigation Menu Toggle navigation Sign in * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + GitHub Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code Explore + All features + Documentation + GitHub Skills + Blog * Solutions By size + Enterprise + Teams + Startups By industry + Healthcare + Financial services + Manufacturing By use case + CI/CD & Automation + DevOps + DevSecOps * Resources Topics + AI + DevOps + Security + Software Development Explore + Learning Pathways + White papers, Ebooks, Webinars + Customer Stories + Partners * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Enterprise + Enterprise platform AI-powered developer platform Available add-ons + Advanced Security Enterprise-grade security features + GitHub Copilot Enterprise-grade AI features + Premium Support Enterprise-grade 24/7 support * Pricing Search or jump to... Search code, repositories, users, issues, pull requests... Search [ ] Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. [ ] [ ] Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name [ ] Query [ ] To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up Reseting focus You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert {{ message }} supabase / pg_replicate Public * * Notifications You must be signed in to change notification settings * Fork 2 * Star 128 License Apache-2.0 license 128 stars 2 forks Branches Tags Activity Star Notifications You must be signed in to change notification settings * Code * Issues 20 * Pull requests 0 * Actions * Projects 0 * Security * Insights Additional navigation options * Code * Issues * Pull requests * Actions * Projects * Security * Insights supabase/pg_replicate This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. main BranchesTags Go to file Code Folders and files Name Name Last commit Last commit message date Latest commit History 197 Commits .github/workflows .github/ workflows api api config-types config-types pg_replicate pg_replicate replicator replicator .dockerignore .dockerignore .gitignore .gitignore Cargo.toml Cargo.toml LICENSE LICENSE README.md README.md deny.toml deny.toml View all files Repository files navigation * README * Code of conduct * Apache-2.0 license About pg_replicate pg_replicate is a Rust crate to quickly build replication solutions for Postgres. It provides building blocks to construct data pipelines which can continually copy data from Postgres to other systems. It builds abstractions on top of Postgres's logical streaming replication protocol and pushes users towards the pit of success without letting them worry about low level details of the protocol. Quickstart To quickly try out pg_replicate, you can run the stdout example, which will replicate the data to standard output. First, create a publication in Postgres which includes the tables you want to replicate: create publication my_publication for table table1, table2; Then run the stdout example: cargo run --example stdout -- --db-host localhost --db-port 5432 --db-name postgres --db-username postgres --db-password password cdc my_publication stdout_slot In the above example, pg_replicate connects to a Postgres database named postgres running on localhost:5432 with a username postgres and password password. The slot name stdout_slot will be created by pg_replicate automatically. Refer to the examples folder to run examples for sinks other than stdout (currently only bigquery and duckdb supported). A quick tip: to see all the command line options, run the example wihout any options specified, e.g. cargo run --example bigquery will print the detailed usage instructions for the bigquery sink. Getting Started To use pg_replicate in your Rust project, add it via a git dependency in Cargo.toml: [dependencies] pg_replicate = { git = "https://github.com/imor/pg_replicate" } The git dependency is needed for now because pg_replicate is not yet published on crates.io. You'd also need to add a dependency to tokio: [dependencies] ... tokio = { version = "1.38" } Now your main.rs can have code like the following: use std::error::Error; use pg_replicate::pipeline::{ data_pipeline::DataPipeline, sinks::stdout::StdoutSink, sources::postgres::{PostgresSource, TableNamesFrom}, PipelineAction, }; #[tokio::main] async fn main() -> Result<(), Box> { let host = "localhost"; let port = 5432; let database = "postgres"; let username = "postgres"; let password = Some("password".to_string()); let slot_name = Some("my_slot".to_string()); let table_names = TableNamesFrom::Publication("my_publication".to_string()); // Create a PostgresSource let postgres_source = PostgresSource::new( host, port, database, username, password, slot_name, table_names, ) .await?; // Create a StdoutSink. This sink just prints out the events it receives to stdout let stdout_sink = StdoutSink; // Create a `DataPipeline` to connect the source to the sink let mut pipeline = DataPipeline::new(postgres_source, stdout_sink, PipelineAction::Both); // Start the `DataPipeline` to start copying data from Postgres to stdout pipeline.start().await?; Ok(()) } For more examples, please refer to the examples folder in the source. Repository Structure The repository is a cargo workspace. Each of the individual sub-folders are crate in the workspace. A brief explanation of each crate is as follows: * api - REST api used for hosting pg_replicate in a cloud environment. * config-types - Common types for configuration used in projects across the workspace. * pg_replicate - The main library crate containing the core logic. * replicator - A binary crate using pg_replicate. Packaged as a docker container for use in cloud hosting. Roadmap pg_replicate is still under heavy development so expect bugs and papercuts but overtime we plan to add the following sinks. * [*] Add BigQuery Sink * [*] Add DuckDb Sink * [*] Add MotherDuck Sink * [ ] Add Snowflake Sink * [ ] Add ClickHouse Sink * [ ] Many more to come... Note: DuckDb and MotherDuck sinks do no use the batched pipeline, hence they currently perform poorly. A batched pipeline version of these sinks is planned. See the open issues for a full list of proposed features (and known issues). License Distributed under the Apache-2.0 License. See LICENSE for more information. Docker To create the docker image for replicator run docker build -f ./ replicator/Dockerfile . from the root of the repo. Similarly, to create the docker image for api run docker build -f ./api/Dockerfile .. Design Applications can use data sources and sinks from pg_replicate to build a data pipeline to continually copy data from the source to the sink. For example, a data pipeline to copy data from Postgres to DuckDB takes about 100 lines of Rust. There are three components in a data pipeline: 1. A data source 2. A data sink 3. A pipline The data source is an object from where data will be copied. The data sink is an object to which data will be copied. The pipeline is an object which drives the data copy operations from the source to the sink. +----------+ +----------+ | | | | | Source |---- Data Pipeline --->| Sink | | | | | +----------+ +----------+ So roughly you write code like this: let postgres_source = PostgresSource::new(...); let duckdb_sink = DuckDbSink::new(..); let pipeline = DataPipeline(postgres_source, duckdb_sink); pipeline.start(); Of course, the real code is more than these four lines, but this is the basic idea. For a complete example look at the duckdb example. Data Sources A data source is the source for data which the pipeline will copy to the data sink. Currently, the repository has only one data source: PostgresSource. PostgresSource is the primary data source; data in any other source or sink would have originated from it. Data Sinks A data sink is where the data from a data source is copied. There are two kinds of data sinks. Those which retain the essential nature of data coming out of a PostgresSource and those which don't. The former kinds of data sinks can act as a data source in future. The latter kind can't act as a data source and are data's final resting place. For instance, DuckDbSink ensures that the change data capture (CDC) stream coming in from a source is materialized into tables in a DuckDB database. Once this lossy data transformation is done, it can not be used as a CDC stream again. Contrast this with a potential future sink S3Sink or KafkaSink which just copies the CDC stream as is. The data deposited in the sink can later be used as if it was coming from Postgres directly. Data Pipeline A data pipeline encapsulates the business logic to copy the data from the source to the sink. It also orchestrates resumption of the CDC stream from the exact location it was last stopped at. The data sink participates in this by persisting the resumption state and returning it to the pipeline when it restarts. If a data sink is not transactional (e.g. S3Sink), it is not always possible to keep the CDC stream and the resumption state consistent with each other. This can result in these non-transactional sinks having duplicate portions of the CDC stream. Data pipeline helps in deduplicating these duplicate CDC events when the data is being copied over to a transactional store like DuckDB. Finally, the data pipeline reports back the log sequence number (LSN) upto which the CDC stream has been copied in the sink to the PostgresSource. This allows the Postgres database to reclaim disk space by removing WAL segment files which are no longer required by the data sink. +----------+ +----------+ | | | | | Source |<---- LSN Numbers -----| Sink | | | | | +----------+ +----------+ Kinds of Data Copies CDC stream is not the only kind of data a data pipeline performs. There's also full table copy, aka backfill. These two kinds can be performed either together or separately. For example, a one-off data copy can use the backfill. But if you want to regularly copy data out of Postgres and into your OLAP database, backfill and CDC stream both should be used. Backfill to get the intial copies of the data and CDC stream to keep those copies up to date and changes in Postgres happen to the copied tables. Performance Currently the data source and sinks copy table row and CDC events one at a time. This is expected to be slow. Batching, and other strategies will likely improve the performance drastically. But at this early stage the focus is on correctness rather than performance. There are also zero benchmarks at this stage, so commentary about performance is closer to speculation than reality. About No description, website, or topics provided. Resources Readme License Apache-2.0 license Code of conduct Code of conduct Activity Custom properties Stars 128 stars Watchers 3 watching Forks 2 forks Report repository Releases No releases published Sponsor this project Sponsor Learn more about GitHub Sponsors Packages 0 No packages published Languages * Rust 98.1% * Shell 1.4% * Dockerfile 0.5% Footer (c) 2024 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact * Manage cookies * Do not share my personal information You can't perform that action at this time.