[HN Gopher] We made Postgres writes faster, but it broke replica...
___________________________________________________________________
We made Postgres writes faster, but it broke replication
Author : philippemnoel
Score : 186 points
Date : 2025-07-21 11:26 UTC (11 hours ago)
(HTM) web link (www.paradedb.com)
(TXT) w3m dump (www.paradedb.com)
| timetoogo wrote:
| Love this style of no-fluff technical deep dive. HN needs more
| content like this.
| otterley wrote:
| > To be an effective alternative to Elasticsearch we needed to
| support high ingest workloads in real time.
|
| Why not just use OpenSearch or ElasticSearch? The tool is already
| in the inventory; why use a screwdriver when a chisel is needed
| and available?
|
| This is another one of those "when you have a hammer, everything
| looks like your thumb" stories.
| iw2rmb wrote:
| Because you don't need to sync and you have ACID with joins.
| otterley wrote:
| Is there a whole business to be had with those advantages
| alone? I'm curious as to who the target market is.
| levkk wrote:
| My last big co, we had a team of 10 who's entire job was to
| sync data from Postgres into Elastic. It would take weeks
| and fallover regularly due to traffic.
|
| If we could have a DB that could do search and be a store
| of record, it would be amazing.
| mathewpregasen wrote:
| Yeah, in general, I think a lot of businesses would love
| to skip ETL pipelines if possible / consolidate data.
| Postgres is a very much a neutral database to extend
| upon, maybe a wild analogy but it's the canola oil of
| databases
| strbean wrote:
| Total tangent, but I think "Canola is a neutral oil" is a
| lie. It's got the most distinctive (and in my opinion,
| bad) flavor of the common cooking oils.
| retakeming wrote:
| What would you say is the most neutral oil then?
| ahartmetz wrote:
| Sunflower oil? It seems to very reliably taste like
| nothing.
| mikegreenberg wrote:
| Once upon a time, I was using postgres for OLTP and OLAP
| purposes combined with in-database transforms using
| TimescaleDB. I had a schema for optimized ingestion and
| then several aggregate views which produced a bunch of
| purpose-specific "materialized" tables for efficient
| analysis based on the ingestion tables.
|
| Timescale had a nice way of abstracting away the cost of
| updating these views without putting too much load on
| ingestion (processing multiple TBs of data a time in a
| single instance with about 500Gb of data churn daily).
| worldsayshi wrote:
| One db that could be interesting here is CrateDB. It's a
| Lucene based DB that supports the postgres wire protocol.
| So you can run SQL queries against it.
|
| I've tried figuring out if it supports acting as a pg
| read-replica, which sounds to me like the ideal set up -
| but it doesn't seem to be supported.
|
| I have no affiliation to them, just met the team at an
| event and thought it sounded cool.
| philippemnoel wrote:
| One of the ParadeDB maintainers here -- Being PostgreSQL
| wire protocol compatible is very different from being
| built inside Postgres on top of the Postgres pages, which
| is what ParadeDB does. You still need the "T" in ETL,
| e.g. transforming data from your source into the format
| of the sink (in your example CrateDB). This is where ETL
| costs and brittleness come into play.
|
| You can read more about it here:
| https://www.paradedb.com/blog/block_storage_part_one
| otterley wrote:
| They're different access patterns, though. Are there no
| concerns about performance and potentially blocking
| behavior? Decoupling OLTP and analytics is frequently
| done with good reason: 1/to allow the systems to scale
| independently, and 2/to help prevent issues with one
| component from impacting the other (i.e., contain blast
| radius). I wouldn't want a failure of my search engine to
| also take down my transaction system.
| philippemnoel wrote:
| You don't need to. Customers usually deploy us on a
| standalone replica(s) on their Postgres cluster. If a
| query were to take it down, it would only take down the
| replica(s) dedicated to ParadeDB, leaving the primary and
| all other read replicas dedicated to OLTP safe.
| otterley wrote:
| Are you saying that the cluster isn't homogenous? It
| sounds like you're describing an architecture that
| involves a cluster that has two entirely different pieces
| of software on it, and whose roles aren't
| interchangeable.
| philippemnoel wrote:
| Bear with me, this will be a bit of a longer answer.
| Today, there are two topologies under which people deploy
| ParadeDB.
|
| - <some managed Postgres service> + ParadeDB. Frequently,
| customers already use a managed Postgres (e.g. AWS RDS)
| and want ParadeDB. In that world, they maintain their
| managed Postgres service and deploy a Kubernetes cluster
| running ParadeDB on the side, with one primary instance
| and some number of replicas. The AWS RDS primary sends
| data to the ParadeDB primary via logical replication. You
| can see a diagram here:
| https://docs.paradedb.com/deploy/byoc
|
| In this topology, the OLTP and search/OLAP workloads are
| fully isolated from each other. You have two clusters,
| but you don't need a third-party ETL service since
| they're both "just Postgres".
|
| - <self-hosted Postgres> + ParadeDB. Some customers,
| typically larger ones, prefer to self-host Postgres and
| want to install our Postgres extension directly. The
| extension is installed in their primary Postgres, and the
| CREATE INDEX commands must be issued on the primary;
| however, they may route reads only to a subset of the
| read replicas in their cluster.
|
| In this topology, all writes could be directed to the
| primary, all OLTP read queries could be routed to a pool
| of read replicas, and all search/OLAP queries could be
| directed to another subset of replicas.
|
| Both are completely reasonable approaches and depend on
| the workload. Hope this helps :)
| adsharma wrote:
| Which of these two is the higher order bit?
|
| * ParadeDB speaks postgres protocol
|
| * These setups don't have a complex ETL pipeline
|
| If you have a ETL pipeline specialized for PG logical
| replication (as opposed to generic JVM based
| Debizium/Kafka setups), you get some fraction of the same
| benefits. I'm curious about Conduit and its postgres
| plugin.
|
| That leaves: ParadeDB uses vanilla postgres + rust
| extension. This is a technology detail. I was looking for
| an articulation of the customer benefit because of this
| technologically appealing architecture.
| philippemnoel wrote:
| The value prop for customers vs Elasticsearch are:
|
| - ACID w/ JOINs
|
| - Real-time indexing under UPDATE-heavy workloads.
| Instacart wrote about this, they had to move away from
| Elasticsearch during COVID because of this problem:
| https://tech.instacart.com/how-instacart-built-a-modern-
| sear...
|
| Beyond these two benefits, then the added benefits are:
|
| - Infrastructure simplification (no need for ETL)
|
| - Lower costs
|
| Speaking the wire protocol is nice, but it's not worth
| much.
| dangoodmanUT wrote:
| they both sound like postgres to me, just with different
| extensions
| ucarion wrote:
| Since we both worked there: I can think of a few places
| at Segment where we'd have added more
| reporting/analytics/search if it weren't such a pain to
| set up a OLAP copy of our control plane databases.
| Remember how much engineering effort we spent on teams
| that did nothing but control plane database stuff?
|
| Data plane is a different story, but not everything is
| 1m+ RPS.
| gtowey wrote:
| It's not going to happen anytime soon, because you simply
| cannot cheat physics.
|
| A system that supports OLAP/ad-hoc queries is going to
| need a ton of IOPs & probably also CPU capacity to do
| your data transformations. If you want this to also scale
| beyond the capacity limits of a single node, then you're
| going to run into distributed joins and network becomes a
| huge factor.
|
| Now, to support OLTP at the same time, your big,
| distributed system needs to support ACID, be highly
| fault-tolerant, etc.
|
| All you end up with is a system that has to be scaled in
| every dimension. It needs to support the maximum possible
| workloads you can throw at it, or else a random,
| expensive reporting query is going to DOS your system and
| your primary customer-facing system will be unusable at
| the same time. It is sort of possible, but it's going to
| cost A LOT of money. You have to have tons and tons of
| "spare" capacity.
|
| Which brings us to the core of engineering -- anyone can
| build a system that burns dump trucks full of venture
| capital dollars to create the one-system-to-rule-them-
| all. But businesses that want to succeed need to optimize
| their costs so their storage systems don't break the
| bank. This is why the current status-quo of specialized
| systems that do one task well isn't going to change. The
| current technology paradigm cannot be optimized for every
| task simultaneously. We have to make tradeoffs.
| throwaway7783 wrote:
| I don't know. For me, I need
|
| * a primary transactional DB that I can write fast, with
| ACID guarantees and a read-after-write guarantee, and
| allows failover
|
| * one (or more) secondaries that are optimized for
| analytics and search. This should also tell me how caught
| up the system is, with the primary.
|
| If they all can talk the same language (SQL) and can
| replicate from primary with no additional
| tools/technology (postgres replcation for example), I
| will take it any day.
|
| It is about operational simplicity and not needing
| intimately to know multiple technologies. Granted, even
| if this is "just" postgresql, it really is not and all
| customizations will have their own tuning and whatnot,
| but the context is all still postgresql.
|
| Yes, this will not magically solve the CAP theorem, but
| for most cases we don't need to care too much
| cryptonector wrote:
| For JOINs? Absolutely! Who wants to hand-code queries at
| the executor level?! It's expensive!
|
| You need a query language.
|
| You don't necessarily need ACID, and you don't necessarily
| need a bunch of things that SQL RDBMSes give you, but you
| definitely need a QL, and it has to support a lot of what
| SQL supports, especially JOINs and GROUP BY w/
| aggregations.
|
| NoSQLs tend to evolve into having a QL layered on top. Just
| start with that if you really want to build a NoSQL.
| otterley wrote:
| To be clear here, I'm not arguing that
| OpenSearch/ElasticSearch is an adequate substitute for
| Postgres. They're different databases, each with
| different strengths and weaknesses. If you need JOINs and
| ACID compliance, you _should_ use Postgres. And if you
| need distributed search, you _should_ use OpenSearch
| /ElasticSearch.
|
| Unless they're building for single-host scale, you're not
| going to get JOINs for free. Lucene (the engine upon
| which ES/OS is based) already has JOIN capability. But
| it's not used in ES/OS because the performance of JOINs
| is absolutely abysmal in distributed databases.
| philippemnoel wrote:
| Our customers typically deploy ParadeDB in a primary-
| replicas topology, with one primary Postgres node and 2
| or more read replicas, depending on read volume. Queries
| are executed on a single node today, yes.
|
| We have plans to eventually support distributed queries.
| cryptonector wrote:
| I'm arguing that sometimes you don't need ACID, or
| rather, sometimes you accept that ACID is too painful so
| you accept not having ACID, but no one ever really
| doesn't want a QL -- they only think that they don't want
| a QL until they learn better.
|
| I.e., NoACID does not imply NoQueryLanguage, and you can
| always have a QL, so you should always get a QL, and you
| should always use a QL.
|
| > Unless they're building for single-host scale, you're
| not going to get JOINs for free.
|
| If by 'free' you mean not having to code them, then
| that's wrong. You can always have or implement a QL.
|
| If by 'free' you mean 'performant', then yes, you might
| have to denormalize your data so that JOINs vanish,
| though at the cost of write amplification. But so what,
| that's true whether you use a QL or not -- it's true in
| SQL RDBMSes too.
| derefr wrote:
| It's funny; as someone who is exactly pg_search's market,
| I actually often want the opposite: ACID, MVCC
| transactions, automatic table and index management... but
| no query language.
|
| At the data scale + level of complexity our OLAP queries
| operate at, we _very_ often run into situations where
| Postgres 's very best plan [with a well-considered
| schema, with great indexes and statistics, and after tons
| of tuning and coaxing], still does something literally
| interminable -- not for any semantic reason to do with
| the query plan, but rather due to how Postgres's
| architecture _executes_ the query plan[1].
|
| The last such job, I thought would be simple enough to
| run in a few hours... I let it run for six days[2], and
| then gave up and killed it. Whereas, when we encoded the
| same "query plan" as a series of bulk-primitive ETL steps
| by:
|
| 1. dumping the raw source data from PG to CSV with a
| `COPY`,
|
| 2. whipping out simple POSIX CLI tools like
| sort/uniq/grep/awk (plus a few hand-rolled streaming
| aggregation scripts) to transform/reduce/normalize the
| source data into the shape we want it in,
|
| 3. and then loading the resulting CSVs back into PG with
| another `COPY`,
|
| ...then the runtime of the whole operation was reduced to
| just a few hours, with the individual steps completing in
| ~30 minutes each. (And that's despite the overhead of
| parsing and/or emitting non-string fields from/to CSV
| with almost every intermediate step!)
|
| Honestly, if Postgres would just let us program it the
| way one programs e.g. Redis through Lua, or ETS tables in
| Erlang -- where the tables and indices are ADTs with low-
| level public APIs, and you set up your own "query plan"
| as a set of streaming-channel actors making calls to
| these APIs -- then we would be a lot happier. But even in
| PL/pgSQL (which we _do_ use, here and there), the only
| APIs are high-level ones.
|
| * Sure, you can get a cursor on a _query_ ; but you can't
| e.g. get an LMDB-like B-tree cursor on a target B-tree
| index, and ask it to jump [i.e. re-nav down from root] or
| walk [i.e. nav up from current pos to nearest common
| ancestor then back down] to "the first row-tuple greater-
| than-or-equal to [key]".
|
| * You can't write your own efficient implementation of
| TABLESAMPLE semantics to set up your own Bigtable-esque
| balanced cluster-order-partitioned parallel seq scan.
|
| * You can't collect pointers to row-tuples, partially
| materialize them, filter them by some criterion on the
| read (but perhaps not _parsed_!) columns, and _then_
| more-fully materialize those same row-tuples "directly"
| from the references to them you still hold.
|
| ---
|
| [1] One example of what I mean by "execution": did you
| know that Postgres doesn't use any form of concurrency
| for query plans -- not even the most basic libuv-like
| "This Merge Append node's child-node A is in a blocking-
| wait on IO; that blocking-wait should yield, so that the
| Merge Append node's child-node B can instead send row-
| tuple batches for a while" kind of concurrency?
|
| ---
|
| [2] If you're wondering, the query that ran for six days
| was literally just this (anonymized):
| SELECT a, b, SUM(value) AS total_value FROM (
| SELECT a, b, value FROM source1 UNION ALL
| SELECT a, b, value FROM source2 ) AS u
| GROUP BY a, b;
|
| `source1` and `source2` are ~150GB tables. (Or at least,
| they're 150GB when dumped to CSV.) Two integer keys
| (a,b), and a bigint value. With a b-tree index on `(a,b)
| INCLUDE (value)`, with correct statistics.
|
| And its EXPLAIN query plan looked like this (with `SET
| enable_hashagg = OFF;`) -- _nominally_ pretty good:
| GroupAggregate (cost=1.17..709462419.92 rows=40000
| width=40) Group Key: a, b -> Merge
| Append (cost=1.17..659276497.84 rows=6691282944
| width=16) Sort Key: a, b
| -> Index Only Scan using source1_a_b_idx on source1
| (cost=0.58..162356175.31 rows=3345641472 width=16)
| -> Index Only Scan using source2_a_b_idx on source2
| (cost=0.58..162356175.31 rows=3345641472 width=16)
|
| Each one of the operations here is "obvious." It's what
| you'd think you'd want! You'd think this _would_ finish
| quickly. And yet.
|
| (And no, the machine it ran on was _not_ resource-
| bottlenecked. It had 1TB of RAM with no contention from
| other jobs, and this PG session was allowed to use much
| of it as work memory. But even if it was spilling to disk
| at every step... that should have been fine. The CSV
| equivalent of this inherently "spills to disk", for
| everything except the nursery levels of sort(1)'s merge-
| sort. And it does fine.)
| cryptonector wrote:
| > At the data scale + level of complexity our OLAP
| queries operate at, we very often run into situations
| where Postgres's very best plan [with a well-considered
| schema, with great indexes and statistics, and after tons
| of tuning and coaxing], still does something literally
| interminable -- not for any semantic reason to do with
| the query plan, but rather due to how Postgres's
| architecture executes the query plan[1].
|
| Well, ok, this is a problem, and I have run into it
| myself. That's not a reason for not wanting a QL. It's a
| reason for wanting a way to improve the query planning.
| Query hints in the QL are a bad idea for several reasons.
| What I would like instead is out-of-band query hints that
| I can provide along with my query (though obviously only
| when using APIs rather than `psql`; for `psql` one would
| have to provide the hints via some \hints commnad) where
| I would address each table source using names/aliases for
| the table source / join, and names for subqueries, and so
| really something like a path through the query and
| subqueries like `.<sub_query_alias0>.<sub_query_alias1>.<
| ..>.<sub_query_aliasN>.<table_source_alias>` and where
| the hint would indicate things like what sub-query plan
| type to use and what index to use.
| derefr wrote:
| I mean, in my case, I don't think what I want could be
| implemented via query hints. The types of things I would
| want to communicate to the server, are pragmas entirely
| divorced from the semantics of SQL: pragmas that only
| make sense if you can force the query's plan to take a
| specific shape to begin with, because you're trying to
| tune specific knobs _on specific plan nodes_ , so if
| those plan nodes aren't part of the final query, then
| your tuning is meaningless.
|
| And if you're pinning the query plan to a specific shape,
| then there's really no point in sending SQL + hints; you
| may as well just expose a lower-level "query-execution-
| engine abstract-machine bytecode" that the user can
| submit, to be translated in a very low-level -- but
| contractual! -- way into a query plan. Or, one step
| further, into _the thing a query plan does_ , skipping
| the plan-node-graph abstraction entirely in favor of
| arbitrarily calling the same primitives the plan nodes
| call [in a sandboxed way, because such bytecode _should_
| be low-level enough that it can encode invalid operation
| sequences that will crash the PG connection backend --
| and this is fine, the user signed up for that; they just
| want to be assured that such a crash won 't affect data
| integrity outside the current transaction.]
|
| Such a bytecode wouldn't _have_ to be used as the literal
| compiled internal representation of SQL within the
| server, mind you. (It 'd be _ideal_ if it was, but it
| doesn 't _need_ to be.) Just like e.g. the published and
| versioned JVM bytecode spec isn 't 1:1 with the bytecode
| ISA the JVM actually uses as its in-memory representation
| for interpretation -- there's module-load-time
| translation/compilation from the stable public format, to
| the current internal format.
| strbean wrote:
| Obligatory whine that the term NoSQL got co-opted to mean
| "no relational". There's tons of space for a better query
| language for querying relation databases.
| beoberha wrote:
| Interestingly enough, it looks like the team was just
| hacking on an open source extension and organically
| attracted some customers, which snowballed into raising
| capital. So definitely seems like there's a market.
| sandeepkd wrote:
| There can be multiple reasons, one that I can think of right
| away would be to keep the stack as simple as possible until you
| can. Realistically speaking most of the companies do not
| operate at the scale where they would need the specialized
| tools.
| AtlasBarfed wrote:
| Why not the Cassandra based elastics if you need ingest?
| izietto wrote:
| Because it's a whole another architectural component for data
| that is already there, in a tool that is made just for that
| lacking just of `SELECT full_text_search('kitty pictures');`
| truetraveller wrote:
| diagramas made how?
| hobs wrote:
| Guessing Figma.
| philippemnoel wrote:
| Yes, Figma!
___________________________________________________________________
(page generated 2025-07-21 23:01 UTC)