[HN Gopher] We made Postgres writes faster, but it broke replica...
       ___________________________________________________________________
        
       We made Postgres writes faster, but it broke replication
        
       Author : philippemnoel
       Score  : 186 points
       Date   : 2025-07-21 11:26 UTC (11 hours ago)
        
 (HTM) web link (www.paradedb.com)
 (TXT) w3m dump (www.paradedb.com)
        
       | timetoogo wrote:
       | Love this style of no-fluff technical deep dive. HN needs more
       | content like this.
        
       | otterley wrote:
       | > To be an effective alternative to Elasticsearch we needed to
       | support high ingest workloads in real time.
       | 
       | Why not just use OpenSearch or ElasticSearch? The tool is already
       | in the inventory; why use a screwdriver when a chisel is needed
       | and available?
       | 
       | This is another one of those "when you have a hammer, everything
       | looks like your thumb" stories.
        
         | iw2rmb wrote:
         | Because you don't need to sync and you have ACID with joins.
        
           | otterley wrote:
           | Is there a whole business to be had with those advantages
           | alone? I'm curious as to who the target market is.
        
             | levkk wrote:
             | My last big co, we had a team of 10 who's entire job was to
             | sync data from Postgres into Elastic. It would take weeks
             | and fallover regularly due to traffic.
             | 
             | If we could have a DB that could do search and be a store
             | of record, it would be amazing.
        
               | mathewpregasen wrote:
               | Yeah, in general, I think a lot of businesses would love
               | to skip ETL pipelines if possible / consolidate data.
               | Postgres is a very much a neutral database to extend
               | upon, maybe a wild analogy but it's the canola oil of
               | databases
        
               | strbean wrote:
               | Total tangent, but I think "Canola is a neutral oil" is a
               | lie. It's got the most distinctive (and in my opinion,
               | bad) flavor of the common cooking oils.
        
               | retakeming wrote:
               | What would you say is the most neutral oil then?
        
               | ahartmetz wrote:
               | Sunflower oil? It seems to very reliably taste like
               | nothing.
        
               | mikegreenberg wrote:
               | Once upon a time, I was using postgres for OLTP and OLAP
               | purposes combined with in-database transforms using
               | TimescaleDB. I had a schema for optimized ingestion and
               | then several aggregate views which produced a bunch of
               | purpose-specific "materialized" tables for efficient
               | analysis based on the ingestion tables.
               | 
               | Timescale had a nice way of abstracting away the cost of
               | updating these views without putting too much load on
               | ingestion (processing multiple TBs of data a time in a
               | single instance with about 500Gb of data churn daily).
        
               | worldsayshi wrote:
               | One db that could be interesting here is CrateDB. It's a
               | Lucene based DB that supports the postgres wire protocol.
               | So you can run SQL queries against it.
               | 
               | I've tried figuring out if it supports acting as a pg
               | read-replica, which sounds to me like the ideal set up -
               | but it doesn't seem to be supported.
               | 
               | I have no affiliation to them, just met the team at an
               | event and thought it sounded cool.
        
               | philippemnoel wrote:
               | One of the ParadeDB maintainers here -- Being PostgreSQL
               | wire protocol compatible is very different from being
               | built inside Postgres on top of the Postgres pages, which
               | is what ParadeDB does. You still need the "T" in ETL,
               | e.g. transforming data from your source into the format
               | of the sink (in your example CrateDB). This is where ETL
               | costs and brittleness come into play.
               | 
               | You can read more about it here:
               | https://www.paradedb.com/blog/block_storage_part_one
        
               | otterley wrote:
               | They're different access patterns, though. Are there no
               | concerns about performance and potentially blocking
               | behavior? Decoupling OLTP and analytics is frequently
               | done with good reason: 1/to allow the systems to scale
               | independently, and 2/to help prevent issues with one
               | component from impacting the other (i.e., contain blast
               | radius). I wouldn't want a failure of my search engine to
               | also take down my transaction system.
        
               | philippemnoel wrote:
               | You don't need to. Customers usually deploy us on a
               | standalone replica(s) on their Postgres cluster. If a
               | query were to take it down, it would only take down the
               | replica(s) dedicated to ParadeDB, leaving the primary and
               | all other read replicas dedicated to OLTP safe.
        
               | otterley wrote:
               | Are you saying that the cluster isn't homogenous? It
               | sounds like you're describing an architecture that
               | involves a cluster that has two entirely different pieces
               | of software on it, and whose roles aren't
               | interchangeable.
        
               | philippemnoel wrote:
               | Bear with me, this will be a bit of a longer answer.
               | Today, there are two topologies under which people deploy
               | ParadeDB.
               | 
               | - <some managed Postgres service> + ParadeDB. Frequently,
               | customers already use a managed Postgres (e.g. AWS RDS)
               | and want ParadeDB. In that world, they maintain their
               | managed Postgres service and deploy a Kubernetes cluster
               | running ParadeDB on the side, with one primary instance
               | and some number of replicas. The AWS RDS primary sends
               | data to the ParadeDB primary via logical replication. You
               | can see a diagram here:
               | https://docs.paradedb.com/deploy/byoc
               | 
               | In this topology, the OLTP and search/OLAP workloads are
               | fully isolated from each other. You have two clusters,
               | but you don't need a third-party ETL service since
               | they're both "just Postgres".
               | 
               | - <self-hosted Postgres> + ParadeDB. Some customers,
               | typically larger ones, prefer to self-host Postgres and
               | want to install our Postgres extension directly. The
               | extension is installed in their primary Postgres, and the
               | CREATE INDEX commands must be issued on the primary;
               | however, they may route reads only to a subset of the
               | read replicas in their cluster.
               | 
               | In this topology, all writes could be directed to the
               | primary, all OLTP read queries could be routed to a pool
               | of read replicas, and all search/OLAP queries could be
               | directed to another subset of replicas.
               | 
               | Both are completely reasonable approaches and depend on
               | the workload. Hope this helps :)
        
               | adsharma wrote:
               | Which of these two is the higher order bit?
               | 
               | * ParadeDB speaks postgres protocol
               | 
               | * These setups don't have a complex ETL pipeline
               | 
               | If you have a ETL pipeline specialized for PG logical
               | replication (as opposed to generic JVM based
               | Debizium/Kafka setups), you get some fraction of the same
               | benefits. I'm curious about Conduit and its postgres
               | plugin.
               | 
               | That leaves: ParadeDB uses vanilla postgres + rust
               | extension. This is a technology detail. I was looking for
               | an articulation of the customer benefit because of this
               | technologically appealing architecture.
        
               | philippemnoel wrote:
               | The value prop for customers vs Elasticsearch are:
               | 
               | - ACID w/ JOINs
               | 
               | - Real-time indexing under UPDATE-heavy workloads.
               | Instacart wrote about this, they had to move away from
               | Elasticsearch during COVID because of this problem:
               | https://tech.instacart.com/how-instacart-built-a-modern-
               | sear...
               | 
               | Beyond these two benefits, then the added benefits are:
               | 
               | - Infrastructure simplification (no need for ETL)
               | 
               | - Lower costs
               | 
               | Speaking the wire protocol is nice, but it's not worth
               | much.
        
               | dangoodmanUT wrote:
               | they both sound like postgres to me, just with different
               | extensions
        
               | ucarion wrote:
               | Since we both worked there: I can think of a few places
               | at Segment where we'd have added more
               | reporting/analytics/search if it weren't such a pain to
               | set up a OLAP copy of our control plane databases.
               | Remember how much engineering effort we spent on teams
               | that did nothing but control plane database stuff?
               | 
               | Data plane is a different story, but not everything is
               | 1m+ RPS.
        
               | gtowey wrote:
               | It's not going to happen anytime soon, because you simply
               | cannot cheat physics.
               | 
               | A system that supports OLAP/ad-hoc queries is going to
               | need a ton of IOPs & probably also CPU capacity to do
               | your data transformations. If you want this to also scale
               | beyond the capacity limits of a single node, then you're
               | going to run into distributed joins and network becomes a
               | huge factor.
               | 
               | Now, to support OLTP at the same time, your big,
               | distributed system needs to support ACID, be highly
               | fault-tolerant, etc.
               | 
               | All you end up with is a system that has to be scaled in
               | every dimension. It needs to support the maximum possible
               | workloads you can throw at it, or else a random,
               | expensive reporting query is going to DOS your system and
               | your primary customer-facing system will be unusable at
               | the same time. It is sort of possible, but it's going to
               | cost A LOT of money. You have to have tons and tons of
               | "spare" capacity.
               | 
               | Which brings us to the core of engineering -- anyone can
               | build a system that burns dump trucks full of venture
               | capital dollars to create the one-system-to-rule-them-
               | all. But businesses that want to succeed need to optimize
               | their costs so their storage systems don't break the
               | bank. This is why the current status-quo of specialized
               | systems that do one task well isn't going to change. The
               | current technology paradigm cannot be optimized for every
               | task simultaneously. We have to make tradeoffs.
        
               | throwaway7783 wrote:
               | I don't know. For me, I need
               | 
               | * a primary transactional DB that I can write fast, with
               | ACID guarantees and a read-after-write guarantee, and
               | allows failover
               | 
               | * one (or more) secondaries that are optimized for
               | analytics and search. This should also tell me how caught
               | up the system is, with the primary.
               | 
               | If they all can talk the same language (SQL) and can
               | replicate from primary with no additional
               | tools/technology (postgres replcation for example), I
               | will take it any day.
               | 
               | It is about operational simplicity and not needing
               | intimately to know multiple technologies. Granted, even
               | if this is "just" postgresql, it really is not and all
               | customizations will have their own tuning and whatnot,
               | but the context is all still postgresql.
               | 
               | Yes, this will not magically solve the CAP theorem, but
               | for most cases we don't need to care too much
        
             | cryptonector wrote:
             | For JOINs? Absolutely! Who wants to hand-code queries at
             | the executor level?! It's expensive!
             | 
             | You need a query language.
             | 
             | You don't necessarily need ACID, and you don't necessarily
             | need a bunch of things that SQL RDBMSes give you, but you
             | definitely need a QL, and it has to support a lot of what
             | SQL supports, especially JOINs and GROUP BY w/
             | aggregations.
             | 
             | NoSQLs tend to evolve into having a QL layered on top. Just
             | start with that if you really want to build a NoSQL.
        
               | otterley wrote:
               | To be clear here, I'm not arguing that
               | OpenSearch/ElasticSearch is an adequate substitute for
               | Postgres. They're different databases, each with
               | different strengths and weaknesses. If you need JOINs and
               | ACID compliance, you _should_ use Postgres. And if you
               | need distributed search, you _should_ use OpenSearch
               | /ElasticSearch.
               | 
               | Unless they're building for single-host scale, you're not
               | going to get JOINs for free. Lucene (the engine upon
               | which ES/OS is based) already has JOIN capability. But
               | it's not used in ES/OS because the performance of JOINs
               | is absolutely abysmal in distributed databases.
        
               | philippemnoel wrote:
               | Our customers typically deploy ParadeDB in a primary-
               | replicas topology, with one primary Postgres node and 2
               | or more read replicas, depending on read volume. Queries
               | are executed on a single node today, yes.
               | 
               | We have plans to eventually support distributed queries.
        
               | cryptonector wrote:
               | I'm arguing that sometimes you don't need ACID, or
               | rather, sometimes you accept that ACID is too painful so
               | you accept not having ACID, but no one ever really
               | doesn't want a QL -- they only think that they don't want
               | a QL until they learn better.
               | 
               | I.e., NoACID does not imply NoQueryLanguage, and you can
               | always have a QL, so you should always get a QL, and you
               | should always use a QL.
               | 
               | > Unless they're building for single-host scale, you're
               | not going to get JOINs for free.
               | 
               | If by 'free' you mean not having to code them, then
               | that's wrong. You can always have or implement a QL.
               | 
               | If by 'free' you mean 'performant', then yes, you might
               | have to denormalize your data so that JOINs vanish,
               | though at the cost of write amplification. But so what,
               | that's true whether you use a QL or not -- it's true in
               | SQL RDBMSes too.
        
               | derefr wrote:
               | It's funny; as someone who is exactly pg_search's market,
               | I actually often want the opposite: ACID, MVCC
               | transactions, automatic table and index management... but
               | no query language.
               | 
               | At the data scale + level of complexity our OLAP queries
               | operate at, we _very_ often run into situations where
               | Postgres 's very best plan [with a well-considered
               | schema, with great indexes and statistics, and after tons
               | of tuning and coaxing], still does something literally
               | interminable -- not for any semantic reason to do with
               | the query plan, but rather due to how Postgres's
               | architecture _executes_ the query plan[1].
               | 
               | The last such job, I thought would be simple enough to
               | run in a few hours... I let it run for six days[2], and
               | then gave up and killed it. Whereas, when we encoded the
               | same "query plan" as a series of bulk-primitive ETL steps
               | by:
               | 
               | 1. dumping the raw source data from PG to CSV with a
               | `COPY`,
               | 
               | 2. whipping out simple POSIX CLI tools like
               | sort/uniq/grep/awk (plus a few hand-rolled streaming
               | aggregation scripts) to transform/reduce/normalize the
               | source data into the shape we want it in,
               | 
               | 3. and then loading the resulting CSVs back into PG with
               | another `COPY`,
               | 
               | ...then the runtime of the whole operation was reduced to
               | just a few hours, with the individual steps completing in
               | ~30 minutes each. (And that's despite the overhead of
               | parsing and/or emitting non-string fields from/to CSV
               | with almost every intermediate step!)
               | 
               | Honestly, if Postgres would just let us program it the
               | way one programs e.g. Redis through Lua, or ETS tables in
               | Erlang -- where the tables and indices are ADTs with low-
               | level public APIs, and you set up your own "query plan"
               | as a set of streaming-channel actors making calls to
               | these APIs -- then we would be a lot happier. But even in
               | PL/pgSQL (which we _do_ use, here and there), the only
               | APIs are high-level ones.
               | 
               | * Sure, you can get a cursor on a _query_ ; but you can't
               | e.g. get an LMDB-like B-tree cursor on a target B-tree
               | index, and ask it to jump [i.e. re-nav down from root] or
               | walk [i.e. nav up from current pos to nearest common
               | ancestor then back down] to "the first row-tuple greater-
               | than-or-equal to [key]".
               | 
               | * You can't write your own efficient implementation of
               | TABLESAMPLE semantics to set up your own Bigtable-esque
               | balanced cluster-order-partitioned parallel seq scan.
               | 
               | * You can't collect pointers to row-tuples, partially
               | materialize them, filter them by some criterion on the
               | read (but perhaps not _parsed_!) columns, and _then_
               | more-fully materialize those same row-tuples  "directly"
               | from the references to them you still hold.
               | 
               | ---
               | 
               | [1] One example of what I mean by "execution": did you
               | know that Postgres doesn't use any form of concurrency
               | for query plans -- not even the most basic libuv-like
               | "This Merge Append node's child-node A is in a blocking-
               | wait on IO; that blocking-wait should yield, so that the
               | Merge Append node's child-node B can instead send row-
               | tuple batches for a while" kind of concurrency?
               | 
               | ---
               | 
               | [2] If you're wondering, the query that ran for six days
               | was literally just this (anonymized):
               | SELECT a, b, SUM(value) AS total_value         FROM (
               | SELECT a, b, value FROM source1           UNION ALL
               | SELECT a, b, value FROM source2         ) AS u
               | GROUP BY a, b;
               | 
               | `source1` and `source2` are ~150GB tables. (Or at least,
               | they're 150GB when dumped to CSV.) Two integer keys
               | (a,b), and a bigint value. With a b-tree index on `(a,b)
               | INCLUDE (value)`, with correct statistics.
               | 
               | And its EXPLAIN query plan looked like this (with `SET
               | enable_hashagg = OFF;`) -- _nominally_ pretty good:
               | GroupAggregate  (cost=1.17..709462419.92 rows=40000
               | width=40)           Group Key: a, b           ->  Merge
               | Append  (cost=1.17..659276497.84 rows=6691282944
               | width=16)                 Sort Key: a, b
               | ->  Index Only Scan using source1_a_b_idx on source1
               | (cost=0.58..162356175.31 rows=3345641472 width=16)
               | ->  Index Only Scan using source2_a_b_idx on source2
               | (cost=0.58..162356175.31 rows=3345641472 width=16)
               | 
               | Each one of the operations here is "obvious." It's what
               | you'd think you'd want! You'd think this _would_ finish
               | quickly. And yet.
               | 
               | (And no, the machine it ran on was _not_ resource-
               | bottlenecked. It had 1TB of RAM with no contention from
               | other jobs, and this PG session was allowed to use much
               | of it as work memory. But even if it was spilling to disk
               | at every step... that should have been fine. The CSV
               | equivalent of this inherently  "spills to disk", for
               | everything except the nursery levels of sort(1)'s merge-
               | sort. And it does fine.)
        
               | cryptonector wrote:
               | > At the data scale + level of complexity our OLAP
               | queries operate at, we very often run into situations
               | where Postgres's very best plan [with a well-considered
               | schema, with great indexes and statistics, and after tons
               | of tuning and coaxing], still does something literally
               | interminable -- not for any semantic reason to do with
               | the query plan, but rather due to how Postgres's
               | architecture executes the query plan[1].
               | 
               | Well, ok, this is a problem, and I have run into it
               | myself. That's not a reason for not wanting a QL. It's a
               | reason for wanting a way to improve the query planning.
               | Query hints in the QL are a bad idea for several reasons.
               | What I would like instead is out-of-band query hints that
               | I can provide along with my query (though obviously only
               | when using APIs rather than `psql`; for `psql` one would
               | have to provide the hints via some \hints commnad) where
               | I would address each table source using names/aliases for
               | the table source / join, and names for subqueries, and so
               | really something like a path through the query and
               | subqueries like `.<sub_query_alias0>.<sub_query_alias1>.<
               | ..>.<sub_query_aliasN>.<table_source_alias>` and where
               | the hint would indicate things like what sub-query plan
               | type to use and what index to use.
        
               | derefr wrote:
               | I mean, in my case, I don't think what I want could be
               | implemented via query hints. The types of things I would
               | want to communicate to the server, are pragmas entirely
               | divorced from the semantics of SQL: pragmas that only
               | make sense if you can force the query's plan to take a
               | specific shape to begin with, because you're trying to
               | tune specific knobs _on specific plan nodes_ , so if
               | those plan nodes aren't part of the final query, then
               | your tuning is meaningless.
               | 
               | And if you're pinning the query plan to a specific shape,
               | then there's really no point in sending SQL + hints; you
               | may as well just expose a lower-level "query-execution-
               | engine abstract-machine bytecode" that the user can
               | submit, to be translated in a very low-level -- but
               | contractual! -- way into a query plan. Or, one step
               | further, into _the thing a query plan does_ , skipping
               | the plan-node-graph abstraction entirely in favor of
               | arbitrarily calling the same primitives the plan nodes
               | call [in a sandboxed way, because such bytecode _should_
               | be low-level enough that it can encode invalid operation
               | sequences that will crash the PG connection backend --
               | and this is fine, the user signed up for that; they just
               | want to be assured that such a crash won 't affect data
               | integrity outside the current transaction.]
               | 
               | Such a bytecode wouldn't _have_ to be used as the literal
               | compiled internal representation of SQL within the
               | server, mind you. (It 'd be _ideal_ if it was, but it
               | doesn 't _need_ to be.) Just like e.g. the published and
               | versioned JVM bytecode spec isn 't 1:1 with the bytecode
               | ISA the JVM actually uses as its in-memory representation
               | for interpretation -- there's module-load-time
               | translation/compilation from the stable public format, to
               | the current internal format.
        
               | strbean wrote:
               | Obligatory whine that the term NoSQL got co-opted to mean
               | "no relational". There's tons of space for a better query
               | language for querying relation databases.
        
             | beoberha wrote:
             | Interestingly enough, it looks like the team was just
             | hacking on an open source extension and organically
             | attracted some customers, which snowballed into raising
             | capital. So definitely seems like there's a market.
        
         | sandeepkd wrote:
         | There can be multiple reasons, one that I can think of right
         | away would be to keep the stack as simple as possible until you
         | can. Realistically speaking most of the companies do not
         | operate at the scale where they would need the specialized
         | tools.
        
         | AtlasBarfed wrote:
         | Why not the Cassandra based elastics if you need ingest?
        
         | izietto wrote:
         | Because it's a whole another architectural component for data
         | that is already there, in a tool that is made just for that
         | lacking just of `SELECT full_text_search('kitty pictures');`
        
       | truetraveller wrote:
       | diagramas made how?
        
         | hobs wrote:
         | Guessing Figma.
        
           | philippemnoel wrote:
           | Yes, Figma!
        
       ___________________________________________________________________
       (page generated 2025-07-21 23:01 UTC)