[HN Gopher] Jepsen: Amazon RDS for PostgreSQL 17.4
       ___________________________________________________________________
        
       Jepsen: Amazon RDS for PostgreSQL 17.4
        
       Author : aphyr
       Score  : 561 points
       Date   : 2025-04-29 14:30 UTC (1 days ago)
        
 (HTM) web link (jepsen.io)
 (TXT) w3m dump (jepsen.io)
        
       | henning wrote:
       | I thought this kind of bullshit was only supposed to happen in
       | MongoDB!
        
         | kabes wrote:
         | Then you haven't read enough jepsen reports. Distributed system
         | guarantees generally can't be trusted
        
           | __alexs wrote:
           | Postgres is not a distributed system in this configuration
           | usually though is it?
        
             | semiquaver wrote:
             | The result is for "Amazon RDS for PostgreSQL multi-AZ
             | clusters" which are certainly a distributed system.
             | 
             | I'm not well versed in RDS but I believe that clustered is
             | the only way to use it.
        
               | NewJazz wrote:
               | No, you can have single instances
        
               | reissbaker wrote:
               | This writeup tested multi-AZ RDS for Postgres -- which is
               | always distributed behind the scenes (otherwise, it
               | couldn't exist in multiple AZs).
        
               | dragonwriter wrote:
               | An RDS cluster can have a single instance (but it can't
               | be multi-AZ with a single instance.)
        
             | dragonwriter wrote:
             | A multi-AZ cluster is necessarily a distributed system.
        
         | colesantiago wrote:
         | Do people still use MongoDB in production?
         | 
         | I was quite surprised to read that Stripe uses MongoDB in the
         | early days and still today and I can't imagine the sheer
         | nightmares they must have faced using it for all these years.
        
           | colechristensen wrote:
           | mongodb is a public company with a market cap of 14.2 billion
           | dollars. so yes, people still use it in production
        
             | djfivyvusn wrote:
             | I've been looking for a job the last few weeks.
             | 
             | Literally the only job ad I've seen talking about MongoDB
             | was a job ad for MongoDB itself.
        
           | senderista wrote:
           | MongoDB has come a long way. They acquired a world-class
           | storage engine (WiredTiger) and then they hired some world-
           | class distsys people (e.g. Murat Demirbas). They might still
           | be hamstrung by early design and API choices but from what I
           | can tell (never used it in anger) the implementation is
           | pretty solid.
        
           | computerfan494 wrote:
           | MongoDB is a very good database, and these days at scale I am
           | significantly more confident in its correctness guarantees
           | than any of the half-baked Postgres horizontal scaling
           | solutions. I have run both databases at seven figure a month
           | spend scale, and I would not choose off-the-shelf Postgres
           | for this task again.
        
         | bananapub wrote:
         | I think zookeeper is still the only distributed system that got
         | through jepsen without dataloss bugs, though at high cost:
         | https://aphyr.com/posts/291-jepsen-zookeeper
        
           | robterrell wrote:
           | Didn't FoundationDB get a clean bill of health?
        
             | MarkMarine wrote:
             | wasn't tested because: "haven't tested foundation in part
             | because their testing appears to be waaaay more rigorous
             | than mine."
             | 
             | https://web.archive.org/web/20150312112552/http://blog.foun
             | d...
        
             | bananapub wrote:
             | apparently wasn't tested because Kyle thought the internal
             | testing was better than jepsen itself:
             | https://abdullin.com/foundationdb-is-back/
        
             | necubi wrote:
             | Aphyr didn't test foundation himself, but the foundation
             | team did their own Jepsen testing which they reported
             | passing. All of this was a long time ago, before Foundation
             | was bought by Apple and open sourced.
             | 
             | Now members of the original Foundation team have started
             | Antithesis (https://antithesis.com/) to make it easier for
             | other systems to adopt this sort of testing.
        
         | Thaxll wrote:
         | Those memes are 10 years old, you know that some very tech
         | company use MongoDB right? We're talking billions a year.
        
           | djfivyvusn wrote:
           | What is your point?
        
             | Thaxll wrote:
             | MongoDB is reliable.
        
           | xmodem wrote:
           | Billion dollar companies lose their customer's data all the
           | time.
        
       | tibbar wrote:
       | The submitted title buries the lede: RDS for PostgreSQL 17.4 does
       | not properly implement snapshot isolation.
        
         | belter wrote:
         | And your comment also...In Multi-AZ clusters.
         | 
         | Well this is from Kyle Kingsbury, the Chuck Norris of
         | transactional guarantees. AWS has to reply or clarify, even if
         | only seems to apply to Multi-AZ Clusters. Those are one of the
         | two possibilities for RDS with Postgres. Multi-AZ deployments
         | can have one standby or two standby DB instances and this is
         | for the two standby DB instances. [1]
         | 
         | They make no such promises in their documentation. Their 5494
         | pages manual on RDS hardly mentions isolation or serializable
         | except in documentation of parameters for the different
         | engines.
         | 
         | Nothing on global read consistency for Multi-AZ clusters
         | because why should they.... :-) They talk about semi-
         | synchronous replication so the writer waits for one standby to
         | confirm log record, but the two readers can be on different
         | snapshots?
         | 
         | [1] - "New Amazon RDS for MySQL & PostgreSQL Multi-AZ
         | Deployment Option: Improved Write Performance & Faster
         | Failover" - https://aws.amazon.com/blogs/aws/amazon-rds-multi-
         | az-db-clus...
         | 
         | [2] - "Amazon RDS Multi-AZ with two readable standbys: Under
         | the hood" - https://aws.amazon.com/blogs/database/amazon-rds-
         | multi-az-wi...
        
           | n2d4 wrote:
           | > They make no such promises in their documentation. Their
           | 5494 pages manual on RDS hardly mentions isolation or
           | serializable
           | 
           | Well, as a user, I wish they would mention it though. If I
           | migrate to RDS with multi-AZ after coming from plain Postgres
           | (which documents snapshot isolation as a feature), I would
           | probably want to know how the two differ.
        
         | gymbeaux wrote:
         | Par for the course
        
         | altairprime wrote:
         | I emailed the mods and asked them to change it to this phrase
         | copy-pasted from the linked article:
         | 
         | > Amazon RDS for PostgreSQL multi-AZ clusters violate Snapshot
         | Isolation
        
           | altairprime wrote:
           | (The mods replied above; thank you!)
        
         | aphyr wrote:
         | Folks on HN are often upset with the titles of Jepsen reports,
         | so perhaps a little more context is in order. Jepsen reports
         | are usually the product of a long collaboration with a client.
         | Clients often have strong feelings about how the report is
         | titled--is it too harsh on the system, or too favorable? Does
         | it capture the most meaningful of the dozen-odd issues we
         | found? Is it fair, in the sense that Jepsen aims to be an
         | honest broker of database safety findings? How will it be
         | interpreted in ten years when people link to it routinely, but
         | the findings no longer apply to recent versions? The resulting
         | discussions can be, ah, vigorous.
         | 
         | The way I've threaded this needle, after several frustrating
         | attempts, is to have a policy of titling all reports "Jepsen:
         | <system> <version>". HN is of course welcome to choose their
         | own link text if they prefer a more descriptive, or colorful,
         | phrase. :-)
        
           | dang wrote:
           | Given that author and submitter (and commenter!) are all the
           | same person I think we can go with your choice :)
           | 
           | The fact that the thread is high on HN, plus the GP comment
           | is high in the thread, plus that the audience knows how
           | interesting Jepsen reports get, should be enough to convey
           | the needful.
        
           | broost3r wrote:
           | long time lurker here who registered on HN many years ago
           | after reading Jepsen: Cassandra
           | 
           | the Jepsen writeups will surely stand the test of time thank
           | you!
        
       | cr3ative wrote:
       | This is in such a thick academic style that it is difficult to
       | follow what the problem actually might be and how it would impact
       | someone. This style of writing serves mostly to remind me that I
       | am not a part of the world that writes like this, which makes me
       | a little sad.
        
         | glutamate wrote:
         | In the beginning, when you read papers like this, it can be
         | hard work. You can either give up or put some effort in to try
         | to understand it. Maybe look at some of the other Jepsen
         | reports, some may be easier. Or perhaps an introductory CS
         | textbook. With practice and patience it will become easier to
         | read and eventually write like this.
         | 
         | You may not be part of that world now, but you can be some day.
         | 
         | EDIT: forgot to say, i had to read 6 or 7 books on Bayesian
         | statistics before i understood the most basic concepts. A few
         | years later i wrote a compiler for a statistical programming
         | language.
        
           | cr3ative wrote:
           | I'll look to do so, and appreciate your pointers. Thank you
           | for being kind!
        
           | concerndc1tizen wrote:
           | The state of the art is always advancing, which greatly
           | increases the burden of starting from first principles.
           | 
           | I somewhat feel that there was a generation that had it
           | easier, because they were pioneers in a new field, allowing
           | them to become experts quickly, while improving year-on-year,
           | being paid well in the process, and having great network and
           | exposure.
           | 
           | Of course, it can be done, but we should at least acknowledge
           | that sometimes the industry is unforgiving and simply doesn't
           | have on-ramps except for the privileged few.
        
             | _AzMoo wrote:
             | > I somewhat feel that there was a generation that had it
             | easier
             | 
             | I don't think so. I've been doing this for nearly 35 years
             | now, and there's always been a lot to learn. Each layer of
             | abstraction developed makes it easier to quickly iterate
             | towards a new outcome faster or with more confidence, but
             | hides away complexity that you might eventually need to
             | know. In a lot of ways it's easier these days, because
             | there's so much information available at your fingertips
             | when you need it, presented in a multitude of different
             | formats. I learned my first programming language by reading
             | a QBasic textbook trying to debug a text-based adventure
             | game that crashed at a critical moment. I had no Internet,
             | no BBS, nobody to help, except my Dad who was a solo RPG
             | programmer who had learned on the job after being promoted
             | from sweeping floors in a warehouse.
        
               | robertlagrant wrote:
               | > solo RPG programmer
               | 
               | The kids might not know this means "IBM mainframe" rather
               | than "role playing game" :)
        
         | jorams wrote:
         | It uses a lot of very specific terminology, but the linked
         | pages like the one on "G-nonadjacent" do a lot to clear up what
         | it all means. It _is_ a lot of reading.
         | 
         | Essentially: The configuration claims "Snapshot Isolation",
         | which means every transaction looks like it operates on a
         | consistent snapshot of the entire database at its starting
         | timestamp. All transactions starting after a transaction
         | commits will see the changes made by the transaction. Jepsen
         | finds that the snapshot a transaction sees doesn't always
         | contain everything that was committed before its starting
         | timestamp. Transactions A an B can both commit their changes,
         | then transactions C and D can start with C only seeing the
         | change made by A and D only seeing the change made by B.
        
         | renewiltord wrote:
         | It's maximal information communication. Use LLM to distill to
         | your own knowledge level. It is trivial with modern LLM. Very
         | good output in general.
        
           | benatkin wrote:
           | It addresses the reader no matter how knowledgeable they are.
           | It's a very good use of hypertext, making it so that a
           | knowledgeable reader won't need to skip over much.
        
             | renewiltord wrote:
             | Better with LLM.
        
         | ZYbCRq22HbJ2y7 wrote:
         | > such a thick academic style
         | 
         | Why? Because it has variables and a graph?
         | 
         | What sort of education background do you have?
        
         | vlovich123 wrote:
         | Have you tried using an LLM? I've found good results getting at
         | the underlying concepts and building a mental model that works
         | for me that way. It makes domain expertise - that often has
         | unique terminology for concepts you already know or at least
         | know without a specific name - more easily accessible after a
         | little bit of a QA round.
        
           | vlovich123 wrote:
           | Lots of downvotes with no actual explanation of what the
           | issue is my suggestion.
           | 
           | I've repeatedly used ChatGPT and Claude to help me understand
           | papers and to cut through the verbiage to the underlying
           | concepts.
        
         | deathanatos wrote:
         | I empathize with the feeling of this being dense and
         | unapproachable; I remember when I was first approaching these
         | posts, and feeling the same.
         | 
         | For this particular one, the graph under "Results" is the most
         | approachable portion, I think. (Don't skip the top two
         | sections, though ... and they're so short.) In the graph, each
         | line is a transaction, and read them left-to-right.
         | 
         | Hopefully I get this right, though if I do not, I'm sure
         | someone will correct me. Our database is a set of ordered lists
         | of integers. Something like,                 CREATE TABLE test
         | (         id int primary key,         -- (but specifically,
         | this next column holds a list of ints, e.g.,         --  a
         | value might be, '1,8,11'; the list of ints is a comma separated
         | --  string.)         list text not null       );
         | 
         | The first transaction:                 a 89 9
         | 
         | This is shorthand; means "(a)ppend to list #89 the integer 9"
         | (in SQL, _crudely_ this is perhaps something like
         | UPDATE test SET list = CONCAT(list, ',9') WHERE id = 89;
         | 
         | ... though we'd need to handle the case where the list doesn't
         | exist yet, turning it into an `INSERT ... ON CONFLICT ... DO
         | UPDATE ...`, so it would get gnarlier.[2]); the next:
         | r 90 nil    # read list 90; the result is nil       r 89 [4 9]
         | # read list 89; the result is [4, 9]       r 90 nil    # read
         | line 90; the result is (still) nil
         | 
         | I assume you can `SELECT` ;) That should provide sufficient
         | syntax for one to understand the remainder.
         | 
         | The arrows indicate the dependencies; if you click "read-write
         | dependencies"[1], that page explains it.
         | 
         | Our first transaction appends 9 to list 89. Our second
         | transaction reads that same list, and sees that same 9, thus,
         | it must start _after_ the first transaction has committed. The
         | remaining arrows form similar dependencies, and once you take
         | them all into account, they form a cycle; this should _feel_
         | problematic. It 's that they're in a cycle, which snapshot
         | isolation does not permit, so we've observed a contradiction in
         | the system: these cannot be obeying snapshot isolation. (This
         | is what "To understand why this cycle is illegal..." gets at;
         | it is fairly straightforward. T1 is the first row in the graph,
         | T2 the second, so forth. But it is only straight-forward once
         | you've understood the graph, I think.)
         | 
         | > _This is in such a thick academic style that it is difficult
         | to follow what the problem actually might be and how it would
         | impact someone._
         | 
         | I think a lot of this is because it is written with precision,
         | and that precision requires a lot of academic terminology.
         | 
         | Some of it is just syntax peculiar to Jepsen, which I think
         | comes from Clojure, which I think most of us (myself included)
         | are just not familiar with. Hence why I used SQL and comma-
         | sep'd lists in my commentary above; that is likely more widely
         | read. It's a bit rough when you first approach it, but once you
         | get the notation, the payoff is worth it, I guess.
         | 
         | More generally, I think once you grasp the graph syntax &
         | simple operations used here, it becomes easier to read other
         | posts, since they're mostly graphs of transactions that, taken
         | together, make no logical sense at all. Yet they happened!
         | 
         | > _This style of writing serves mostly to remind me that I am
         | not a part of the world that writes like this, which makes me a
         | little sad._
         | 
         | I think Jepsen posts, with a little effort, are approachable.
         | This post is a good starter post; normally I'd say Jepsen posts
         | tend to inject faults into the system, as we're testing if the
         | guarantees of the system hold up under stress. This one has no
         | fault injection, though, so it's a bit simpler.
         | 
         | Beware though, that if you learn to read these, that you'll
         | never trust a database again.
         | 
         | [1]: https://jepsen.io/consistency/dependencies
         | 
         | [2]: I think this is it? https://github.com/jepsen-
         | io/postgres/blob/225203dd64ad5e5e4... -- but this is pushing
         | the limits of my own understanding.
        
           | mdaniel wrote:
           | > Beware though, that if you learn to read these, that you'll
           | never trust a database again.
           | 
           | I chuckled, but _(while I don 't have links to offer)_ I
           | could have sworn that there were some of them which actually
           | passed, and a handful of others that took the report to heart
           | and _fixed_ the bugs. I am similarly recalling that a product
           | showed up to their Show HN or Launch HN with a Jepsen in
           | hand, which I was especially in awe of the maturity of that
           | (assuming, of course, I 'm not hallucinating such a thing)
        
       | nijave wrote:
       | It's not entirely clear but this isn't an issue in multi instance
       | upstream Postgres clusters?
       | 
       | Am I correct in understanding either AWS is doing something with
       | the cluster configuration or has added some patches that
       | introduce this behavior?
        
         | belter wrote:
         | Yes its different. This is a deeper overview of what they did:
         | https://youtu.be/fLqJXTOhUg4
         | 
         | Specially here: https://youtu.be/fLqJXTOhUg4?t=434
        
         | aphyr wrote:
         | This is a very good question! I do not understand AWS's
         | replication architecture well enough to reimplement it with
         | standard Postgres yet. This behavior doesn't happen in single-
         | node Postgres, as far as I can tell, but it might happen in
         | some replication setups!
         | 
         | I also understand there are lots of ways to do Postgres
         | replication in general, with varying results. For instance,
         | here's Bin Wang's report on Patroni:
         | https://www.binwang.me/2024-12-02-PostgreSQL-High-Availabili...
        
         | aeyes wrote:
         | What are multi instance upstream Postgres clusters for you?
         | PostgreSQL has no official support for failover of a master
         | instance, the only mechanism is Postgres replication which you
         | can make synchronous. Then you can build your own tooling
         | around this to build a Postgres cluster (Patroni is one such
         | tool).
         | 
         | AWS patched Postgres to replicate to two instances and to call
         | it good if one of the two acknowledges the change. When this
         | ack happens is not public information.
         | 
         | My personal opinion is that filesystem level replication (think
         | drbd) is the better approach for PostgreSQL. I believe that
         | this is what the old school AWS Multi-AZ instances do. But you
         | get lower throughput and you can't read from the secondary
         | instance.
        
           | nijave wrote:
           | >My personal opinion is that filesystem level replication
           | (think drbd) is the better approach for PostgreSQL
           | 
           | That's basically what their Aurora variant does. It uses
           | clustered/shared storage then uses traditional replication
           | only for cache invalidation (so replicas know when data
           | loaded into memory/cache has changed on the shared storage)
        
         | mattashii wrote:
         | > It's not entirely clear but this isn't an issue in multi
         | instance upstream Postgres clusters?
         | 
         | No, it isn't an issue with single-instance PostgreSQL clusters.
         | Multi-instance PostgreSQL clusters (single primary, plus
         | streaming/physical replicas) are affected.
         | 
         | What they -too- discovered is that PostgreSQL currently doesn't
         | have consistent snapshot behaviour between the primary and
         | replicas. Presumably, read-only transaction T2 was executed on
         | a secondary (replica) node, while T1, T3, and T4 (all modifying
         | transactions) were executed on the primary.
         | 
         | Some background:
         | 
         | Snapshots on secondary PostgreSQL nodes rely on transaction
         | persistence order (location of commit record in WAL) to
         | determine which transactions are visible, while the visibility
         | order on the primary is determined by when the backend that
         | authorized the transaction first got notice that the
         | transaction was completely committed (and then got to marking
         | the transaction as committed). On each of these (primary and
         | secondary) the commit order is consistent across backends that
         | connect to that system, but the commit order may be somewhat
         | different between the primary and the secondary.
         | 
         | There is _some_ work ongoing to improve this, but that 's still
         | very much WIP.
        
           | aphyr wrote:
           | Thank you matashii--this would definitely explain it. I've
           | also received another email suggesting this anomaly is due to
           | the difference in commit/visibility order between primary and
           | secondary. Is there by chance a writeup of this available
           | anywhere that I can link to? It looks like
           | https://postgrespro.com/list/thread-id/1827129 miiight be
           | related, but I'm not certain. If so, I'd like to update the
           | report.
           | 
           | My email is aphyr@jepsen.io, if you'd like to drop me a line.
           | :-)
        
             | ants_a wrote:
             | That thread is indeed about the same issue. I don't think
             | anyone has done a more concise writeup on it.
             | 
             | Core of the issue is that on the primary, commit inserts a
             | WAL record, waits for durability, local and/or replicated,
             | and then grabs a lock (ProcArrayLock) to mark itself as no
             | longer running. Taking a snapshot takes that same lock and
             | builds a list of running transactions. WAL insert and
             | marking itself as visible can happen in different order.
             | This causes an issue on the secondary where there is no
             | idea of the apparent visibility order, so visibility order
             | on secondary is strictly based on order of commit records
             | in the WAL.
             | 
             | The obvious fix would be to make visibility happen in WAL
             | order on the primary too. However there is one feature that
             | makes that complicated. Clients can change the desired
             | durability on a transaction-by-transaction basis. The
             | settings range from confirm transaction immediately after
             | it is inserted in WAL stream, through wait for local
             | durability, all the way up to wait for it to be visible on
             | synchronous replicas. If visibility happens in WAL order,
             | then an async transaction either has to wait on every
             | higher durability transaction that comes before it in the
             | WAL stream, or give up on read-your-writes. That's
             | basically where the discussion got stuck without achieving
             | a consensus on which breakage to accept. This same problem
             | is also the main blocker for adopting a logical (or
             | physical) clock based snapshot mechanism.
             | 
             | By now I'm partial to the option of giving up on read-your-
             | writes, with an opt-in option to see non-durable
             | transactions as an escape hatch for backwards
             | compatibility. Re-purposing SQL read uncommitted isolation
             | level for this sounds appealing, but I haven't checked if
             | there is some language in the standard that would make that
             | a bad idea.
             | 
             | A somewhat elated idea is Eventual Durability, where write
             | transactions become visible before they are durable, but
             | read transactions wait for all observed transactions to be
             | durable before committing.
        
       | ezekiel68 wrote:
       | In my reading of this, it looks like the practical implication
       | could be that reads happening quickly after writes to the same
       | row(s) might return stale data. The write transaction gets marked
       | as complete before all of the distributed layers of a multi AZ
       | RDS instance have been fully updated, such that immediate reads
       | from the same rows might return nothing (if the row does not
       | exist yet) or older values if the columns have not been fully
       | updated.
       | 
       | Due to the way PostgreSQL does snapshotting, I don't believe this
       | implies such a read might obtain a nonsense value due to only a
       | portion of the bytes in a multi-byte column type having been
       | updated yet.
       | 
       | It seems like a race condition that becomes eventually
       | consistent. Or did anyone read this as if the later
       | transaction(s) of a "long fork" might never complete under normal
       | circumstances?
        
         | aphyr wrote:
         | This isn't just stale data, in the sense of "a point-in-time
         | consistent snapshot which does not reflect some recent
         | transactions". I think what's going on here is that a read-only
         | transaction against a secondary can observe some transaction T,
         | but also _miss_ transactions which must have logically executed
         | before T.
        
           | mikesun wrote:
           | _" I think what's going on here is that a read-only
           | transaction against a secondary can observe some transaction
           | T, but also miss transactions which must have logically
           | executed before T."_
           | 
           | i was intuitively wondering the same but i'm having trouble
           | reasoning how the post's example with transactions 1, 2, 3, 4
           | exhibits this behavior. in the example, is transaction 2 the
           | only read-only transaction and therefore the only transaction
           | to read from the read replica? i.e. transactions 1, 3, 4 use
           | the primary and transaction 2 uses the read replica?
        
             | aphyr wrote:
             | Yeah, that's right. It may be that the (apparent) order of
             | transactions differs between primary and secondary.
        
               | mikesun wrote:
               | ah, so something like... if the primary ordered
               | transaction 3 < transaction 1, but transaction 2 observes
               | only transaction 1 on the read-only secondary potentially
               | because the secondary orders transaction 1 < transaction
               | 3?
        
               | franckpachot wrote:
               | The Write-Ahead Logging (WAL) is single-threaded and
               | maintains consistency at a specific point in time in each
               | instance. However, there can be anomalies between two
               | instances. This behavior is expected because the RDS
               | Multi-AZ cluster does not wait for changes to be applied
               | in the shared buffers. It only waits for the WAL to sync.
               | This is similar to the behavior of PostgreSQL when
               | synchronous_commit is set to on. Nothing unexpected.
        
           | kevincox wrote:
           | To provide a simple (although contrived) example of the type
           | of thing that can happen. Imagine that you have a table with
           | three columns `gps_coordinate`, `postal_code` and `city`. The
           | way these are set is that the new coordinate gets posted to
           | the API and `gps_coordinate` is updated. This then kicks off
           | a background task that uses the new coordinate to lookup and
           | update `postal_code`. Then another background task uses the
           | postal code to look up and set `city`.
           | 
           | Since these happen sequentially, for a single update of
           | `gps_coordinate` you would only expect to be able to observe
           | one of:
           | 
           | 1. Nothing updated yet, all columns have the previous value.
           | 
           | 2. `gps_coordinate` updated, with `postal_code` and `city`
           | still having the previous values.
           | 
           | 3. `gps_coordinate` and `postal_code` updated with `city`
           | still having the previous value.
           | 
           | 4. All fields updated.
           | 
           | But the ordering that aphyr proved is possible allows you to
           | see "impossible" states such as
           | 
           | 1. `postal_code` updated with `gps_coordinate` and `city`
           | still having the previous values.
           | 
           | 2. `city` updated with `gps_coordinate` and `postal_code`
           | still having the previous values.
           | 
           | Basically since these transactions happen in order and depend
           | on one another you would expect that you can only see the
           | "left to right" progression. But actually you can see some
           | subset of the transactions applied even if that isn't a
           | possible logical state of the database.
        
       | mushufasa wrote:
       | > These phenomena occurred in every version tested, from 13.15 to
       | 17.4.
       | 
       | I was worried I had made the wrong move upgrading major versions,
       | but it looks like this is not that. This is not a regression, but
       | just a feature request or longstanding bug.
        
       | skywhopper wrote:
       | This is an unfortunate report in a lot of ways. First, the title
       | is incomplete. Second, there's no context as to the purpose of
       | the test and very little about the parameters of the test. It
       | makes no comparison to other PostgreSQL architectures except one
       | reference at the end to a standalone system. Third, it
       | characterizes the transaction isolation of this system as if it
       | were a failure (see comments in this thread assuming this is a
       | bug or a missing feature of Postgres). Finally, it never compares
       | the promises made by the product vendors to the reality. Does AWS
       | or Postgres promise perfect snapshot isolation?
       | 
       | I understand the mission of the Jepsen project but presenting
       | results in this format is misleading and will only sow confusion.
       | 
       | Transaction isolation involves a ton of tradeoffs, and the
       | tradeoffs chosen here may be fine for most use cases. The issues
       | can be easily avoided by doing any critical transactional work
       | against the primary read-write node only, which would be the only
       | typical way in which transactional work would be done against a
       | Postgres cluster of this sort.
        
         | Sesse__ wrote:
         | Postgres does indeed promise perfect snapshot isolation, and
         | Amazon does not (to the best of my knowledge) document that
         | their managed Postgres service weakens Postgres' promises.
        
       | billiam wrote:
       | New headline: AWS RDS is not CockroachDB or Spanner. And it's not
       | trying to be.
        
         | cnlwsu wrote:
         | Cockroach doesn't offer strict serializability. It has
         | serializability with some limits depending on clock drift. Also
         | CockroachDB does not provide linearizability over the entire
         | database.
        
         | senderista wrote:
         | However, Aurora DSQL is trying to compete with both CDB and
         | Spanner, and they explicitly promise snapshot isolation.
        
           | anentropic wrote:
           | But this test wasn't of Aurora DSQL
        
       | film42 wrote:
       | I think AWS will need to update their documentation to
       | communicate this. Will a snapshot isolation fix introduce a
       | performance regression in latency or throughput? Or, maybe they
       | stand by what they have as being strong enough. Either way,
       | they'll need to say something.
        
         | kevincox wrote:
         | I think the ideal solution from AWS would be fixing the bug and
         | actually providing the guarantees that the docs say that they
         | do.
        
           | film42 wrote:
           | I agree, but I have a feeling this isn't a small fix. Sounds
           | like someone picked a mechanism that seemed to be equivalent
           | but is not. Swapping that will require a lot of time and
           | testing.
        
             | mdaniel wrote:
             | > Swapping that will require a lot of time and testing.
             | 
             | Lucky them, there is an automated suite[1] to verify the
             | correct behavior :-D
             | 
             | 1: https://github.com/jepsen-io/rds
        
           | zaphirplane wrote:
           | Yet bellow your comment is a quote that this is since v13 and
           | above is a comment that there is no mention in the docs.
           | 
           | Using the words Bug and guarantee is throwing the casual
           | readers off the mark ?
        
           | slt2021 wrote:
           | there is no trivial fix for this without breaking
           | performance. roughly, there is no free lunch in distributed
           | systems, and AWS made a tradeoff to relax consistency
           | guarantees for that specific setup, and didn't really
           | advertise that
        
           | belter wrote:
           | It looks like a bug, but the problem is the documentation
           | does not detail what guarantees are offered in this scenario,
           | but would love if somebody could point me where it does...
        
       | oblio wrote:
       | I wonder how Aurora fares on this?
        
         | RachelF wrote:
         | I wondered how Microsoft SQL Server fares, but not it's tested
         | in the long list of databases:
         | 
         | https://jepsen.io/analyses
        
           | __float wrote:
           | It may violate the SQL Server license? Microsoft have not
           | apparently paid for a Jepsen analysis (or perhaps don't want
           | it public :))
        
             | KronisLV wrote:
             | > Microsoft have not apparently paid for a Jepsen analysis
             | (or perhaps don't want it public :))
             | 
             | If I was some database vendor that sometimes plays fast and
             | loose (not saying Microsoft is, just an example) and my
             | product is good for 99.95% of use cases and the remainder
             | is exceedingly hard to fix, I'd probably be more likely to
             | pay for Jepsen _not_ to do an analysis, because hiring them
             | would result in people being more likely to leave an
             | otherwise sufficient product due to those faults being
             | brought to light.
        
               | mdaniel wrote:
               | And yet, this one was done without compensation, so it
               | seems the value of the report and the backing
               | investigation is not only for money
        
         | VeejayRampay wrote:
         | Aurora doesn't offer Postgresql 17 for now I think
        
           | phonon wrote:
           | "They occurred in every PostgreSQL version we tested, from
           | 13.15 (the oldest version which AWS supported) to 17.4 (the
           | newest)."
           | 
           | So unlikely v17 will make a difference.
        
         | nijave wrote:
         | If I'm understanding the issue correctly, it probably doesn't.
         | 
         | From what I understand, multi-az has some setup with multiple
         | semi synchronous replicas where only 1 replica needs to
         | acknowledge the transaction.
         | 
         | Aurora doesn't use semi synchronous replication but uses
         | clustered/shared storage with a different replication setup for
         | cache invalidation
        
       | password4321 wrote:
       | It would be great to get all the Amazon RDS flavors Jepsen'd.
        
         | aphyr wrote:
         | I have actually been working on this (very slowly, in
         | occasional nights and weekends!) Peter Alvaro and I reported on
         | a safety issue in RDS for MySQL here too:
         | https://jepsen.io/analyses/mysql-8.0.34#fractured-read-like-...
        
           | password4321 wrote:
           | There is a universe where cloud providers announce each new
           | database offering by commissioning a Jepsen test and
           | iterating on the results until every issue has been resolved
           | or at least documented.
           | 
           | Unfortunately reliability is not that high on the priority
           | list here. Keep up the good work!
        
       | wb14123 wrote:
       | Surprised to see Amazon RDS doesn't pass such simple test. Nicely
       | done!
        
         | belter wrote:
         | This is not a single RDS or the single instance in another AZ.
         | Its a more specific and advanced cluster setup.
        
           | wb14123 wrote:
           | Yeah yet this is a happy path with no failure imported. Still
           | surprised to see this kind of scenario was not tested inside
           | AWS.
        
       | cswilliams wrote:
       | Interesting. At a previous company, when we changed the pg_dump
       | command in a backup script to start using parallel workers (-j
       | flag) we started to rarely see errors that suggested
       | inconsistency when restoring the backups (duplicate key errors
       | and fk constraint errors). At the time, I tried reporting the
       | issue to both AWS and on the Postgres mailing list but never got
       | anywhere since I could not easily reproduce it. We eventually
       | gave up and went back to single threaded dumps. I wonder if this
       | issue is related to that behavior we were seeing.
        
         | belter wrote:
         | Was a single instance, one instance with a standby in another
         | AZ or a multiaz cluster as tested here?
        
           | cswilliams wrote:
           | We saw it when we ran the pg_dump off a standby instance (or
           | a "replica" to use RDS terminology). Our primary was a multi-
           | az instance. So not exactly what they tested here I guess,
           | but it makes me wonder what changes, if any, they've made to
           | postgres under the hood.
        
       | luhn wrote:
       | It's not mentioned in the headline and not made super clear in
       | the article: This is specific to multi-AZ clusters, which is a
       | relatively new feature of RDS, and differ from multi-AZ instance
       | that most will be familiar with. (Clear as mud.)
       | 
       | Multi-AZ instances is a long-standing feature of RDS where the
       | primary DB is synchronously replicated to a secondary DB in
       | another AZ. On failure of the primary, RDS fails over to the
       | secondary.
       | 
       | Multi-AZ clusters has _two_ secondaries, and transactions are
       | synchronously replicated to at least one of them. This is more
       | robust than multi-AZ instances if a secondary fails or is
       | degraded. It also allows read-only access to the secondaries.
       | 
       | Multi-AZ clusters no doubt have more "magic" under the hood, as
       | its not a vanilla Postgres feature as far as I'm aware. I imagine
       | this is why it's failing the Jepsen test.
        
         | ashu1461 wrote:
         | Have one question
         | 
         | So if snapshot violation is happening inside Multi-AZ
         | instances, it can happen with a single region - multiple read
         | replica kind of setup as well ? But it might be easily
         | observable in Multi-AZ setups because the lag is high ?
        
           | luhn wrote:
           | A synchronous replica via WAL shipping is a well-worn feature
           | of Postgres. I'd expect RDS to be using that feature behind
           | the scenes and would be extremely surprised if that has
           | consistency bugs.
           | 
           | Two replicas in a "semi synchronous" configuration, as AWS
           | calls it, is to my knowledge not available in base Postgres.
           | AWS must be using some bespoke replication strategy, which
           | would have different bugs than synchronous replication and is
           | less battle-tested.
           | 
           | But as nobody except AWS knows the implementation details of
           | RDS, this is all idle speculation that doesn't mean much.
        
             | wb14123 wrote:
             | This kind of replication can be configured in vanilla
             | Postgres with something like ANY 3 (s1, s2, s3, s4) in
             | synchronous_standby_names? Doc:
             | https://www.postgresql.org/docs/current/runtime-config-
             | repli...
        
               | ctapobep wrote:
               | I don't think it's possible with ANY set up. All you get
               | is that some replicas are more outdated than others. But
               | they won't return 2 conflicting states when ReplicaA says
               | tx1 wrote (but not tx2), while ReplicaB says tx2 wrote
               | (but not tx1). Which is what Long Fork and Parallel
               | Snapshot are about.
               | 
               | So Amazon Multi-cluster seems to replicate changes out of
               | order?
        
               | mattashii wrote:
               | Kinda. I think it's "just" PostgreSQL behaviour that's to
               | blame here: On replicas, transaction commit visibility
               | order is determined by the order of WAL records; on the
               | primary it's based on when the backend that wrote the
               | transaction notices that its transaction is sufficiently
               | persisted.
               | 
               | See also my comment
               | https://news.ycombinator.com/item?id=43843790 elsewhere
               | in this thread
        
         | ants_a wrote:
         | Interesting why this magic would be needed. Vanilla Postgres
         | does support quorum commit which can do this. You can also set
         | up the equivalent multi-AZ cluster with Patroni, and (modulo
         | bugs) it does the necessary coordination to make sure to
         | promote primaries in a way that does not lose transactions or
         | makes visible a transaction that is not durable.
         | 
         | There still is a Postgres deficiency that makes something
         | similar to this pattern possible. Non-replicated transactions
         | where the client goes away mid-commit become visible
         | immediately. So in the example, if T1 happens on a partitioned
         | leader, disconnects during commit, T2 also happens on a
         | partitioned node, and T3 and T4 happen later on a new leader,
         | you would also see the same result. However, this does not jive
         | with the statement that fault injection was not done in this
         | test.
         | 
         | Edit: did not notice the post that this pattern can be
         | explained by inconsistent commit order on replica and primary.
         | Kind of embarrassing given I've done a talk proposing how to
         | fix that.
        
           | sontek wrote:
           | Link the talk video
        
             | ants_a wrote:
             | https://www.youtube.com/watch?v=vz-dhwSpjOw
        
         | x0x0 wrote:
         | it's the 2nd sentence in the article:
         | 
         | > We show that Amazon RDS for PostgreSQL multi-AZ clusters
         | violate Snapshot Isolation
         | 
         | you kind of have to expect people to read
        
           | evil-olive wrote:
           | I think it's still an important clarification, because for
           | years you've had a choice in RDS (classic RDS, not Aurora)
           | between "single-AZ" and "multi-AZ" instances, with the
           | general rule of thumb that production workloads should always
           | be multi-AZ.
           | 
           | however, "multi-AZ" has been made ambiguous, because there
           | are now multi-AZ _instances_ and multi-AZ _clusters_.
           | 
           | ...and your multi-AZ  "instance", despite being not a multi-
           | AZ "cluster" from AWS's perspective, is still two nodes that
           | are "clustered" together and treated as one logical database
           | from the client connection perspective.
           | 
           | see [0] and scroll down to the "availability and durability"
           | screenshot for an example.
           | 
           | 0: https://aws.amazon.com/blogs/aws/amazon-rds-multi-az-db-
           | clus...
        
       | badmonster wrote:
       | What safety or application-level bugs could arise if developers
       | assume Snapshot Isolation but Amazon RDS for PostgreSQL is
       | actually providing only Parallel Snapshot Isolation, especially
       | in multi-AZ configurations using the read replica endpoint?
        
         | Elucalidavah wrote:
         | Consider a "git push"-like flow: begin a transaction, read the
         | current state, check that it matches the expected, write the
         | new state, commit (with a new state hash). In some unfortunate
         | situations, you'll have a commit hash that doesn't match any
         | valid state.
         | 
         | And the mere fact that it's hard to reason about these things
         | means that it's hard to avoid problems. Hence, the easiest
         | solution is likely "it may be possible to recover Snapshot
         | Isolation by only using the writer endpoint", for anything
         | where write is anyhow conditional on a read.
         | 
         | Although I'm surprised the "only using the writer endpoint"
         | method wasn't tested, especially in availability loss
         | situations.
        
         | ctapobep wrote:
         | Consider this: you leave a comment under a post. The user who
         | posts first deserves a "first commenter badge". Now:
         | 
         | - User1 comments
         | 
         | - User2 comments
         | 
         | - User1 checks (in a separate tx) that there's only 1 comment,
         | so User1 gets the badge
         | 
         | - User2 checks the same (in a separate tx) and also sees only 1
         | comment (his), and also receives the badge.
         | 
         | With Snapshot isolation this isn't possible. At least one of
         | the checks made in a separate tx would see 2 comments.
         | 
         | The original article on the Parallel Snapshot is a good read:
         | https://scispace.com/pdf/transactional-storage-for-geo-repli...
        
       | gitroom wrote:
       | honestly this made me side-eye aws docs hard, i always think
       | snapshot isolation just means what it says. good catch
        
       | baq wrote:
       | > This work was performed independently by Jepsen, without
       | compensation
       | 
       | not what a RDBMS stakeholder wants to wake up to on the best of
       | days. I'd imagine there were a couple emails expressing concern
       | internally.
       | 
       | hats off to aphyr as usual.
        
         | tasuki wrote:
         | What's a "RDBMS stakeholder" ?
         | 
         | (Hats off to aphyr for sure!)
        
           | bobnamob wrote:
           | The three layers of middlemanagement between engineers and
           | whichever director owns this particular incarnation of RDS
        
           | baq wrote:
           | a stakeholder is anyone who has any business at all with the
           | system - customer, engineer, manager, etc.
           | 
           | RDBMS -
           | https://en.wikipedia.org/wiki/Relational_database#RDBMS
        
       | hliyan wrote:
       | I wish more writing in the software world was done this way:
       | 
       | "Amazon RDS for PostgreSQL is an Amazon Web Services (AWS)
       | service which provides managed instances of the PostgreSQL
       | database. We show that Amazon RDS for PostgreSQL multi-AZ
       | clusters violate Snapshot Isolation, the strongest consistency
       | model supported across all endpoints. Healthy clusters
       | occasionally allow..."
       | 
       | Direct, to-the-point, unembellished and analogous to how other
       | STEM disciplines share findings. There was a time I liked reading
       | cleverly written blog posts that use memes to explain things, but
       | now I long for the plain and simple.
        
         | augustl wrote:
         | Jepsen is awesome, on so many levels!
        
           | fuy wrote:
           | isolation levels, that is!
        
         | Twirrim wrote:
         | I'm so past wanting to read meme laden blog posts. Especially
         | when all too often it's just stretching a paragraph of content.
         | Security vulnerability stuff is probably the worst at it these
         | days.
        
         | sgarland wrote:
         | A company I was at had an internal blog where anyone could
         | write an article, and others could comment on it. Zero
         | requirement to do so, and it in no way factored into your
         | rating. I think it was the result of a hackathon one year.
         | 
         | Anyway, I really enjoyed it, because I like technical writing.
         | I found that if I wrote a deeply technical post, I'd get very
         | few likes and comments - in fact, I even had a Staff Eng tell
         | me I should more narrowly target the audience (you could tag
         | groups as an intended audience; they'd only see the
         | notification if they went to the blog, so it wasn't intrusive)
         | because most of engineering had no idea what I was talking
         | about.
         | 
         | Then, I made a post about Kubecost (disclaimer: this was in its
         | very early days, long before being acquired by IBM; I have no
         | idea how it performs now, and this should not dissuade you from
         | trying it if you want to) and how in my tests with it, its
         | recommendations were a poor fit, and would have resulted in
         | either minimal savings, or caused container performance issues.
         | The post was still fairly technical, examining CPU throttling,
         | discussing cgroups, etc. but the key difference was memes.
         | People LOVED it.
         | 
         | I later repeated this experiment with something even more
         | technical; IIRC it involved writing some tiny Python external
         | library in C and accessing it with ctypes, and comparing stack
         | vs. heap allocations. Except, I also included memes. Same
         | result, slightly lessened from the FinOps one, but still far
         | more likes and comments than I would expect for something so
         | dry and utterly inapplicable to most people's day-to-day job.
         | 
         | Like you, I find this trend upsetting, but I also don't know
         | how else to avoid it if you're trying to reach a broader
         | audience. Jensen, of course, is not, and I applaud them for
         | their rigorous approach and pure writing.
        
           | jbaiter wrote:
           | It's funny, because I remember the early days of Jepsen, and
           | it relied heavily on memes (the whole name is based on "call
           | me maybe"/carly rae jepsen) and aphyr wasn't (and still
           | isn't) shy about his colorful real life personality :-)
           | 
           | See for example https://aphyr.com/posts/282-call-me-maybe-
           | postgres, which makes heavy uses of memes.
        
           | Mawr wrote:
           | The reason for that outcome was likely two-fold:
           | 
           | 1. If your memes were analogies to the dry technical
           | concepts, then the simple, easy to digest analogies were the
           | key here, not the memes themselves.
           | 
           | 2. Pictures _are_ worth a thousand words. The more visual you
           | can make your writing the better. Even something as simple as
           | using bullet points instead of dense paragraphs of text works
           | wonders. But the key is to use graphs and illustrations to
           | explain and show concepts wherever possible.
        
         | cwmma wrote:
         | I was just thinking about how much I miss the old old Jepsen,
         | the same matter of fact and directness but full of memes, see
         | for example, the old redis post
         | https://aphyr.com/posts/283-call-me-maybe-redis.
        
         | n8m8 wrote:
         | Amazon is known for a healthy culture around technical writing
         | that I can attest to. This comment reflect my own thoughts, not
         | of my company. Here's a public article considering it.
         | https://quartr.com/insights/business-philosophy/amazon-s-wri...
        
       | havkom wrote:
       | Good investigation!
       | 
       | Software developers nowadays barely know about transactions, and
       | definitely not about different transaction models (in my
       | experience). I have even encountered "senior developers" (who are
       | actually so called "CRUD developers"), who are clueless about
       | database transactions.. In reality, transactions and transaction
       | models matter a lot to performance and error free code (at least
       | when you have volumes of traffic and your software solves
       | something non-trivial).
       | 
       | For example: After a lot of analysis, I switched from SQL Server
       | standard Read Committed to Read Committed Snapshot Isolation in a
       | large project - the users could not be happier -> a lot of
       | locking contention has disappeared. No software engineer in that
       | project had any clue of transaction models or locks before I
       | taught them some basics (even though they had used transactions
       | extensively in that project)..
        
         | ljm wrote:
         | I've noticed the lack of transaction awareness mostly in
         | serverless/edge contexts where the backend architecture (if you
         | can even call it that) is driven exclusively by the needs of
         | the client. For instance, database queries are modelled as
         | react hooks or sequential API calls.
         | 
         | I've seen this work out _terribly_ at certain points in my
         | career.
        
         | fuy wrote:
         | Had similar situation a few years before - switched a (now)
         | billion revenue product from Read Committed to Read Committed
         | Snapshot with huge improvements in performance. One thing to be
         | aware when doing this - it will break all code that rely on
         | blocking reads (e.g. select with exists). These need to be
         | rewritten using explicit locks or some other methods.
        
         | baq wrote:
         | My recommendation for juniors stands unchanged for a decade
         | now: read a book about SQL databases over a weekend and a book
         | about the database your current work project is using over the
         | next weekend. Chances are you are now the database expert on
         | the project.
        
         | jacobsenscott wrote:
         | Soon most software devs will just be transcribing LLM trash to
         | code with no concept of what's actually happening (its actually
         | required at shopify now - MS is bragging 1/3rd of their
         | software is written this way), and no new engineers are coming
         | up because why invest the time to learn if there won't be any
         | engineering jobs left?
        
           | whazor wrote:
           | I think that this is really the duality of LLMs. I can ask it
           | to explain different database transaction models and it would
           | perfectly explain to me how it works, which one to pick, and
           | how to apply it.
           | 
           | But generated code by a LLM will likely also have bugs that
           | could be fixed with transactions.
        
         | shivasaxena wrote:
         | This isn't confined just to senior developers. I have even
         | encountered system architects who were clueless about Isolation
         | levels. Some even confused "Consistency" in ACID with the
         | "Consistency" in CAP.
         | 
         | Makes me sad, since I work mostly in retail and and encounter
         | systems that are infested with race conditions and simila
         | errors: things where these isolation levels would be of great
         | help.
         | 
         | However it's mostly engineers at startups, I have a very high
         | opinion of typical Oracel/MSSQL developers at BigCos who at
         | least have their fundamentals right.
        
           | icedchai wrote:
           | In over 25+ years at various companies, I only recall one
           | interview where isolation levels were even discussed. Almost
           | nobody cares until it's a problem.
        
             | bdangubic wrote:
             | we must have had entirely different careers, same in years
             | and 180 degrees opposite, absolute core (and disqualifying)
             | questions at every interview, no exceptions.
        
               | icedchai wrote:
               | Possibly. Most of my career has been at startups or
               | smaller companies where database fundamentals were
               | severely lacking.
        
       | kchoudhu wrote:
       | I've suspected that there are consistency issues on RDS for a
       | while now: if you push large quantities of data (e.g. 1MM+ rows)
       | into a database quickly and then try to read the same data out on
       | another connection, you'll periodically get null return sets.
       | 
       | We've worked around it by not touching the hot stove, but it's
       | kind of worrying that there are consistency issues with it.
        
       ___________________________________________________________________
       (page generated 2025-04-30 23:00 UTC)