hngopher.com

       [HN Gopher] A race condition in Aurora RDS
       ___________________________________________________________________
        
       A race condition in Aurora RDS
        
       Author : theanomaly
       Score  : 237 points
       Date   : 2025-11-14 18:20 UTC (1 days ago)
        
 (HTM) web link (hightouch.com)
 (TXT) w3m dump (hightouch.com)
        
       | redwood wrote:
       | A good reminder of how people developing a mental model of adding
       | read replicas as a way to scale is a slippery slope. At the end
       | of the day you're scaling only one specific part of your system
       | with certain consistency dynamics that are difficult to reason
       | about
        
         | terminalshort wrote:
         | Works fine for workloads like:
         | 
         | 1. I need to grab some rows from a table
         | 
         | 2. Eventual consistency is good enough
         | 
         | And that's a lot of workloads.
        
           | candiddevmike wrote:
           | As a user, I've come to realize the situations where I think
           | eventual consistency (or delayed processing) are good enough
           | aren't the same as the folks developing most products.
           | Nothing annoys me more than stuff not showing up immediately
           | or having to manually refresh.
        
             | darth_avocado wrote:
             | Sometimes users want everything to show up immediately, but
             | not pay extra for the feature. Everything real time is
             | expensive. Eventual consistency is a good thing for most
             | systems.
        
             | terminalshort wrote:
             | For a workload where you need true read after write you can
             | just send those reads to the writer. But even if you don't
             | there are plenty of workarounds here. You can send a
             | success response to the user when the transaction commits
             | to the writer and update the UI on response. The only case
             | where this will fail is if the user manually reloads the
             | page within the replication lag window and the request goes
             | to the reader. This should be exceedingly rare in a single
             | region cluster, and maybe a little less rare in a multi-
             | region set up, but still pretty rare. I almost never see >
             | 1s replication lag between regions in my Aurora clusters.
             | There are certainly DB workloads where this will not be
             | true, but if you are in a high replication lag cluster, you
             | just don't want to use that for this type of UI dependency
             | in the first place.
        
             | nilamo wrote:
             | I think the key here is just proper notifications. Yes it's
             | eventually consistent, but having a "processing" or "update
             | in progress" is a huge improvement over showing a user old
             | data.
        
           | redwood wrote:
           | The future you or future team member may struggle to reason
           | about that in the future
        
           | morshu9001 wrote:
           | That's readonly. RW workloads usually don't tolerate eventual
           | consistency on the thing they're writing.
        
             | terminalshort wrote:
             | Yeah, if you have a mix of reads and writes in a workflow,
             | you gotta hit the writer node. But a lot of times an
             | endpoint is only reading data from a particular DB.
        
         | nijave wrote:
         | You can hit the same problems horizontally scaling compute. One
         | instance reads from the DB, a request hits a different instance
         | which updates the DB. The original instance writes to the DB
         | and overwrites the changes or makes decisions based on stale
         | data.
         | 
         | More broadly a distributed system problem
        
       | gtowey wrote:
       | This article seems to indicate that manually triggered failovers
       | will always fail if your application tries to maintain its normal
       | write traffic during that process.
       | 
       | Not that I'm discounting the author's experience, but something
       | doesn't quite add up:
       | 
       | - How is it possible that other users of Aurora aren't
       | experiencing this issue basically all the time? How could AWS not
       | know it exists?
       | 
       | - If they know, how is this not an urgent P0 issue for AWS? This
       | seems like the most basic of basic usability features is 100%
       | broken.
       | 
       | - Is there something more nuanced to the failure case here such
       | as does this depend on transactions in-progress? I can see how
       | maybe the failover is waiting for in-flight transactions to close
       | and then maybe hits a timeout where it proceeds with the other
       | part of the failover by accident. That could explain why it
       | doesn't seem like the issue is more widespread.
        
         | maherbeg wrote:
         | Yeah I agree, this seems like a pretty critical feature to the
         | Aurora product itself. We saw a similar behavior recently where
         | we had a connection pooler in between which indicates something
         | wrong with how they propagate DNS changes during the failover.
         | wtf aws
        
           | CaptainKanuk wrote:
           | Whenever we have to do any type of AWS Aurora or RDS cluster
           | modification in prod we always have the entire emergency
           | response crew standing by right outside the door.
           | 
           | Their docs are not good and things frequently don't behave
           | how you expect them to.
        
           | ekropotin wrote:
           | Oh, well, it's always DNS!
        
         | dboreham wrote:
         | Although the article has an SEO-optimized vibe, I think it's
         | reasonable to take it as true until refuted. My rule of thumb
         | is that any rarely executed, very tricky operation (e.g.
         | database writer fail over) is likely to not work because there
         | are too many variables in play and way too few opportunities to
         | find and fix bugs. So the overall story sounds very plausible
         | to me. It has a feel of: it doesn't work under continuous heavy
         | write load, in combination with some set of hardware
         | performance parameters that plays badly with some arbitrary
         | time out. Note that the system didn't actually fail. It just
         | didn't process the fail over operation. It reverted to the
         | original configuration and afaics preserved data.
        
         | theanomaly wrote:
         | I'm surprised this hasn't come up more often too. When we
         | worked with AWS on this, they confirmed there was nothing
         | unique about our traffic pattern that would trigger this issue.
         | We also didn't run into this race condition in any of our other
         | regions running similar workloads. What's particularly
         | concerning is that this seems to be a fundamental flaw in
         | Aurora's failover mechanism that could theoretically affect
         | anyone doing manual failover.
        
         | twisteriffic wrote:
         | > How is it possible that other users of Aurora aren't
         | experiencing this issue basically all the time? How could AWS
         | not know it exists?
         | 
         | If it's anything like how Azure handles this kind of issue,
         | it's likely "lots of people have experienced it, a restart
         | fixes it so no one cares that much, few have any idea how to
         | figure out a root cause on their own, and the process to find a
         | root cause with the vendor is so painful that no one ever sees
         | it through"
        
           | perching_aix wrote:
           | An experience not exclusive to cloud vendors :) Even better
           | when the vendor throws their hands up cause the issue is not
           | reliably repro'able.
           | 
           | That was when I scripted away a test that ran hundreds of
           | times a day on a lower environment, attempting repro. As they
           | say, at scale, even insignificant issues become significant.
           | I don't remember clearly, I think it was a 5-10% chance that
           | the issue triggered.
           | 
           | At least confirming the fix, which we did eventually receive,
           | was mostly a breeze. Had to provide an inordinate amount of
           | captures, logs, and data to get there though. Was quite the
           | grueling few weeks, especially all the office politics laden
           | calls.
        
             | pixl97 wrote:
             | I've had customers with load related bugs for years simply
             | because they'd reboot when the problem happened. When
             | dealing with the F100 it seems there is a rather limited
             | number of people in these organizations that can
             | troubleshoot complex issues, that or they lock them away
             | out of sight.
        
               | perching_aix wrote:
               | It is a tough bargain to be fair, and it is seen in other
               | places too. From developers copying out their stuff from
               | their local git repo, recloning from remote, then pasting
               | their stuff back, all the way to phone repair just
               | meaning "here's a new device, we synced all your data
               | across for you", it's fairly hard to argue with the
               | economic factors and the effectiveness of this approach
               | at play.
               | 
               | With all the enterprise solutions being distributed,
               | loosely coupled, self-healing, redundant, and fault-
               | tolerant, issues like this essentially just slot in
               | perfectly. Compound this with man-hours (especially
               | expert ones) being a lot harder to justify for any one
               | particular bump in tail latency, and the equation is just
               | really not there for all this.
               | 
               | What gets us specifically to look into things is either
               | the issue being operationally gnarly (e.g. frequent,
               | impacting, or both), or management being swayed enough by
               | principled thinking (or at least pretending to be). I'd
               | imagine it's the same elsewhere. The latter would mostly
               | happen if fixing a given thing becomes an office
               | political concern, or a corporate reputation one. You
               | might wonder if those individual issues ever snowballed
               | into a big one, but turns out human nature takes care of
               | that just "sufficiently enough" before it would manifest
               | "too severely". [0]
               | 
               | Otherwise, you're looking at fixing / RCA'ing / working
               | around someone else's product defect on their behalf, and
               | giving your engineers a "fun challenge". Fun doesn't pay
               | the bills, and we rarely saw much in return from the
               | vendor in exchange for our research. I'd love to
               | entertain the idea that maybe behind closed doors the
               | negotiations went a little better because of these, but
               | for various reasons, I really doubt so in hindsight.
               | 
               | [0] as delightfully subjective as those get of course
        
               | hobs wrote:
               | If I had a nickel for every time I had to explain that
               | rebooting a database server is usually the wrong choice I
               | would have quite a fortune.
        
           | sally_glance wrote:
           | Theoretically you're supposed to assign lower prio to issues
           | with known workarounds but then there should also be
           | reporting for product management (which assigns weight by age
           | of first occurrence and total count of similar issues).
           | 
           | Amazon is mature enough for processes to reflect this, so my
           | guess for why something like this could slip through is
           | either too many new feature requests or many more critical
           | issues to resolve.
        
           | pwarner wrote:
           | Azure yes, I'd expect this and the restart would take many
           | minutes. Been there done that.
           | 
           | AWS this is surprising
        
         | nijave wrote:
         | fwiw we haven't seen issues manually doing manual failovers for
         | maintenance using the same/similar procedure described in the
         | article. I imagine there is something more nuanced here and
         | it's hard to draw too many conclusions without a lot more
         | details being provided by AWS
        
         | aetherson wrote:
         | My experience with AWS is that they are extremely, extremely
         | parsimonious about any information they give out. It is near-
         | impossible to get them to give you any details about what is
         | happening beyond the level of their API. So my gut hunch is
         | that they think that there's something very rare about this
         | happening, but they refuse to give the article writer the
         | information that might or might not help them avoid the bug.
        
           | everfrustrated wrote:
           | If you pay for the highest level of support you will get
           | extremely good support. But it comes with signing a NDA so
           | you're not going to read about anything coming out of it on a
           | blog.
           | 
           | I've had AWS engineers confirm very detailed and specific
           | technical implementation details many many times. But these
           | were at companies that happily spent over a $1M/year with
           | AWS.
        
             | qaq wrote:
             | Nah if your monthly spend is really significant than you
             | will get good support and issues you care about will get
             | prioritized. Going from startup with 50K/month spend to a
             | large company with untold millions per month spend
             | experience is night and day. We have Dev managers and eng.
             | from key AWS teams present in meetings when need be, we get
             | issues we raise prioritized and added to dev roadmaps etc.
        
             | aetherson wrote:
             | I was at a company that spent over $90M a year with AWS and
             | we got defensive, limited comms.
        
         | Hovertruck wrote:
         | Agreed, we've been running multiple aurora clusters in
         | production for years now and have not encountered this issue
         | with failovers.
        
           | dalyons wrote:
           | Same. There's something missing here.
        
         | kobalsky wrote:
         | > - How is it possible that other users of Aurora aren't
         | experiencing this issue basically all the time? How could AWS
         | not know it exists?
         | 
         | I know that there is no comparison in the user base, but a few
         | years ago I ran into a massive Python + MySQL bug that:
         | 
         | 1. made SELECT ... FOR UPDATE fail silenty 2. aborted the
         | transaction and set the connection into autocommit mode
         | 
         | This basically a worst case scenario in a transactional system.
         | 
         | I was basically screaming like a mad man in the corner but no
         | one seemed to care.
         | 
         | Someone contacted me months later telling me that they
         | experienced the same problem with "interesting" consequences in
         | their system.
         | 
         | The bug was eventually fixed but at that point I wasn't
         | tracking it anymore, I provided a patch when I created the
         | issue and moved on.
         | 
         | https://stackoverflow.com/questions/945482/why-doesnt-anyone...
        
           | sroussey wrote:
           | Converting a connection to autocommit upon error. Yikes!!
        
             | evanelias wrote:
             | If I'm reading this correctly, it sounds like the
             | connection was already using autocommit by default? In that
             | situation, if you initiate a transaction, and then it gets
             | rolled back, you're back in autocommit unless/until you
             | initiate another transaction.
             | 
             | If so, that part is all totally normal and expected. It's
             | just that due to a bug in the Python client library (16
             | years ago), the rollback was happening silently because the
             | error was not surfaced properly by the client library.
        
               | o11c wrote:
               | I would argue that it's a bug for it even to be
               | _possible_ to autocommit.
        
               | evanelias wrote:
               | What do you mean? Autocommit mode is the default mode in
               | Postgres and MS SQL Server as well. This is by no means a
               | MySQL-specific behavior!
               | 
               | When you're in autocommit mode, BEGIN starts an explicit
               | transaction, but after that transaction (either COMMIT or
               | ROLLBACK), you return to autocommit mode.
               | 
               | The situation being described upthread is a case where a
               | transaction was started, and then rolled back by the
               | server due to deadlock error. So it's totally normal that
               | you're back in autocommit mode after the rollback. Most
               | DBMS handle this identically.
               | 
               | The bug described was entirely in the client library
               | failing to surface the deadlock error. There's simply no
               | autocommit-related bug as it was described.
        
               | o11c wrote:
               | Yes, and most DBMS's are full of historical mistakes.
               | 
               | In a sane world, statements outside `BEGIN` would be an
               | unconditional error.
        
               | grogers wrote:
               | Autocommit mode is pretty handy for ad-hoc queries at
               | least. You wouldn't want to have to remember to close the
               | transaction since keeping a transaction open is often
               | really bad for the DB
        
               | evanelias wrote:
               | Lack of autocommit would be bad for performance at scale,
               | since it would add latency to every single query. And the
               | MVCC implications are non-trivial, especially for
               | interactive queries (human taking their time typing)
               | while using REPEATABLE READ isolation or stronger...
               | every interactive query would effectively disrupt
               | purge/vacuum until the user commits. And as the sibling
               | comment noted, that would be quite harmful if the user
               | completely forgets to commit, which is common.
               | 
               | In any case, that's a subjective opinion on database
               | design, not a bug. Anyway it's fairly tangential to the
               | client library bug described up-thread.
        
         | benmmurphy wrote:
         | it could be most people pause writes because its going to
         | create errors if you try and execute a write against an
         | instance that refuses to accept and writes, and for some people
         | those errors might not be recoverable. so they just have some
         | option in their application that puts the application into
         | maintenance mode where it will hard reject writes at the
         | application layer.
        
         | biggoodwolf wrote:
         | I recall seeing this also happening in CosmosDB. Both auto and
         | manual
        
         | nrhrjrjrjtntbt wrote:
         | P0 if it happens to everyone, right? Like the USE1 outage
         | recently. If it is 0.001% of customers (enough to get a HN
         | story) is may not be that high. Maybe this customer is on a
         | migration or upgrade path under the hood. Or just on a bad unit
         | in the rack.
        
         | belter wrote:
         | The article is low quality. It does not mention which Aurora
         | PostgreSQL version was involved, and it provides no real detail
         | about how the staging environment differed from production,
         | only saying that staging "didn't reproduce the exact
         | conditions," which is not actionable.
         | 
         | This AWS documentation section:
         | https://docs.aws.amazon.com/AmazonRDS/latest/AuroraPostgreSQ...
         | 
         | "Amazon Aurora PostgreSQL updates": under Aurora PostgreSQL
         | 17.5.3, September 16, 2025 - Critical stability enhancements
         | includes a potential match:
         | 
         | "...Fixed a race condition where an old writer instance may not
         | step down after a new writer instance is promoted and continues
         | to write..."
         | 
         | If that is the underlying issue, it would be serious, but
         | without more specifics we can't draw conclusions.
         | 
         | For context: I do not work for AWS, but I do run several
         | production systems on Aurora PostgreSQL. I will try to
         | reproduce this using the latest versions over the next few
         | hours. If I do not post an update within 24 hours, assume my
         | tests did not surface anything.
         | 
         | That would not rule out a real issue in certain edge cases,
         | configurations, or version combinations but it would at least
         | suggest it is not broadly reproducible.
        
         | grogers wrote:
         | It sounds like part of the problem was how the application
         | reacted to the reverted fail over. They had to restart their
         | service to get writes to be accepted, implying some sort of
         | broken caching behavior where it kept trying to send queries to
         | the wrong primary.
         | 
         | It's at least possible that this sort of aborted failover
         | happens a fair amount, but if there's no downtime then users
         | just try again and it succeeds, so they never bother
         | complaining to AWS. Unless AWS is specifically monitoring for
         | it, they might be blind to it happening.
        
       | jansommer wrote:
       | People who have experience with Aurora and RDS Postgres: What's
       | your experience in terms of performance? If you dont need multi
       | A-Z and quick failover, can you achieve better performance with
       | RDS and e.g. gp3 64.000 iops and 3125 throughput (assuming
       | everything else can deliver that and cpu/mem isn't the
       | bottleneck)? Aurora seems to be especially slow for inserts and
       | also quite expensive compared to what I get with RDS when I
       | estimate things in the calculator. And what's the story on read
       | performance for Aurora vs RDS? There's an abundance of benchmarks
       | showing Aurora is better in terms of performance but they leave
       | out so much about their RDS config that I'm having a hard time
       | believing them.
        
         | shawabawa3 wrote:
         | > 3125 throughput
         | 
         | Max throughput on gp3 was recently increased to 2GB/s, is there
         | some way I don't know about of getting 3.125?
        
           | jansommer wrote:
           | This is super confusing. Check out the RDS Postgres
           | calculator with gp3:
           | 
           | > General Purpose SSD (gp3) - Throughput > gp3 supports a max
           | of 4000 MiBps per volume
           | 
           | But the docs say 2000. Then there's IOPS... The calculator
           | allows up to 64.000 but on [0], if you expand "Higher
           | performance and throughout" it says
           | 
           | > Customers looking for higher performance can scale up to
           | _80,000_ IOPS and 2,000 MiBps for an additional fee.
           | 
           | [0] https://aws.amazon.com/ebs/general-purpose/
        
             | nijave wrote:
             | RDS PG stripes multiple gp3 volumes so that's why RDS
             | throughput is higher than gp3
             | 
             | I think 80k IOPs on gp3 is a newer release so presumably
             | AWS hasn't updated RDS from the old max of 64k. iirc it
             | took a while before gp3 and io2 were even available for RDS
             | after they were released as EBS options
             | 
             | Edit: Presumably it takes some time to do
             | testing/optimizations to make sure their RDS config can
             | achieve the same performance as EBS. Sometimes there are
             | limitations with instance generations/types that also
             | impact whether you can hit maximum advertised throughput
        
               | mkesper wrote:
               | Only if you allocate (and pay for) more than 400GB. And
               | if you have high traffic 24/7 beware of "EBS optimized"
               | instances which will fall down to baseline rates after a
               | certain time. I use vantage.sh/rds (not affiliated) to
               | get an overview of the tons of instance details stretched
               | out over several tables in AWS docs.
        
           | nijave wrote:
           | RDS stripes multiple gp3 volumes. Docs are saying 4Gi/s per
           | instance is the max for gp3 if I'm looking at the right table
        
         | nijave wrote:
         | We've seen better results and lower costs in a 1 writer, 1-2
         | reader setup on Aurora PG 14. The main advantages are 1) you
         | don't re-pay for storage for each instance--you pay for cluster
         | storage instead of per-instance storage & 2) you no longer need
         | to provision IOPs and it provides ~80k IOPs
         | 
         | If you have a PG cluster with 1 writer, 2 readers, 10Ti of
         | storage and 16k provision IOPs (io1/2 has better latency than
         | gp3), you pay for 30Ti and 48k PIOPS without redundancy or 60Ti
         | and 96k PIOPS with multi-AZ.
         | 
         | The same Aurora setup you pay for 10Ti and get multi-AZ for
         | free (assuming the same cluster setup and that you've stuck the
         | instances in different AZs).
         | 
         | I don't want to figure the exact numbers but iirc if you have
         | enough storage--especially io1/2--you can end up saving money
         | and getting better performance. For smaller amounts of storage,
         | the numbers don't necessarily work out.
         | 
         | There's also 2 IO billing modes to be aware of. There's the
         | default pay-per-IO which is really only helpful for extreme
         | spikes and generally low IO usage. The other mode is
         | "provisioned" or "storage optimized" or something where you pay
         | a flat 30% of the instance cost (in addition to the instance
         | cost) for unlimited IO--you can get a lot more IO and end up
         | cheaper in this mode if you had an IO heavy workload before
         | 
         | I'd also say Serverless is almost never worth it. Iirc
         | provisioning instances was ~17% of the cost of serverless.
         | Serverless only works out if you have ~ <4 hours of heavy usage
         | followed by almost all idle. You can add instances fairly
         | quickly and failover for minimal downtime (of course barring
         | running into the bug the article describes...) to handle
         | workload spikes using fixed instance sizes without serverless
        
           | jansommer wrote:
           | Have you benchmarked your load on RDS? [0] says that IOPS on
           | Aurora is vastly different from actual IOPS. We have just one
           | writer instance and mostly write 100's of GB in bulk.
           | 
           | [0] https://dev.to/aws-heroes/100k-write-iops-in-
           | aurora-t3medium...
        
         | jaggederest wrote:
         | I've had better results with managing my own clusters on metal
         | instances. You get much better performance with e.g. NVMe
         | drives in a 0+1 raid (~million iops in a pure raid 0 with 7
         | drives) and I am comfortable running my own instances and
         | clusters. I don't care for the way RDS limits your options on
         | extensions and configuration, and I haven't had a good time
         | with the high availability failovers internally, I'd rather run
         | my own 3 instances in a cluster, 3 clusters in different AZs.
         | 
         | Blatant plug time:
         | 
         | I'm actually working for a company right now (
         | https://pgdog.dev/ ) that is working on proper sharding and
         | failovers from a connection pooler standpoint. We handle
         | failovers like this by pausing write traffic for up to 60
         | seconds by default at the connection pooler and swapping which
         | backend instance is getting traffic.
        
         | everfrustrated wrote:
         | Aurora doesn't use EBS under the hood. It has no option to
         | choose storage type or io latency. Only a billing choice
         | between pay per io or fixed price io.
        
           | jansommer wrote:
           | Precisely! That's why RDS sounds so interesting. I get a lot
           | more knobs to tweak performance, but I'm curious if a maxed
           | out gp3 with instances that support it is going to fare any
           | better than Aurora.
        
         | Exoristos wrote:
         | We were burned by Aurora. Costs, performance, latency, all were
         | poor and affected our product. Having good systems admins on
         | staff, we ended up moving PostgreSQL on-prem.
        
         | Scubabear68 wrote:
         | For me, the big miss with Postgres Aurora RDS was costs. We had
         | some queries that did a fair amount of I/O in a way that would
         | not normally be a problem, but in the Aurora Postgres RDS world
         | that I/O was crazy expensive. A couple of fuzzy queries blew
         | costs up to over $3,000/month for a database that should have
         | cost maybe $50-$100/month. And this was for a dataset of only
         | about 15 million rows without anything crazy in them.
        
           | Hexcles wrote:
           | Sounds like you need to use IO optimized storage billing
           | mode.
        
         | paranoidrobot wrote:
         | My experience is with Aurora MySQL, not postgres. But my
         | understanding is that the way the storage layer works is much
         | the same.
         | 
         | We have some clusters with very high write IOPS on Aurora.
         | 
         | When looking at costs we modelled running MySQL and regular RDS
         | MySQL.
         | 
         | We found for the IOPS capacity of Aurora we wouldn't be able to
         | match it on AWS without paying a stupid amount more.
        
         | belter wrote:
         | > There's an abundance of benchmarks showing Aurora is better
         | in terms of performance but they leave out so much about their
         | RDS config that I'm having a hard time believing them.
         | 
         | Do you have a problem believing these claims on equivalent
         | hardware?:
         | https://pages.cs.wisc.edu/~yxy/cs764-f20/papers/aurora-sigmo...
         | 
         | Or do your own performance assessments, following the published
         | document and templates available so you can find the facts on
         | your own?
         | 
         | For Aurora MySql:
         | 
         | "Amazon Aurora Performance Assessment Technical Guide" -
         | https://d1.awsstatic.com/product-marketing/Aurora/RDS_Aurora...
         | 
         | For Aurora Postgres:
         | 
         | "...Steps to benchmark the performance of the PostgreSQL-
         | compatible edition of Amazon Aurora using the pgbench and
         | sysbench benchmarking tools..." -
         | https://d1.awsstatic.com/product-marketing/Aurora/RDS_Aurora...
         | 
         | "Automate benchmark tests for Amazon Aurora PostgreSQL" -
         | https://aws.amazon.com/blogs/database/automate-benchmark-tes...
         | 
         | "Benchmarking Amazon Aurora Limitless with pgbench" -
         | https://aws.amazon.com/blogs/database/benchmarking-amazon-au...
        
       | grhmc wrote:
       | Yikes! This is exactly the kind of invariant I'd expect Aurora to
       | maintain on my behalf. It is why I pay them so much...
        
         | dangoodmanUT wrote:
         | It did, the storage layer did not allow for concurrent writes.
        
       | bob1029 wrote:
       | > Aurora's architecture differs from traditional PostgreSQL in a
       | crucial way: it separates compute from storage.
       | 
       | I find this approach very compelling. MSSQL has a similar thing
       | with their hyperscale offering. It's probably the only service in
       | Azure that I would actually use.
        
       | robinduckett wrote:
       | Glad to know I'm not crazy.
        
         | theanomaly wrote:
         | AWS Support initially pushed back and suggested it's because of
         | high replication lag but they were looking at metrics that were
         | more than 24 hours old. What kind of failure did you encounter?
         | I really want to understand what edge case we triggered in
         | their failover process - especially since we could not
         | reproduce it in other regions.
        
           | robben1234 wrote:
           | My cluster recently started to failover every few days
           | whenever it experiences the load to trigger scale up from 1-2
           | to 20+ acu.
           | 
           | And then I also encountered errors just like op in my app
           | layer about trying to execute a write query via read-only
           | transaction.
           | 
           | The workaround so far is to invalidate connection on error.
           | When app reconnects the cluster write endpoint correctly
           | leads to current primary.
        
       | d1egoaz wrote:
       | > AWS has indicated a fix is on their roadmap, but as of now, the
       | recommended mitigation aligns with our solution: use Aurora's
       | Failover feature on an as-needed basis and ensure that no writes
       | are executed against the DB during the failover.
       | 
       | Is there a case number where we can reach out to AWS regarding
       | this recommendation?
        
         | paranoidrobot wrote:
         | Yeah. I'd like this too.
         | 
         | We use Aurora MySQL but I would like to be able to point to
         | that and ask if it applies to us.
        
       | time0ut wrote:
       | Wow. This is alarming.
       | 
       | We have done a similar operation routinely on databases under
       | pretty write intensive workloads (like 10s of thousands of
       | inserts per second). It is so routine we have automation to
       | adjust to planned changes in volume and do so a dozen times a
       | month or so. It has been very robust for us. Our apps are
       | designed for it and use AWS's JDBC wrapper.
       | 
       | Just one more thing to worry about I guess...
        
         | dangoodmanUT wrote:
         | Not really: Their storage layer worked perfectly and prevented
         | the ACID violations.
        
       | almosthere wrote:
       | probably should have added postgres to end of title
        
         | evanelias wrote:
         | Absolutely this. The differences between Aurora Postgres and
         | Aurora MySQL are quite significant. A failover bug affecting
         | one doesn't imply the same bug exists in the other.
         | 
         | A lot of people seem to have the misconception that "Aurora" is
         | its own unique database system, with different front-ends
         | "pretending" to be Postgres or MySQL, but that isn't the case
         | at all.
        
       | ldkge wrote:
       | Am I the only one who misread that as "AI race condition"?
        
       | dangoodmanUT wrote:
       | This confirms a lot of what their engineers preach: The lego
       | brick model.
       | 
       | They made the storage layer in total isolation, and they made
       | sure that it guaranteed correctness for exclusive writer access.
       | When the upstream service failed to also make its own guarantees,
       | the data layer was still protected.
       | 
       | Good job AWS engineering!
        
       | halifaxbeard wrote:
       | I think OP is wrong in their hypothesis based on the logs they
       | share and the root cause AWS support provided them.
       | 
       | I think the promotion fails to happen and then an external
       | watchdog notices that it didn't, and kills everything ASAP as
       | it's a cluster state mismatch.
       | 
       | The message about the storage subsystem going away is after the
       | other Postgres process was kill -9'd.
        
       | halfmatthalfcat wrote:
       | CC pm. MgtzkskskzjauHjhffd
        
       | shayonj wrote:
       | Sadly, its not the first time I have noticed unexpected and odd
       | behaviors from Aurora PostgreSQL offering.
       | 
       | I noticed another interesting (and still unconfirmed) bug with
       | Aurora PostgreSQL around their Zero Downtime Patching.
       | 
       | During an Aurora minor version upgrade, Aurora preserves sessions
       | across the engine restart, but it appears to also preserve stale
       | per-session execution state (including the internal statement
       | timer). After ZDP, I've seen very simple queries (e.g. a single-
       | row lookup via Rails/ActiveRecord) fail with `PG::QueryCanceled:
       | ERROR: canceling statement due to statement timeout` in far less
       | than the configured statement_timeout (GUC), and only in the
       | brief window right after ZDP completes.
       | 
       | My working theory is that when the client reconnects (e.g. via
       | PG::Connection#reset), Aurora routes the new TCP connection back
       | to a preserved session whose "statement start time" wasn't
       | properly reset, so the new query inherits an old timer and gets
       | canceled almost immediately even though it's not long-running at
       | all.
        
       ___________________________________________________________________
       (page generated 2025-11-15 23:01 UTC)