[HN Gopher] A race condition in Aurora RDS
___________________________________________________________________
A race condition in Aurora RDS
Author : theanomaly
Score : 237 points
Date : 2025-11-14 18:20 UTC (1 days ago)
(HTM) web link (hightouch.com)
(TXT) w3m dump (hightouch.com)
| redwood wrote:
| A good reminder of how people developing a mental model of adding
| read replicas as a way to scale is a slippery slope. At the end
| of the day you're scaling only one specific part of your system
| with certain consistency dynamics that are difficult to reason
| about
| terminalshort wrote:
| Works fine for workloads like:
|
| 1. I need to grab some rows from a table
|
| 2. Eventual consistency is good enough
|
| And that's a lot of workloads.
| candiddevmike wrote:
| As a user, I've come to realize the situations where I think
| eventual consistency (or delayed processing) are good enough
| aren't the same as the folks developing most products.
| Nothing annoys me more than stuff not showing up immediately
| or having to manually refresh.
| darth_avocado wrote:
| Sometimes users want everything to show up immediately, but
| not pay extra for the feature. Everything real time is
| expensive. Eventual consistency is a good thing for most
| systems.
| terminalshort wrote:
| For a workload where you need true read after write you can
| just send those reads to the writer. But even if you don't
| there are plenty of workarounds here. You can send a
| success response to the user when the transaction commits
| to the writer and update the UI on response. The only case
| where this will fail is if the user manually reloads the
| page within the replication lag window and the request goes
| to the reader. This should be exceedingly rare in a single
| region cluster, and maybe a little less rare in a multi-
| region set up, but still pretty rare. I almost never see >
| 1s replication lag between regions in my Aurora clusters.
| There are certainly DB workloads where this will not be
| true, but if you are in a high replication lag cluster, you
| just don't want to use that for this type of UI dependency
| in the first place.
| nilamo wrote:
| I think the key here is just proper notifications. Yes it's
| eventually consistent, but having a "processing" or "update
| in progress" is a huge improvement over showing a user old
| data.
| redwood wrote:
| The future you or future team member may struggle to reason
| about that in the future
| morshu9001 wrote:
| That's readonly. RW workloads usually don't tolerate eventual
| consistency on the thing they're writing.
| terminalshort wrote:
| Yeah, if you have a mix of reads and writes in a workflow,
| you gotta hit the writer node. But a lot of times an
| endpoint is only reading data from a particular DB.
| nijave wrote:
| You can hit the same problems horizontally scaling compute. One
| instance reads from the DB, a request hits a different instance
| which updates the DB. The original instance writes to the DB
| and overwrites the changes or makes decisions based on stale
| data.
|
| More broadly a distributed system problem
| gtowey wrote:
| This article seems to indicate that manually triggered failovers
| will always fail if your application tries to maintain its normal
| write traffic during that process.
|
| Not that I'm discounting the author's experience, but something
| doesn't quite add up:
|
| - How is it possible that other users of Aurora aren't
| experiencing this issue basically all the time? How could AWS not
| know it exists?
|
| - If they know, how is this not an urgent P0 issue for AWS? This
| seems like the most basic of basic usability features is 100%
| broken.
|
| - Is there something more nuanced to the failure case here such
| as does this depend on transactions in-progress? I can see how
| maybe the failover is waiting for in-flight transactions to close
| and then maybe hits a timeout where it proceeds with the other
| part of the failover by accident. That could explain why it
| doesn't seem like the issue is more widespread.
| maherbeg wrote:
| Yeah I agree, this seems like a pretty critical feature to the
| Aurora product itself. We saw a similar behavior recently where
| we had a connection pooler in between which indicates something
| wrong with how they propagate DNS changes during the failover.
| wtf aws
| CaptainKanuk wrote:
| Whenever we have to do any type of AWS Aurora or RDS cluster
| modification in prod we always have the entire emergency
| response crew standing by right outside the door.
|
| Their docs are not good and things frequently don't behave
| how you expect them to.
| ekropotin wrote:
| Oh, well, it's always DNS!
| dboreham wrote:
| Although the article has an SEO-optimized vibe, I think it's
| reasonable to take it as true until refuted. My rule of thumb
| is that any rarely executed, very tricky operation (e.g.
| database writer fail over) is likely to not work because there
| are too many variables in play and way too few opportunities to
| find and fix bugs. So the overall story sounds very plausible
| to me. It has a feel of: it doesn't work under continuous heavy
| write load, in combination with some set of hardware
| performance parameters that plays badly with some arbitrary
| time out. Note that the system didn't actually fail. It just
| didn't process the fail over operation. It reverted to the
| original configuration and afaics preserved data.
| theanomaly wrote:
| I'm surprised this hasn't come up more often too. When we
| worked with AWS on this, they confirmed there was nothing
| unique about our traffic pattern that would trigger this issue.
| We also didn't run into this race condition in any of our other
| regions running similar workloads. What's particularly
| concerning is that this seems to be a fundamental flaw in
| Aurora's failover mechanism that could theoretically affect
| anyone doing manual failover.
| twisteriffic wrote:
| > How is it possible that other users of Aurora aren't
| experiencing this issue basically all the time? How could AWS
| not know it exists?
|
| If it's anything like how Azure handles this kind of issue,
| it's likely "lots of people have experienced it, a restart
| fixes it so no one cares that much, few have any idea how to
| figure out a root cause on their own, and the process to find a
| root cause with the vendor is so painful that no one ever sees
| it through"
| perching_aix wrote:
| An experience not exclusive to cloud vendors :) Even better
| when the vendor throws their hands up cause the issue is not
| reliably repro'able.
|
| That was when I scripted away a test that ran hundreds of
| times a day on a lower environment, attempting repro. As they
| say, at scale, even insignificant issues become significant.
| I don't remember clearly, I think it was a 5-10% chance that
| the issue triggered.
|
| At least confirming the fix, which we did eventually receive,
| was mostly a breeze. Had to provide an inordinate amount of
| captures, logs, and data to get there though. Was quite the
| grueling few weeks, especially all the office politics laden
| calls.
| pixl97 wrote:
| I've had customers with load related bugs for years simply
| because they'd reboot when the problem happened. When
| dealing with the F100 it seems there is a rather limited
| number of people in these organizations that can
| troubleshoot complex issues, that or they lock them away
| out of sight.
| perching_aix wrote:
| It is a tough bargain to be fair, and it is seen in other
| places too. From developers copying out their stuff from
| their local git repo, recloning from remote, then pasting
| their stuff back, all the way to phone repair just
| meaning "here's a new device, we synced all your data
| across for you", it's fairly hard to argue with the
| economic factors and the effectiveness of this approach
| at play.
|
| With all the enterprise solutions being distributed,
| loosely coupled, self-healing, redundant, and fault-
| tolerant, issues like this essentially just slot in
| perfectly. Compound this with man-hours (especially
| expert ones) being a lot harder to justify for any one
| particular bump in tail latency, and the equation is just
| really not there for all this.
|
| What gets us specifically to look into things is either
| the issue being operationally gnarly (e.g. frequent,
| impacting, or both), or management being swayed enough by
| principled thinking (or at least pretending to be). I'd
| imagine it's the same elsewhere. The latter would mostly
| happen if fixing a given thing becomes an office
| political concern, or a corporate reputation one. You
| might wonder if those individual issues ever snowballed
| into a big one, but turns out human nature takes care of
| that just "sufficiently enough" before it would manifest
| "too severely". [0]
|
| Otherwise, you're looking at fixing / RCA'ing / working
| around someone else's product defect on their behalf, and
| giving your engineers a "fun challenge". Fun doesn't pay
| the bills, and we rarely saw much in return from the
| vendor in exchange for our research. I'd love to
| entertain the idea that maybe behind closed doors the
| negotiations went a little better because of these, but
| for various reasons, I really doubt so in hindsight.
|
| [0] as delightfully subjective as those get of course
| hobs wrote:
| If I had a nickel for every time I had to explain that
| rebooting a database server is usually the wrong choice I
| would have quite a fortune.
| sally_glance wrote:
| Theoretically you're supposed to assign lower prio to issues
| with known workarounds but then there should also be
| reporting for product management (which assigns weight by age
| of first occurrence and total count of similar issues).
|
| Amazon is mature enough for processes to reflect this, so my
| guess for why something like this could slip through is
| either too many new feature requests or many more critical
| issues to resolve.
| pwarner wrote:
| Azure yes, I'd expect this and the restart would take many
| minutes. Been there done that.
|
| AWS this is surprising
| nijave wrote:
| fwiw we haven't seen issues manually doing manual failovers for
| maintenance using the same/similar procedure described in the
| article. I imagine there is something more nuanced here and
| it's hard to draw too many conclusions without a lot more
| details being provided by AWS
| aetherson wrote:
| My experience with AWS is that they are extremely, extremely
| parsimonious about any information they give out. It is near-
| impossible to get them to give you any details about what is
| happening beyond the level of their API. So my gut hunch is
| that they think that there's something very rare about this
| happening, but they refuse to give the article writer the
| information that might or might not help them avoid the bug.
| everfrustrated wrote:
| If you pay for the highest level of support you will get
| extremely good support. But it comes with signing a NDA so
| you're not going to read about anything coming out of it on a
| blog.
|
| I've had AWS engineers confirm very detailed and specific
| technical implementation details many many times. But these
| were at companies that happily spent over a $1M/year with
| AWS.
| qaq wrote:
| Nah if your monthly spend is really significant than you
| will get good support and issues you care about will get
| prioritized. Going from startup with 50K/month spend to a
| large company with untold millions per month spend
| experience is night and day. We have Dev managers and eng.
| from key AWS teams present in meetings when need be, we get
| issues we raise prioritized and added to dev roadmaps etc.
| aetherson wrote:
| I was at a company that spent over $90M a year with AWS and
| we got defensive, limited comms.
| Hovertruck wrote:
| Agreed, we've been running multiple aurora clusters in
| production for years now and have not encountered this issue
| with failovers.
| dalyons wrote:
| Same. There's something missing here.
| kobalsky wrote:
| > - How is it possible that other users of Aurora aren't
| experiencing this issue basically all the time? How could AWS
| not know it exists?
|
| I know that there is no comparison in the user base, but a few
| years ago I ran into a massive Python + MySQL bug that:
|
| 1. made SELECT ... FOR UPDATE fail silenty 2. aborted the
| transaction and set the connection into autocommit mode
|
| This basically a worst case scenario in a transactional system.
|
| I was basically screaming like a mad man in the corner but no
| one seemed to care.
|
| Someone contacted me months later telling me that they
| experienced the same problem with "interesting" consequences in
| their system.
|
| The bug was eventually fixed but at that point I wasn't
| tracking it anymore, I provided a patch when I created the
| issue and moved on.
|
| https://stackoverflow.com/questions/945482/why-doesnt-anyone...
| sroussey wrote:
| Converting a connection to autocommit upon error. Yikes!!
| evanelias wrote:
| If I'm reading this correctly, it sounds like the
| connection was already using autocommit by default? In that
| situation, if you initiate a transaction, and then it gets
| rolled back, you're back in autocommit unless/until you
| initiate another transaction.
|
| If so, that part is all totally normal and expected. It's
| just that due to a bug in the Python client library (16
| years ago), the rollback was happening silently because the
| error was not surfaced properly by the client library.
| o11c wrote:
| I would argue that it's a bug for it even to be
| _possible_ to autocommit.
| evanelias wrote:
| What do you mean? Autocommit mode is the default mode in
| Postgres and MS SQL Server as well. This is by no means a
| MySQL-specific behavior!
|
| When you're in autocommit mode, BEGIN starts an explicit
| transaction, but after that transaction (either COMMIT or
| ROLLBACK), you return to autocommit mode.
|
| The situation being described upthread is a case where a
| transaction was started, and then rolled back by the
| server due to deadlock error. So it's totally normal that
| you're back in autocommit mode after the rollback. Most
| DBMS handle this identically.
|
| The bug described was entirely in the client library
| failing to surface the deadlock error. There's simply no
| autocommit-related bug as it was described.
| o11c wrote:
| Yes, and most DBMS's are full of historical mistakes.
|
| In a sane world, statements outside `BEGIN` would be an
| unconditional error.
| grogers wrote:
| Autocommit mode is pretty handy for ad-hoc queries at
| least. You wouldn't want to have to remember to close the
| transaction since keeping a transaction open is often
| really bad for the DB
| evanelias wrote:
| Lack of autocommit would be bad for performance at scale,
| since it would add latency to every single query. And the
| MVCC implications are non-trivial, especially for
| interactive queries (human taking their time typing)
| while using REPEATABLE READ isolation or stronger...
| every interactive query would effectively disrupt
| purge/vacuum until the user commits. And as the sibling
| comment noted, that would be quite harmful if the user
| completely forgets to commit, which is common.
|
| In any case, that's a subjective opinion on database
| design, not a bug. Anyway it's fairly tangential to the
| client library bug described up-thread.
| benmmurphy wrote:
| it could be most people pause writes because its going to
| create errors if you try and execute a write against an
| instance that refuses to accept and writes, and for some people
| those errors might not be recoverable. so they just have some
| option in their application that puts the application into
| maintenance mode where it will hard reject writes at the
| application layer.
| biggoodwolf wrote:
| I recall seeing this also happening in CosmosDB. Both auto and
| manual
| nrhrjrjrjtntbt wrote:
| P0 if it happens to everyone, right? Like the USE1 outage
| recently. If it is 0.001% of customers (enough to get a HN
| story) is may not be that high. Maybe this customer is on a
| migration or upgrade path under the hood. Or just on a bad unit
| in the rack.
| belter wrote:
| The article is low quality. It does not mention which Aurora
| PostgreSQL version was involved, and it provides no real detail
| about how the staging environment differed from production,
| only saying that staging "didn't reproduce the exact
| conditions," which is not actionable.
|
| This AWS documentation section:
| https://docs.aws.amazon.com/AmazonRDS/latest/AuroraPostgreSQ...
|
| "Amazon Aurora PostgreSQL updates": under Aurora PostgreSQL
| 17.5.3, September 16, 2025 - Critical stability enhancements
| includes a potential match:
|
| "...Fixed a race condition where an old writer instance may not
| step down after a new writer instance is promoted and continues
| to write..."
|
| If that is the underlying issue, it would be serious, but
| without more specifics we can't draw conclusions.
|
| For context: I do not work for AWS, but I do run several
| production systems on Aurora PostgreSQL. I will try to
| reproduce this using the latest versions over the next few
| hours. If I do not post an update within 24 hours, assume my
| tests did not surface anything.
|
| That would not rule out a real issue in certain edge cases,
| configurations, or version combinations but it would at least
| suggest it is not broadly reproducible.
| grogers wrote:
| It sounds like part of the problem was how the application
| reacted to the reverted fail over. They had to restart their
| service to get writes to be accepted, implying some sort of
| broken caching behavior where it kept trying to send queries to
| the wrong primary.
|
| It's at least possible that this sort of aborted failover
| happens a fair amount, but if there's no downtime then users
| just try again and it succeeds, so they never bother
| complaining to AWS. Unless AWS is specifically monitoring for
| it, they might be blind to it happening.
| jansommer wrote:
| People who have experience with Aurora and RDS Postgres: What's
| your experience in terms of performance? If you dont need multi
| A-Z and quick failover, can you achieve better performance with
| RDS and e.g. gp3 64.000 iops and 3125 throughput (assuming
| everything else can deliver that and cpu/mem isn't the
| bottleneck)? Aurora seems to be especially slow for inserts and
| also quite expensive compared to what I get with RDS when I
| estimate things in the calculator. And what's the story on read
| performance for Aurora vs RDS? There's an abundance of benchmarks
| showing Aurora is better in terms of performance but they leave
| out so much about their RDS config that I'm having a hard time
| believing them.
| shawabawa3 wrote:
| > 3125 throughput
|
| Max throughput on gp3 was recently increased to 2GB/s, is there
| some way I don't know about of getting 3.125?
| jansommer wrote:
| This is super confusing. Check out the RDS Postgres
| calculator with gp3:
|
| > General Purpose SSD (gp3) - Throughput > gp3 supports a max
| of 4000 MiBps per volume
|
| But the docs say 2000. Then there's IOPS... The calculator
| allows up to 64.000 but on [0], if you expand "Higher
| performance and throughout" it says
|
| > Customers looking for higher performance can scale up to
| _80,000_ IOPS and 2,000 MiBps for an additional fee.
|
| [0] https://aws.amazon.com/ebs/general-purpose/
| nijave wrote:
| RDS PG stripes multiple gp3 volumes so that's why RDS
| throughput is higher than gp3
|
| I think 80k IOPs on gp3 is a newer release so presumably
| AWS hasn't updated RDS from the old max of 64k. iirc it
| took a while before gp3 and io2 were even available for RDS
| after they were released as EBS options
|
| Edit: Presumably it takes some time to do
| testing/optimizations to make sure their RDS config can
| achieve the same performance as EBS. Sometimes there are
| limitations with instance generations/types that also
| impact whether you can hit maximum advertised throughput
| mkesper wrote:
| Only if you allocate (and pay for) more than 400GB. And
| if you have high traffic 24/7 beware of "EBS optimized"
| instances which will fall down to baseline rates after a
| certain time. I use vantage.sh/rds (not affiliated) to
| get an overview of the tons of instance details stretched
| out over several tables in AWS docs.
| nijave wrote:
| RDS stripes multiple gp3 volumes. Docs are saying 4Gi/s per
| instance is the max for gp3 if I'm looking at the right table
| nijave wrote:
| We've seen better results and lower costs in a 1 writer, 1-2
| reader setup on Aurora PG 14. The main advantages are 1) you
| don't re-pay for storage for each instance--you pay for cluster
| storage instead of per-instance storage & 2) you no longer need
| to provision IOPs and it provides ~80k IOPs
|
| If you have a PG cluster with 1 writer, 2 readers, 10Ti of
| storage and 16k provision IOPs (io1/2 has better latency than
| gp3), you pay for 30Ti and 48k PIOPS without redundancy or 60Ti
| and 96k PIOPS with multi-AZ.
|
| The same Aurora setup you pay for 10Ti and get multi-AZ for
| free (assuming the same cluster setup and that you've stuck the
| instances in different AZs).
|
| I don't want to figure the exact numbers but iirc if you have
| enough storage--especially io1/2--you can end up saving money
| and getting better performance. For smaller amounts of storage,
| the numbers don't necessarily work out.
|
| There's also 2 IO billing modes to be aware of. There's the
| default pay-per-IO which is really only helpful for extreme
| spikes and generally low IO usage. The other mode is
| "provisioned" or "storage optimized" or something where you pay
| a flat 30% of the instance cost (in addition to the instance
| cost) for unlimited IO--you can get a lot more IO and end up
| cheaper in this mode if you had an IO heavy workload before
|
| I'd also say Serverless is almost never worth it. Iirc
| provisioning instances was ~17% of the cost of serverless.
| Serverless only works out if you have ~ <4 hours of heavy usage
| followed by almost all idle. You can add instances fairly
| quickly and failover for minimal downtime (of course barring
| running into the bug the article describes...) to handle
| workload spikes using fixed instance sizes without serverless
| jansommer wrote:
| Have you benchmarked your load on RDS? [0] says that IOPS on
| Aurora is vastly different from actual IOPS. We have just one
| writer instance and mostly write 100's of GB in bulk.
|
| [0] https://dev.to/aws-heroes/100k-write-iops-in-
| aurora-t3medium...
| jaggederest wrote:
| I've had better results with managing my own clusters on metal
| instances. You get much better performance with e.g. NVMe
| drives in a 0+1 raid (~million iops in a pure raid 0 with 7
| drives) and I am comfortable running my own instances and
| clusters. I don't care for the way RDS limits your options on
| extensions and configuration, and I haven't had a good time
| with the high availability failovers internally, I'd rather run
| my own 3 instances in a cluster, 3 clusters in different AZs.
|
| Blatant plug time:
|
| I'm actually working for a company right now (
| https://pgdog.dev/ ) that is working on proper sharding and
| failovers from a connection pooler standpoint. We handle
| failovers like this by pausing write traffic for up to 60
| seconds by default at the connection pooler and swapping which
| backend instance is getting traffic.
| everfrustrated wrote:
| Aurora doesn't use EBS under the hood. It has no option to
| choose storage type or io latency. Only a billing choice
| between pay per io or fixed price io.
| jansommer wrote:
| Precisely! That's why RDS sounds so interesting. I get a lot
| more knobs to tweak performance, but I'm curious if a maxed
| out gp3 with instances that support it is going to fare any
| better than Aurora.
| Exoristos wrote:
| We were burned by Aurora. Costs, performance, latency, all were
| poor and affected our product. Having good systems admins on
| staff, we ended up moving PostgreSQL on-prem.
| Scubabear68 wrote:
| For me, the big miss with Postgres Aurora RDS was costs. We had
| some queries that did a fair amount of I/O in a way that would
| not normally be a problem, but in the Aurora Postgres RDS world
| that I/O was crazy expensive. A couple of fuzzy queries blew
| costs up to over $3,000/month for a database that should have
| cost maybe $50-$100/month. And this was for a dataset of only
| about 15 million rows without anything crazy in them.
| Hexcles wrote:
| Sounds like you need to use IO optimized storage billing
| mode.
| paranoidrobot wrote:
| My experience is with Aurora MySQL, not postgres. But my
| understanding is that the way the storage layer works is much
| the same.
|
| We have some clusters with very high write IOPS on Aurora.
|
| When looking at costs we modelled running MySQL and regular RDS
| MySQL.
|
| We found for the IOPS capacity of Aurora we wouldn't be able to
| match it on AWS without paying a stupid amount more.
| belter wrote:
| > There's an abundance of benchmarks showing Aurora is better
| in terms of performance but they leave out so much about their
| RDS config that I'm having a hard time believing them.
|
| Do you have a problem believing these claims on equivalent
| hardware?:
| https://pages.cs.wisc.edu/~yxy/cs764-f20/papers/aurora-sigmo...
|
| Or do your own performance assessments, following the published
| document and templates available so you can find the facts on
| your own?
|
| For Aurora MySql:
|
| "Amazon Aurora Performance Assessment Technical Guide" -
| https://d1.awsstatic.com/product-marketing/Aurora/RDS_Aurora...
|
| For Aurora Postgres:
|
| "...Steps to benchmark the performance of the PostgreSQL-
| compatible edition of Amazon Aurora using the pgbench and
| sysbench benchmarking tools..." -
| https://d1.awsstatic.com/product-marketing/Aurora/RDS_Aurora...
|
| "Automate benchmark tests for Amazon Aurora PostgreSQL" -
| https://aws.amazon.com/blogs/database/automate-benchmark-tes...
|
| "Benchmarking Amazon Aurora Limitless with pgbench" -
| https://aws.amazon.com/blogs/database/benchmarking-amazon-au...
| grhmc wrote:
| Yikes! This is exactly the kind of invariant I'd expect Aurora to
| maintain on my behalf. It is why I pay them so much...
| dangoodmanUT wrote:
| It did, the storage layer did not allow for concurrent writes.
| bob1029 wrote:
| > Aurora's architecture differs from traditional PostgreSQL in a
| crucial way: it separates compute from storage.
|
| I find this approach very compelling. MSSQL has a similar thing
| with their hyperscale offering. It's probably the only service in
| Azure that I would actually use.
| robinduckett wrote:
| Glad to know I'm not crazy.
| theanomaly wrote:
| AWS Support initially pushed back and suggested it's because of
| high replication lag but they were looking at metrics that were
| more than 24 hours old. What kind of failure did you encounter?
| I really want to understand what edge case we triggered in
| their failover process - especially since we could not
| reproduce it in other regions.
| robben1234 wrote:
| My cluster recently started to failover every few days
| whenever it experiences the load to trigger scale up from 1-2
| to 20+ acu.
|
| And then I also encountered errors just like op in my app
| layer about trying to execute a write query via read-only
| transaction.
|
| The workaround so far is to invalidate connection on error.
| When app reconnects the cluster write endpoint correctly
| leads to current primary.
| d1egoaz wrote:
| > AWS has indicated a fix is on their roadmap, but as of now, the
| recommended mitigation aligns with our solution: use Aurora's
| Failover feature on an as-needed basis and ensure that no writes
| are executed against the DB during the failover.
|
| Is there a case number where we can reach out to AWS regarding
| this recommendation?
| paranoidrobot wrote:
| Yeah. I'd like this too.
|
| We use Aurora MySQL but I would like to be able to point to
| that and ask if it applies to us.
| time0ut wrote:
| Wow. This is alarming.
|
| We have done a similar operation routinely on databases under
| pretty write intensive workloads (like 10s of thousands of
| inserts per second). It is so routine we have automation to
| adjust to planned changes in volume and do so a dozen times a
| month or so. It has been very robust for us. Our apps are
| designed for it and use AWS's JDBC wrapper.
|
| Just one more thing to worry about I guess...
| dangoodmanUT wrote:
| Not really: Their storage layer worked perfectly and prevented
| the ACID violations.
| almosthere wrote:
| probably should have added postgres to end of title
| evanelias wrote:
| Absolutely this. The differences between Aurora Postgres and
| Aurora MySQL are quite significant. A failover bug affecting
| one doesn't imply the same bug exists in the other.
|
| A lot of people seem to have the misconception that "Aurora" is
| its own unique database system, with different front-ends
| "pretending" to be Postgres or MySQL, but that isn't the case
| at all.
| ldkge wrote:
| Am I the only one who misread that as "AI race condition"?
| dangoodmanUT wrote:
| This confirms a lot of what their engineers preach: The lego
| brick model.
|
| They made the storage layer in total isolation, and they made
| sure that it guaranteed correctness for exclusive writer access.
| When the upstream service failed to also make its own guarantees,
| the data layer was still protected.
|
| Good job AWS engineering!
| halifaxbeard wrote:
| I think OP is wrong in their hypothesis based on the logs they
| share and the root cause AWS support provided them.
|
| I think the promotion fails to happen and then an external
| watchdog notices that it didn't, and kills everything ASAP as
| it's a cluster state mismatch.
|
| The message about the storage subsystem going away is after the
| other Postgres process was kill -9'd.
| halfmatthalfcat wrote:
| CC pm. MgtzkskskzjauHjhffd
| shayonj wrote:
| Sadly, its not the first time I have noticed unexpected and odd
| behaviors from Aurora PostgreSQL offering.
|
| I noticed another interesting (and still unconfirmed) bug with
| Aurora PostgreSQL around their Zero Downtime Patching.
|
| During an Aurora minor version upgrade, Aurora preserves sessions
| across the engine restart, but it appears to also preserve stale
| per-session execution state (including the internal statement
| timer). After ZDP, I've seen very simple queries (e.g. a single-
| row lookup via Rails/ActiveRecord) fail with `PG::QueryCanceled:
| ERROR: canceling statement due to statement timeout` in far less
| than the configured statement_timeout (GUC), and only in the
| brief window right after ZDP completes.
|
| My working theory is that when the client reconnects (e.g. via
| PG::Connection#reset), Aurora routes the new TCP connection back
| to a preserved session whose "statement start time" wasn't
| properly reset, so the new query inherits an old timer and gets
| canceled almost immediately even though it's not long-running at
| all.
___________________________________________________________________
(page generated 2025-11-15 23:01 UTC)