[HN Gopher] We migrated our PostgreSQL database with 11 seconds ...
       ___________________________________________________________________
        
       We migrated our PostgreSQL database with 11 seconds downtime
        
       Author : sh_tomer
       Score  : 563 points
       Date   : 2024-01-18 21:51 UTC (2 days ago)
        
 (HTM) web link (gds.blog.gov.uk)
 (TXT) w3m dump (gds.blog.gov.uk)
        
       | Edwinr95 wrote:
       | I'm quite negatively surprised that a government service is
       | moving from their own platform to AWS for such an important
       | service.
        
         | cpursley wrote:
         | And an American one, at that (we're talking government services
         | here, not some SaaS). Are there really no native UK cloud
         | providers?
        
           | cameronh90 wrote:
           | Not any remotely comparable. The small "cloud" providers we
           | do have were just reselling vSphere last time I looked into
           | it.
        
         | jazzyjackson wrote:
         | I don't know what it's like in UK but it may be the case that
         | government has a hard time a{ttract,fford}ing talent to
         | administer everything in house. Not that AWS is great for cost
         | saving but if its between paying 50k/year for cloud services
         | and not being able to find an engineer who will competently do
         | the job for less than 50k, then the cloud is your only move
         | really.
        
           | swozey wrote:
           | They require various clearances (digging into your life and
           | past relationships to a miserable degree), don't allow
           | someone to have ever smoked pot and pay half or less of what
           | you can make in the pvt sector here (usa).
           | 
           | Everyone I know working FedRAMP jobs is prior
           | military/g-level.
        
             | robertlagrant wrote:
             | They wouldn't need that. And having been SC cleared in the
             | UK, and known a few DV-cleared ones, at least in the UK
             | they don't care if you've smoked pot. They just care that
             | if you have, that you don't mind your family knowing one
             | day. They don't want people who can be blackmailed.
        
               | swozey wrote:
               | Here it's like this: Don't ever lie to them, "no matter
               | what it is they'll find out."
               | 
               | So, some people don't lie, say they smoked pot in high
               | school and none of them make it to the next step.
               | 
               | I had a twitter convo last year or pre-x whenever with
               | the CTO of some org I can't remember (I don't think
               | centcom, something much smaller) and he mentioned that
               | they've lightened up quite a bit, or at least his program
               | which was a softwar engineering group was more lenient.
               | He was looking for engineers on via twitter on his
               | official account.
               | 
               | So maybe that's loosening up here thankfully.
        
           | justsomehnguy wrote:
           | > who will competently do the job for less than 50k, then the
           | cloud is your only move really
           | 
           | Well, there _is_ the other way, but, as we know, never ever
           | that would happen.
        
           | prmoustache wrote:
           | Once your past the emerging startup status, running on the
           | cloud involve as much engineers and complexity as running on
           | prem if you want to follow best practices.
           | 
           | The "let's be managed and only hire developers" is a huge
           | myth. All large organizations involve tons of "cloud
           | engineers" or "devops" depending on how they want to call
           | them and are just sysadmins with a different name and a
           | bigger paycheck.
           | 
           | Having actual datacenters doesn't add a ton of complexity and
           | datacenters themselves are often managed by people who don't
           | even have an engineer paycheck. The main difference between
           | being on prem vs cloud is you have to plan (how many
           | servers/storage/network equipment you have to buy and replace
           | on the following year) and pay for stuff (like space, racks)
           | more in advance + take into accounts delays in delivery. This
           | is where cloud makes the job much faster for companies but
           | given the slow pace at which gov stuff happen usually I don't
           | think this is a problem for them.
        
           | solatic wrote:
           | > and not being able to find an engineer
           | 
           | Remember it's not just about being able to find one single
           | engineer - then they become key-person risk. You need
           | multiple engineers to be able to handle the loss of that
           | engineer, either temporarily (vacation) or permanently
           | (suddenly hit by a bus). Then you end up having a team of
           | DBAs. Then you have functional rather than feature teams.
           | Then you need multiple managers to align to get anything
           | done, and have internal politics.
           | 
           | Being able to consume databases _as a product_ has non-
           | trivial value.
        
         | ris wrote:
         | GOV.UK PaaS also runs on AWS (for as long as it remains to
         | exist)
        
         | okasaki wrote:
         | All UK businesses run on Oracle and Microsoft, so I'm not sure
         | why you're surprised. They have us by the balls.
        
         | kingkongjaffa wrote:
         | > This is an AWS RDS PostgreSQL database and it lives in the
         | PaaS' AWS account. Our apps that run in the PaaS talk to this
         | database. We are going to call this database our 'source
         | database'.
         | 
         | It already was. Read the article.
        
         | conception wrote:
         | AWS has a lot of pre-audited compliance built into their
         | services. Being able to inherit their certification for
         | services can save an organization a lot of time and effort.
        
           | lnxg33k1 wrote:
           | Its not an organisation, its a blucking government, it
           | handles citizen data, and its sending them to a company of
           | foreign country, because it can't hire some system
           | administrators? A GOVERNMENT? What are they doing? Still
           | looking for their product market fit and can't afford the
           | headcount? Is it a joke?
           | 
           | EDIT If they are looking for money id like to participate a
           | bit in the seed round
        
             | kunwon1 wrote:
             | AWS has a G-Cloud for UK just like they have one for US,
             | no?
        
               | snoman wrote:
               | AWS has government specific regions (called GovCloud).
               | Many services or features make it to GovCloud later than
               | other regions because of the certification requirements.
        
               | travem wrote:
               | AWS has US based GovCloud regions: AWS GovCloud (US-East)
               | and AWS GovCloud (US-West). It does not have a UK
               | specific GovCloud region that I am aware of.
        
               | bboygravity wrote:
               | Why can't the UK government build there own cloud?
               | 
               | It's just completely insane to me that they would make
               | the gov internet infrastructure completely
               | (geopolitically) dependent on another country AND just
               | literally give all their (citizens') data away AND pay
               | for that "privilege"?!
               | 
               | I mean if the government can't host the government's
               | websites using tech from the government's country, maybe
               | it would be better to just forget about the whole
               | cyberweb thing altogether? Just turn it off?
        
               | robertlagrant wrote:
               | They'd still be outsourcing to a firm to do this. They
               | wouldn't hire a load of people to do it in-house. See
               | also Fujitsu in the recently-popular Horizon scandal, or
               | the NHS for IT debacle[0].
               | 
               | [0]
               | https://en.wikipedia.org/wiki/NHS_Connecting_for_Health
        
               | vdaea wrote:
               | Why should they build their own cloud, seeing that costs
               | more money?
        
               | willsmith72 wrote:
               | you want every government to build their own cloud? what
               | in the world? the whole world is interlinked, should they
               | also manufacture their own government laptops in the UK?
        
               | ris wrote:
               | I don't think you have any idea just how much it costs to
               | run infrastructure at the reliability levels provided by
               | AWS, and just how much investment it would require to get
               | the ball rolling on this.
               | 
               | A lot of people have a very unrealistic picture of what
               | government budgets are like.
        
               | blibble wrote:
               | > I don't think you have any idea just how much it costs
               | to run infrastructure at the reliability levels provided
               | by AWS
               | 
               | my $12/year VPS does better than us-east-1
        
               | hdlothia wrote:
               | Where do you get a 12 dollar a year vps. Hetzner charges
               | me 4 bucks a month and it feels like a steal
        
               | blibble wrote:
               | https://lowendtalk.com/
               | 
               | quality varies
        
               | bboygravity wrote:
               | I get that it costs a lot.
               | 
               | My point is that NOT hosting it yourself (as a
               | government) costs WAY more in the long run. See my points
               | above.
               | 
               | The same goes for companies in Europe who literally host
               | their trade-secrets (designs, sales, the entire company)
               | on US-servers (OneDrive, Google Drive, etc). The US is
               | the home of their competitors. Who cares about
               | infrastructure costs if you're PAYING to give your trade
               | secrets away to your competitor(s)?!
        
               | shagmin wrote:
               | I've always wondered how beholden the world is to
               | Microsoft. I was once surprised to learn the US military
               | (and probably virtually all others) don't have their own
               | OS to avoid being tied to a particular company.
        
               | 0xbadcafebee wrote:
               | Why can't the UK government build their own cars? Their
               | own boots? Their own pens, paper? How wasteful and
               | pathetic that they wouldn't make all those things
               | themselves. If it's _possible_ to do it yourself, by
               | golly, you _should_ do it yourself, and there 's
               | absolutely no reason in the entire world to purchase
               | those things from someone else instead.
        
               | kakoni wrote:
               | You know that UK's National Health Service did a deal
               | with Palantir for "federated data platform"?
        
               | NomDePlum wrote:
               | No. It is only relatively recently (~5/6 years) AWS have
               | had any data centres in the UK.
               | 
               | That blocked use of AWS for a lot of UK departments due
               | to data sovereignty concerns.
        
               | kunwon1 wrote:
               | I see. I don't use AWS much, I saw this [1] and assumed
               | this was like the US g-cloud.
               | 
               | [1] https://aws.amazon.com/government-education/g-cloud-
               | uk/
        
             | dijit wrote:
             | Sysadmins are cheaper than many people seem to think.
             | 
             | I had a person I trust a lot telling me that "if we go with
             | a bare metal provider like GCore we'd have to hire
             | someone", his reason for bringing that up was that the cost
             | difference would be justified by _not_ having to hire
             | someone,.
             | 
             | However a GCore EUR400,000k/y bill becomes a EUR6,000,000~
             | if you were to use a public cloud, even with the scaling up
             | and down when not in use (we are an extreme case of needing
             | a lot of dumb unreliable compute thats geographically
             | distributed).
             | 
             | I can hire a _lot_ of sysadmins for that money, but I
             | probably don 't even need _one_ because public clouds also
             | need devops staff to manage the complexity anyway.
        
               | DylanDmitri wrote:
               | The risk is hiring a team of ineffective sysadmins,
               | especially if your organization can't assess sysadmin
               | competence.
        
               | dijit wrote:
               | That would indeed be a risk, but the circular logic of
               | this means no new company could ever have any competence
               | outside of its founders. Which feels shortsighted.
               | 
               | Anyway, I am a former sysadmin. I am confident that I can
               | identify competence in the requisite areas.
        
               | criley2 wrote:
               | Governments tend to be far less competent at determining
               | technical competence. Due to a wide variety of factors,
               | governments tend to be completely uncompetitive in salary
               | for technical positions meaning they're already hiring
               | from the lower end of the pool (not including a few
               | altruistic folks willing to forgo their market value).
               | 
               | At a company if a department isn't working out you just
               | restructure and move on, but in the government, that team
               | is going to retire in your org and collect pension from
               | you, and there's very little you can do about that.
        
               | dijit wrote:
               | Everything you said seems to also apply for developers
               | and the staff that would manage cloud resources.
               | 
               | Lack of cost control or effective use of a cloud provider
               | leads to spiralling uncontrollable costs.
        
               | rcxdude wrote:
               | yeah, every company I know of that uses cloud has a team
               | responsible for managing it anyway, and they don't seem
               | much smaller than the team needed to manage on-prem. I
               | don't really think this 'savings' exists in most cases.
        
             | belter wrote:
             | If you are worried about that, start with your government
             | use of Microsoft Office and Windows who both send MB of
             | data per minute to a US based company.
        
             | foofie wrote:
             | > (...) its a blucking government, it handles citizen data,
             | and its sending them to a company of foreign country,
             | because it can't hire some system administrators? A
             | GOVERNMENT? What are they doing?
             | 
             | This is a very good question, and bears repeating.
             | 
             | It's not a massive database as well. 400GB with 1k
             | inserts/second.
        
             | everfrustrated wrote:
             | The UK replies entirely on the mercy of USA for its nuclear
             | deterrent (Trident)
             | 
             | For the UK at least, that ship has _long_ since sailed....
        
               | blibble wrote:
               | trident has UK built warheads and is operationally
               | independent of the US
        
             | solatic wrote:
             | > because it can't hire some system administrators?
             | 
             | Spoken like someone who has never worked in the public
             | sector. Hiring can easily take 6+ months or more due to an
             | ever-increasing list of requirements that government HR is
             | required to fulfill, not least of which is passing a
             | security clearance which takes even more time. The best
             | people on the market rarely have the patience for this.
             | Once your employees do get hired - on-boarding can take
             | another few/several/more months, getting various
             | permissions, technical documentation, etc. Everything is
             | out-of-date because making changes requires committee
             | consensus, in a culture that is risk-averse, because nobody
             | notices when you out-perform (after all, the requirements
             | were also set by a committee that doesn't know who you are)
             | but something going wrong is grounds for termination.
             | Public sector work over-relies on hiring contractors
             | precisely to shift blame for failure to the contractors.
             | Managed database services are _excellent_ tools to shift
             | this kind of catastrophic risk of data loss to a contractor
             | /vendor (who is managing the database).
             | 
             | Governments not owning their data isn't due to technical or
             | budgetary limitations - it's strictly cultural.
        
               | NomDePlum wrote:
               | Fully agree with this. I'd also add that a lot of IT is
               | buy not build, in general. That includes support.
               | Particularly true for the public sector and has been in
               | place well before AWS existed.
               | 
               | Outsourcing the complexity to run and maintain a secure
               | reliable database cluster really is making good use of
               | the managed service model.
        
               | Dylan16807 wrote:
               | > Hiring can easily take 6+ months or more
               | 
               | Do you think this move didn't take even longer to plan?
               | 
               | > to shift blame
               | 
               | That reason is much more plausible.
        
           | sph wrote:
           | How would you feel if the US government ran on servers from a
           | European company, which also works very hard to avoid paying
           | taxes in US soil?
           | 
           | All those reasons to go AWS hold for a private company, not
           | for a government service of a first world country and G7
           | member. AWS has a lot of compliant services, but it's not
           | like they're doing rocket science one of the top 5 richest
           | countries in the world cannot afford to develop or contract
           | within its borders.
           | 
           | The simple reason is that the UK has been on a long trend of
           | selling out to the highest bidder, whether they are US tax
           | avoiding companies, chinese or managed by Russian oligarchs.
           | We have chosen AWS for the same reason Post Office chose
           | Fujitsu.
        
             | freedomben wrote:
             | I would be surprised if they aren't deploying to the London
             | data center, so I would think it _is_ within the UK
        
               | dijit wrote:
               | There's no govcloud in the UK; unless there are specific
               | terms then the terms-of-service state that you are
               | licensing either the irish entity or the american entity
               | to have access and dominion of your data.
               | 
               | I had to spend a lot of time writing my privacy policy
               | (perks of being CTO... yay), and part of that privacy
               | policy was an admission that we transfer ownership of
               | data to a US company (by using public cloud) despite
               | using european datacenters.
               | 
               | This is because our agreement is with a US entity.
        
               | arpinum wrote:
               | The blogpost shows a connection string to eu-west-1 in
               | Ireland
        
               | esskay wrote:
               | eu-west-2 is a bit misleading, most of its nowhere near
               | London, they've got DC's right up into the midlands. One
               | of their newer ones for example is out in Didcot
               | Oxfordshire, they've also got a few up towards
               | Peterborough. All classed as 'London' despite being a
               | fair distance away from it.
        
         | otteromkram wrote:
         | I'd argue that AWS is much better suited than self-hosting
         | _because_ it 's such an important service.
         | 
         | Downtime becomes negligible and global reach vastly increases
         | with comparably little cost.
        
         | 0xbadcafebee wrote:
         | That sentence was a little confusing. You're not happy that the
         | government is hiring experts to run an important service?
        
           | dtnewman wrote:
           | Yes. RDS is a very reasonable choice if you are a tech
           | company, let alone a govt org. The alternative isn't "let's
           | host this ourselves" it is "let's host this with Oracle at a
           | much higher cost".
        
             | 15457345234 wrote:
             | The alternative - at government scale - is absolutely
             | 'let's host this ourselves' and that's what they should be
             | doing, to ensure that institutional expertise remains. They
             | should also own and operate their own datacentres which
             | should be physically secure, not shared with commercial
             | ventures and guarded by the armed forces, not civilian
             | security.
        
               | 0xbadcafebee wrote:
               | Why doesn't the government manufacture their own cars?
               | They're going to lose institutional expertise in building
               | cars! They should also own and operate their own
               | manufacturing facilities which should be physically
               | secure, not shared with some 'civilian commercial
               | venture'.
               | 
               | By golly, the government can't do business if it isn't a
               | datacenter operations company, a software vendor, and a
               | car manufacturer.
        
               | 15457345234 wrote:
               | > By golly, the government can't do business if it isn't
               | a datacenter operations company, a software vendor, and a
               | car manufacturer.
               | 
               | I mean, precisely, which is why some countries are
               | rapidly turning into failed states. Too much buck-passing
               | and outsourcing.
        
             | tobias_irmer wrote:
             | It isn't? AWS is crazy expensive and you don't have as much
             | control over things as you may occasionally need. The best
             | decision we took in the past few years with regards to
             | infrastructure was moving away from AWS and doing
             | everything ourselves.
             | 
             | On RDS we had inexplicable spikes in cost, deteriorating
             | support and no real support for any of our issues. When we
             | tried using DMS, it just didn't work as expected, even
             | after spending two days on the phone with their support.
        
           | 15457345234 wrote:
           | > is hiring experts
           | 
           | 'moving to AWS' (or any cloud provider) is not 'hiring
           | experts' it's just outsourcing the risk to an entity that
           | you, in the event of a genuine crisis, have no leverage over
           | beyond 'we're going to stop paying you (once we migrate away
           | from you which will take ten years)'
        
             | 0xbadcafebee wrote:
             | AWS are not experts at providing computing services? Holy
             | cow. This is news to me! I thought they were the most
             | popular and highly regarded computing infrastructure and
             | PaaS company in the world, managing both hardware and
             | software and providing subject matter experts to work with
             | customers on architecture and implementation, along with
             | official partners and a marketplace of turn-key products.
             | 
             | Boy, am I embarrassed! I need to start building my own
             | datacenter right away, all my shit is on AWS!!!
        
               | rcxdude wrote:
               | They may be experts, but you sure aren't hiring them. You
               | are renting something from them.
        
         | pixelesque wrote:
         | As other comments point out, their own platform was (at least
         | in terms of DB) already running on AWS, just using a different
         | account.
        
           | foofie wrote:
           | > As other comments point out, their own platform was (at
           | least in terms of DB) already running on AWS, just using a
           | different account.
           | 
           | That changes nothing. It just means this unjustifiable
           | nonsense is going on for a while.
        
         | pmcp wrote:
         | As somebody who worked for the European Commission, and a
         | european national government, I agree with your sentiment, but
         | the harsh reality is that government divisions in generally
         | work on a shoe string budget, when it comes to decisions like
         | these. I wouldn't be surprised if this was a "best effort given
         | the circumstances" move.
        
         | _joel wrote:
         | There's an absolute ton of stuff on AWS. There used to be
         | gCloud that allowed for smaller clouds to tender for government
         | contracts bit there was a big pull to AWS, at least from my
         | experience with it.
        
         | ivix wrote:
         | I'm surprised that you're surprised. Why on earth would
         | government not be migrating to the cloud?
        
         | NomDePlum wrote:
         | Why?
         | 
         | I've worked on a number of UK government projects, including
         | some with particularly sensitive security and data
         | requirements.
         | 
         | Having some knowledge of their on-prem data centres and UK
         | Cloud offering they have also used moving to AWS has so many
         | operational, security and resilience benefits that aren't
         | available elsewhere. It's not a free-lunch by any means and
         | needs thought and governance certainly but the procurement
         | simplification benefits alone make going to the public cloud a
         | no brainer for a lot of government services.
         | 
         | It is worth knowing that even the on-prem data centres are
         | usually operated by 3rd parties such as HP, BT and IBM. There
         | was an initiative to have "Crown-managed" data-centers but it's
         | not particularly scalable.
        
         | overstay8930 wrote:
         | If you saw how non-tech companies run datacenters, well let's
         | just say they're not exactly working with NATO like the big 3
         | cloud providers do when designing their DCs and backbone.
         | 
         | Honestly you should be frightened when you see someone NOT
         | using a cloud provider, because it is hard work to properly run
         | and secure a datacenter. Even Equinix fucks up HARD regularly
         | and they are considered the gold standard (shout out to those I
         | saw at 350 E Cermak over the weekend).
        
         | msla wrote:
         | It incentivizes public-private cooperation: If the government
         | cracks down on Amazon, Amazon turns off the government's AWS
         | accounts and deletes the data. The government finds that
         | subpoenaing a wiped hard drive is utterly nugatory, and thereby
         | learns humility.
        
         | mobilemidget wrote:
         | "The PaaS team offered us the ability to migrate databases
         | using AWS Database Migration Service (DMS)."
         | 
         | And I'm not surprised if, they got some kickback, discount etc
         | in some way to promote AWS on their blog. Not claiming its so,
         | but I would not be surprised at all. It reads as one big
         | advertisement.
        
       | mbb70 wrote:
       | Interesting to compare this to https://knock.app/blog/zero-
       | downtime-postgres-upgrades discussed here
       | https://news.ycombinator.com/item?id=38616181
       | 
       | A lot of the discussion boiled down to 'this is a lot of
       | complexity to avoid a few minutes of downtime'. I guess this is
       | the proof, just use AWS Data Migration Service, swap the DNS
       | entries to go live and live with 11 seconds of downtime.
        
         | ris wrote:
         | There is no "just" about it. The absolute key takeaway is in
         | "what we learned":
         | 
         | > We chose to use DMS because it was well supported by the
         | GOV.UK PaaS and we could also get support from AWS. If we were
         | doing a PostgreSQL to PostgreSQL database migration in the
         | future, we would invest more time in trying alternative tools
         | such as pglogical. DMS potentially added more complexity, and
         | an unfamiliar replication process than what we may have found
         | with other tools. This backs up what AWS say themselves on
         | PostgreSQL to PostgreSQL migrations.
         | 
         | The message here is _not_ "just use DMS".
        
           | T-Winsnes wrote:
           | Even AWS in their own docs says to use the native tools when
           | migrating from postgres to postgres[1]. They don't go into
           | the details to much and points to pg_dump rather than
           | pg_logical, but interesting to see that they don't recommend
           | using DMS for it
           | 
           | [1] https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Sou
           | rce...
        
             | ris wrote:
             | They _do_ , but those recommendations are buried quite deep
             | in the documentation, well behind all the marketing guff
             | that suggests that DMS is all things to all people, and
             | wonderful magic that is ideal for all situations.
        
         | btown wrote:
         | Has anyone used https://cloud.google.com/database-
         | migration/docs/postgres/qu... to do something like this? Does
         | it work similarly to AWS DMS?
        
         | postpawl wrote:
         | There are a lot of gotchas with using DMS (which seems to use
         | pglogical under the hood). Since it's not hardware-level
         | replication, you can run into issues with large
         | rows/columns/tables and it doesn't really handle foreign keys.
         | It may not handle some special data types at all. You also need
         | to update the sequences after the migration or you'll get
         | errors about duplicate primary keys. You can also have issues
         | if you don't have proper primary keys, because it doesn't
         | always copy the entire row at once.
         | 
         | If the databases are within the same AWS account, it's likely
         | easier to use hardware-level replication with global database
         | or snapshots to do migrations if you're ok with 4-5 mins of
         | downtime.
        
           | Twisell wrote:
           | There are many options available with PostgreSQL you could
           | also do a physical full backup + WAL level replication to
           | keep key AND a get low downtime.
           | 
           | What might have oriented theirs choice is that they wanted to
           | upgrade from major version 11 to 15 during the migration
           | process. This is only available using logical replication.
           | Otherwise you'd have to chain upgrade process of each major
           | version (and possibly OS because 11 is EOL on some arch) and
           | this is nor trivial nor quick.
        
       | igammarays wrote:
       | Now all we need is for Amazon to come out with a "government-as-
       | a-service" product for your budding nation state.
        
         | ris wrote:
         | FWIW GOV.UK Notify is part of a suite of services offered by
         | GDS to UK public sector bodies (along with GOV.UK Pay and
         | GOV.UK PaaS) that was originally known as "Government As A
         | Platform".
        
           | robin_reala wrote:
           | ...and the platform is all open source, so people are free to
           | fork / steal.
           | https://github.com/orgs/alphagov/repositories?q=paas
        
         | hinkley wrote:
         | In partnership with pinkerton.com
        
       | londons_explore wrote:
       | There are various ways to 'pause' incoming postgres queries, for
       | example using pgbouncer, - ie. don't fail them, simply delay them
       | until the replication has caught up and then let them continue on
       | the new database.
       | 
       | If anything goes wrong and replication doesn't catch up, you can
       | unpause and let those queries happen on the old database.
       | 
       | Therefore, your 11 seconds of downtime becomes 0 to 11 seconds of
       | added page load time. But more importantly, of the thousands of
       | users of the database who have never seen a query fail before and
       | might have buggy error handling codepaths or have a single failed
       | query ruin a whole batch job, this approach leads to a lot less
       | collateral damage.
        
         | whartung wrote:
         | It's one thing to pause queries but can you also pause
         | transactions that are in flight? How does that work?
        
       | londons_explore wrote:
       | Note that the enemy of low/zero downtime migrations like this is
       | long running queries.
       | 
       | Ie. a single update query which takes 30 mins.
       | 
       | You either have to kill and roll back that query, or suffer 30
       | mins of unavailability.
       | 
       | As far as I know, there is no way to migrate a currently in
       | progress query.
        
         | Amezarak wrote:
         | It's hard for me personally to imagine a 30-minute update query
         | that is not written extremely inefficiently, or else a one-time
         | huge data migration.
         | 
         | There are a lot of the former in the wild to be sure. I've had
         | a lot of run turning minutes-hours into milliseconds. :)
        
           | londons_explore wrote:
           | Data scientists... And when you look at the query, it'll be
           | 800 lines of SQL...
        
             | hobs wrote:
             | Heh, that's not so bad - 2 jobs ago I had to fix a KPI
             | generation process that was 70k lines of dynamic sql, that
             | unwrapped to up to 1m LOC :)
        
               | chaz6 wrote:
               | Did you ever run into a hard limit of statement size?
        
               | hobs wrote:
               | No, but all the code generation was happening in the
               | server itself so I dont know how hard and fast it
               | applies, iirc SQL Server's limits are like 64k x packet
               | size.
        
         | necovek wrote:
         | For a software engineering project, you probably want to limit
         | your transactions to much less than that (set statement_timeout
         | is your friend). If you've got extremely long transactions, you
         | can probably avoid doing the switch-over when they run
         | (hopefully they are not a random occurrence but a result of a
         | scheduled job or similar).
         | 
         | In combination with transaction time limit and fail-over
         | configuration (where you fail the old primary), you can control
         | the slowdown (instead of downtime, eg. with pgbouncer) very
         | precisely.
         | 
         | I would be more concerned with the DNS TTL being respected in
         | the entire stack (and external caching DNS servers you rely
         | on), tbh.
         | 
         | But it is usually not critical to avoid a dozen seconds of
         | downtime for an app, so whatever is simpler for you should be
         | your go to solution.
        
         | callalex wrote:
         | Well you've certainly introduced a teaching moment to me! What
         | are the nature of writes you deal with that last 30+ minutes?
         | What kind of data/what kind of people are involved with such DB
         | writes where you need to rely on the DB engine to work so hard
         | instead of something more split up by queues at a higher layer?
        
           | londons_explore wrote:
           | It's usually badly engineered things. However, just because
           | it's badly engineered doesn't mean it's fine for it to break
           | :-P
           | 
           | Things like a super complex n^3 complexity query to find all
           | users who were 3 hops away from a known fraudster, where a
           | 'hop' means 'shared any piece of account data in common'
        
             | chaz6 wrote:
             | Would that be more suited to a graph database such as
             | Neo4j?
        
           | londons_explore wrote:
           | Postgres queries are streaming. Ie. If the query result has 1
           | billion rows, but the client only has enough ram for 1
           | million rows, then the query will be slowed down until the
           | client is ready to accept more rows.
           | 
           | If the client is slowly processing through the result set,
           | then the query can take many minutes/hours.
        
       | Topgamer7 wrote:
       | We tried to work with DMS to migrate mysql to postgres, and it
       | was a nightmare. Support was useless, and would often just not
       | get back to us without prodding. Then them giving us canned
       | responses unrelated to our queries. The whole thing is nigh on
       | un-debuggable.
       | 
       | Stay away.
        
         | rjh29 wrote:
         | Tried it for mysql and it was flat out broken and silently
         | corrupted data.
        
           | Topgamer7 wrote:
           | For us it seemed to be trying to load data from the wrong
           | table on mysql, into the wrong table on postgres.
        
         | mey wrote:
         | I worked on migrating our MySQL system to PostgreSQL using
         | pgloader ( https://pgloader.io/ ).
         | 
         | There were some hiccups, things that needed clarification in
         | documentation, and some additional processes that needed to be
         | done outside of the system to get everything we need in place,
         | it was a amazing help. Not sure the project would've been
         | possible without it.
         | 
         | Data mapping from PostgreSQL to PostgreSQL as in the article
         | isn't nearly as bad as going between systems. We took a full
         | extended outage and didn't preload any data. There were many
         | dry runs before hand and validation before hand, but the system
         | wasn't so mission critical that we couldn't afford to shutoff
         | the system for a couple of hours.
        
           | Topgamer7 wrote:
           | We also ended up using pgloader. Its not without its friction
           | either. For example the concurrency/number of rows setting
           | seems broken out the gate, and its like playing darts with a
           | blindfold on to get it to run without running out of memory.
           | But being able to read the source, github actions, and
           | overall at least I could debug my problems, or find others
           | who had the same issue.
           | 
           | Would recommend pgloader.
        
         | Spivak wrote:
         | This was my experience as well. We thought it was a nice
         | managed way to move pg to pg with minimal setup but we ran into
         | so many issues we just did logical replication with bespoke
         | fixes for things it didn't handle well.
        
         | NomDePlum wrote:
         | Surprised by that. I've used AWS DMS quite a lot to do both on-
         | prem to AWS and AWS (MySQL) to AWS Postgres migrations and long
         | term ongoing replication. Whilst there is some
         | complexity/gotchas there it's always been more than up to the
         | task. Takes a little bit of validation/testing to understand
         | but it's very well documented too.
         | 
         | What sort of issues did you hit? In all honesty I'm not sure
         | I've been more impressed by another AWS service.
        
           | tibbon wrote:
           | One issue we hit were any schema changes totally messed it
           | up. I don't have my notes in front of me, but we were
           | constantly hitting data that wouldn't migrate, or that things
           | suddenly broke whenever things changed.
        
             | NomDePlum wrote:
             | Interesting we managed several schema changes when using
             | AWS DMS for replicating data over a long period between
             | MySQL and Postgres clusters.
             | 
             | We treated these carefully and tested as we made them but
             | never had any real issues with them. From memory DMS could
             | cope with adding columns pretty transparently. One setting
             | we invested in configuring and understanding was to allow
             | DMS to attempt to recover from replication failures. This
             | allowed it to error on DDL changes and attempt to recover.
             | This usually involved restarting the task, but it would do
             | this transparently as part of the recovery.
        
           | Topgamer7 wrote:
           | It seemed to try to load data into the wrong table on
           | postgres. That was the one that immediately comes to mind.
           | Honestly poor support is what really killed it for us. But we
           | had other technical problems with it.
           | 
           | We burned 3 weeks just trying to get support to provide a
           | sensible response. I never got the sense anyone replying to
           | us knew any more than the surface level about the
           | infrastructure of how DMS worked.
        
             | NomDePlum wrote:
             | We kept our table mappings pretty much mirrors of the
             | source tables. Any data transformation we managed on the
             | target cluster, not through AWS DMS.
             | 
             | I've used it fairly frequently over a number of years so
             | maybe the issues we hit on the learning curve have dimmed.
             | 
             | We also deliberately kept things as simple as possible at
             | first and focused on DMS's ability to move the data from
             | source to target and really only tweaked settings that
             | increased security or reliability. We stayed well away from
             | any of the data transformation or more complex
             | functionality.
        
         | extesy wrote:
         | I can confirm that. Both DMS reliability and support quality
         | were terrible.
        
         | tibbon wrote:
         | We also tried to use DMS for a few things (Rackspace to AWS
         | migration, replication out to data lakes, etc) and it has been
         | consistently undersupported, buggy and ate months of time
         | before we went to other solutions. While a lot of AWS support
         | has been good; not for DMS. It feels entirely half baked .
        
         | troublebucket wrote:
         | +1 DMS is very half baked. Silent, unbuggable errors. Tons of
         | unsupported LOB and CLOB data types. Built-in Postgres logical
         | replication is way easier.
        
       | karol wrote:
       | Hope the picture is illustrative only.
        
       | aidos wrote:
       | I love that where the rubber meets the road it doesn't matter if
       | you're a government organisation or a web agency, it's still just
       | plain old Postgres, python, sqlalchemy and dns with a low ttl.
        
         | NegativeK wrote:
         | To be fair, there's plenty of government orgs at the bottom and
         | web agencies at the top.
        
       | Ozzie_osman wrote:
       | We did a similar migration (somewhat larger database) with ~20
       | seconds of downtime and much less work... using the magic of AWS
       | RDS Blue-Green Deployments [1]. Surprised they aren't mentioned
       | in the thread yet.
       | 
       | Basically, you spin up a new Blue Green deployment with any
       | desired changes (in our case, we were upgrading Postgres major
       | from 13 to 15). While your blue configuration continues to serve
       | traffic, AWS uses logical replication to keep the "green"
       | deployment in-sync. You can keep modifying (or testing) the
       | "green" deployment (eg you could load test it if you wanted to),
       | as long as you don't do any writes to it (writes still have to go
       | to your live, blue configuration, and are replicated to green).
       | 
       | When you're ready, you run the "switch" command, and AWS does a
       | few things for you: run checks to ensure blue/green are in sync,
       | stops writes and connections, waits a few seconds to ensure
       | replication is caught up, renames your database, then allows
       | connections/writes again. We had less than 20 seconds of
       | downtime, by our count. And, we had a primary and several read
       | replicas and AWS successfully switched the full configuration
       | over with no hiccups. You don't even need to switch your
       | configuration because AWS swaps the database URLs for you. Green
       | becomes blue, blue becomes old blue, and when you're ready, you
       | delete "old blue".
       | 
       | Highly recommend! They do have some restrictions (for instance,
       | not sure if it would work if you're switching accounts, etc).
       | 
       | 1.
       | https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/blue-...
        
         | sgarland wrote:
         | +1 for B/G. That said, I imagine they couldn't use it due to
         | the cross-account shift. I've used it for both MySQL (with much
         | higher QPS than TFA, by two orders of magnitude) and Postgres,
         | and they both went flawlessly.
         | 
         | Read the docs, especially the limitations. Re-read them. Do a
         | test run in the dev environment, under load. Do it again in
         | staging.
         | 
         | Or just YOLO into prod 'cause it'll probably be fine, I guess.
        
           | Ozzie_osman wrote:
           | We definitely did a dry run with a parity configuration a
           | couple nights before. It gave us a lot more confidence.
        
         | troublebucket wrote:
         | Has anyone encrypted the storage on a previously unencrypted
         | RDS using Blue/Green?
        
           | d1str0 wrote:
           | Good question. This was a pain point for my small team (me,
           | myself, and I) a little while back. We had an unencrypted db
           | deployed with CDK, and then tried to get it encrypted without
           | losing data.
        
           | bgschulman31 wrote:
           | We recently did this on my team over Christmas this year. We
           | opted not to use Blue/Green for this but instead spun up an
           | encrypted snapshot and began replication from the old
           | database to the new database using MySQL's tools. Once
           | traffic on our platform was sufficiently low, we cut
           | connections to the database, waited for for replica lag to
           | reach 0 and relaunch the servers with the new databases host
           | info. Our downtime was around a minute.
        
           | wisosim wrote:
           | We did the exact thing not too long ago, but we couldn't do
           | it using blue/green. We were able to minimize downtime quite
           | a bit, but it was on the order of minutes, not seconds. I
           | wrote a little bit about the process here. I spent a lot of
           | time prepping and running the migration, so happy to share
           | any details if it's helpful.
           | 
           | https://phizzle.space/dbadmin/aws/postgres/2023/12/30/rds-
           | en...
        
         | cljacoby wrote:
         | I also used RDS Blue/Green deployment to apply a MySQL major
         | engine version upgrade from 5.7 to 8.0. With respect to
         | downtime it worked fantastically, I think we measured 13
         | seconds of observable downtime from the API.
         | 
         | However we did learn the hard way that RDS Blue/Green cannot be
         | used to apply arbitrary changes. In our case, we discovered RDS
         | Blue/Green can only be used to move up engine versions, not
         | down.
         | 
         | We discovered on MySQL 8.0 one of our stored procedures had
         | very occasional failures, and considered the option of using
         | RDS Blue/Green again to move back down to 5.7. Turns out that's
         | not an option.
        
           | Mavvie wrote:
           | I wonder if that could be because MySQL 8's replication is
           | backwards compatible but MySQL 5.7's isn't forwards
           | compatible. If so, it makes sense that you're only able to
           | move forward.
        
           | 0xbadcafebee wrote:
           | Database changes are typically one-way. If your new change
           | includes creating or modifying a table, such that there are
           | new additional columns, and you populate those with data,
           | then downgrading would destroy the changed columns and the
           | data in them. Hence you can't downgrade once you upgrade or
           | you'd potentially be breaking things. To downgrade safely
           | you'd need to backup or snapshot the old database, and then
           | restore your database back to the backup/snapshot, but that's
           | not blue/green.
        
             | fulafel wrote:
             | DB schema migration script frameworks (at least in Python,
             | Ruby & Java lands) do typically support both upgrade and
             | downgrade directions. People skip implementing and testing
             | the downgrade side if the development model doesn't need it
             | but the problem of what happens to the data is controlled
             | by what you put in the "down" migration script.
             | 
             | I'd guess if you can't throw the data away, you won't do a
             | down migration, you'll do an up migration that changes the
             | db to save that data in your preferred way before undoing
             | or reworking the previous schema change.
        
               | yellowsir wrote:
               | but now you have 2 up paths. and migrations are critical,
               | i would avoid it where possible!
        
               | patmorgan23 wrote:
               | Database migrations are always "fail forward" if there's
               | an error you figure out what it was and fix it.
        
               | eropple wrote:
               | _> DB schema migration script frameworks (at least in
               | Python, Ruby  & Java lands) do typically support both
               | upgrade and downgrade directions._
               | 
               | They do, and in every shop I've ever been in these are
               | considered a trap precisely because they _don 't_
               | consider data loss.
               | 
               | Always roll forward. If you have to change migration
               | history, restore a backup and lament past you's hubris.
        
               | evanelias wrote:
               | This is solved more cleanly in declarative schema
               | management systems, where you have a schema repo of
               | CREATE statements, and the tool can auto-generate the
               | correct DDL. You never need to write any migrations at
               | all, up or down. If you need to roll back, you use `git
               | revert` and then auto-generate from there. The history is
               | in Git, and you can fully leverage Git like a proper
               | codebase.
               | 
               | A key component is that the schema management tool must
               | be able to detect and warn/error on destructive changes
               | -- regardless of whether it's a conceptual revert or just
               | a bad change (i.e. altering a column's data type in a
               | lossy way). My declarative tool Skeema [1] has handled
               | this since the first release, among many other safety
               | features.
               | 
               | That all said, schema changes are mostly orthogonal to
               | database version upgrades, so this whole subthread is a
               | bit different than the issue discussed several levels
               | above :) The root of the blue/green no-rollback-after-
               | upgrade issue discussed above is that MySQL logical
               | replication officially supports older-version-primary ->
               | newer-version-replica, but not vice versa. Across
               | different release series, the replication format can
               | change in ways that the older version replicas do not
               | understand or support.
               | 
               | [1] https://github.com/skeema/skeema
        
               | fulafel wrote:
               | I have two theories where people end up wanting them:
               | 
               | (1) Circumstances that for some reason enforce the
               | requirement that sysadmin type ppl always have to be able
               | to downgrade / roll back deployments without "the
               | developers" producing new builds or sw artifacts. A
               | separation of ops and dev teams, where you decide you
               | need to survive an inability to make or procure new
               | software builds on demand, and just dig up old deployment
               | artifacts to use after down migration. There are a lot of
               | wrong reasons to do this in inhouse sw settings, but also
               | I guess the classic "we bought a 3rd party server app and
               | plugged it into our onprem database", like Jira or
               | something.
               | 
               | (2) Systems that are technically unable to recover from
               | errors happening in db migrations (missing transactional
               | schema change feature in db and/or application doing db
               | related stuff that can't be rolled back at deployment
               | time). So the down migration is more like a hand coded
               | rollback for the migration that will be automatically run
               | in the failure case of a deployment.
               | 
               | In both cases I can see how the "what happens to data in
               | new columns" situation might still work out. In the (2)
               | case it's sort of obvious, there's no new data yet. In
               | the (1) case you live with it or choose the backup
               | restore path - I can see scenarios where you decide it'd
               | be much worse to restore from backup and lose people's
               | entered data for couple of days, or however it took to
               | find the showstopper for the upgrade, vs run the down
               | migration and just lose new feature related data. (Which
               | you could also rehearse and test beforehand with backups)
        
               | magicalhippo wrote:
               | Our in-house schema migration tool supports downgrading,
               | but it won't remove non-empty tables or columns etc.
               | 
               | For us this isn't a big deal though because we're writing
               | our software so it should be able to function as expected
               | on a DB with a newer schema. This makes upgrades much
               | easier to handle has users can run new and old software
               | side-by-side.
        
         | yeswecatan wrote:
         | How did you go about stopping and restarting applications which
         | reach out to the database? We have a number of tasks running in
         | ECS which can take a minute to spin down and a few minutes to
         | spin back up.
        
           | ThePowerOfFuet wrote:
           | Presumably you don't stop them, and they throw errors during
           | the cutover.
        
           | todd3834 wrote:
           | You aren't supposed to have to change anything in the
           | application code. The same database URL should work.
        
           | Ozzie_osman wrote:
           | For our web service, we didn't stop anything. They had a few
           | seconds of errors though it seems like some sessions were
           | just buffered or paused and experienced high latency.
           | 
           | We also had background worker services. For the very high
           | throughput ones, we spun down the # of tasks to a bare
           | minimum for <5 minutes and let the queue build up, rather
           | than have a massive amount of errors and retries. For the
           | other ones where throughput wasn't high, we just let them be,
           | and during the downtime they errored and retried and the
           | retries mostly succeeded.
        
         | benmorris wrote:
         | I used this about 2 months ago going from MySql 5.7->8.0 Really
         | awesome feature.
        
         | sharadov wrote:
         | Blue/green is new, so am guessing most folks don't know about
         | it. It's the way for close to zero downtime upgrades.
         | 
         | Am waiting for them to support upgrades from RDS Postgres to
         | Aurora.
        
         | shayonj wrote:
         | +1 for Route53 Groups and B/G setups. We did something similar
         | with PG upgrades, no downtime with AWS R53 groups & retry
         | inflight queries with a custom Rails ActiveRecord transaction
         | patch.
         | 
         | Trade off: For a few seconds some requests were slower.
         | 
         | DNS Groups w/ retries is a nifty mechanism for these things.
         | 
         | Tool used: https://github.com/shayonj/pg_easy_replicate
        
         | crucialfelix wrote:
         | I'm planning to do this soon. Is there any problem skipping
         | versions when upgrading? I would like to go from Postgres 12 to
         | 16. Does Postgres handle that without worries?
        
       | ComputerGuru wrote:
       | > The second change was to create a DNS record in AWS Route53 for
       | `database.notifications.service.gov.uk` with a 1 second TTL (time
       | to live) [..] our migration script just needed to update the DNS
       | weighting in AWS to 100% of results being sent to the target
       | database location and wait 1 second for the TTL to expire. Then,
       | when our apps next try to query our database they will be
       | querying our target database.
       | 
       | Wait. Their (or Python's default?) db orm blocks while it
       | performs a DNS lookup for each and every query!? It doesn't cache
       | resolved addresses for _any_ length of time? No connections are
       | pooled and reused?
        
         | necovek wrote:
         | It would probably be the OS' `getaddrinfo` or `gethostname`
         | that does this: Python rarely reimplements system level calls,
         | which means it relies on the system's configuration.
         | 
         | If TTL of 1s was respected, they would be cached for 1s, but
         | it's not uncommon for DNS query libraries and especially
         | caching DNS servers to not fully respect TTL anyway: tbh, that
         | might explain some of the downtime they've seen.
        
           | ComputerGuru wrote:
           | I didn't mean it was directly implementing the networking
           | call to the dns server -- just that it wasn't directly
           | caching the result.
           | 
           | getaddrinfo(3) and getnameinfo(3) (guessing that's what you
           | meant) don't implement caching, at least not explicitly in
           | the spec and not normally in practice. On Windows, DNS
           | results are cached by the OS but on Linux that would be
           | distro-dependent behavior and usually requires setting up a
           | local caching dns server (Ubuntu uses unbound out-of-the-box,
           | iirc. Other choices include ncsd and dnsmasq).
           | 
           | Even if they implemented caching at the syscall level, this
           | still assumes no connection stays open for more than 1s or is
           | reused except per query. It seems like a big assumption (at
           | least I _hope_ it is, because I certainly wouldn 't want my
           | app to initialize a new db connection, let alone perform a
           | DNS lookup, for every query).
        
             | juggertao wrote:
             | They mentioned they had a script which terminated all
             | connections to the old database then changed the password.
             | 
             | But on the app side you typically don't cache DNS, that
             | creates other problems like stale DNS.
        
       | ClassicOrgin wrote:
       | Minor detail but did anyone else notice they are using eu-west-1?
       | Kind of weird for the UK to be hosting there sites in another
       | country (Ireland). I'm sure this isn't super sensitive data but
       | still.
        
         | lozenge wrote:
         | The London AWS region was missing a lot of features for a long
         | time. It's easier to just go with a big region.
        
           | justinclift wrote:
           | "Lets go with the easier approach" is a bit worrying when it
           | means potentially breaking security assumptions / legal
           | assurances.
           | 
           | Though I guess it could be explained away by having something
           | in the website's "Terms and Conditions for use". ;)
        
             | juggertao wrote:
             | It's a gov website. What are they going to do, sue
             | themselves?
        
               | another2another wrote:
               | Have a public enquiry, with lots of tea.
        
         | esskay wrote:
         | They've been on AWS since before eu-west-2 was added, it's
         | still not at the scale of eu-west-1. Capacity issues are still
         | a thing (but getting much better), and only recently have they
         | become pretty much feature parallel.
        
         | callalex wrote:
         | It made a lot more sense before...you know...the thing that
         | made no sense.
        
         | kitd wrote:
         | I think one of the bits of carry-over legislation in the
         | Withdrawal Agreement after Brexit was the UK continuing to
         | comply with GDPR. So using an Irish DC is still ok.
        
       | gbraad wrote:
       | They don't just sent the notifications but store them? Sounds
       | like it might contain PPI as it records passport extension data,
       | etc. Might be minimal, though over 1 billion rows (400GB) sounds
       | massive trove to keep around.
        
         | M2Ys4U wrote:
         | I'd highly doubt their database contains payment protection
         | insurance.
        
       | efxhoy wrote:
       | Lovely! We just migrated from postgres 14 to 16 for 3 postgres
       | clusters (servers) on RDS containing about 2TB of data across 8
       | databases. We were down from 00:00 to 04:00. Steps we took:
       | * enabled our fallback "maintenance mode" site. It's a super
       | lightweight version of our site running on CF workers.       *
       | scaled down all apps using the db to 0 in terraform       * hit
       | the upgrade button in the aws web ui, which runs pg_upgrade.
       | 14->15       * waited for it to finish       * hit the upgrade
       | button again. 15->16       * waited for the dbs to start
       | accepting connections (they do before they're marked as ready, I
       | think aws does more stuff than pg_upgrade)       * Started
       | `VACUUM ANALYZE; REINDEX DATABASE CONCURRENTLY`. The idea is to
       | avoid performance issues between versions and make use of
       | performance improvements from new versions.       * Started
       | spinning up the apps again       * Waited until all apps had a
       | handful of containers running       * Started accepting traffic
       | (disabled maintenance site)       * Went to bed
       | 
       | The REINDEX CONCURRENTLY happily chugged along for the biggest db
       | for another 18 hours without blocking anything. Next time we're
       | doing aws blue/green deploys to avoid downtime. We didn't this
       | time since we weren't on 14.9 yet (the minimum minor version of
       | 14 supported by blue green).
       | 
       | If I was doing this myself I wouldn't pay the AWS tax, instead do
       | blue/green myself with logical replication and a load balancer.
        
         | sharadov wrote:
         | I would just use pg_upgrade with --hardlinks for an in-place
         | upgrade.
         | 
         | Have done 2 TB dbs in less than a minute.
         | 
         | We were running our own Postgres instances on-prem.
        
       | vfclists wrote:
       | This doesn't seem to bode well for Oracle, but lobbying trumps
       | the day.
        
       | saisrirampur wrote:
       | Very clear and concise blog! Loved reading it. I'd be very
       | curious to see how Amazon DMS performs at scale. Scale includes
       | either of these situations: a) Larger dataset - 2+TB b) Higher
       | throughputs - WAL Spikes (at least 2-3K TPS) c) Reliably and
       | efficiently replicating TOAST Columns (ex: large JSONBs). d)
       | Advance data-types - BLOB, CLOB, HSTORE, ARRAYs etc.
       | 
       | In my previous role with the Azure Postgres team, we observed
       | that Azure DMS worked pretty seamlessly in migrations that did
       | not involve the above situations. Once either of the above came
       | in, the replication experience was hairy. Interestingly, in my
       | current gig at PeerDB, many use-cases we are seeing have either
       | of the above situations. We have been doing quite a surgical
       | optimizations to handle those situations and they keep coming!
       | That is the beauty of Postgres, enabling a myriad of workloads,
       | each of the unique in their own way! :)
        
       | oulu2006 wrote:
       | We didn't even use red/green deployment; just had a multi-AZ
       | deployment and migrated from 11.x -> 15.2 with about 30 seconds
       | of downtime.
       | 
       | No dramas really, didn't even need logical replication.
        
         | ris wrote:
         | I'll bet that wasn't across AWS accounts though.
        
       | sharadov wrote:
       | DMS is a terrible migration tool, I spent almost a month fighting
       | with various migration issues before I gave up.
       | 
       | It would not migrate text and json types. Even AWS support could
       | not offer a solution.
       | 
       | We got in early testing AWS Blue/Green and that has made close to
       | zero downtime upgrades a reality.
        
         | slyall wrote:
         | If you think DMS is a bad migration tool then try using it for
         | ongoing replication to an external destination.
         | 
         | Completely broken.
        
       | ryandv wrote:
       | I've also used DMS to migrate smaller datasets (~200GB) from AWS
       | RDS MySQL 5.7 to RDS PostgreSQL 11. 10 seconds of downtime here
       | was actually incurred not due to the migration itself, but to
       | enable MySQL binary logging on the RDS instance, which requires a
       | restart, and which AWS DMS uses to replicate changes from the
       | replication source (MySQL) to the target (PostgreSQL).
       | 
       | Traffic was steered to the new PostgreSQL instance not with
       | DNS/Route 53 weighted records, but an application-level
       | percentage enroller (based on user ID). Our particular set-up did
       | in fact have apps talking to both databases simultaneously, each
       | with a percentage share of traffic, and so we did not need to
       | incur additional downtime to cut off traffic to the original
       | source database - though now you do have to worry about eventual
       | consistency.
       | 
       | I wouldn't recommend using their Schema Conversion Tool. We
       | instead migrated the data as 1-to-1 as possible into the
       | PostgreSQL target table, and then used plpgsql triggers on the
       | replication target table to clean/validate the data and write it
       | to another table with our desired (and more strongly typed - no
       | more storing JSON as TEXT) schema.
       | 
       | There were also some issues with the replication task seeming to
       | stall and stop during the change data capture (CDC) phase. As
       | mentioned support is kind of spotty in this area, though we
       | learned it may have to do with sizing the DMS replication
       | instance correctly.
        
       | zImPatrick wrote:
       | I really love how verbose and up to date the UK with tech
        
       | DeathArrow wrote:
       | Seems bad that government services depend on clod providers.
        
         | callalex wrote:
         | Why? Do you deny the benefits of shared hosting in its
         | entirety?
        
           | DeathArrow wrote:
           | No. But I wish government agencies depend less on other
           | entities. A government agency isn't in a business to make
           | money. They should be concerned less with economic efficacy
           | and more with accomplishing their goals.
           | 
           | They have a fiduciary duty towards the citizens.
        
             | mattsan wrote:
             | Out of all the entities to depend on, AWS is tried and
             | battle tested with US gov.
             | 
             | Additionally, a pandemic has shown that government services
             | actually need to scale to arbitrary amounts, seemingly at
             | random.
             | 
             | Don't you think that in this case using AWS is better to
             | simply guarantee availability? Imagine tax payers money
             | being spent on servers and then a pandemic happens and when
             | the economy is screeching to a halt, more tax payers money
             | is needed to expand the cluster, money that really would be
             | useful to spend elsewhere.
        
             | patmorgan23 wrote:
             | And what if AWS is an efficient, efficiency, and affordable
             | way to achieve that goal?
        
       | denysvitali wrote:
       | They could have probably used pgpool [1] to avoid relying on the
       | DNS (you never know in the chain who's not respecting the TTL).
       | 
       | With this, you can choose which queries are ending up where - and
       | migrating the DB endpoint can be done once, centrally
       | 
       | [1]: https://www.pgpool.net/docs/latest/en/html/
        
       | speculator wrote:
       | nice writeup. easy to read, easy to understand
        
       | pedrokost wrote:
       | We recently completed the migration of a self-hosted 3 TB
       | PostgreSQL database from version 12 to 16, transitioning from
       | Ubuntu 18 to Ubuntu 22. Concurrently, we had to upgrade various
       | extensions, most notably Timescale, for which a compatible
       | version did not exist across all scenarios. We performed the
       | upgrade by updating a replica in the following sequence:
       | 
       | - Start: PG12, Ubuntu 18, TS2.9
       | 
       | - Step 1: Set up a read-only replica with PG12 on Ubuntu 22,
       | maintaining TS2.9.
       | 
       | - Step 1.5: Enter maintenance mode and halt all services.
       | 
       | - Step 2: Detach the the read-only replica, upgrading from PG12
       | to PG15 on Ubuntu 22 with TS2.9.
       | 
       | - Step 3: Upgrade from PG15 with TS2.9 to TS2.13 on Ubuntu 22.
       | 
       | - Step 4: Upgrade from PG15 to PG16 on Ubuntu 22 with TS2.13.
       | 
       | - Step 4.5 : Reconnect services to the new database server,
       | resume all services, and exit maintenance mode.
       | 
       | All the database upgrade steps were well-tested and automated
       | using Ansible. Nonetheless, we did encounter an issue that had
       | not arisen during testing. This extended our downtime to
       | approximately half an hour, which, for our use case, was
       | perfectly acceptable.
       | 
       | Employing logical replication could have mitigated the last-
       | minute surprise. So we will consider this approach for our next
       | upgrade cycle.
        
         | pbrooko wrote:
         | We recently followed an almost identical upgrade path (however
         | at the time PG16 wasn't yet supported by Timescale, so stopped
         | at PG15 + TS 2.12).
         | 
         | We did look into using logical replication to reduce the
         | downtime of the upgrade, but because database schema and DDL
         | commands aren't replicated, it seems that it isn't recommended
         | with Timescale in the loop.. (I suppose the underlying schema
         | changes that Timescale needs to make under the hood are mostly
         | a function of your hypertable chunk sizing and what your
         | incoming writes look like, so this could be planned around /
         | timed well, but we felt it added too much potential complexity
         | & risk compared to simply opting for a small maintenance window
         | while pg_upgrade completed).
        
       | globular-toast wrote:
       | Is that a picture of them doing it?
       | 
       | What is this weird fantasy that programming looks like this? It's
       | always the same: a couple of people, always of diverse
       | backgrounds, physically interacting, usually pointing and
       | laughing at something, with perhaps another group, representing
       | different demographics, looking on with a mixture of curiosity
       | and awe.
       | 
       | Do an image search for programming to see more examples of this.
       | 
       | I guarantee this work was the result of guys sitting at desks,
       | possibly in the dark, thinking. Are we collectively embarrassed
       | that this is what programming looks like? Or are these just
       | fantasies of management types who have never written a line in
       | their lives?
        
         | johneth wrote:
         | Give it a rest, mate.
        
           | globular-toast wrote:
           | Nah, this is the thing that leads to open offices etc. I want
           | people to understand what programming actually looks like and
           | these silly pictures need to go.
        
         | draaglom wrote:
         | The primary goal of most large organisations' dev blog is to
         | attract potential candidates to the top of the recruitment
         | funnel.
         | 
         | This picture appears to be (presumably) the actual team doing a
         | team building activity.
         | 
         | This is both a reasonable choice given the inferred
         | instrumental goals of the blog, and a honest reflection of (one
         | part of) the team life, assuming it's a photo of the real team.
        
         | beejiu wrote:
         | They are clearly not programming. It's a team building game
         | called the marshmallow challenge.
        
       | alerter wrote:
       | Interesting, though I have no idea why the government is using
       | AWS in the first place. This isn't a startup hacking away trying
       | to find PMF, or dealing with unpredictable marketing-driven
       | traffic spikes. We know we need these services running long term,
       | and can make solid predictions about usage patterns.
       | 
       | We could build a public sector cloud and/or adopt a sensible on-
       | prem approach. This requires funding, coordination and technical
       | leadership, but would save the taxpayer an enormous amount over
       | the long term.
       | 
       | Public sector IT is a disaster in general ofc, but I know there
       | are good engineers working there.
        
         | MrBuddyCasino wrote:
         | If you think the public sector could improve its operational
         | efficiency by building, operating and using their own cloud,
         | I've go a bridge to sell you.
        
         | rjzzleep wrote:
         | I don't know about the UK, but AWS has has GovCloud in the US
         | for a long time, and to be honest compared to a lot of
         | infrastructure I have seen there it's a blessing. On the
         | flipside, I've met some really amazing infrastructure and ops
         | people in a German gov healthcare institution running the in-
         | house DC, where the problem was neither the tech, nor the
         | people, but 100% the management and their processes, and their
         | desire to be the bottleneck for every single interaction
         | between infrastructure and engineering teams.
        
         | StupidOne wrote:
         | I'm sure a lot of countries in EU did this. First hand I know,
         | they did this in Croatia as I was one of the developer who had
         | to use it to deploy on it. The worst thing I have ever seen in
         | my life. And I worked on a lot legacy apps written in VB.NET,
         | Web forms, old Sharepoint, Basic and even when the whole app
         | was one big mess of store procedures.
         | 
         | AWS, Azure, GC are at least written with thought about end
         | users (us, developers) while government cloud was
         | architectured, designed and built by the lowest bidder whose
         | first goal was to cut and cut and cut his costs whenever
         | possible.
        
         | jamietanna wrote:
         | The UK government as a whole operates across the big names in
         | the public cloud as well as some on-Prem/colo.
         | 
         | I'd very much recommend watching
         | https://youtube.com/watch?v=mpY1lxkikqM&pp=ygUOUmljaGFyZCB0b...
         | from September about Gov.UK's various iterations and some of
         | the migrations across cloud that they've had to do.
         | 
         | One thing about (at least UK government) is that procurement
         | requirements means that they go to market for quotes around
         | usage every few years. If ie Oracle Cloud was 1/10th the price,
         | it would likely mean they'd win the deal, and so would have to
         | migrate to Oracle for the duration of the contract, and then
         | potentially do the same to another cloud if that was cheaper
        
         | j-krieger wrote:
         | Disclaimer: I work in the public sector.
         | 
         | If you really think the public sector could build anything
         | closely resembling any cloud, you are dreaming. Imagining that
         | one cloud working for the entire public sector, we are entering
         | delusional territory here.
         | 
         | Public sector projects are insanely expensive, take ages to
         | develop and often flat out fail.
         | 
         | Not only that, we are starved for engineers even now. If we ran
         | such a custom cloud, we would shoot ourselves in the leg by not
         | being able to hire from a pool of experienced engineers.
        
           | alerter wrote:
           | Oh I totally understand that. Public sector IT is littered
           | with failed projects, massive overspends and lasting
           | functional problems. There are numerous examples from the
           | last few years alone - hundreds of millions spent on broken
           | software. I used to work in public affairs and followed a lot
           | of those projects closely.
           | 
           | I don't think this means "government = bad at tech" though.
           | You sometimes see smaller in-house teams do really good work.
           | The biggest issue seems to be with contracting and
           | procurement policy. For example, on the Police Scotland i6
           | program they hired a bunch of consultancies to write a tender
           | document, and then hired CGI for the final project. That
           | turned out to be a copy of the Spanish system, which turned
           | into a huge and expensive disaster as the requirements
           | differed.
           | 
           | Feels like government has a similar problem to a lot of
           | legacy non-tech companies. They don't have much technical
           | leadership, don't understand the problem, and decide to hand
           | it off to the lowest bidder. Doesn't help that they are often
           | legally obliged to act this way. But the underlying
           | engineering problems aren't unsolvable, and don't need to
           | become a huge mess every time. (Your point about recruitment
           | is fair though)
        
         | ris wrote:
         | > This isn't a startup hacking away trying to find PMF, or
         | dealing with unpredictable marketing-driven traffic spikes. We
         | know we need these services running long term, and can make
         | solid predictions about usage patterns.
         | 
         | If you think government needs and demands are predictable, you
         | don't follow politics (particularly uk politics in the last
         | decade).
         | 
         | And then there are these things like pandemics that completely
         | come out of left field. Being able to scale things on demand
         | over the pandemic was one of the key demonstrators for use of
         | the public commercial cloud by the public sector.
        
           | objektif wrote:
           | Why do we assume scaling can happen only on cloud? Gov sites
           | are usually very low tech. We are not mining crypto. I would
           | guess even if they outright bought servers and just scrap
           | them later on they would come ahead.
        
         | berkes wrote:
         | As a startup, my reason for using "cloud" (PaaS) is not to
         | catch spikes, but because of focus. Every hour that I spend
         | running around with HDDs, screwdrivers (or the cloud version
         | thereof - storage, ansible, etc) is an hour that I'm not
         | spending on stuff my customers need.
         | 
         | Why would this be any different for a government?
         | 
         | We don't expect our government to build their own cars, but to
         | buy them from Volkswagen or Renault. Even when the government
         | has a clear need for transport. Why do we then insist they
         | build their own IT infrastructure?
        
           | 0dayz wrote:
           | Because the government is not a corporation and it's
           | obligations are different from the private market.
        
             | berkes wrote:
             | So should a government (e.g. the police) produce its own
             | cars, rather then buy them? Or set up a factory to make
             | laptops?
             | 
             | I'm not trying to be pedantic. I'm trying to understand
             | where the line lies (and whether that line is universal for
             | all governments on all layers in all countries).
        
               | objektif wrote:
               | Your examples do not make sense on prem existed long
               | before cloud and some shops were perfectly fine with it.
               | If the police were producing perfectly fine cars and then
               | they went for a car share program then I would argue that
               | yes it may be reasonable for them to consider going back
               | to producing cars.
        
         | chasd00 wrote:
         | The government likely hired a consulting firm and AWS was part
         | of the solution proposed and bought.
        
           | plugin-baby wrote:
           | This doesn't sound like how GDS operates.
        
       | jm2dev wrote:
       | On a similar case AWS DMS did not work for my team and we
       | proceeded with pglogical.
       | 
       | Although it has some limitations it worked pretty well, and
       | personally I like that it is cloud agnostic. Definitively I would
       | recommend to consider pglogical first.
        
       | zeroclicks wrote:
       | Let's ignore the technicalities completely. Here's a government
       | providing a detailed look into their engineering practices,
       | complete with graphs and configuration snippets. How many
       | governments manage to publish such a thing?
        
       | glintik wrote:
       | 11 seconds downtime - looks like devops had a break for smoking
        
       | mwarkentin wrote:
       | Back when we first tried out DMS there was a fun bug where
       | booleans were interpreted as strings, so all false records were
       | interpreted as "false" and translated to true on the target DB.
       | It was fixed shortly after but was a good reminder to validate
       | your data during a migration.
        
       ___________________________________________________________________
       (page generated 2024-01-20 23:02 UTC)