[HN Gopher] We migrated our PostgreSQL database with 11 seconds ...
___________________________________________________________________
We migrated our PostgreSQL database with 11 seconds downtime
Author : sh_tomer
Score : 563 points
Date : 2024-01-18 21:51 UTC (2 days ago)
(HTM) web link (gds.blog.gov.uk)
(TXT) w3m dump (gds.blog.gov.uk)
| Edwinr95 wrote:
| I'm quite negatively surprised that a government service is
| moving from their own platform to AWS for such an important
| service.
| cpursley wrote:
| And an American one, at that (we're talking government services
| here, not some SaaS). Are there really no native UK cloud
| providers?
| cameronh90 wrote:
| Not any remotely comparable. The small "cloud" providers we
| do have were just reselling vSphere last time I looked into
| it.
| jazzyjackson wrote:
| I don't know what it's like in UK but it may be the case that
| government has a hard time a{ttract,fford}ing talent to
| administer everything in house. Not that AWS is great for cost
| saving but if its between paying 50k/year for cloud services
| and not being able to find an engineer who will competently do
| the job for less than 50k, then the cloud is your only move
| really.
| swozey wrote:
| They require various clearances (digging into your life and
| past relationships to a miserable degree), don't allow
| someone to have ever smoked pot and pay half or less of what
| you can make in the pvt sector here (usa).
|
| Everyone I know working FedRAMP jobs is prior
| military/g-level.
| robertlagrant wrote:
| They wouldn't need that. And having been SC cleared in the
| UK, and known a few DV-cleared ones, at least in the UK
| they don't care if you've smoked pot. They just care that
| if you have, that you don't mind your family knowing one
| day. They don't want people who can be blackmailed.
| swozey wrote:
| Here it's like this: Don't ever lie to them, "no matter
| what it is they'll find out."
|
| So, some people don't lie, say they smoked pot in high
| school and none of them make it to the next step.
|
| I had a twitter convo last year or pre-x whenever with
| the CTO of some org I can't remember (I don't think
| centcom, something much smaller) and he mentioned that
| they've lightened up quite a bit, or at least his program
| which was a softwar engineering group was more lenient.
| He was looking for engineers on via twitter on his
| official account.
|
| So maybe that's loosening up here thankfully.
| justsomehnguy wrote:
| > who will competently do the job for less than 50k, then the
| cloud is your only move really
|
| Well, there _is_ the other way, but, as we know, never ever
| that would happen.
| prmoustache wrote:
| Once your past the emerging startup status, running on the
| cloud involve as much engineers and complexity as running on
| prem if you want to follow best practices.
|
| The "let's be managed and only hire developers" is a huge
| myth. All large organizations involve tons of "cloud
| engineers" or "devops" depending on how they want to call
| them and are just sysadmins with a different name and a
| bigger paycheck.
|
| Having actual datacenters doesn't add a ton of complexity and
| datacenters themselves are often managed by people who don't
| even have an engineer paycheck. The main difference between
| being on prem vs cloud is you have to plan (how many
| servers/storage/network equipment you have to buy and replace
| on the following year) and pay for stuff (like space, racks)
| more in advance + take into accounts delays in delivery. This
| is where cloud makes the job much faster for companies but
| given the slow pace at which gov stuff happen usually I don't
| think this is a problem for them.
| solatic wrote:
| > and not being able to find an engineer
|
| Remember it's not just about being able to find one single
| engineer - then they become key-person risk. You need
| multiple engineers to be able to handle the loss of that
| engineer, either temporarily (vacation) or permanently
| (suddenly hit by a bus). Then you end up having a team of
| DBAs. Then you have functional rather than feature teams.
| Then you need multiple managers to align to get anything
| done, and have internal politics.
|
| Being able to consume databases _as a product_ has non-
| trivial value.
| ris wrote:
| GOV.UK PaaS also runs on AWS (for as long as it remains to
| exist)
| okasaki wrote:
| All UK businesses run on Oracle and Microsoft, so I'm not sure
| why you're surprised. They have us by the balls.
| kingkongjaffa wrote:
| > This is an AWS RDS PostgreSQL database and it lives in the
| PaaS' AWS account. Our apps that run in the PaaS talk to this
| database. We are going to call this database our 'source
| database'.
|
| It already was. Read the article.
| conception wrote:
| AWS has a lot of pre-audited compliance built into their
| services. Being able to inherit their certification for
| services can save an organization a lot of time and effort.
| lnxg33k1 wrote:
| Its not an organisation, its a blucking government, it
| handles citizen data, and its sending them to a company of
| foreign country, because it can't hire some system
| administrators? A GOVERNMENT? What are they doing? Still
| looking for their product market fit and can't afford the
| headcount? Is it a joke?
|
| EDIT If they are looking for money id like to participate a
| bit in the seed round
| kunwon1 wrote:
| AWS has a G-Cloud for UK just like they have one for US,
| no?
| snoman wrote:
| AWS has government specific regions (called GovCloud).
| Many services or features make it to GovCloud later than
| other regions because of the certification requirements.
| travem wrote:
| AWS has US based GovCloud regions: AWS GovCloud (US-East)
| and AWS GovCloud (US-West). It does not have a UK
| specific GovCloud region that I am aware of.
| bboygravity wrote:
| Why can't the UK government build there own cloud?
|
| It's just completely insane to me that they would make
| the gov internet infrastructure completely
| (geopolitically) dependent on another country AND just
| literally give all their (citizens') data away AND pay
| for that "privilege"?!
|
| I mean if the government can't host the government's
| websites using tech from the government's country, maybe
| it would be better to just forget about the whole
| cyberweb thing altogether? Just turn it off?
| robertlagrant wrote:
| They'd still be outsourcing to a firm to do this. They
| wouldn't hire a load of people to do it in-house. See
| also Fujitsu in the recently-popular Horizon scandal, or
| the NHS for IT debacle[0].
|
| [0]
| https://en.wikipedia.org/wiki/NHS_Connecting_for_Health
| vdaea wrote:
| Why should they build their own cloud, seeing that costs
| more money?
| willsmith72 wrote:
| you want every government to build their own cloud? what
| in the world? the whole world is interlinked, should they
| also manufacture their own government laptops in the UK?
| ris wrote:
| I don't think you have any idea just how much it costs to
| run infrastructure at the reliability levels provided by
| AWS, and just how much investment it would require to get
| the ball rolling on this.
|
| A lot of people have a very unrealistic picture of what
| government budgets are like.
| blibble wrote:
| > I don't think you have any idea just how much it costs
| to run infrastructure at the reliability levels provided
| by AWS
|
| my $12/year VPS does better than us-east-1
| hdlothia wrote:
| Where do you get a 12 dollar a year vps. Hetzner charges
| me 4 bucks a month and it feels like a steal
| blibble wrote:
| https://lowendtalk.com/
|
| quality varies
| bboygravity wrote:
| I get that it costs a lot.
|
| My point is that NOT hosting it yourself (as a
| government) costs WAY more in the long run. See my points
| above.
|
| The same goes for companies in Europe who literally host
| their trade-secrets (designs, sales, the entire company)
| on US-servers (OneDrive, Google Drive, etc). The US is
| the home of their competitors. Who cares about
| infrastructure costs if you're PAYING to give your trade
| secrets away to your competitor(s)?!
| shagmin wrote:
| I've always wondered how beholden the world is to
| Microsoft. I was once surprised to learn the US military
| (and probably virtually all others) don't have their own
| OS to avoid being tied to a particular company.
| 0xbadcafebee wrote:
| Why can't the UK government build their own cars? Their
| own boots? Their own pens, paper? How wasteful and
| pathetic that they wouldn't make all those things
| themselves. If it's _possible_ to do it yourself, by
| golly, you _should_ do it yourself, and there 's
| absolutely no reason in the entire world to purchase
| those things from someone else instead.
| kakoni wrote:
| You know that UK's National Health Service did a deal
| with Palantir for "federated data platform"?
| NomDePlum wrote:
| No. It is only relatively recently (~5/6 years) AWS have
| had any data centres in the UK.
|
| That blocked use of AWS for a lot of UK departments due
| to data sovereignty concerns.
| kunwon1 wrote:
| I see. I don't use AWS much, I saw this [1] and assumed
| this was like the US g-cloud.
|
| [1] https://aws.amazon.com/government-education/g-cloud-
| uk/
| dijit wrote:
| Sysadmins are cheaper than many people seem to think.
|
| I had a person I trust a lot telling me that "if we go with
| a bare metal provider like GCore we'd have to hire
| someone", his reason for bringing that up was that the cost
| difference would be justified by _not_ having to hire
| someone,.
|
| However a GCore EUR400,000k/y bill becomes a EUR6,000,000~
| if you were to use a public cloud, even with the scaling up
| and down when not in use (we are an extreme case of needing
| a lot of dumb unreliable compute thats geographically
| distributed).
|
| I can hire a _lot_ of sysadmins for that money, but I
| probably don 't even need _one_ because public clouds also
| need devops staff to manage the complexity anyway.
| DylanDmitri wrote:
| The risk is hiring a team of ineffective sysadmins,
| especially if your organization can't assess sysadmin
| competence.
| dijit wrote:
| That would indeed be a risk, but the circular logic of
| this means no new company could ever have any competence
| outside of its founders. Which feels shortsighted.
|
| Anyway, I am a former sysadmin. I am confident that I can
| identify competence in the requisite areas.
| criley2 wrote:
| Governments tend to be far less competent at determining
| technical competence. Due to a wide variety of factors,
| governments tend to be completely uncompetitive in salary
| for technical positions meaning they're already hiring
| from the lower end of the pool (not including a few
| altruistic folks willing to forgo their market value).
|
| At a company if a department isn't working out you just
| restructure and move on, but in the government, that team
| is going to retire in your org and collect pension from
| you, and there's very little you can do about that.
| dijit wrote:
| Everything you said seems to also apply for developers
| and the staff that would manage cloud resources.
|
| Lack of cost control or effective use of a cloud provider
| leads to spiralling uncontrollable costs.
| rcxdude wrote:
| yeah, every company I know of that uses cloud has a team
| responsible for managing it anyway, and they don't seem
| much smaller than the team needed to manage on-prem. I
| don't really think this 'savings' exists in most cases.
| belter wrote:
| If you are worried about that, start with your government
| use of Microsoft Office and Windows who both send MB of
| data per minute to a US based company.
| foofie wrote:
| > (...) its a blucking government, it handles citizen data,
| and its sending them to a company of foreign country,
| because it can't hire some system administrators? A
| GOVERNMENT? What are they doing?
|
| This is a very good question, and bears repeating.
|
| It's not a massive database as well. 400GB with 1k
| inserts/second.
| everfrustrated wrote:
| The UK replies entirely on the mercy of USA for its nuclear
| deterrent (Trident)
|
| For the UK at least, that ship has _long_ since sailed....
| blibble wrote:
| trident has UK built warheads and is operationally
| independent of the US
| solatic wrote:
| > because it can't hire some system administrators?
|
| Spoken like someone who has never worked in the public
| sector. Hiring can easily take 6+ months or more due to an
| ever-increasing list of requirements that government HR is
| required to fulfill, not least of which is passing a
| security clearance which takes even more time. The best
| people on the market rarely have the patience for this.
| Once your employees do get hired - on-boarding can take
| another few/several/more months, getting various
| permissions, technical documentation, etc. Everything is
| out-of-date because making changes requires committee
| consensus, in a culture that is risk-averse, because nobody
| notices when you out-perform (after all, the requirements
| were also set by a committee that doesn't know who you are)
| but something going wrong is grounds for termination.
| Public sector work over-relies on hiring contractors
| precisely to shift blame for failure to the contractors.
| Managed database services are _excellent_ tools to shift
| this kind of catastrophic risk of data loss to a contractor
| /vendor (who is managing the database).
|
| Governments not owning their data isn't due to technical or
| budgetary limitations - it's strictly cultural.
| NomDePlum wrote:
| Fully agree with this. I'd also add that a lot of IT is
| buy not build, in general. That includes support.
| Particularly true for the public sector and has been in
| place well before AWS existed.
|
| Outsourcing the complexity to run and maintain a secure
| reliable database cluster really is making good use of
| the managed service model.
| Dylan16807 wrote:
| > Hiring can easily take 6+ months or more
|
| Do you think this move didn't take even longer to plan?
|
| > to shift blame
|
| That reason is much more plausible.
| sph wrote:
| How would you feel if the US government ran on servers from a
| European company, which also works very hard to avoid paying
| taxes in US soil?
|
| All those reasons to go AWS hold for a private company, not
| for a government service of a first world country and G7
| member. AWS has a lot of compliant services, but it's not
| like they're doing rocket science one of the top 5 richest
| countries in the world cannot afford to develop or contract
| within its borders.
|
| The simple reason is that the UK has been on a long trend of
| selling out to the highest bidder, whether they are US tax
| avoiding companies, chinese or managed by Russian oligarchs.
| We have chosen AWS for the same reason Post Office chose
| Fujitsu.
| freedomben wrote:
| I would be surprised if they aren't deploying to the London
| data center, so I would think it _is_ within the UK
| dijit wrote:
| There's no govcloud in the UK; unless there are specific
| terms then the terms-of-service state that you are
| licensing either the irish entity or the american entity
| to have access and dominion of your data.
|
| I had to spend a lot of time writing my privacy policy
| (perks of being CTO... yay), and part of that privacy
| policy was an admission that we transfer ownership of
| data to a US company (by using public cloud) despite
| using european datacenters.
|
| This is because our agreement is with a US entity.
| arpinum wrote:
| The blogpost shows a connection string to eu-west-1 in
| Ireland
| esskay wrote:
| eu-west-2 is a bit misleading, most of its nowhere near
| London, they've got DC's right up into the midlands. One
| of their newer ones for example is out in Didcot
| Oxfordshire, they've also got a few up towards
| Peterborough. All classed as 'London' despite being a
| fair distance away from it.
| otteromkram wrote:
| I'd argue that AWS is much better suited than self-hosting
| _because_ it 's such an important service.
|
| Downtime becomes negligible and global reach vastly increases
| with comparably little cost.
| 0xbadcafebee wrote:
| That sentence was a little confusing. You're not happy that the
| government is hiring experts to run an important service?
| dtnewman wrote:
| Yes. RDS is a very reasonable choice if you are a tech
| company, let alone a govt org. The alternative isn't "let's
| host this ourselves" it is "let's host this with Oracle at a
| much higher cost".
| 15457345234 wrote:
| The alternative - at government scale - is absolutely
| 'let's host this ourselves' and that's what they should be
| doing, to ensure that institutional expertise remains. They
| should also own and operate their own datacentres which
| should be physically secure, not shared with commercial
| ventures and guarded by the armed forces, not civilian
| security.
| 0xbadcafebee wrote:
| Why doesn't the government manufacture their own cars?
| They're going to lose institutional expertise in building
| cars! They should also own and operate their own
| manufacturing facilities which should be physically
| secure, not shared with some 'civilian commercial
| venture'.
|
| By golly, the government can't do business if it isn't a
| datacenter operations company, a software vendor, and a
| car manufacturer.
| 15457345234 wrote:
| > By golly, the government can't do business if it isn't
| a datacenter operations company, a software vendor, and a
| car manufacturer.
|
| I mean, precisely, which is why some countries are
| rapidly turning into failed states. Too much buck-passing
| and outsourcing.
| tobias_irmer wrote:
| It isn't? AWS is crazy expensive and you don't have as much
| control over things as you may occasionally need. The best
| decision we took in the past few years with regards to
| infrastructure was moving away from AWS and doing
| everything ourselves.
|
| On RDS we had inexplicable spikes in cost, deteriorating
| support and no real support for any of our issues. When we
| tried using DMS, it just didn't work as expected, even
| after spending two days on the phone with their support.
| 15457345234 wrote:
| > is hiring experts
|
| 'moving to AWS' (or any cloud provider) is not 'hiring
| experts' it's just outsourcing the risk to an entity that
| you, in the event of a genuine crisis, have no leverage over
| beyond 'we're going to stop paying you (once we migrate away
| from you which will take ten years)'
| 0xbadcafebee wrote:
| AWS are not experts at providing computing services? Holy
| cow. This is news to me! I thought they were the most
| popular and highly regarded computing infrastructure and
| PaaS company in the world, managing both hardware and
| software and providing subject matter experts to work with
| customers on architecture and implementation, along with
| official partners and a marketplace of turn-key products.
|
| Boy, am I embarrassed! I need to start building my own
| datacenter right away, all my shit is on AWS!!!
| rcxdude wrote:
| They may be experts, but you sure aren't hiring them. You
| are renting something from them.
| pixelesque wrote:
| As other comments point out, their own platform was (at least
| in terms of DB) already running on AWS, just using a different
| account.
| foofie wrote:
| > As other comments point out, their own platform was (at
| least in terms of DB) already running on AWS, just using a
| different account.
|
| That changes nothing. It just means this unjustifiable
| nonsense is going on for a while.
| pmcp wrote:
| As somebody who worked for the European Commission, and a
| european national government, I agree with your sentiment, but
| the harsh reality is that government divisions in generally
| work on a shoe string budget, when it comes to decisions like
| these. I wouldn't be surprised if this was a "best effort given
| the circumstances" move.
| _joel wrote:
| There's an absolute ton of stuff on AWS. There used to be
| gCloud that allowed for smaller clouds to tender for government
| contracts bit there was a big pull to AWS, at least from my
| experience with it.
| ivix wrote:
| I'm surprised that you're surprised. Why on earth would
| government not be migrating to the cloud?
| NomDePlum wrote:
| Why?
|
| I've worked on a number of UK government projects, including
| some with particularly sensitive security and data
| requirements.
|
| Having some knowledge of their on-prem data centres and UK
| Cloud offering they have also used moving to AWS has so many
| operational, security and resilience benefits that aren't
| available elsewhere. It's not a free-lunch by any means and
| needs thought and governance certainly but the procurement
| simplification benefits alone make going to the public cloud a
| no brainer for a lot of government services.
|
| It is worth knowing that even the on-prem data centres are
| usually operated by 3rd parties such as HP, BT and IBM. There
| was an initiative to have "Crown-managed" data-centers but it's
| not particularly scalable.
| overstay8930 wrote:
| If you saw how non-tech companies run datacenters, well let's
| just say they're not exactly working with NATO like the big 3
| cloud providers do when designing their DCs and backbone.
|
| Honestly you should be frightened when you see someone NOT
| using a cloud provider, because it is hard work to properly run
| and secure a datacenter. Even Equinix fucks up HARD regularly
| and they are considered the gold standard (shout out to those I
| saw at 350 E Cermak over the weekend).
| msla wrote:
| It incentivizes public-private cooperation: If the government
| cracks down on Amazon, Amazon turns off the government's AWS
| accounts and deletes the data. The government finds that
| subpoenaing a wiped hard drive is utterly nugatory, and thereby
| learns humility.
| mobilemidget wrote:
| "The PaaS team offered us the ability to migrate databases
| using AWS Database Migration Service (DMS)."
|
| And I'm not surprised if, they got some kickback, discount etc
| in some way to promote AWS on their blog. Not claiming its so,
| but I would not be surprised at all. It reads as one big
| advertisement.
| mbb70 wrote:
| Interesting to compare this to https://knock.app/blog/zero-
| downtime-postgres-upgrades discussed here
| https://news.ycombinator.com/item?id=38616181
|
| A lot of the discussion boiled down to 'this is a lot of
| complexity to avoid a few minutes of downtime'. I guess this is
| the proof, just use AWS Data Migration Service, swap the DNS
| entries to go live and live with 11 seconds of downtime.
| ris wrote:
| There is no "just" about it. The absolute key takeaway is in
| "what we learned":
|
| > We chose to use DMS because it was well supported by the
| GOV.UK PaaS and we could also get support from AWS. If we were
| doing a PostgreSQL to PostgreSQL database migration in the
| future, we would invest more time in trying alternative tools
| such as pglogical. DMS potentially added more complexity, and
| an unfamiliar replication process than what we may have found
| with other tools. This backs up what AWS say themselves on
| PostgreSQL to PostgreSQL migrations.
|
| The message here is _not_ "just use DMS".
| T-Winsnes wrote:
| Even AWS in their own docs says to use the native tools when
| migrating from postgres to postgres[1]. They don't go into
| the details to much and points to pg_dump rather than
| pg_logical, but interesting to see that they don't recommend
| using DMS for it
|
| [1] https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Sou
| rce...
| ris wrote:
| They _do_ , but those recommendations are buried quite deep
| in the documentation, well behind all the marketing guff
| that suggests that DMS is all things to all people, and
| wonderful magic that is ideal for all situations.
| btown wrote:
| Has anyone used https://cloud.google.com/database-
| migration/docs/postgres/qu... to do something like this? Does
| it work similarly to AWS DMS?
| postpawl wrote:
| There are a lot of gotchas with using DMS (which seems to use
| pglogical under the hood). Since it's not hardware-level
| replication, you can run into issues with large
| rows/columns/tables and it doesn't really handle foreign keys.
| It may not handle some special data types at all. You also need
| to update the sequences after the migration or you'll get
| errors about duplicate primary keys. You can also have issues
| if you don't have proper primary keys, because it doesn't
| always copy the entire row at once.
|
| If the databases are within the same AWS account, it's likely
| easier to use hardware-level replication with global database
| or snapshots to do migrations if you're ok with 4-5 mins of
| downtime.
| Twisell wrote:
| There are many options available with PostgreSQL you could
| also do a physical full backup + WAL level replication to
| keep key AND a get low downtime.
|
| What might have oriented theirs choice is that they wanted to
| upgrade from major version 11 to 15 during the migration
| process. This is only available using logical replication.
| Otherwise you'd have to chain upgrade process of each major
| version (and possibly OS because 11 is EOL on some arch) and
| this is nor trivial nor quick.
| igammarays wrote:
| Now all we need is for Amazon to come out with a "government-as-
| a-service" product for your budding nation state.
| ris wrote:
| FWIW GOV.UK Notify is part of a suite of services offered by
| GDS to UK public sector bodies (along with GOV.UK Pay and
| GOV.UK PaaS) that was originally known as "Government As A
| Platform".
| robin_reala wrote:
| ...and the platform is all open source, so people are free to
| fork / steal.
| https://github.com/orgs/alphagov/repositories?q=paas
| hinkley wrote:
| In partnership with pinkerton.com
| londons_explore wrote:
| There are various ways to 'pause' incoming postgres queries, for
| example using pgbouncer, - ie. don't fail them, simply delay them
| until the replication has caught up and then let them continue on
| the new database.
|
| If anything goes wrong and replication doesn't catch up, you can
| unpause and let those queries happen on the old database.
|
| Therefore, your 11 seconds of downtime becomes 0 to 11 seconds of
| added page load time. But more importantly, of the thousands of
| users of the database who have never seen a query fail before and
| might have buggy error handling codepaths or have a single failed
| query ruin a whole batch job, this approach leads to a lot less
| collateral damage.
| whartung wrote:
| It's one thing to pause queries but can you also pause
| transactions that are in flight? How does that work?
| londons_explore wrote:
| Note that the enemy of low/zero downtime migrations like this is
| long running queries.
|
| Ie. a single update query which takes 30 mins.
|
| You either have to kill and roll back that query, or suffer 30
| mins of unavailability.
|
| As far as I know, there is no way to migrate a currently in
| progress query.
| Amezarak wrote:
| It's hard for me personally to imagine a 30-minute update query
| that is not written extremely inefficiently, or else a one-time
| huge data migration.
|
| There are a lot of the former in the wild to be sure. I've had
| a lot of run turning minutes-hours into milliseconds. :)
| londons_explore wrote:
| Data scientists... And when you look at the query, it'll be
| 800 lines of SQL...
| hobs wrote:
| Heh, that's not so bad - 2 jobs ago I had to fix a KPI
| generation process that was 70k lines of dynamic sql, that
| unwrapped to up to 1m LOC :)
| chaz6 wrote:
| Did you ever run into a hard limit of statement size?
| hobs wrote:
| No, but all the code generation was happening in the
| server itself so I dont know how hard and fast it
| applies, iirc SQL Server's limits are like 64k x packet
| size.
| necovek wrote:
| For a software engineering project, you probably want to limit
| your transactions to much less than that (set statement_timeout
| is your friend). If you've got extremely long transactions, you
| can probably avoid doing the switch-over when they run
| (hopefully they are not a random occurrence but a result of a
| scheduled job or similar).
|
| In combination with transaction time limit and fail-over
| configuration (where you fail the old primary), you can control
| the slowdown (instead of downtime, eg. with pgbouncer) very
| precisely.
|
| I would be more concerned with the DNS TTL being respected in
| the entire stack (and external caching DNS servers you rely
| on), tbh.
|
| But it is usually not critical to avoid a dozen seconds of
| downtime for an app, so whatever is simpler for you should be
| your go to solution.
| callalex wrote:
| Well you've certainly introduced a teaching moment to me! What
| are the nature of writes you deal with that last 30+ minutes?
| What kind of data/what kind of people are involved with such DB
| writes where you need to rely on the DB engine to work so hard
| instead of something more split up by queues at a higher layer?
| londons_explore wrote:
| It's usually badly engineered things. However, just because
| it's badly engineered doesn't mean it's fine for it to break
| :-P
|
| Things like a super complex n^3 complexity query to find all
| users who were 3 hops away from a known fraudster, where a
| 'hop' means 'shared any piece of account data in common'
| chaz6 wrote:
| Would that be more suited to a graph database such as
| Neo4j?
| londons_explore wrote:
| Postgres queries are streaming. Ie. If the query result has 1
| billion rows, but the client only has enough ram for 1
| million rows, then the query will be slowed down until the
| client is ready to accept more rows.
|
| If the client is slowly processing through the result set,
| then the query can take many minutes/hours.
| Topgamer7 wrote:
| We tried to work with DMS to migrate mysql to postgres, and it
| was a nightmare. Support was useless, and would often just not
| get back to us without prodding. Then them giving us canned
| responses unrelated to our queries. The whole thing is nigh on
| un-debuggable.
|
| Stay away.
| rjh29 wrote:
| Tried it for mysql and it was flat out broken and silently
| corrupted data.
| Topgamer7 wrote:
| For us it seemed to be trying to load data from the wrong
| table on mysql, into the wrong table on postgres.
| mey wrote:
| I worked on migrating our MySQL system to PostgreSQL using
| pgloader ( https://pgloader.io/ ).
|
| There were some hiccups, things that needed clarification in
| documentation, and some additional processes that needed to be
| done outside of the system to get everything we need in place,
| it was a amazing help. Not sure the project would've been
| possible without it.
|
| Data mapping from PostgreSQL to PostgreSQL as in the article
| isn't nearly as bad as going between systems. We took a full
| extended outage and didn't preload any data. There were many
| dry runs before hand and validation before hand, but the system
| wasn't so mission critical that we couldn't afford to shutoff
| the system for a couple of hours.
| Topgamer7 wrote:
| We also ended up using pgloader. Its not without its friction
| either. For example the concurrency/number of rows setting
| seems broken out the gate, and its like playing darts with a
| blindfold on to get it to run without running out of memory.
| But being able to read the source, github actions, and
| overall at least I could debug my problems, or find others
| who had the same issue.
|
| Would recommend pgloader.
| Spivak wrote:
| This was my experience as well. We thought it was a nice
| managed way to move pg to pg with minimal setup but we ran into
| so many issues we just did logical replication with bespoke
| fixes for things it didn't handle well.
| NomDePlum wrote:
| Surprised by that. I've used AWS DMS quite a lot to do both on-
| prem to AWS and AWS (MySQL) to AWS Postgres migrations and long
| term ongoing replication. Whilst there is some
| complexity/gotchas there it's always been more than up to the
| task. Takes a little bit of validation/testing to understand
| but it's very well documented too.
|
| What sort of issues did you hit? In all honesty I'm not sure
| I've been more impressed by another AWS service.
| tibbon wrote:
| One issue we hit were any schema changes totally messed it
| up. I don't have my notes in front of me, but we were
| constantly hitting data that wouldn't migrate, or that things
| suddenly broke whenever things changed.
| NomDePlum wrote:
| Interesting we managed several schema changes when using
| AWS DMS for replicating data over a long period between
| MySQL and Postgres clusters.
|
| We treated these carefully and tested as we made them but
| never had any real issues with them. From memory DMS could
| cope with adding columns pretty transparently. One setting
| we invested in configuring and understanding was to allow
| DMS to attempt to recover from replication failures. This
| allowed it to error on DDL changes and attempt to recover.
| This usually involved restarting the task, but it would do
| this transparently as part of the recovery.
| Topgamer7 wrote:
| It seemed to try to load data into the wrong table on
| postgres. That was the one that immediately comes to mind.
| Honestly poor support is what really killed it for us. But we
| had other technical problems with it.
|
| We burned 3 weeks just trying to get support to provide a
| sensible response. I never got the sense anyone replying to
| us knew any more than the surface level about the
| infrastructure of how DMS worked.
| NomDePlum wrote:
| We kept our table mappings pretty much mirrors of the
| source tables. Any data transformation we managed on the
| target cluster, not through AWS DMS.
|
| I've used it fairly frequently over a number of years so
| maybe the issues we hit on the learning curve have dimmed.
|
| We also deliberately kept things as simple as possible at
| first and focused on DMS's ability to move the data from
| source to target and really only tweaked settings that
| increased security or reliability. We stayed well away from
| any of the data transformation or more complex
| functionality.
| extesy wrote:
| I can confirm that. Both DMS reliability and support quality
| were terrible.
| tibbon wrote:
| We also tried to use DMS for a few things (Rackspace to AWS
| migration, replication out to data lakes, etc) and it has been
| consistently undersupported, buggy and ate months of time
| before we went to other solutions. While a lot of AWS support
| has been good; not for DMS. It feels entirely half baked .
| troublebucket wrote:
| +1 DMS is very half baked. Silent, unbuggable errors. Tons of
| unsupported LOB and CLOB data types. Built-in Postgres logical
| replication is way easier.
| karol wrote:
| Hope the picture is illustrative only.
| aidos wrote:
| I love that where the rubber meets the road it doesn't matter if
| you're a government organisation or a web agency, it's still just
| plain old Postgres, python, sqlalchemy and dns with a low ttl.
| NegativeK wrote:
| To be fair, there's plenty of government orgs at the bottom and
| web agencies at the top.
| Ozzie_osman wrote:
| We did a similar migration (somewhat larger database) with ~20
| seconds of downtime and much less work... using the magic of AWS
| RDS Blue-Green Deployments [1]. Surprised they aren't mentioned
| in the thread yet.
|
| Basically, you spin up a new Blue Green deployment with any
| desired changes (in our case, we were upgrading Postgres major
| from 13 to 15). While your blue configuration continues to serve
| traffic, AWS uses logical replication to keep the "green"
| deployment in-sync. You can keep modifying (or testing) the
| "green" deployment (eg you could load test it if you wanted to),
| as long as you don't do any writes to it (writes still have to go
| to your live, blue configuration, and are replicated to green).
|
| When you're ready, you run the "switch" command, and AWS does a
| few things for you: run checks to ensure blue/green are in sync,
| stops writes and connections, waits a few seconds to ensure
| replication is caught up, renames your database, then allows
| connections/writes again. We had less than 20 seconds of
| downtime, by our count. And, we had a primary and several read
| replicas and AWS successfully switched the full configuration
| over with no hiccups. You don't even need to switch your
| configuration because AWS swaps the database URLs for you. Green
| becomes blue, blue becomes old blue, and when you're ready, you
| delete "old blue".
|
| Highly recommend! They do have some restrictions (for instance,
| not sure if it would work if you're switching accounts, etc).
|
| 1.
| https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/blue-...
| sgarland wrote:
| +1 for B/G. That said, I imagine they couldn't use it due to
| the cross-account shift. I've used it for both MySQL (with much
| higher QPS than TFA, by two orders of magnitude) and Postgres,
| and they both went flawlessly.
|
| Read the docs, especially the limitations. Re-read them. Do a
| test run in the dev environment, under load. Do it again in
| staging.
|
| Or just YOLO into prod 'cause it'll probably be fine, I guess.
| Ozzie_osman wrote:
| We definitely did a dry run with a parity configuration a
| couple nights before. It gave us a lot more confidence.
| troublebucket wrote:
| Has anyone encrypted the storage on a previously unencrypted
| RDS using Blue/Green?
| d1str0 wrote:
| Good question. This was a pain point for my small team (me,
| myself, and I) a little while back. We had an unencrypted db
| deployed with CDK, and then tried to get it encrypted without
| losing data.
| bgschulman31 wrote:
| We recently did this on my team over Christmas this year. We
| opted not to use Blue/Green for this but instead spun up an
| encrypted snapshot and began replication from the old
| database to the new database using MySQL's tools. Once
| traffic on our platform was sufficiently low, we cut
| connections to the database, waited for for replica lag to
| reach 0 and relaunch the servers with the new databases host
| info. Our downtime was around a minute.
| wisosim wrote:
| We did the exact thing not too long ago, but we couldn't do
| it using blue/green. We were able to minimize downtime quite
| a bit, but it was on the order of minutes, not seconds. I
| wrote a little bit about the process here. I spent a lot of
| time prepping and running the migration, so happy to share
| any details if it's helpful.
|
| https://phizzle.space/dbadmin/aws/postgres/2023/12/30/rds-
| en...
| cljacoby wrote:
| I also used RDS Blue/Green deployment to apply a MySQL major
| engine version upgrade from 5.7 to 8.0. With respect to
| downtime it worked fantastically, I think we measured 13
| seconds of observable downtime from the API.
|
| However we did learn the hard way that RDS Blue/Green cannot be
| used to apply arbitrary changes. In our case, we discovered RDS
| Blue/Green can only be used to move up engine versions, not
| down.
|
| We discovered on MySQL 8.0 one of our stored procedures had
| very occasional failures, and considered the option of using
| RDS Blue/Green again to move back down to 5.7. Turns out that's
| not an option.
| Mavvie wrote:
| I wonder if that could be because MySQL 8's replication is
| backwards compatible but MySQL 5.7's isn't forwards
| compatible. If so, it makes sense that you're only able to
| move forward.
| 0xbadcafebee wrote:
| Database changes are typically one-way. If your new change
| includes creating or modifying a table, such that there are
| new additional columns, and you populate those with data,
| then downgrading would destroy the changed columns and the
| data in them. Hence you can't downgrade once you upgrade or
| you'd potentially be breaking things. To downgrade safely
| you'd need to backup or snapshot the old database, and then
| restore your database back to the backup/snapshot, but that's
| not blue/green.
| fulafel wrote:
| DB schema migration script frameworks (at least in Python,
| Ruby & Java lands) do typically support both upgrade and
| downgrade directions. People skip implementing and testing
| the downgrade side if the development model doesn't need it
| but the problem of what happens to the data is controlled
| by what you put in the "down" migration script.
|
| I'd guess if you can't throw the data away, you won't do a
| down migration, you'll do an up migration that changes the
| db to save that data in your preferred way before undoing
| or reworking the previous schema change.
| yellowsir wrote:
| but now you have 2 up paths. and migrations are critical,
| i would avoid it where possible!
| patmorgan23 wrote:
| Database migrations are always "fail forward" if there's
| an error you figure out what it was and fix it.
| eropple wrote:
| _> DB schema migration script frameworks (at least in
| Python, Ruby & Java lands) do typically support both
| upgrade and downgrade directions._
|
| They do, and in every shop I've ever been in these are
| considered a trap precisely because they _don 't_
| consider data loss.
|
| Always roll forward. If you have to change migration
| history, restore a backup and lament past you's hubris.
| evanelias wrote:
| This is solved more cleanly in declarative schema
| management systems, where you have a schema repo of
| CREATE statements, and the tool can auto-generate the
| correct DDL. You never need to write any migrations at
| all, up or down. If you need to roll back, you use `git
| revert` and then auto-generate from there. The history is
| in Git, and you can fully leverage Git like a proper
| codebase.
|
| A key component is that the schema management tool must
| be able to detect and warn/error on destructive changes
| -- regardless of whether it's a conceptual revert or just
| a bad change (i.e. altering a column's data type in a
| lossy way). My declarative tool Skeema [1] has handled
| this since the first release, among many other safety
| features.
|
| That all said, schema changes are mostly orthogonal to
| database version upgrades, so this whole subthread is a
| bit different than the issue discussed several levels
| above :) The root of the blue/green no-rollback-after-
| upgrade issue discussed above is that MySQL logical
| replication officially supports older-version-primary ->
| newer-version-replica, but not vice versa. Across
| different release series, the replication format can
| change in ways that the older version replicas do not
| understand or support.
|
| [1] https://github.com/skeema/skeema
| fulafel wrote:
| I have two theories where people end up wanting them:
|
| (1) Circumstances that for some reason enforce the
| requirement that sysadmin type ppl always have to be able
| to downgrade / roll back deployments without "the
| developers" producing new builds or sw artifacts. A
| separation of ops and dev teams, where you decide you
| need to survive an inability to make or procure new
| software builds on demand, and just dig up old deployment
| artifacts to use after down migration. There are a lot of
| wrong reasons to do this in inhouse sw settings, but also
| I guess the classic "we bought a 3rd party server app and
| plugged it into our onprem database", like Jira or
| something.
|
| (2) Systems that are technically unable to recover from
| errors happening in db migrations (missing transactional
| schema change feature in db and/or application doing db
| related stuff that can't be rolled back at deployment
| time). So the down migration is more like a hand coded
| rollback for the migration that will be automatically run
| in the failure case of a deployment.
|
| In both cases I can see how the "what happens to data in
| new columns" situation might still work out. In the (2)
| case it's sort of obvious, there's no new data yet. In
| the (1) case you live with it or choose the backup
| restore path - I can see scenarios where you decide it'd
| be much worse to restore from backup and lose people's
| entered data for couple of days, or however it took to
| find the showstopper for the upgrade, vs run the down
| migration and just lose new feature related data. (Which
| you could also rehearse and test beforehand with backups)
| magicalhippo wrote:
| Our in-house schema migration tool supports downgrading,
| but it won't remove non-empty tables or columns etc.
|
| For us this isn't a big deal though because we're writing
| our software so it should be able to function as expected
| on a DB with a newer schema. This makes upgrades much
| easier to handle has users can run new and old software
| side-by-side.
| yeswecatan wrote:
| How did you go about stopping and restarting applications which
| reach out to the database? We have a number of tasks running in
| ECS which can take a minute to spin down and a few minutes to
| spin back up.
| ThePowerOfFuet wrote:
| Presumably you don't stop them, and they throw errors during
| the cutover.
| todd3834 wrote:
| You aren't supposed to have to change anything in the
| application code. The same database URL should work.
| Ozzie_osman wrote:
| For our web service, we didn't stop anything. They had a few
| seconds of errors though it seems like some sessions were
| just buffered or paused and experienced high latency.
|
| We also had background worker services. For the very high
| throughput ones, we spun down the # of tasks to a bare
| minimum for <5 minutes and let the queue build up, rather
| than have a massive amount of errors and retries. For the
| other ones where throughput wasn't high, we just let them be,
| and during the downtime they errored and retried and the
| retries mostly succeeded.
| benmorris wrote:
| I used this about 2 months ago going from MySql 5.7->8.0 Really
| awesome feature.
| sharadov wrote:
| Blue/green is new, so am guessing most folks don't know about
| it. It's the way for close to zero downtime upgrades.
|
| Am waiting for them to support upgrades from RDS Postgres to
| Aurora.
| shayonj wrote:
| +1 for Route53 Groups and B/G setups. We did something similar
| with PG upgrades, no downtime with AWS R53 groups & retry
| inflight queries with a custom Rails ActiveRecord transaction
| patch.
|
| Trade off: For a few seconds some requests were slower.
|
| DNS Groups w/ retries is a nifty mechanism for these things.
|
| Tool used: https://github.com/shayonj/pg_easy_replicate
| crucialfelix wrote:
| I'm planning to do this soon. Is there any problem skipping
| versions when upgrading? I would like to go from Postgres 12 to
| 16. Does Postgres handle that without worries?
| ComputerGuru wrote:
| > The second change was to create a DNS record in AWS Route53 for
| `database.notifications.service.gov.uk` with a 1 second TTL (time
| to live) [..] our migration script just needed to update the DNS
| weighting in AWS to 100% of results being sent to the target
| database location and wait 1 second for the TTL to expire. Then,
| when our apps next try to query our database they will be
| querying our target database.
|
| Wait. Their (or Python's default?) db orm blocks while it
| performs a DNS lookup for each and every query!? It doesn't cache
| resolved addresses for _any_ length of time? No connections are
| pooled and reused?
| necovek wrote:
| It would probably be the OS' `getaddrinfo` or `gethostname`
| that does this: Python rarely reimplements system level calls,
| which means it relies on the system's configuration.
|
| If TTL of 1s was respected, they would be cached for 1s, but
| it's not uncommon for DNS query libraries and especially
| caching DNS servers to not fully respect TTL anyway: tbh, that
| might explain some of the downtime they've seen.
| ComputerGuru wrote:
| I didn't mean it was directly implementing the networking
| call to the dns server -- just that it wasn't directly
| caching the result.
|
| getaddrinfo(3) and getnameinfo(3) (guessing that's what you
| meant) don't implement caching, at least not explicitly in
| the spec and not normally in practice. On Windows, DNS
| results are cached by the OS but on Linux that would be
| distro-dependent behavior and usually requires setting up a
| local caching dns server (Ubuntu uses unbound out-of-the-box,
| iirc. Other choices include ncsd and dnsmasq).
|
| Even if they implemented caching at the syscall level, this
| still assumes no connection stays open for more than 1s or is
| reused except per query. It seems like a big assumption (at
| least I _hope_ it is, because I certainly wouldn 't want my
| app to initialize a new db connection, let alone perform a
| DNS lookup, for every query).
| juggertao wrote:
| They mentioned they had a script which terminated all
| connections to the old database then changed the password.
|
| But on the app side you typically don't cache DNS, that
| creates other problems like stale DNS.
| ClassicOrgin wrote:
| Minor detail but did anyone else notice they are using eu-west-1?
| Kind of weird for the UK to be hosting there sites in another
| country (Ireland). I'm sure this isn't super sensitive data but
| still.
| lozenge wrote:
| The London AWS region was missing a lot of features for a long
| time. It's easier to just go with a big region.
| justinclift wrote:
| "Lets go with the easier approach" is a bit worrying when it
| means potentially breaking security assumptions / legal
| assurances.
|
| Though I guess it could be explained away by having something
| in the website's "Terms and Conditions for use". ;)
| juggertao wrote:
| It's a gov website. What are they going to do, sue
| themselves?
| another2another wrote:
| Have a public enquiry, with lots of tea.
| esskay wrote:
| They've been on AWS since before eu-west-2 was added, it's
| still not at the scale of eu-west-1. Capacity issues are still
| a thing (but getting much better), and only recently have they
| become pretty much feature parallel.
| callalex wrote:
| It made a lot more sense before...you know...the thing that
| made no sense.
| kitd wrote:
| I think one of the bits of carry-over legislation in the
| Withdrawal Agreement after Brexit was the UK continuing to
| comply with GDPR. So using an Irish DC is still ok.
| gbraad wrote:
| They don't just sent the notifications but store them? Sounds
| like it might contain PPI as it records passport extension data,
| etc. Might be minimal, though over 1 billion rows (400GB) sounds
| massive trove to keep around.
| M2Ys4U wrote:
| I'd highly doubt their database contains payment protection
| insurance.
| efxhoy wrote:
| Lovely! We just migrated from postgres 14 to 16 for 3 postgres
| clusters (servers) on RDS containing about 2TB of data across 8
| databases. We were down from 00:00 to 04:00. Steps we took:
| * enabled our fallback "maintenance mode" site. It's a super
| lightweight version of our site running on CF workers. *
| scaled down all apps using the db to 0 in terraform * hit
| the upgrade button in the aws web ui, which runs pg_upgrade.
| 14->15 * waited for it to finish * hit the upgrade
| button again. 15->16 * waited for the dbs to start
| accepting connections (they do before they're marked as ready, I
| think aws does more stuff than pg_upgrade) * Started
| `VACUUM ANALYZE; REINDEX DATABASE CONCURRENTLY`. The idea is to
| avoid performance issues between versions and make use of
| performance improvements from new versions. * Started
| spinning up the apps again * Waited until all apps had a
| handful of containers running * Started accepting traffic
| (disabled maintenance site) * Went to bed
|
| The REINDEX CONCURRENTLY happily chugged along for the biggest db
| for another 18 hours without blocking anything. Next time we're
| doing aws blue/green deploys to avoid downtime. We didn't this
| time since we weren't on 14.9 yet (the minimum minor version of
| 14 supported by blue green).
|
| If I was doing this myself I wouldn't pay the AWS tax, instead do
| blue/green myself with logical replication and a load balancer.
| sharadov wrote:
| I would just use pg_upgrade with --hardlinks for an in-place
| upgrade.
|
| Have done 2 TB dbs in less than a minute.
|
| We were running our own Postgres instances on-prem.
| vfclists wrote:
| This doesn't seem to bode well for Oracle, but lobbying trumps
| the day.
| saisrirampur wrote:
| Very clear and concise blog! Loved reading it. I'd be very
| curious to see how Amazon DMS performs at scale. Scale includes
| either of these situations: a) Larger dataset - 2+TB b) Higher
| throughputs - WAL Spikes (at least 2-3K TPS) c) Reliably and
| efficiently replicating TOAST Columns (ex: large JSONBs). d)
| Advance data-types - BLOB, CLOB, HSTORE, ARRAYs etc.
|
| In my previous role with the Azure Postgres team, we observed
| that Azure DMS worked pretty seamlessly in migrations that did
| not involve the above situations. Once either of the above came
| in, the replication experience was hairy. Interestingly, in my
| current gig at PeerDB, many use-cases we are seeing have either
| of the above situations. We have been doing quite a surgical
| optimizations to handle those situations and they keep coming!
| That is the beauty of Postgres, enabling a myriad of workloads,
| each of the unique in their own way! :)
| oulu2006 wrote:
| We didn't even use red/green deployment; just had a multi-AZ
| deployment and migrated from 11.x -> 15.2 with about 30 seconds
| of downtime.
|
| No dramas really, didn't even need logical replication.
| ris wrote:
| I'll bet that wasn't across AWS accounts though.
| sharadov wrote:
| DMS is a terrible migration tool, I spent almost a month fighting
| with various migration issues before I gave up.
|
| It would not migrate text and json types. Even AWS support could
| not offer a solution.
|
| We got in early testing AWS Blue/Green and that has made close to
| zero downtime upgrades a reality.
| slyall wrote:
| If you think DMS is a bad migration tool then try using it for
| ongoing replication to an external destination.
|
| Completely broken.
| ryandv wrote:
| I've also used DMS to migrate smaller datasets (~200GB) from AWS
| RDS MySQL 5.7 to RDS PostgreSQL 11. 10 seconds of downtime here
| was actually incurred not due to the migration itself, but to
| enable MySQL binary logging on the RDS instance, which requires a
| restart, and which AWS DMS uses to replicate changes from the
| replication source (MySQL) to the target (PostgreSQL).
|
| Traffic was steered to the new PostgreSQL instance not with
| DNS/Route 53 weighted records, but an application-level
| percentage enroller (based on user ID). Our particular set-up did
| in fact have apps talking to both databases simultaneously, each
| with a percentage share of traffic, and so we did not need to
| incur additional downtime to cut off traffic to the original
| source database - though now you do have to worry about eventual
| consistency.
|
| I wouldn't recommend using their Schema Conversion Tool. We
| instead migrated the data as 1-to-1 as possible into the
| PostgreSQL target table, and then used plpgsql triggers on the
| replication target table to clean/validate the data and write it
| to another table with our desired (and more strongly typed - no
| more storing JSON as TEXT) schema.
|
| There were also some issues with the replication task seeming to
| stall and stop during the change data capture (CDC) phase. As
| mentioned support is kind of spotty in this area, though we
| learned it may have to do with sizing the DMS replication
| instance correctly.
| zImPatrick wrote:
| I really love how verbose and up to date the UK with tech
| DeathArrow wrote:
| Seems bad that government services depend on clod providers.
| callalex wrote:
| Why? Do you deny the benefits of shared hosting in its
| entirety?
| DeathArrow wrote:
| No. But I wish government agencies depend less on other
| entities. A government agency isn't in a business to make
| money. They should be concerned less with economic efficacy
| and more with accomplishing their goals.
|
| They have a fiduciary duty towards the citizens.
| mattsan wrote:
| Out of all the entities to depend on, AWS is tried and
| battle tested with US gov.
|
| Additionally, a pandemic has shown that government services
| actually need to scale to arbitrary amounts, seemingly at
| random.
|
| Don't you think that in this case using AWS is better to
| simply guarantee availability? Imagine tax payers money
| being spent on servers and then a pandemic happens and when
| the economy is screeching to a halt, more tax payers money
| is needed to expand the cluster, money that really would be
| useful to spend elsewhere.
| patmorgan23 wrote:
| And what if AWS is an efficient, efficiency, and affordable
| way to achieve that goal?
| denysvitali wrote:
| They could have probably used pgpool [1] to avoid relying on the
| DNS (you never know in the chain who's not respecting the TTL).
|
| With this, you can choose which queries are ending up where - and
| migrating the DB endpoint can be done once, centrally
|
| [1]: https://www.pgpool.net/docs/latest/en/html/
| speculator wrote:
| nice writeup. easy to read, easy to understand
| pedrokost wrote:
| We recently completed the migration of a self-hosted 3 TB
| PostgreSQL database from version 12 to 16, transitioning from
| Ubuntu 18 to Ubuntu 22. Concurrently, we had to upgrade various
| extensions, most notably Timescale, for which a compatible
| version did not exist across all scenarios. We performed the
| upgrade by updating a replica in the following sequence:
|
| - Start: PG12, Ubuntu 18, TS2.9
|
| - Step 1: Set up a read-only replica with PG12 on Ubuntu 22,
| maintaining TS2.9.
|
| - Step 1.5: Enter maintenance mode and halt all services.
|
| - Step 2: Detach the the read-only replica, upgrading from PG12
| to PG15 on Ubuntu 22 with TS2.9.
|
| - Step 3: Upgrade from PG15 with TS2.9 to TS2.13 on Ubuntu 22.
|
| - Step 4: Upgrade from PG15 to PG16 on Ubuntu 22 with TS2.13.
|
| - Step 4.5 : Reconnect services to the new database server,
| resume all services, and exit maintenance mode.
|
| All the database upgrade steps were well-tested and automated
| using Ansible. Nonetheless, we did encounter an issue that had
| not arisen during testing. This extended our downtime to
| approximately half an hour, which, for our use case, was
| perfectly acceptable.
|
| Employing logical replication could have mitigated the last-
| minute surprise. So we will consider this approach for our next
| upgrade cycle.
| pbrooko wrote:
| We recently followed an almost identical upgrade path (however
| at the time PG16 wasn't yet supported by Timescale, so stopped
| at PG15 + TS 2.12).
|
| We did look into using logical replication to reduce the
| downtime of the upgrade, but because database schema and DDL
| commands aren't replicated, it seems that it isn't recommended
| with Timescale in the loop.. (I suppose the underlying schema
| changes that Timescale needs to make under the hood are mostly
| a function of your hypertable chunk sizing and what your
| incoming writes look like, so this could be planned around /
| timed well, but we felt it added too much potential complexity
| & risk compared to simply opting for a small maintenance window
| while pg_upgrade completed).
| globular-toast wrote:
| Is that a picture of them doing it?
|
| What is this weird fantasy that programming looks like this? It's
| always the same: a couple of people, always of diverse
| backgrounds, physically interacting, usually pointing and
| laughing at something, with perhaps another group, representing
| different demographics, looking on with a mixture of curiosity
| and awe.
|
| Do an image search for programming to see more examples of this.
|
| I guarantee this work was the result of guys sitting at desks,
| possibly in the dark, thinking. Are we collectively embarrassed
| that this is what programming looks like? Or are these just
| fantasies of management types who have never written a line in
| their lives?
| johneth wrote:
| Give it a rest, mate.
| globular-toast wrote:
| Nah, this is the thing that leads to open offices etc. I want
| people to understand what programming actually looks like and
| these silly pictures need to go.
| draaglom wrote:
| The primary goal of most large organisations' dev blog is to
| attract potential candidates to the top of the recruitment
| funnel.
|
| This picture appears to be (presumably) the actual team doing a
| team building activity.
|
| This is both a reasonable choice given the inferred
| instrumental goals of the blog, and a honest reflection of (one
| part of) the team life, assuming it's a photo of the real team.
| beejiu wrote:
| They are clearly not programming. It's a team building game
| called the marshmallow challenge.
| alerter wrote:
| Interesting, though I have no idea why the government is using
| AWS in the first place. This isn't a startup hacking away trying
| to find PMF, or dealing with unpredictable marketing-driven
| traffic spikes. We know we need these services running long term,
| and can make solid predictions about usage patterns.
|
| We could build a public sector cloud and/or adopt a sensible on-
| prem approach. This requires funding, coordination and technical
| leadership, but would save the taxpayer an enormous amount over
| the long term.
|
| Public sector IT is a disaster in general ofc, but I know there
| are good engineers working there.
| MrBuddyCasino wrote:
| If you think the public sector could improve its operational
| efficiency by building, operating and using their own cloud,
| I've go a bridge to sell you.
| rjzzleep wrote:
| I don't know about the UK, but AWS has has GovCloud in the US
| for a long time, and to be honest compared to a lot of
| infrastructure I have seen there it's a blessing. On the
| flipside, I've met some really amazing infrastructure and ops
| people in a German gov healthcare institution running the in-
| house DC, where the problem was neither the tech, nor the
| people, but 100% the management and their processes, and their
| desire to be the bottleneck for every single interaction
| between infrastructure and engineering teams.
| StupidOne wrote:
| I'm sure a lot of countries in EU did this. First hand I know,
| they did this in Croatia as I was one of the developer who had
| to use it to deploy on it. The worst thing I have ever seen in
| my life. And I worked on a lot legacy apps written in VB.NET,
| Web forms, old Sharepoint, Basic and even when the whole app
| was one big mess of store procedures.
|
| AWS, Azure, GC are at least written with thought about end
| users (us, developers) while government cloud was
| architectured, designed and built by the lowest bidder whose
| first goal was to cut and cut and cut his costs whenever
| possible.
| jamietanna wrote:
| The UK government as a whole operates across the big names in
| the public cloud as well as some on-Prem/colo.
|
| I'd very much recommend watching
| https://youtube.com/watch?v=mpY1lxkikqM&pp=ygUOUmljaGFyZCB0b...
| from September about Gov.UK's various iterations and some of
| the migrations across cloud that they've had to do.
|
| One thing about (at least UK government) is that procurement
| requirements means that they go to market for quotes around
| usage every few years. If ie Oracle Cloud was 1/10th the price,
| it would likely mean they'd win the deal, and so would have to
| migrate to Oracle for the duration of the contract, and then
| potentially do the same to another cloud if that was cheaper
| j-krieger wrote:
| Disclaimer: I work in the public sector.
|
| If you really think the public sector could build anything
| closely resembling any cloud, you are dreaming. Imagining that
| one cloud working for the entire public sector, we are entering
| delusional territory here.
|
| Public sector projects are insanely expensive, take ages to
| develop and often flat out fail.
|
| Not only that, we are starved for engineers even now. If we ran
| such a custom cloud, we would shoot ourselves in the leg by not
| being able to hire from a pool of experienced engineers.
| alerter wrote:
| Oh I totally understand that. Public sector IT is littered
| with failed projects, massive overspends and lasting
| functional problems. There are numerous examples from the
| last few years alone - hundreds of millions spent on broken
| software. I used to work in public affairs and followed a lot
| of those projects closely.
|
| I don't think this means "government = bad at tech" though.
| You sometimes see smaller in-house teams do really good work.
| The biggest issue seems to be with contracting and
| procurement policy. For example, on the Police Scotland i6
| program they hired a bunch of consultancies to write a tender
| document, and then hired CGI for the final project. That
| turned out to be a copy of the Spanish system, which turned
| into a huge and expensive disaster as the requirements
| differed.
|
| Feels like government has a similar problem to a lot of
| legacy non-tech companies. They don't have much technical
| leadership, don't understand the problem, and decide to hand
| it off to the lowest bidder. Doesn't help that they are often
| legally obliged to act this way. But the underlying
| engineering problems aren't unsolvable, and don't need to
| become a huge mess every time. (Your point about recruitment
| is fair though)
| ris wrote:
| > This isn't a startup hacking away trying to find PMF, or
| dealing with unpredictable marketing-driven traffic spikes. We
| know we need these services running long term, and can make
| solid predictions about usage patterns.
|
| If you think government needs and demands are predictable, you
| don't follow politics (particularly uk politics in the last
| decade).
|
| And then there are these things like pandemics that completely
| come out of left field. Being able to scale things on demand
| over the pandemic was one of the key demonstrators for use of
| the public commercial cloud by the public sector.
| objektif wrote:
| Why do we assume scaling can happen only on cloud? Gov sites
| are usually very low tech. We are not mining crypto. I would
| guess even if they outright bought servers and just scrap
| them later on they would come ahead.
| berkes wrote:
| As a startup, my reason for using "cloud" (PaaS) is not to
| catch spikes, but because of focus. Every hour that I spend
| running around with HDDs, screwdrivers (or the cloud version
| thereof - storage, ansible, etc) is an hour that I'm not
| spending on stuff my customers need.
|
| Why would this be any different for a government?
|
| We don't expect our government to build their own cars, but to
| buy them from Volkswagen or Renault. Even when the government
| has a clear need for transport. Why do we then insist they
| build their own IT infrastructure?
| 0dayz wrote:
| Because the government is not a corporation and it's
| obligations are different from the private market.
| berkes wrote:
| So should a government (e.g. the police) produce its own
| cars, rather then buy them? Or set up a factory to make
| laptops?
|
| I'm not trying to be pedantic. I'm trying to understand
| where the line lies (and whether that line is universal for
| all governments on all layers in all countries).
| objektif wrote:
| Your examples do not make sense on prem existed long
| before cloud and some shops were perfectly fine with it.
| If the police were producing perfectly fine cars and then
| they went for a car share program then I would argue that
| yes it may be reasonable for them to consider going back
| to producing cars.
| chasd00 wrote:
| The government likely hired a consulting firm and AWS was part
| of the solution proposed and bought.
| plugin-baby wrote:
| This doesn't sound like how GDS operates.
| jm2dev wrote:
| On a similar case AWS DMS did not work for my team and we
| proceeded with pglogical.
|
| Although it has some limitations it worked pretty well, and
| personally I like that it is cloud agnostic. Definitively I would
| recommend to consider pglogical first.
| zeroclicks wrote:
| Let's ignore the technicalities completely. Here's a government
| providing a detailed look into their engineering practices,
| complete with graphs and configuration snippets. How many
| governments manage to publish such a thing?
| glintik wrote:
| 11 seconds downtime - looks like devops had a break for smoking
| mwarkentin wrote:
| Back when we first tried out DMS there was a fun bug where
| booleans were interpreted as strings, so all false records were
| interpreted as "false" and translated to true on the target DB.
| It was fixed shortly after but was a good reminder to validate
| your data during a migration.
___________________________________________________________________
(page generated 2024-01-20 23:02 UTC)