[HN Gopher] How we upgraded our 4TB Postgres database
___________________________________________________________________
How we upgraded our 4TB Postgres database
Author : mrbabbage
Score : 251 points
Date : 2022-04-19 15:07 UTC (7 hours ago)
(HTM) web link (retool.com)
(TXT) w3m dump (retool.com)
| giovannibonetti wrote:
| By the way, Google Cloud recently launched in-place upgrade of
| Postgres instances. A few days ago we used it to upgrade our
| multi TB database in my company as well.
|
| https://cloud.google.com/blog/products/databases/cloud-sql-l...
| davidkuennen wrote:
| Woah, this is great. Been waiting for this, since Cloud SQL has
| been very reliable in the past few years I've been using it,
| but upgrading was always a pain.
| aeyes wrote:
| What is "in-place" about this? According to the docs you'll
| have ~10min downtime and you'll loose table statistics which is
| exactly what happens when you run pg_upgrade manually.
|
| The biggest problem with all the cloud providers is that you
| won't know exactly when this 10 minute downtime window will
| start
|
| I guess the only advantage here is that you don't have to do
| 9->10->11->12->13->14 like in the past and maybe that was one
| of the blockers Azure has. AWS allows to skip some major
| versions but 9->14 is not possible.
| Teletio wrote:
| The first upgrade Strategie is not the normal or easy one on
| anything production btw.
|
| Very small companies might be able to do this on 'no load day'
| but from a pure business perspective, running your db twice is
| easier and way less risky.
|
| You could have done this even without downtime by letting your
| connection proxy handling the switch.
| gamegod wrote:
| If you're running a 4 TB Postgres database, but you still have to
| worry about this level of maintenance, what's the value
| proposition of using a hosted service? There's usually insane
| markup on any hosted Postgres instance.
|
| If you want to pay multi-thousands dollars a month for a database
| server, it's WAY cheaper just to slap a server with a ton of
| drives in colocation.
| orangepurple wrote:
| Might be an accounting scam^H^H^H^H trick to book the costs as
| an operating expense vs a capital expenditure. In general
| capital expenditures have to be planned carefully, held on the
| books for years, and show a return on investment. Perhaps an
| accountant can provide more detail.
| ipaddr wrote:
| This is a big reason why yearly pricing vs buying hardware is
| popular (one of many reasons).
|
| An expensive can be used that year where a capital expenses
| vest over years so only a portion can be applied that year.
| mywittyname wrote:
| I feel like the explanation is a much more simple: it's
| easier to use Azure/AWS/GCP for everything.
|
| Yeah, this migration might have been somewhat painful, but it
| looks like it's a situation the company experiences every few
| years.
| Spivak wrote:
| It's crazy how many apps opt for an RDBMS for append-only data
| like audit events. It's so tantalizing at the beginning but turns
| into a nightmare time marches forward. audit
| events -> queue -> elastic -> blob storage
|
| is so easy to maintain and we save TBs from living in the DB.
| luhn wrote:
| Mind expanding on what the "nightmare" is?
| mrbabbage wrote:
| We (Retool) are going to be doing this very soon and cut our
| database size by 50%+. And you're exactly right: it's so easy
| to get started by sticking audits (or other append-only data
| schema) in an RDBMS, but it quickly becomes a headache and
| bottleneck.
| abrkn wrote:
| What will you use as a queue and storage?
| mrbabbage wrote:
| To be determined!
| baq wrote:
| up until some size the headache from maintaining a separate
| datastore is bigger. everything should be in the RDBMS until
| proven otherwise for the sake of simplicity. it's actually
| amazing how much you can squeeze out of 'old school'
| databases.
|
| e.g. https://docs.microsoft.com/en-us/azure/azure-
| sql/database/le...
| nrmitchi wrote:
| One of the biggest thing that keeping audit records in your DB
| gives you is transactionality around your audit logs. Sending
| audit events to an external system (quite often) loses this,
| and the resources to address this before you have to are way
| larger than a slightly larger AWS/GCP/Azure/<insert-computer-
| provider-here> bill.
| magicalhippo wrote:
| We're implementing something similar to what OP describes,
| but we'll keep the "queue" in the DB in order to insert the
| application audit event in the same transaction as the data
| change. A background process then uploads to secondary
| storage.
|
| We won't have billions of rows though, so once uploaded to
| secondary storage we'll just clear the blob field and set a
| "processed" flag.
|
| This way we can find all the relevant keys for a given order,
| invoice etc quickly based on a partial key search in the
| database, and transparently fetch from either db directly or
| secondary storage as needed.
| onphonenow wrote:
| Actually, I've seen more problems with folks mixing lots of
| different tools up then I have from folks doing an append only
| audit event in a RDBMS.
|
| When your audit trail is in DB, you can pretty easily surface
| audit events to your customers. Who changed what when is just
| another feature. Capturing audit events is also usually pretty
| smooth.
|
| The folks doing the blob storage route, you would not BELIEVE
| the complexity they have to spin up to expose very simple
| histories etc. This matters a LOT in some spaces (financial
| etc), less so in others.
|
| In my RDBMS model, who changed this field when from what to
| what is a basic select. You can even shard by recordID or
| similar if you want to reduce table scans, good select of
| indexes etc can be a huge help as well. In most cases users
| don't mind a bit of latency on these queries.
| jeffbee wrote:
| My only experience in the financial sector indicates the
| opposite. The firm held its trading history for tens of
| thousands of accounts going back 60 years in a SQL Server.
| Anyone who wanted a question answered had to submit it for
| overnight analysis and get the answer the next day. But in an
| optimal non-RDMS representation, said trading history could
| be condensed to a single 200MB flat file that could be
| queried interactively, in microseconds. Dumping the RDBMS for
| most use cases pretty much revolutionized the daily
| experience for the people at that firm.
| icedchai wrote:
| Were they running this on a 386? Why didn't they optimize
| their database?
| jeffbee wrote:
| It was "optimized" for the aesthetic concerns of RDBMS
| purists and for ease of implementation of certain other
| systems, in other words it was optimal for the developers
| and severely sub-optimal for the users, which is a
| problem endemic to RDBMS fandom.
| icedchai wrote:
| So it was basically "write optimized", fully normalized,
| required at least a half dozen joins to get anything out
| of it?
| onphonenow wrote:
| This seems borderline impossible? I'd be curious if there
| were missing indexes or bad index selection or efforts to
| do things like joins? Normally for very large data sets you
| can shard by account if needed if that's the common filter
| if there is some insane table scan. Audit stuff tends to be
| "sparse" so if you can get an index which just tells you
| which pages have results, that is usually a 100X speedup
| with a pretty small index size.
|
| But agreed, a daily dump to something can go a long way to
| unlocking other tools - in govt this is especially true not
| because the DBMS is hard to use, but because so many layers
| of consultants and others in the way it's not usable.
| baq wrote:
| SQL Server can easily handle such datasets if columnstore
| is employed. I wouldn't be surprised if a single weekend of
| building a proper index (...being very generous here)
| wouldn't make their DB go literally 100x faster.
| jeffbee wrote:
| The problem at that firm wasn't the SQL Server it was the
| RDBMS mindset which led them to do joins between tables
| of journaled trades and much larger tables of corporate
| actions going back decades, instead of materializing a
| more useful intermediate result. This is my main beef
| with RDBMSs: they lead to cognitive hazards of this type.
| It's not that the databases are themselves naturally bad
| systems.
| more_corn wrote:
| Set site read only. Snapshot master. Create new db from snapshot.
| Point site to new db?
| mrbabbage wrote:
| Hey folks--I wrote the post! This was my biggest Postgres project
| to date, and it proved quite tricky since I didn't rehearse with
| a test database of the same size. I learned a bunch about
| Postgres, not least the incredibly powerful NOT VALID option for
| safely and quickly adding constraints.
|
| Happy to stick around and answer any questions you have.
| palijer wrote:
| I'm curious why a test run on a proper sized replica database
| wasn't in the testing plans. That is something I've been
| ensuring happens for a while now for similar projects.
| wrs wrote:
| Wondering why you couldn't use in-place upgrade (pg_upgrade
| [0]). Not supported in Azure?
|
| [0] https://www.postgresql.org/docs/current/pgupgrade.html
| majewsky wrote:
| I'm guessing this would still take more than one hour since
| it needs to rewrite all (or most) of the 4TB?
| aidos wrote:
| Not in my experience. You can use the ---link flag to get
| it to use hard links so it doesn't need to move the data at
| all. Have been through this process myself a few times and
| it only took seconds on a 50-100GB db. I'm always a little
| surprised with how week it works.
| aoms wrote:
| This instance must cost you a ton
| mrbabbage wrote:
| More than I care to admit!
|
| As mentioned downthread, we're doing some work soon to remove
| the audit table from the database, which will cut storage
| usage by over 50%.
| tmikaeld wrote:
| Seems to be around 1784$/monthly according to the pricing
| calculator
| loopdoend wrote:
| Let me save you a bunch of money:
| https://adamj.eu/tech/2021/04/13/reindexing-all-tables-
| after...
| karmelapple wrote:
| This technique saved us from seriously increasing the
| cost of our Heroku Postgres instance. Thank goodness it
| exists and works so well. Multiple 80+ GB indexes shrinks
| down to less than 10GB after just a couple of hours.
| Teletio wrote:
| What's your strategy?
|
| Using compression? Second instance for old data? Creating
| in-between states?
|
| Just curious :)
| clessg wrote:
| > we're doing some work soon to remove the audit table from
| the database
|
| Out of pure curiosity, what are you replacing it with (if
| anything)? Just a simple rotating log file?
| icheishvili wrote:
| You can partition your audit/event table by time period
| and archive old events [1] or you can avoid the records
| hitting the database in the first place by generating the
| events elsewhere to begin with [2].
|
| [1] https://github.com/icheishvili/audit-trigger [2]
| https://github.com/eulerto/wal2json
| mrbabbage wrote:
| Exact strategy to be determined--we're looking at various
| data layers at the moment. I wish we could do something
| simple like a rotating log file, but we want to be able
| to query it in the app (for instance, to show recent
| logins).
| jamesrr39 wrote:
| Have you considered an OLAP database like Clickhouse or
| QuestDB? An OLAP database would be a much better fit for
| audit tables given the read and append-only writing
| requirements, would compress better, and you might be
| able to fit it directly without changing any app code
| with a Postgresql foreign data wrapper.
| carlhjerpe wrote:
| I was thinking TimescaleDB
| clessg wrote:
| Very interesting! Looking forward to the next blog post
| on that ;)
| abrkn wrote:
| Great read! Thanks
| tmikaeld wrote:
| How come you didn't consider a distributed database like
| CochroachDB, instead of Postgres?
| nrmitchi wrote:
| I'm not OP, but they were upgrading from Postgres 9.6, which
| at least _implies_ that this initial db was from ~2017.
|
| This is barely past the initial release of Cockroach. It
| would have been kind of crazy for the Retool team to use an
| experimental db with a lack of history when building up their
| product (that was not dependent on experimental new features)
| tmikaeld wrote:
| Baidu migrated from an old MySQL cluster to CockroachDB for
| their 40TBs of databases, now runs on 20 clusters.
|
| When does it stop being experimental to you?
|
| https://www.cockroachlabs.com/customers/
| flyinknockin wrote:
| I don't think they're calling Cockroach experimental
| right now - rather that it was experimental in 2017.
| burai wrote:
| I used to work on a company that had MongoDB as the main
| database. Leaving a lot of criticism aside, the replicaset model
| for Mongo made the upgrades much easier than the ones in other
| type of databses.
| stingraycharles wrote:
| While that's true, managed services on eg AWS provide hot
| replica's as well, which you can use to upgrade the database
| and do a failover to the new version.
|
| We actually migrated from vanilla Postgres to Aurora that way
| with minimal risk / downtime, it was a really smooth process.
| sscarduzio wrote:
| Hey @mrbabbage! Retool customer here! Great service :)
|
| Non distributed RDBMS is a great (yet underrated) choice. Thank
| you for the good writeup.
|
| I was thinking you could have a much less delicate migration
| experience next time (some years from now). So you can go for a
| quicker parallel "dump and restore" migration.
|
| For example:
|
| - Client side sharding in the application layer: you could shard
| your customers' data across N smaller DB instances (consistent
| hashing on customer ID)
|
| - Moving the append-only data somewhere else than postgres prior
| to the upgrade. You don't need RDBMS capabilities for that stuff
| anyway. Look at Elasticsearch, Clickhouse, or any DB oriented to
| time series data.
|
| WDYT?
| mrbabbage wrote:
| The second bullet point is underway! Getting audit events out
| of the main database will be a major headache saver.
|
| The first bullet point is on our radar for the near term. We
| have a very natural shard key in our schema (the customer ID),
| with AFAIK no relationships across that shard key. And once we
| start horizontally sharding, we can do cool things like putting
| your data in a shard geographically close to you, which will
| greatly increase app performance for our non US customers.
| Exciting stuff coming down the pike!
| sscarduzio wrote:
| Oh cool! Yes locality is very important! Great idea :)
| georgewfraser wrote:
| It is amazing how many large-scale applications run on a single
| or a few large RDBMS. It seems like a bad idea at first: surely a
| single point of failure must be bad for availability and
| scalability? But it turns out you can achieve excellent
| availability using simple replication and failover, and you can
| get _huge_ database instances from the cloud providers. You can
| basically serve the entire world with a single supercomputer
| running Postgres and a small army of stateless app servers
| talking to it.
| lazide wrote:
| Well, it's always been that way.
|
| The big names started using no-sql type stuff because their
| instances got 2-3 orders of magnitude larger, and that didn't
| work. It adds a lot of other overhead and problems doing all
| the denormalization though, but if you literally have multi-PB
| metadata stores, not like you have a choice.
|
| Then everyone started copying them without knowing why.... and
| then everyone forgot how much you can actually do with a normal
| database.
|
| And hardware has been getting better and cheaper, which makes
| it only more so.
|
| Still not a good idea to store multi-PB metadata stores in a
| single DB though.
| more_corn wrote:
| Ideally you offer developers both a relational data store and
| a fast key-value data store. Train your developers to
| understand the pros and cons and then step back.
|
| There's nothing inherently wrong with a big db instance. The
| cloud providers have fantastic automation around multi-az
| masters, read replicas and failover. They even do cross
| region or cross account replication.
|
| That being said. Multi PB would raise an eyebrow.
| seunosewa wrote:
| Large solid state drives really changed the game; your
| working set doesn't have to fit in RAM for decent performance
| anymore.
| openthc wrote:
| With postgres you'll want to tune those cost paramters
| however. Eg: lowering the random page cost will change how
| the planner does things on some queries. But don't just
| blindly change it -- like modify the value and run the
| benchmark again. The point is that the SSD is not 10x the
| cost of RAM (0.1 vs 1.0). In our example a few queries the
| planner move to, what I always assumed was the slower
| sequential scan -- but it's only slower depending on your
| table size (how tall and how wide). I mean, PG works
| awesome w/o tweaking that stuff but if you've got a few
| days to play with these values it's quite educational.
| jghn wrote:
| > Then everyone started copying them without knowing why
|
| People tend to have a very bad sense of what constitutes
| large scale. It usually maps to "larger than the largest
| thing I've personally seen". So they hear "Use X instead of Y
| when operating at scale", and all of a sudden we have people
| implementing distributed datastore for a few MB of data.
|
| Having gone downward in scale over the last few years of my
| career it has been eye opening how many people tell me X
| won't work due to "our scale", and I point out I have
| _already_ used X in prior jobs for scale that 's much larger
| than what we have.
| dzhiurgis wrote:
| Place I've recently left had 10M record MongoDB table
| without indexes which would take tens of seconds to query.
| Celery was running in cron mode every 2 second or so
| meaning jobs would just pile up and redis eventually ran
| out of memory. No one understood why this was happening so
| just restart everything after pagerduty alert...
| lazide wrote:
| Yikes. Don't get me wrong, it's always been this way to
| some extent - not enough people who can look into a
| problem and understand what is happening to make many
| things actually work correctly, so iterate with new shiny
| thing.
|
| It seems like the last 4-5 years though have really made
| it super common again. Bubble maybe?
|
| Huge horde of newbs?
|
| Maybe I'm getting crustier.
|
| I remember it was SUPER bad before the dot-com crash, all
| the fake it 'til you make it too. I even had someone
| claim 10 years of Java experience who couldn't write out
| a basic class on a whiteboard at all, and tons of folks
| starting that literally couldn't write a hello world in
| the language they claimed experience in, and this was
| before decent GUI IDEs.
| dzhiurgis wrote:
| Technically we were a tech startup with 10+ "senior"
| engineers which scrape entire web ;D
| [deleted]
| lazide wrote:
| 100% agree. I've also run across many cases where no-one
| bothered to even attempt any benchmarks or load tests on
| anything (either old or new solutions), compared latency,
| optimize anything, etc.
|
| Sometimes making 10+ million dollar decisions off that gut
| feel with literally zero data on what is actually going on.
|
| It rarely works out well, but hey, have to leave that
| opening for competition somehow I guess?
|
| And I'm not talking about 'why didn't they spend 6 months
| optimizing that one call which would save them $50 type
| stuff'. I mean literally zero idea what is going on, what
| actual performance issues are, etc.
| cortesoft wrote:
| This is funny, because I suffer from the opposite issue...
| every time I try to bring up scaling issues on forums like
| HN, everyone says I don't actually need to worry because it
| can scale up to size X... but my current work is with
| systems at 100X size.
|
| I feel like sometimes the pendulum has swung too far the
| other way, where people deny that there ARE people dealing
| with actual scale problems.
| dmd wrote:
| Yep. I've personally been in the situation where I had to
| show someone that I could do their analysis in a few
| seconds using the proverbial awk-on-a-laptop when they were
| planning on building a hadoop cluster in the cloud because
| "BIG DATA". (Their Big Data was 50 gigabytes.)
| strictfp wrote:
| I agree in principle. But one major headache for us has been
| upgrading the database software without downtime. Is there any
| solution that does this without major headaches? I would love
| some out-of-the-box solution.
| AtlasBarfed wrote:
| Cassandra can do it since it has cell level timestamps, so
| you can mirror online writes and clone existing data to the
| new database, and there's no danger of newer mutations being
| overwritten by the bulk restored data.
|
| Doing an active no-downtime database migration basically
| involves having a coherent row-level merge policy (assuming
| you AT LEAST have a per-row last updated column), or other
| tricks. Or maybe you temporarily write cell-level timestamps
| and then drop it later.
|
| Or if you have data that expires on a window, you just do
| double-writes for that period and then switch over.
| simonw wrote:
| The best trick I know of for zero-downtime upgrades is to
| have a read-only mode.
|
| Sure, that's not the same thing as pure zero-downtime but for
| many applications it's OK to put the entire thing into read-
| only mode for a few minutes at a well selected time of day.
|
| While it's in read-only mode (so no writes are being
| accepted) you can spin up a brand new DB server, upgrade it,
| finish copying data across - do all kinds of big changes.
| Then you switch read-only mode back off again when you're
| finished.
|
| I've even worked with a team used this trick to migrate
| between two data centers without visible end-user downtime.
|
| A trick I've always wanted to try for smaller changes is the
| ability to "pause" traffic at a load balancer - effectively
| to have a 5 second period where each incoming HTTP request
| appears to take 5 seconds longer to return, but actually it's
| being held by the load balancer until some underlying upgrade
| has completed.
|
| Depends how much you can get done in 5 seconds though!
| kgeist wrote:
| >The best trick I know of for zero-downtime upgrades is to
| have a read-only mode.
|
| I've done something similar, although it wasn't about
| upgrading the database. We needed to not only migrate data
| between different DB instances, but also between completely
| different data models (as part of refactoring). We had
| several options, such as proper replication + schema
| migration in the target DB, or by making the app itself
| write to two models at the same time (which would require a
| multi-stage release). It all sounded overly complex to me
| and prone to error, due to a lot of asynchronous
| code/queues running in parallel. I should also mention that
| our DB is sharded per tenant (i.e. per an organization).
| What I came up with was much simpler: I wrote a simple
| script which simply marked a shard read-only (for this
| feature), transformed and copied data via a simple HTTP
| interface, then marked it read-write again, and proceeded
| to the next shard. All other shards were read-write at a
| given moment. Since the migration window only affected a
| single shard at any given moment, no one noticed anything:
| for a tenant, it translated to 1-2 seconds of not being
| able to save. In case of problems it would also be easier
| to revert a few shards than the entire database.
| nuclearnice3 wrote:
| I like this approach.
|
| I'm picturing your migration script looping over shards.
| It flips it to read-only, migrates, then flips back to
| read-write.
|
| How did the app handle having some shards in read-write
| mode pre-migration and other shards in read-write post-
| migration simultaneously.
| kgeist wrote:
| Yes, it simply looped over shards, we already had a tool
| to do that.
|
| The app handled it by proxying calls to the new
| implementation if the shard was marked as "post-
| migration", the API stayed the same. If it was "in
| migration", all write operations returned an error. If
| the state was "pre-migration", it worked as before.
|
| I don't already remember the details but it was something
| about the event queue or the notification queue which
| made me prefer this approach over the others. When a
| shard was in migration, queue processing was also
| temporarily halted.
|
| Knowing that a shard is completely "frozen" during
| migration made it much easier to reason about the whole
| process.
| ww520 wrote:
| The system needs to be architected in certain way to make
| upgrade without downtime. Something like the Command and
| Query Responsibility Segregation (CQRS) would work. A update
| queue serves as the explicit transaction log keeping track of
| the updates from the frontend applications, while the
| databases at the end of the queue applies updates and serves
| as the querying service. Upgrading the live database just
| means having a standby database with new version software
| replaying all the changes from the queue to catch up to the
| latest changes, pausing the live database from taking new
| changes from the queue when the new db has caught up,
| switching all client connections to the new db, and shutting
| down the old db.
| jtc331 wrote:
| Preconditions:
|
| 1. Route all traffic through pgbouncer in transaction pooling
| mode.
|
| 2. Logically replicate from old to new.
|
| For failover:
|
| 1. Ensure replication is not far behind.
|
| 2. Issue a PAUSE on pgbouncer.
|
| 3. Wait for replication to be fully caught up.
|
| 4. Update pgbouncer config to point to the new database.
|
| 5. Issue a RELOAD on pgbouncer.
|
| 6. Issue a RESUME on pgbouncer.
|
| Zero downtime; < 2s additional latency for in-flight queries
| at time of op is possible (and I've done it at scale).
| georgewfraser wrote:
| This is such a huge problem. It's even worse than it looks:
| because users are slow to upgrade, changes to the database
| system take years to percolate down to the 99th percentile
| user. The decreases the incentive to do certain kinds of
| innovation. My opinion is that we need to fundamentally
| change how DBMS are engineered and deployed to support silent
| in-the-background minor version upgrades, and probably stop
| doing major version bumps that incorporate breaking changes.
| brentjanderson wrote:
| Depends on the database - I know that CockroachDB supports
| rolling upgrades with zero downtime, as it is built with a
| multi-primary architecture.
|
| For PostgresQL or MySQL/MariaDB, your options are more
| limited. Here are two that come to mind, there may be more:
|
| # The "Dual Writer" approach
|
| 1. Spin up a new database cluster on the new version. 2. Get
| all your data into it (including dual writes to both the old
| and new version). 3. Once you're confident that the new
| version is 100% up to date, switch to using it as your
| primary database. 4. Shut down the old cluster.
|
| # The eventually consistent approach
|
| 1. Put a queue in front of each service for writes, where
| each service of your system has its own database. 2. When you
| need to upgrade the database, stop consuming from the queue,
| upgrade in place (bringing the DB down temporarily) and
| resume consumption once things are back online. 3. No service
| can directly read from another service's database. Eventually
| consistent caches/projections service reads during normal
| service operation and during the upgrade.
|
| A system like this is more flexible, but suffers from stale
| reads or temporary service degradation.
| jtc331 wrote:
| Dual writing has huge downsides: namely you're now moving
| consistency into the application, and it's almost
| guaranteed that the databases won't match in any
| interesting application.
| msh wrote:
| Migrate the data to a new host having the new version.
| karmakaze wrote:
| The way I've done it with MySQL since 5.7 is to use multiple
| writers of which only one is actively used by clients. Take
| one out, upgrade it, put it back into replication but not
| serving requests until caught up. Switch the clients to
| writing to the upgraded one then upgrade the others.
| dijit wrote:
| Even when it's not a cloud provider, in fact, especially when
| it's not a cloud provider: you can achieve insane scale from
| single instances.
|
| Of course these systems have warm standbys, dedicated backup
| infrastructure and so it's not really a "single machine"; but
| I've seen 80TiB Postgres instances back in 2011.
| cogman10 wrote:
| We are currently pushing close to 80tb mssql on prem
| instances.
|
| The biggest issue we have with these giant dbs is they
| require pretty massive amounts of RAM. That's currently our
| main bottle neck.
|
| But I agree. While our design is pretty bad in a few ways,
| the amount of data that we are able to serve from these big
| DBs is impressive. We have something like 6 dedicated servers
| for a company with something like 300 apps. A hand full of
| them hit dedicated dbs.
|
| Were I to redesign the system, I'd have more tiny dedicated
| dbs per app to avoid a lot of the noisy neighbor/scaling
| problems we've had. But at the same time, It's impressive how
| far this design has gotten us and appears to have a lot more
| legs on it.
| Nathanba wrote:
| Can I ask you how large tables can generally get before
| querying becomes slower? I just can't intuitively wrap my
| head around how tables can grow from 10gb to 100gb and why
| this wouldnt worsen query performance by x10. Surely you do
| table partitions or cycle data out into archive tables to
| keep up the query performance of the more recent table
| data, correct?
| WJW wrote:
| At my old workplace we had a few multi-TB tables with
| several billion rows in a vanilla RDS MySql 5.7 instance
| (although it was obviously a sizable instance type),
| simple single-row SELECT queries on an indexed column (ie
| SELECT * FROM table WHERE external_id = 123;) would be
| low single-digit milliseconds.
|
| Proper indexing is key of course, and metrics to find
| bottlenecks.
| cogman10 wrote:
| > I just can't intuitively wrap my head around how tables
| can grow from 10gb to 100gb and why this wouldnt worsen
| query performance by x10
|
| Sql server data is stored as a BTree structure. So a 10
| -> 100gb growth ends up being roughly a 1/2 query
| performance slowdown (since it grows by a factor of log
| n) assuming good indexes are in place.
|
| Filtered indexes can work pretty well for improving query
| performance. But ultimately we do have some tables which
| are either archived if we can or partitioned if we can't.
| SQL Server native partitioning is rough if the query
| patterns are all over the board.
|
| The other thing that has helped is we've done a bit of
| application data shuffling. Moving heavy hitters onto new
| database servers that aren't as highly utilized.
|
| We are currently in the process of getting read only
| replicas (always on) setup and configured in our
| applications. That will allow for a lot more load
| distribution.
| AtlasBarfed wrote:
| The issue with b-tree scaling isn't really the lookup
| performance issues, it is the index update time issues,
| which is why log structured merge trees were created.
|
| EVENTUALLY, yes even read query performance also would
| degrade, but typically the insert / update load on a
| typical index is the first limiter.
| abraxas wrote:
| If there is a natural key and updates are infrequent then
| table partitioning can help extend the capacity of a
| table almost indefinitely. There are limitations of
| course but even for non-insane time series workloads,
| Postgres with partitioned tables will work just fine.
| cmckn wrote:
| Well, any hot table should be indexed (with regards to
| your access patterns) and, thankfully, the data
| structures used to implement tables and indexes don't
| behave linearly :)
|
| Of course, if your application rarely makes use of older
| rows, it could still make sense to offload them to some
| kind of colder, cheaper storage.
| noselasd wrote:
| Think of finding a record amongst many as e.g. a binary
| search. It doesn't take 10 times as many tries to find a
| thing(row/record) amongst 100 as it does amonst 1000.
| jlokier wrote:
| A lot depends on the type of queries. You could have
| tables the size of the entire internet and every disk
| drive ever made, and they'd still be reasonably fast for
| queries that just look up a single value by an indexed
| key.
|
| The trick is to have the right indexes (which includes
| the interior structures of the storage data structure) so
| that queries jump quickly to the relevant data and ignore
| the rest. Like opening a book at the right page because
| the page number is known. Sometimes a close guess is good
| enough.
|
| In addition, small indexes and tree interior nodes should
| stay hot in RAM between queries.
|
| When the indexes are too large to fit in RAM, those get
| queried from storage as well, and at a low level it's
| analogous to the system finding the right page in an
| "index book", using an "index index" to get that page
| number. As many levels deep as you need. The number of
| levels is generally small.
|
| For example, the following is something I worked on
| recently. It's a custom database (written by me) not
| Postgres, so the performance is higher but the table
| scaling principles are similar. The thing has 200GB
| tables at the moment, and when it's warmed up, querying a
| single value takes just one 4k read from disk, a single
| large sector, because the tree index fits comfortably in
| RAM.
|
| It runs at approximately 1.1 million random-access
| queries/second from a single SSD on my machine, which is
| just a $110/month x86 server. The CPU has to work quite
| hard to keep up because the data is compressed, albeit
| with special query-friendly compression.
|
| If there was very little RAM so nothing could be kept in
| it, the speed would drop by a factor of about 5, to 0.2
| million queries/second. That shows you don't need a lot
| of RAM, it just helps.
|
| Keeping the RAM and increasing table size to roughly 10TB
| the speed would drop by half to 0.5 million
| queries/second. In principle, with the same storage
| algorithms a table size of roughly 1000TB (1PB) would
| drop it to 0.3 million queries/second, and roughly
| 50,000TB (50PB) would drop it to 0.2 million. (But of
| course those sizes won't fit on a single SSD. A real
| system of that size would have more parallel components,
| and could have higher query performance.) You can grow to
| very large tables without much slowdown.
| the8472 wrote:
| > scalability
|
| You can scale quite far vertically and avoid all the clustering
| headaches for a long time these days. With EPYCs you can get
| 128C/256T, 128PCIe lanes (= 32 4x NVMes = ~half a petabyte of
| low-latency storage, minus whatever you need for your network
| cards), 4TB of RAM in a single machine. Of course that'll cost
| you an arm and a leg and maybe a kidney too, but so would
| renting the equivalent in the cloud.
| Vladimof wrote:
| I don't think that it's amazing.... I think that the new and
| shinny databases tried to make you think that it was not
| possible...
| eastbound wrote:
| React makes us believe everything must have 1-2s response to
| clicks and the maximum table size is 10 rows.
|
| When I come back to web 1.0 apps, I'm often surprised that it
| does a round-trip to the server in less than 200ms, and
| reloads the page seamlessly, including a full 5ms SQL query
| for 5k rows and returned them in the page (=a full 1MB of
| data, with basically no JS).
| icedchai wrote:
| Caches also help a ton (redis, memcache...)
| nesarkvechnep wrote:
| Also HTTP caching. It's always funny to me why people, not
| you in particular, reach for Redis when they don't even use
| HTTP caching.
| markandrewj wrote:
| Scaling databases vertically, like Oracle DB, in the past was
| the norm. It is possible to serve a large number of users, and
| data, from a single instance. There are some things worth
| considering though. First of all, no matter how reliable your
| database is, you will have to take it down eventually to do
| things like upgrades.
|
| The other consideration that isn't initially obvious, is how
| you may hit an upper bound for resources in most modern
| environments. If your database is sitting on top of a virtual
| or containerized environment, your single instance database
| will be limited in resources (CPU/memory/network) to a single
| node of the cluster. You could also eventually hit the same
| problem on bare metal.
|
| That said there are some very high density systems available.
| You may also not need the ability to scale as large as I am
| talking, or choose to shard and scale your database
| horizontally at later time.
|
| If your project gets big enough you might also start wanting to
| replicate your data to localize it closer to the user. Another
| strategy might be to cache the data locally to the user.
|
| There are positive and negatives with a single node or cluster.
| If retools database was clustered they would have been able to
| do a rolling upgrade though.
| booleanbetrayal wrote:
| Low / zero downtime is totally achievable with pg_logical and
| really boils down to whether or not you want to try to adopt bi-
| directional writes (and conflict management / integrity issues),
| or if you're willing to just have a brief session termination
| event and swapover. To me, the latter has generally been
| preferable as conflict management systems tend to be more
| complicated in reality (based on business logic / state) than
| what pg_logcical provides. Interested if people here have had
| success with bi-directional writes though.
| bogomipz wrote:
| >"Last fall, we migrated this database from Postgres version 9.6
| to version 13 with minimal downtime."
|
| I thought it was interesting that they upgraded 4 major version
| numbers in one go. I kept expecting to read something about
| version compatibility and configuration but was surprised there
| was none. Are major upgrades like this just less of an issue with
| Postgres in general?
| rkwasny wrote:
| "However, on our 4 TB production database, the initial dump and
| restore never completed: DMS encountered an error but failed to
| report the error to us."
|
| THERE IS NO CLOUD: It's just someone else's computer
| forinti wrote:
| I recently upgraded a 9.x database to 14 using pg_upgrade and it
| was just simple and fast. No issues whatsoever.
|
| I only wish my Oracle updates were so simple and bug-free.
| VincentEvans wrote:
| Why is SQL Server able to backup and restore databases 100s of
| gigabytes in size in single-digit minutes, while Postgres is at
| least 10x slower?
| dboreham wrote:
| Perhaps time to re-architect to avoid having a huge monolith
| database, particularly since the SaaS product is inherently
| multi-tenant?
| wenbin wrote:
| For listennotes.com, we did a postgres 9.6 => 11 upgrade (in
| 2019), and 11 => 13 upgrade (in 2021). ~0 downtime for read ops,
| and ~45 seconds downtime for write ops.
|
| Our database is less than 1TB. One master (for writes + some
| reads) + multiple slaves (read-only).
|
| Here's what we did -
|
| 1, Launched a new read-only db with pg9.6, let's call it DB_A.
|
| 2, Stopped all offline tasks, and only maintained a minimal fleet
| of online servers (e.g., web, api...).
|
| 3, Changed all db hosts (no matter master or slave) in /etc/hosts
| on the minimal fleet of online servers (e.g., web, api...) to use
| old read-only db with pg9.6, let's call it DB_B. From this point
| on, all write ops should fail.
|
| 4, Ran pg_upgrade (with --link) on DB_A to upgrade to pg11, and
| promoted it to be a master db.
|
| 5, Changed /etc/hosts on the minimal fleet of online servers
| (e.g., web, api...) to use DB_A for all db hosts. By this point,
| DB_A is a master db. And write ops should be good now.
|
| 6, Changed /etc/hosts for all other servers and brought back all
| services.
|
| Step 4 is the most critical. If it fails or runs too long (e.g.,
| more than 10 minutes), then we had to rollback by changing
| /etc/hosts on those online servers.
|
| We carefully rehearsed these steps for an entire week, and timed
| each step. By the time we did it on production, we knew how many
| seconds/minutes each step would take. And we tried to automate as
| many things as possible in bash scripts.
| aolle wrote:
| wow
| aolle wrote:
| karmelapple wrote:
| We did something similar recently jumping from 10 to 13. We
| took measurements, did some dry runs, and came up with
| strategies to ensure our read-only followers would work fine
| and we'd have a minimum downtime for writes.
|
| We missed one or two pieces of reconnecting things afterwards,
| and some of that seems to be limitations of Heroku Postgres
| that we couldn't change. Hopefully those keep improving.
| hankman86 wrote:
| Having successfully built (and sold!) a technology startup
| myself, I would always, always opt for a managed database
| service. Yes, it's more expensive on paper and you want to run
| the numbers and choose the right offering. But nothing beats the
| peace of mind of storing your customers' data on a system that
| others (Google Cloud in our case) look after. Not to mention that
| you're better off focussing on your core value proposition and
| spending your scarce resources there than to have a database
| administrator on your payroll.
| pojzon wrote:
| This only works to some scale. Startup from my previous
| contract reached the point where "using bigger instance" would
| be losing them money.
|
| On the other hand, self built databases maintained by
| professionals were ALOT cheaper.
|
| We are talking here about million of dollars bills only for
| databases per month.
|
| Self hosted solution did cost around 300k per month.
|
| This included precompiled kernel patches, tweaks to postgres
| engine etc.
|
| Overall the investment of a year of work of dbadmins will
| probably return itself by ten times if not more.
| lijogdfljk wrote:
| How does MySQL fair in this type of setup? Do the DBs differ
| greatly?
| thatwasunusual wrote:
| I worked for a company that did a similar "upgrade by
| replication", but with MySQL. It's quite a few years ago, so I
| don't remember the versions involved, but it was quite
| straight-forward once we had done _weeks_ of test runs on a dev
| environment.
|
| One invaluable thing, though: our application was from the
| beginning designed to do 100% of all the reads from a read-only
| slave _if the slave was up to sync_ (which it was 95% of the
| time). We could also identify testers/developers in the
| application itself, so we had them using the upgraded slave for
| two weeks before the actual upgrade.
|
| This made it possible for us to filter out problems in the
| application/DB-layer, which were few, which means that we
| probably did a minor version upgrade.
|
| But upgrading by replication is something I can recommend.
| evanelias wrote:
| MySQL's built-in replication has always been logical
| replication, and it officially supports replicating from an
| older-version primary to newer-version replicas. So similar
| concept to what's described here, but much simpler upgrade
| process.
|
| Generally you just upgrade the replicas; then promote a replica
| to be the new primary; then upgrade the old primary and turn it
| into a replica.
|
| The actual "upgrade" step is quite fast, since it doesn't
| actually need to iterate over your tables' row data.
|
| At large scale, the painful part of major-version MySQL
| upgrades tends to be performance testing, but that's performed
| separately and prior to the actual upgrade process. Third-party
| tools (pt-upgrade, proxysql mirroring, etc) help a lot with
| this.
| cube00 wrote:
| _> To resolve this, we ended up choosing to leave foreign key
| constraints unenforced on a few large tables.
|
| > We reasoned this was likely safe, as Retool's product logic
| performs its own consistency checks, and also doesn't delete from
| the referenced tables, meaning it was unlikely we'd be left with
| a dangling reference._
|
| I was holding my breath here and I'm glad these were eventually
| turned back on.
|
| Nobody should ever rely on their own product logic to ensure
| consistency of the database.
|
| The database has features (constraints, transactions, etc) for
| this purpose which are guaranteed to work correctly and
| atomically in all situations such as database initiated rollbacks
| that your application will never have control over.
| phphphphp wrote:
| Does that pattern you describe require any considerations when
| writing code? I'm thinking of applications I've worked on where
| events are triggered by change, and so a database rolling back
| independent of my application would be a nightmare. I treat the
| database as a place to store data, not an authority: the
| application is the authority. Do you approach it differently?
| Thanks!
| kijin wrote:
| ON DELETE CASCADE can be dangerous when used with
| applications that expect to be notified of deletions, like in
| your case.
|
| Ideally, everything that needs to change when a row is
| deleted would be changed automatically and atomically using
| database-side constraints and triggers. In practice,
| applications often need to sync state with external services
| that the database knows nothing about, so I understand your
| concerns.
|
| ON DELETE RESTRICT, on the other hand, will result in errors
| just like any other query error that you can handle in your
| application. Nothing happened, so there's nothing to be
| notified of.
| brightball wrote:
| The database is the only place that can be the authority
| because the application can have race conditions. It's the
| only way to guarantee data integrity.
| grogers wrote:
| There's no way to specify every single application specific
| constraint directly in the database. Race conditions are
| not present when using locking reads (select ... for
| update, or DB specific shared locking selects) or
| serializable isolation level, which are the typical way of
| enforcing application level constraints.
| icedchai wrote:
| You'd be surprised. I used to work on a product where the lead
| developer made sure foreign keys were NOT enabled on
| production. They were only "allowed" in dev. Some teams have a
| strict "no foreign keys" rule.
| oppositelock wrote:
| It's difficult to make a blanket statement like this.
|
| I've built some very high throughput Postgres backed systems in
| my years, and doing application side foreign key constraints
| (FKC) does have its benefits. Doing this client side will
| result in constraints that are usually, but not always in sync
| with data. However, this kind of almost-consistency lets you do
| much higher throughput queries. An FKC is a read on every
| write, for example, and does limit write throughput. Of course,
| this isn't ok for some workloads, and you do proper FKC in the
| DB, but if you don't need absolute consistency, you can make
| writes far cheaper.
| javajosh wrote:
| The trade-offs between foreign key constraints vs none are
| almost identical to the trade-offs between static typing vs
| dynamic typing. Nowadays people realize that when they turn
| off these features is that they'll eventually have to re-
| implement them later.
| WJW wrote:
| You make this claim as if this happens to every company
| sooner or later, but if a company the size of GitHub can
| still do without ( https://github.com/github/gh-
| ost/issues/331#issuecomment-266...) it does become a little
| bit of a "you do not have google problems" type discussion.
|
| (Perhaps you do have such problems, I don't know where you
| work! But 99%+ of companies don't have such problems and
| never will.)
| [deleted]
| jasfi wrote:
| True, but the DB constraints often aren't enough. I like to
| have a verify program to check that the DB is as expected.
| weird-eye-issue wrote:
| You don't trust DB constraints?
| [deleted]
| deathanatos wrote:
| Depends on the database, sometimes the database config, as
| to whether they'll actually be enforced or not, or in what
| situations data might evade enforcement of the
| constraints...
|
| Applies to vendors, too. Had some data in Rackspace Files
| where "list files" would say X existed, but "GET X" got you
| a 404. Had an AWS RDS instance; query on it returned no
| results. Adding the "don't use the index" index hint caused
| it to return data. (Allegedly, this bug was fixed, but we
| had migrated off by that point, so I never got to confirm
| it.)
|
| Conversely, I do like DB constraints, because if the DB
| constraint _doesn 't_ exist, then I guarantee you the
| production DB has a row that is a counter-example to
| whatever constraint you think the data _should_ obey...
| majewsky wrote:
| > Had some data in Rackspace Files where "list files"
| would say X existed, but "GET X" got you a 404.
|
| Well yes, Rackspace Files (aka OpenStack Swift) is
| eventually consistent. It says so literally in the _first
| sentence_ of the documentation [1]. But this discussion
| is about relational databases with ACID guarantees, where
| the C is literally "consistent".
|
| [1] https://docs.openstack.org/swift/latest/
| rattray wrote:
| They're limited in what they can express; your application
| often has invariants you want to enforce/maintain that
| can't be (performantly) expressed with DB constraints, and
| must be validated another way.
|
| As great as it can be to enforce rules within the database,
| a lot of them usually end up needing to be enforced at the
| application layer instead. Especially when performance at
| scale comes into play.
| brightball wrote:
| I think it's a balance. Transactions + Constraints can
| enforce most things but there will certainly be things
| that can only be verified in the app.
|
| My goal is always to verify what I can in the database to
| minimize potential data cleanup. In my experience, app
| only verification always leads to future time investments
| to clean up the mess.
| jasfi wrote:
| Make no mistake, I think DB constraints are a best
| practice.
| WJW wrote:
| They have their place, but also their limits.
| simonw wrote:
| I found the script they used for copying data really interesting:
| https://gist.github.com/peterwj/0614bf6b6fe339a3cbd42eb93dc5...
|
| It's written in Python, spins up a queue.Queue object, populates
| it with ranges of rows that need to be copied (based on min < ID
| < max ranges), starts up a bunch of Python threads and then each
| of those threads uses os.system() to run this:
| psql "{source_url}" -c "COPY (SELECT * FROM ...) TO STDOUT" \
| | psql "{dest_url}" -c "COPY {table_name} FROM STDIN"
|
| This feels really smart to me. The Python GIL won't be a factor
| here.
| teej wrote:
| For ETL out of Postgres, it is very hard to beat psql.
| Something as simple as this will happily saturate all your
| available network, CPU, and disk write. Wrapping it in Python
| helps you batch it out cleanly. psql -c "..."
| | pigz -c > file.tsv.gz
| mrbabbage wrote:
| Thanks Simon! I can indeed confirm that this script managed to
| saturate the database's hardware capacity (I recall CPU being
| the bottleneck, and I had to dial down the parallelism to leave
| some CPU for actual application queries).
| SahAssar wrote:
| Sounds to me like this is the exact thing that the normal
| parallel command was made for, not sure python is needed here
| if the end result is shelling out to os.system anyway.
| valzam wrote:
| I am generally a fan of using as few moving parts as possible but
| if > 60-70% (2TB + a "few hundred GB") of your prod database are
| an append only audit log surely if would make sense to split that
| part into a separate DB server? Especially when you are using a
| hosted service. It sounds like both uptime and consistency
| requirements are very different between these two parts of the
| production data.
| ranyefet wrote:
| My thoughts exactly.
|
| Does Postgres still make sense for append only tables or maybe
| elastic or other kind of database would be more suitable?
| abraxas wrote:
| Postgres makes a lot of sense with append only tables. You
| can easily partition them by time (usually) and thus have an
| easy way to break up the index trees as well as using a cheap
| indexing scheme like BRIN and being able to just drop old
| chunks as they become irrelevant.
| sega_sai wrote:
| Using pg_upgrade I recently updated this 100Tb sized DB (from
| PG12 to PG13) in ~10 min of downtime.
|
| => select pg_size_pretty(pg_database_size ('**'));
|
| pg_size_pretty
|
| ----------------
|
| 99 TB
|
| (1 row)
|
| (The re-analyze of tables took a day or so though)
| riku_iki wrote:
| Curious what is your storage story? Where DB is actually
| stored? Some NAS?
| sega_sai wrote:
| It's DAS. Mostly it's in one box of RAIDed 32x5.5Tb drives
| with a couple of tablespaces/WAL elsewhere. The DB is mostly
| read-only and not many concurrent users, so that's probably
| not the most typical case.
| comboy wrote:
| Yeah, my petty 6Tb also went fine with pg_upgrade and
| practically no downtime. Upgrade slave, promote to master,
| upgrade master and then promote it back. It's a marvelous piece
| of technology.
|
| It's really just a handful of core people who did most of the
| work, crafting it so thoughtfully over the years and it has
| such a huge impact on the world.
|
| Doing huge part of it before postgresql was as popular as it is
| today, spending countless hours on making some great design
| choices and implementing them carefully.
|
| It seems unlikely any of them will read that but I'm so deeply
| grateful to these people. It allowed so many things to flourish
| on top thanks to it being open source and free.
| jeffrallen wrote:
| Nifty. But I can't help thinking this was harder than it needed
| to be in the cloud. Because frankly, 4 TB is not big: my home
| Synology backup server is 4 TB. Making a pair of standalone Linux
| servers to rehearse this locally, and with full control of which
| Postgres modules and other software to use, would have made
| things easier, it seems.
|
| Anyway, thanks for food for thought.
| TruthWillHurt wrote:
| Wow this takes me back 20 years ago when we did this kind of
| partial-dump+sync migrations of mysql.
|
| Then the cloud and DBaaS was invented.
| electroly wrote:
| In SQL Server you just... do the upgrade. You install the upgrade
| on your nodes starting with the passive nodes, and it will
| automatically failover from the old version to the new version
| once half the nodes have been upgraded. No downtime, but your
| redundancy drops when some nodes have been upgraded but the
| cluster hasn't fully been upgraded yet. You certainly don't have
| to dump and restore your database. Without giving private
| numbers, our database is much bigger than OP's 4TB; dump and
| restore would be wildly unacceptable.
|
| The idea that you don't get a seamless upgrade of the database
| itself with PostgreSQL is absurd to me. The part about
| "maximizing the amount of time this upgrade buys is" is only
| necessary because of how difficult PostgreSQL makes upgrades. We
| upgrade to every new version of SQL Server. It's not that big of
| a deal.
|
| With every PostgreSQL blog article I read, I become more and more
| of an SQL Server fanboy. At this point it's full-blown. So many
| "serious business" PostgreSQL ops posts are just nothingburgers
| in the SQL Server world.
| calpaterson wrote:
| PG is far behind SQL Server on ease of upgrade but the method
| described in this post is not the best practice right now,
| which I think is:
|
| - physical restore to new cluster
|
| - pg_upgrade the new cluster
|
| - catch up on logical wal logs from old cluster
|
| - failover to new cluster
|
| - STONITH
|
| I think the above was not open to them because of the
| limitations of their managed PG instance. I haven't used Azure
| but GCP managed SQL has loads of limitations. It seems very
| common and I think is a major (and undiscussed) drawback of
| these managed instance.
|
| But the truth is that very few of the people who use PG want to
| hear that things are better in the MS SQL community for reasons
| of prejudice and as a result you're being downvoted unfairly
| for pointing out PGs relative backwardness here.
| Teletio wrote:
| Amy real facts you can show besides your sentiment on blog
| posts?
| electroly wrote:
| What facts are you looking for? I just described the steps
| from this document: https://docs.microsoft.com/en-us/sql/sql-
| server/failover-clu... -- specifically, the "Perform a
| rolling upgrade or update" section. There's nothing else to
| my post other than contrasting the SQL Server upgrade process
| to the one described in the article, and musing about my
| growing appreciation for SQL Server; I apologize if it seemed
| like it was going to be deeper than that.
|
| EDIT: I realized you're looking for the other PostgreSQL blog
| posts. Here's an example of two recent HN posts about
| PostgreSQL issues that I pulled out of my comment history.
| Both of these blog posts exist because PostgreSQL doesn't
| have query hints. SQL Server has them; I've dealt with issues
| like these blog posts describe but they have trivial fixes in
| the SQL Server world. Nothing to write a blog post about. I
| don't have a link handy regarding PostgreSQL's txn id
| wraparound problem, but SQL Server doesn't have that problem,
| either.
|
| - https://news.ycombinator.com/item?id=30296490
|
| - https://news.ycombinator.com/item?id=29981737
| aidos wrote:
| It's been a long time since I used SQL Server so I don't know
| that upgrade process well (I'm willing to believe it's smoother
| though, especially wrt to replication / failover).
|
| Keep in mind that they're upgrading from a database version
| that's almost 6 years old. Postgres has improved a lot in the
| last 5 major versions since then.
|
| Another thing here is that I'm pretty sure they could have just
| done the in-place upgrade and it would have been fine. I've run
| pg_upgrade myself for a bunch of major versions now and it's
| easy and doesn't require dumping / restoring anything. Maybe
| there's something else going on that I'm missing though.
|
| What setup are you running with sql server to have it
| automatically failover? Is it a multi master configuration or
| are the additional nodes just read replicas?
|
| These days Postgres actually allows logical replication so your
| servers can be running different versions at the same time,
| which allows for much smoother upgrades (haven't tried that
| myself yet, don't quote me on it!)
| Symbiote wrote:
| They did not dump and restore.
|
| They created a replica database running the new version, then
| switched over to it. Not too dissimilar to what you describe,
| although more work since they started out with only a single
| instance without replication support.
| electroly wrote:
| They _ultimately_ didn't dump and restore, but it was the
| first thing they tried. It didn't work; it actually failed
| catastrophically for them. They describe this under the
| "Implementing logical replication" section. Their ultimate
| solution is what they tried after the dump-and-restore based
| DMS method failed and they took an unplanned outage due to
| yet more PostgreSQL-specific issues (unvacuumed tuples).
|
| All of this is exactly what I'm talking about. This blog post
| describes kind of a nightmare process for something that is
| trivial in SQL Server. They actually needed a third party
| product from Citus just to successfully upgrade PostgreSQL!
| Stunning.
| VincentEvans wrote:
| Postgres dump and restore tooling is very poor performance-
| wise , easily 10x slower compared to SQL Server. I love
| Postgres dearly and prefer to use it despite that, but I
| wish Postgres devs renewed their interest in improving
| neglected dump/restore tooling.
___________________________________________________________________
(page generated 2022-04-19 23:00 UTC)