[HN Gopher] It's fine, Rewind: Revert a migration without losing...
___________________________________________________________________
It's fine, Rewind: Revert a migration without losing data
Author : ksec
Score : 128 points
Date : 2022-03-24 12:16 UTC (10 hours ago)
(HTM) web link (planetscale.com)
(TXT) w3m dump (planetscale.com)
| losvedir wrote:
| This is pretty amazing. What are the restrictions on it working?
| For example, suppose we had a non-NULL column A which we drop in
| the migration, and new records come in without A data. That works
| on the new table, but if you revert, presumably you would lose
| those records since they can't be added to the old schema. Does
| it prevent you from rolling back, or let you roll back to a
| modified previous schema that allows NULLs, or does it roll back
| and drop those new records?
| shlomi-noach wrote:
| Engineer at PlanetScale; If you drop columns that are `NOT NULL
| DEFAULT <something>`, and then you insert some new rows to your
| newly-versioned table, then you're in a good spot: when you
| revert, those columns will get the DEFAULT value on those rows.
|
| If those columns were `NOT NULL` and with no DEFAULT, then you
| are unable to rewind. The rewind process will make an attempt
| -- after all, maybe you didn't add new rows; maybe you just
| deleted or updated -- but if you did INSERT new rows, then the
| Rewind process will fail (and you will be notified that rewind
| is impossible).
|
| There's a couple more interesting scenarios, see this doc page
| for more: https://docs.planetscale.com/concepts/deploy-requests
| omegalulw wrote:
| This is why you should put minimal consistency checks in your
| DB directly. Put them in your storage later instead.
| cesaref wrote:
| So that you can cope with someone messing up a deployment
| and wanting to roll it back?
|
| Let's say you avoid column restrictions in your db design,
| and you deploy some dodgy code that doesn't fill in a
| column correctly, now you've got heaps of dubious data
| cluttering your database, and apps failing because some
| precondition isn't being met. That doesn't sound like a
| good compromise to avoid headaches when updating a schema,
| does it?
|
| So as ever, it's about making the right decisions on a case
| by case basis, and there is no 'one size fits all' for
| stuff like this.
| mixedCase wrote:
| That would create an explosion of code to handle edge cases
| that should never happen when interacting with the DB, or
| worse, pretending everything's alright and letting
| everything blow up in your face later.
| robertlagrant wrote:
| Does it only work if no data has been written to the new
| structure? E.g. drop a column, and new record comes in without
| that column? What then?
| shlomi-noach wrote:
| Engineer at PlanetScale here: it _does_ work if data has been
| written to the new structure!
|
| In your scenario, you drop a column, populate some new rows in
| the new structure. Then, you regret the migration and rewind.
| You get the column back with all the pre-dropped values, AND
| you get to keep all the new rows which you've inserted.
|
| The values for the now-restored column for the newly inserted
| rows is the DEFAULT value per column definition.
| mnapoli wrote:
| This is pretty incredible TBH
| asleepawake wrote:
| that's awesome but when would it be useful? wouldn't that
| lead to data loss?
| shlomi-noach wrote:
| Engineer at PlanetScale; it will let you go back to safety
| _without_ data loss, and without making your database
| inconsistent.
|
| If you will indulge a realistic story; I've been through
| this process multiple times in production.
|
| You change a large table via ALTER TABLE; you possibly
| change a data type, or drop a column, or modify an index.
| The change takes 5 hours to complete - and things go bad.
| Testing in staging was good, but as it turns out the
| production environment cannot cope with the changes and
| still needs the previous schema. Some traffic is still able
| to pass through, but some requests are erroring.
|
| What do you do?
|
| One option is to run another ALTER TABLE that takes you
| back into the original schema. This will take yet another 5
| hours, during which your app may be degraded or altogether
| down. Plus you'll be unable to recover lost data (such as
| in a DROP COLUMN scenario). Another option is to do a point
| in time recovery for your entire database. This will both
| take time, but more importantly you will lose all the data
| you've accumulated since the migration completed. Any new
| user account, any new artifact, any new event - will be
| lost. Rows that were deleted suddenly reappear. Data that
| should not be available anymore suddenly is.
|
| Most people will try a third option: do a point in time
| recovery on an offline server, and extract/copy just the
| specific table and copy it onto production. Typically this
| involves a lot of juggling and most environments will not
| have the infrastructure to automate the entire process. But
| even once this is done, you're still hit with the
| unfortunate implication: your data set is now both
| incomplete as well as inconsistent.
|
| It is incomplete because data is missing from the restored
| table. Any rows accumulated since the point in time
| recovery point - are lost. It is inconsistent, because in
| many cases, due to the natural relational design of your
| schema, other tables will have rows that relate to the
| missing restored table's rows. You may try to then manually
| backfill those missing rows into the restored table (or
| remove rows previously deleted) , but in reality some
| processes will already have manipulated the data on the
| restored table even while you're trying to resolve the
| situation, leading to more conflicts.
|
| It seems like the only safe way is to take everything
| offline, disable any writes to the broken tables as well as
| some of, or all tables, associated with it, resolve all
| conflicts, then restore data onto production and enable
| writes again. Or, you choose to lose data, track down any
| known conflicts, reach out to users and inform them of the
| data loss. Either way this has a significant impact on your
| service.
|
| And so Rewind offers an instant fall back to your previous
| schema, while still retaining any data you've accumulated
| since the time of incident. Rewind resolves the differences
| between previous and current schema, and adapts the latest
| data changes onto the old schema. As you rewind the
| migration your table still has the same amount of rows, and
| maintains all incoming or outgoing references from and to
| other tables. It all happens on your production environment
| and does not require an offline server.
|
| Here's a technical description of how Rewind works:
| https://planetscale.com/blog/behind-the-scenes-how-we-
| built-...
| aboodman wrote:
| This is a really clear description of the problem and
| solution. Thanks!
|
| I recommend amending the blog (or maybe making a new
| post) with this exact content. I regularly run SQL
| databases for smaller projects but was not able to
| immediately conceptualize how I would use this feature
| from just the blog post. Maybe your target audience would
| be able to? But I don't see the downside in just spelling
| it out clearly!
| edf13 wrote:
| Sounds a bit iffy if your prod deployment fails after
| testing... surely the correct way is to have staging
| pickup any issues prior to prod
| ledauphin wrote:
| lol.
|
| I mean, yes... but also - have you really never seen a
| bug make it to production?
| edf13 wrote:
| Yes - but this kinda gives it an easy pass. Makes it
| easier to 'move fast and break things' which isn't always
| best
| rorymalcolm wrote:
| If you drop a 'title' column from a users table, for
| example, you can revert and have the title column reappear
| with the dropped data, new users added during this time
| (while the column was dropped) will not have a title.
| codingdave wrote:
| Just to clarify my understanding -
|
| If we extend that scenario a bit to dropping a title
| column and at the same time adding a foo column. Then add
| rows with data in foo. Then revert. Do you lose the foo
| data?
|
| Alternatively, can we separate those actions out? Drop
| columns in one migration. Add columns in another. Add
| rows and data. Then revert only the migration where
| columns were dropped, keeping the more recent adds?
| shlomi-noach wrote:
| Right. So if you're both adding one column and removing
| another, then the revert will lose your new column and
| will regain your old column. Normally, you deploy DB and
| app in steps. E.g. if your migration adds a new column,
| your app is not yet aware of the column (or else your app
| would break). The moment the migration completes, your
| app is still on the not-knowing state. It takes an app
| deployment to actually start utilizing the new column. If
| you do that, and then want to revert -- you will lose any
| new data you've added to the new column.
|
| In my experience, when a schema migration goes wrong, it
| goes wrong with a bang. It takes seconds to maybe one
| minute until pagers are alarming. So I'd say in a common
| scenario you will not get to deploy your app with the new
| column awareness, because you'll have realized the
| migration was bad right away.
|
| > Alternatively, can we separate those actions out? Drop
| columns in one migration.
|
| If you choose to do that, then you're on safer grounds;
| it costs you some wall clock time, because migrations do
| take a while to complete on medium to large tables.
|
| Do note that Rewind only lets you rewind your most recent
| deployment (PlanetScale's app will not let you run the
| next migration before you've committed to, or have
| rewinded, the previous one).
| robertlagrant wrote:
| Makes sense! Thanks. And congratulations.
| robertlagrant wrote:
| And now I have a follow-up!
|
| So what if you upgrade your data structure, add a new
| required column and a record is inserted. You then roll back,
| removing the column, in order to fix a bug. When you roll
| forward again, is that removed/readded column's data for the
| inserted record still there?
| epberry wrote:
| P.S. you can do the forward version of this, online schema
| changes, directly in the Arctype SQL GUI
| https://arctype.com/blog/changelog/planetscale-exports/. I
| imagine we'll have support for Rewind too in the future. Game
| changing feature.
| ZeroCool2u wrote:
| I'm not really the target audience for PlanetScale, but this is
| still pretty damn cool.
|
| This is the link where they go into detail about the mechanism
| that enables this feature.
|
| https://vitess.io/docs/13.0/reference/vreplication/vreplicat...
|
| Maybe GitHub should look into migrating to PlanetScale for their
| mysql1 cluster that keeps going down this week? Unless
| PlanetScale uses GitHub and that would introduce a circular
| dependency. Eh, it's turtles the whole way down either way I
| suppose.
| gtowey wrote:
| They already do. Not for that cluster, but large tranches of
| data are being logically separated fr there then physically
| moved to separate Vitess-based clusters. It takes a lot of
| engineering time to safely separate the data.
| ksec wrote:
| Is this similar to Temporal Tables in MSSQL?
| shlomi-noach wrote:
| Engineer at PlanetScale -- It is not similar; so temporal
| tables are about getting a table's dataset at a given point in
| time in the past, sort of a time machine for your table. Rewind
| is about undoing a (bad) structural schema change, and
| _without_ having to go back in time - you continue your current
| timeline, with your current data, but flipped into the previous
| schema.
|
| It's also not something you need to activate ahead of time,
| like you do in MSSQL temporal tables; it is activated on your
| behalf for any schema change you deploy.
| joshstrange wrote:
| This is pretty awesome. I've been super happy with my move to
| PlanetScale so far and they keep improving the product and adding
| things I didn't even think were possible (or at least
| easy/viable).
| throwusawayus wrote:
| "We are giving away limited-edition, retro Rewind t-shirts to the
| first 100 people that successfully revert a schema change."
|
| they are encouraging people to record videos of making schema
| changes and reverting them, in order to win a t-shirt
|
| does this seem unnecessarily risky or in really poor taste to
| anyone else?
| misterbwong wrote:
| I think that's the point. Schema migration is traditionally a
| _very_ risky thing but they're trying to make it come off as
| "safe", if you're a PlanetScale customer.
| throwusawayus wrote:
| yeah but half the risk is on the application side, completely
| independent of planetscale! not all applications can handle
| some schema change occurring and then suddenly reversing.
| someone may get fired just trying to win a shirt
|
| also their exact wording, "successfully", so if it fails your
| db is broken AND you do not get a shirt?
| ebiester wrote:
| Hopefully someone is doing this in testing or staging.
| ahachete wrote:
| Does it support migrations that affect more than one table?
| [deleted]
| danielovichdk wrote:
| Do you backup the present schema and data before the new
| migration is done ?
|
| And then if you rewind, you "simply" point to that backup scehma
| and data instead of the newly migration?
|
| I know it's very simplified, but is this the gist of it?
| shlomi-noach wrote:
| It's not like that -- that BTW is super simple to achieve with
| either of the existing online schema change tools (pt-online-
| schema-change, gh-ost, facebook's OSC) -- they all end up with
| your old table renamed away, and which you can instantly
| reinstate back in place. Very cool and important feature! But
| then, you lose data; all the data you've accumulated since the
| migration completed; or some data you've deleted will suddenly
| reappear.
|
| Rewind does not move you back to an old snapshot, but rather
| keeps you on your current timeline, with the current data, but
| with the old schema.
|
| Technically, there are two tables involved, yes! And a synching
| mechanism that compensates for the structural differences
| between them. But perhaps I should just point to this technical
| explanation of how this works internally:
| https://planetscale.com/blog/behind-the-scenes-how-we-built-...
| throwusawayus wrote:
| > But then, you lose data; all the data you've accumulated
| since the migration completed; or some data you've deleted
| will suddenly reappear.
|
| this is not correct, for example pt-online-schema-change has
| long had a --reverse-triggers option which reverses the
| direction of the triggers to keep the old table up to date
| KwisaksHaderach wrote:
| Does planetscale doesn't have PITR? couldn't find anything in the
| docs.
___________________________________________________________________
(page generated 2022-03-24 23:02 UTC)