[HN Gopher] It's fine, Rewind: Revert a migration without losing...
       ___________________________________________________________________
        
       It's fine, Rewind: Revert a migration without losing data
        
       Author : ksec
       Score  : 128 points
       Date   : 2022-03-24 12:16 UTC (10 hours ago)
        
 (HTM) web link (planetscale.com)
 (TXT) w3m dump (planetscale.com)
        
       | losvedir wrote:
       | This is pretty amazing. What are the restrictions on it working?
       | For example, suppose we had a non-NULL column A which we drop in
       | the migration, and new records come in without A data. That works
       | on the new table, but if you revert, presumably you would lose
       | those records since they can't be added to the old schema. Does
       | it prevent you from rolling back, or let you roll back to a
       | modified previous schema that allows NULLs, or does it roll back
       | and drop those new records?
        
         | shlomi-noach wrote:
         | Engineer at PlanetScale; If you drop columns that are `NOT NULL
         | DEFAULT <something>`, and then you insert some new rows to your
         | newly-versioned table, then you're in a good spot: when you
         | revert, those columns will get the DEFAULT value on those rows.
         | 
         | If those columns were `NOT NULL` and with no DEFAULT, then you
         | are unable to rewind. The rewind process will make an attempt
         | -- after all, maybe you didn't add new rows; maybe you just
         | deleted or updated -- but if you did INSERT new rows, then the
         | Rewind process will fail (and you will be notified that rewind
         | is impossible).
         | 
         | There's a couple more interesting scenarios, see this doc page
         | for more: https://docs.planetscale.com/concepts/deploy-requests
        
           | omegalulw wrote:
           | This is why you should put minimal consistency checks in your
           | DB directly. Put them in your storage later instead.
        
             | cesaref wrote:
             | So that you can cope with someone messing up a deployment
             | and wanting to roll it back?
             | 
             | Let's say you avoid column restrictions in your db design,
             | and you deploy some dodgy code that doesn't fill in a
             | column correctly, now you've got heaps of dubious data
             | cluttering your database, and apps failing because some
             | precondition isn't being met. That doesn't sound like a
             | good compromise to avoid headaches when updating a schema,
             | does it?
             | 
             | So as ever, it's about making the right decisions on a case
             | by case basis, and there is no 'one size fits all' for
             | stuff like this.
        
             | mixedCase wrote:
             | That would create an explosion of code to handle edge cases
             | that should never happen when interacting with the DB, or
             | worse, pretending everything's alright and letting
             | everything blow up in your face later.
        
       | robertlagrant wrote:
       | Does it only work if no data has been written to the new
       | structure? E.g. drop a column, and new record comes in without
       | that column? What then?
        
         | shlomi-noach wrote:
         | Engineer at PlanetScale here: it _does_ work if data has been
         | written to the new structure!
         | 
         | In your scenario, you drop a column, populate some new rows in
         | the new structure. Then, you regret the migration and rewind.
         | You get the column back with all the pre-dropped values, AND
         | you get to keep all the new rows which you've inserted.
         | 
         | The values for the now-restored column for the newly inserted
         | rows is the DEFAULT value per column definition.
        
           | mnapoli wrote:
           | This is pretty incredible TBH
        
           | asleepawake wrote:
           | that's awesome but when would it be useful? wouldn't that
           | lead to data loss?
        
             | shlomi-noach wrote:
             | Engineer at PlanetScale; it will let you go back to safety
             | _without_ data loss, and without making your database
             | inconsistent.
             | 
             | If you will indulge a realistic story; I've been through
             | this process multiple times in production.
             | 
             | You change a large table via ALTER TABLE; you possibly
             | change a data type, or drop a column, or modify an index.
             | The change takes 5 hours to complete - and things go bad.
             | Testing in staging was good, but as it turns out the
             | production environment cannot cope with the changes and
             | still needs the previous schema. Some traffic is still able
             | to pass through, but some requests are erroring.
             | 
             | What do you do?
             | 
             | One option is to run another ALTER TABLE that takes you
             | back into the original schema. This will take yet another 5
             | hours, during which your app may be degraded or altogether
             | down. Plus you'll be unable to recover lost data (such as
             | in a DROP COLUMN scenario). Another option is to do a point
             | in time recovery for your entire database. This will both
             | take time, but more importantly you will lose all the data
             | you've accumulated since the migration completed. Any new
             | user account, any new artifact, any new event - will be
             | lost. Rows that were deleted suddenly reappear. Data that
             | should not be available anymore suddenly is.
             | 
             | Most people will try a third option: do a point in time
             | recovery on an offline server, and extract/copy just the
             | specific table and copy it onto production. Typically this
             | involves a lot of juggling and most environments will not
             | have the infrastructure to automate the entire process. But
             | even once this is done, you're still hit with the
             | unfortunate implication: your data set is now both
             | incomplete as well as inconsistent.
             | 
             | It is incomplete because data is missing from the restored
             | table. Any rows accumulated since the point in time
             | recovery point - are lost. It is inconsistent, because in
             | many cases, due to the natural relational design of your
             | schema, other tables will have rows that relate to the
             | missing restored table's rows. You may try to then manually
             | backfill those missing rows into the restored table (or
             | remove rows previously deleted) , but in reality some
             | processes will already have manipulated the data on the
             | restored table even while you're trying to resolve the
             | situation, leading to more conflicts.
             | 
             | It seems like the only safe way is to take everything
             | offline, disable any writes to the broken tables as well as
             | some of, or all tables, associated with it, resolve all
             | conflicts, then restore data onto production and enable
             | writes again. Or, you choose to lose data, track down any
             | known conflicts, reach out to users and inform them of the
             | data loss. Either way this has a significant impact on your
             | service.
             | 
             | And so Rewind offers an instant fall back to your previous
             | schema, while still retaining any data you've accumulated
             | since the time of incident. Rewind resolves the differences
             | between previous and current schema, and adapts the latest
             | data changes onto the old schema. As you rewind the
             | migration your table still has the same amount of rows, and
             | maintains all incoming or outgoing references from and to
             | other tables. It all happens on your production environment
             | and does not require an offline server.
             | 
             | Here's a technical description of how Rewind works:
             | https://planetscale.com/blog/behind-the-scenes-how-we-
             | built-...
        
               | aboodman wrote:
               | This is a really clear description of the problem and
               | solution. Thanks!
               | 
               | I recommend amending the blog (or maybe making a new
               | post) with this exact content. I regularly run SQL
               | databases for smaller projects but was not able to
               | immediately conceptualize how I would use this feature
               | from just the blog post. Maybe your target audience would
               | be able to? But I don't see the downside in just spelling
               | it out clearly!
        
               | edf13 wrote:
               | Sounds a bit iffy if your prod deployment fails after
               | testing... surely the correct way is to have staging
               | pickup any issues prior to prod
        
               | ledauphin wrote:
               | lol.
               | 
               | I mean, yes... but also - have you really never seen a
               | bug make it to production?
        
               | edf13 wrote:
               | Yes - but this kinda gives it an easy pass. Makes it
               | easier to 'move fast and break things' which isn't always
               | best
        
             | rorymalcolm wrote:
             | If you drop a 'title' column from a users table, for
             | example, you can revert and have the title column reappear
             | with the dropped data, new users added during this time
             | (while the column was dropped) will not have a title.
        
               | codingdave wrote:
               | Just to clarify my understanding -
               | 
               | If we extend that scenario a bit to dropping a title
               | column and at the same time adding a foo column. Then add
               | rows with data in foo. Then revert. Do you lose the foo
               | data?
               | 
               | Alternatively, can we separate those actions out? Drop
               | columns in one migration. Add columns in another. Add
               | rows and data. Then revert only the migration where
               | columns were dropped, keeping the more recent adds?
        
               | shlomi-noach wrote:
               | Right. So if you're both adding one column and removing
               | another, then the revert will lose your new column and
               | will regain your old column. Normally, you deploy DB and
               | app in steps. E.g. if your migration adds a new column,
               | your app is not yet aware of the column (or else your app
               | would break). The moment the migration completes, your
               | app is still on the not-knowing state. It takes an app
               | deployment to actually start utilizing the new column. If
               | you do that, and then want to revert -- you will lose any
               | new data you've added to the new column.
               | 
               | In my experience, when a schema migration goes wrong, it
               | goes wrong with a bang. It takes seconds to maybe one
               | minute until pagers are alarming. So I'd say in a common
               | scenario you will not get to deploy your app with the new
               | column awareness, because you'll have realized the
               | migration was bad right away.
               | 
               | > Alternatively, can we separate those actions out? Drop
               | columns in one migration.
               | 
               | If you choose to do that, then you're on safer grounds;
               | it costs you some wall clock time, because migrations do
               | take a while to complete on medium to large tables.
               | 
               | Do note that Rewind only lets you rewind your most recent
               | deployment (PlanetScale's app will not let you run the
               | next migration before you've committed to, or have
               | rewinded, the previous one).
        
           | robertlagrant wrote:
           | Makes sense! Thanks. And congratulations.
        
           | robertlagrant wrote:
           | And now I have a follow-up!
           | 
           | So what if you upgrade your data structure, add a new
           | required column and a record is inserted. You then roll back,
           | removing the column, in order to fix a bug. When you roll
           | forward again, is that removed/readded column's data for the
           | inserted record still there?
        
       | epberry wrote:
       | P.S. you can do the forward version of this, online schema
       | changes, directly in the Arctype SQL GUI
       | https://arctype.com/blog/changelog/planetscale-exports/. I
       | imagine we'll have support for Rewind too in the future. Game
       | changing feature.
        
       | ZeroCool2u wrote:
       | I'm not really the target audience for PlanetScale, but this is
       | still pretty damn cool.
       | 
       | This is the link where they go into detail about the mechanism
       | that enables this feature.
       | 
       | https://vitess.io/docs/13.0/reference/vreplication/vreplicat...
       | 
       | Maybe GitHub should look into migrating to PlanetScale for their
       | mysql1 cluster that keeps going down this week? Unless
       | PlanetScale uses GitHub and that would introduce a circular
       | dependency. Eh, it's turtles the whole way down either way I
       | suppose.
        
         | gtowey wrote:
         | They already do. Not for that cluster, but large tranches of
         | data are being logically separated fr there then physically
         | moved to separate Vitess-based clusters. It takes a lot of
         | engineering time to safely separate the data.
        
       | ksec wrote:
       | Is this similar to Temporal Tables in MSSQL?
        
         | shlomi-noach wrote:
         | Engineer at PlanetScale -- It is not similar; so temporal
         | tables are about getting a table's dataset at a given point in
         | time in the past, sort of a time machine for your table. Rewind
         | is about undoing a (bad) structural schema change, and
         | _without_ having to go back in time - you continue your current
         | timeline, with your current data, but flipped into the previous
         | schema.
         | 
         | It's also not something you need to activate ahead of time,
         | like you do in MSSQL temporal tables; it is activated on your
         | behalf for any schema change you deploy.
        
       | joshstrange wrote:
       | This is pretty awesome. I've been super happy with my move to
       | PlanetScale so far and they keep improving the product and adding
       | things I didn't even think were possible (or at least
       | easy/viable).
        
       | throwusawayus wrote:
       | "We are giving away limited-edition, retro Rewind t-shirts to the
       | first 100 people that successfully revert a schema change."
       | 
       | they are encouraging people to record videos of making schema
       | changes and reverting them, in order to win a t-shirt
       | 
       | does this seem unnecessarily risky or in really poor taste to
       | anyone else?
        
         | misterbwong wrote:
         | I think that's the point. Schema migration is traditionally a
         | _very_ risky thing but they're trying to make it come off as
         | "safe", if you're a PlanetScale customer.
        
           | throwusawayus wrote:
           | yeah but half the risk is on the application side, completely
           | independent of planetscale! not all applications can handle
           | some schema change occurring and then suddenly reversing.
           | someone may get fired just trying to win a shirt
           | 
           | also their exact wording, "successfully", so if it fails your
           | db is broken AND you do not get a shirt?
        
         | ebiester wrote:
         | Hopefully someone is doing this in testing or staging.
        
       | ahachete wrote:
       | Does it support migrations that affect more than one table?
        
       | [deleted]
        
       | danielovichdk wrote:
       | Do you backup the present schema and data before the new
       | migration is done ?
       | 
       | And then if you rewind, you "simply" point to that backup scehma
       | and data instead of the newly migration?
       | 
       | I know it's very simplified, but is this the gist of it?
        
         | shlomi-noach wrote:
         | It's not like that -- that BTW is super simple to achieve with
         | either of the existing online schema change tools (pt-online-
         | schema-change, gh-ost, facebook's OSC) -- they all end up with
         | your old table renamed away, and which you can instantly
         | reinstate back in place. Very cool and important feature! But
         | then, you lose data; all the data you've accumulated since the
         | migration completed; or some data you've deleted will suddenly
         | reappear.
         | 
         | Rewind does not move you back to an old snapshot, but rather
         | keeps you on your current timeline, with the current data, but
         | with the old schema.
         | 
         | Technically, there are two tables involved, yes! And a synching
         | mechanism that compensates for the structural differences
         | between them. But perhaps I should just point to this technical
         | explanation of how this works internally:
         | https://planetscale.com/blog/behind-the-scenes-how-we-built-...
        
           | throwusawayus wrote:
           | > But then, you lose data; all the data you've accumulated
           | since the migration completed; or some data you've deleted
           | will suddenly reappear.
           | 
           | this is not correct, for example pt-online-schema-change has
           | long had a --reverse-triggers option which reverses the
           | direction of the triggers to keep the old table up to date
        
       | KwisaksHaderach wrote:
       | Does planetscale doesn't have PITR? couldn't find anything in the
       | docs.
        
       ___________________________________________________________________
       (page generated 2022-03-24 23:02 UTC)