[HN Gopher] Building a better and scalable system for data migra...
       ___________________________________________________________________
        
       Building a better and scalable system for data migrations
        
       Author : YorickPeterse
       Score  : 64 points
       Date   : 2024-10-24 14:16 UTC (5 days ago)
        
 (HTM) web link (yorickpeterse.com)
 (TXT) w3m dump (yorickpeterse.com)
        
       | gregw2 wrote:
       | Considered doing the migration via Liquibase? I dont know that it
       | supports parallelism in the migration process but you could do
       | most of the rest in it (pre/post migration scripts, rollbacks,
       | test cases (within precondition changesets), etc... Not a real
       | programming language though; its xml config files or annotated
       | SQL scripts...
        
         | YorickPeterse wrote:
         | I hadn't heard of Liquibase until now. Taking a brief look, it
         | doesn't seem like it would be suitable though. For large scale
         | migrations you really need to be able to express arbitrary
         | logic ("migrate A to B but only if file X on the file system
         | contains Y"), which you can't do with something like SQL or
         | XML.
        
           | svieira wrote:
           | The great thing is, Liquibase even allows you to run
           | arbitrary programs as migrations using `customChange`:
           | https://docs.liquibase.com/change-types/custom-change.html
           | 
           | Though you can get a _long_ way with just specifying the
           | appropriate contexts before you kick off your migrations (and
           | tagging your changesets with those context tags as well): htt
           | ps://docs.liquibase.com/concepts/changelogs/attributes/co...
        
         | gelatocar wrote:
         | I found that liquibase doesn't really support rollbacks,
         | particularly with mysql as you can't do transactions for schema
         | updates, if the migration fails in the middle of an update it
         | just gets left in a half-updated state.
        
           | a8m wrote:
           | I'm glad to see I'm not the ~only one bothered by this issue.
           | This has been, and still is, one of the challenges in
           | managing MySQL schemas.
           | 
           | That's one of the reasons we implemented "migrate down"
           | differently than other tools.
           | 
           | I'm not here to promote my blog post, but if you are
           | interested in seeing how we tackled this in Atlas, you can
           | read more here: https://atlasgo.io/blog/2024/04/01/migrate-
           | down
        
       | brunoarueira wrote:
       | I liked the line of thought from the author and suffered similar
       | points at a much small scale! Once, I'd planned to do a
       | migration, which moved multiple columns from one table to another
       | and the associated data, but calculated wrong the time to
       | complete and the amount of CPU/RAM, if I'd the right tools,
       | probably I'd scheduled better and planned a better approach.
        
       | jensenbox wrote:
       | I have been eyeing up https://atlasgo.io/ - I am going to give it
       | a whirl soon.
        
       | jakozaur wrote:
       | The SQL migrations are when running in-place, hard to reverse
       | changes, where production is the only realistic env.
       | 
       | I wish there was a better way to run blue/green DB deployments.
       | Though this feature is rare (e.g. gh-ost) and not that usable at
       | less than bug tech scale.
        
         | fabianlindfors wrote:
         | My project Reshape (https://github.com/fabianlindfors/reshape)
         | might be interesting to you, it's a bit like blue/green DB
         | deployments in that it uses a two-step process where it can
         | support both the old and the new schema at the same time during
         | the deployment phase.
        
       | fabianlindfors wrote:
       | For people interested in this subject, you might also be
       | interested in my project Reshape:
       | https://github.com/fabianlindfors/reshape. It takes database
       | migrations a step further than most tools, guaranteeing zero-
       | downtime migration for both schema and data changes.
        
         | oulipo wrote:
         | interesting thanks! How would it compare to other like
         | bytebase/bytebase, Xataio/pgroll, liquibase/liquibase,
         | flyway/flyway, graphile/migrate
        
           | oulipo wrote:
           | and why building a separate "reshapedb" if "postgres +
           | reshape" is good enough? won't this bring just more issues?
        
           | oulipo wrote:
           | also separate question but I feel it would be nice to have a
           | few "example" tutorials, where you go through setting a
           | database with fake data, with a sample app which queries it
           | in real-time in a different terminal and adds data or
           | interact, and then going through a few updates / reverts, and
           | be able to check that the sample app can still work perfectly
           | without having to close it and restart, would be a quite
           | useful "start tutorial" to get the idea of your lib
        
       | tianzhou wrote:
       | This post addresses the data migration specifics well. Besides,
       | there is the part about scaling the organization to deploy
       | database migrations. We've built a project
       | https://github.com/bytebase/bytebase to address this.
        
       | oulipo wrote:
       | Not directly linked to migration (although I'd love to know the
       | best way to handle those when you have a kind of Datalake with
       | big number of past tables that have been archived on S3)
       | 
       | I'm trying to move my database to Postgres, there is a part which
       | is "describing all the objects" (object id, properties, etc), and
       | a huge table which is a log of events, that I'm storing in case I
       | want to data-mine it later.
       | 
       | Of course this last table is:
       | 
       | huge (or should become huge at some point) better suited by
       | columnar storage might be archived from time to time on S3 My
       | initial thinking was to store it in Postgres "natively" or as a
       | "duckdb/clickhouse" extension with postgres-querying
       | capabilities, keep the last 90 days of data in the database, and
       | regularly have a script to export the rest as Parquet files on S3
       | 
       | does this seem reasonable? is there a "best practice" to do this?
       | 
       | I also want to do the same with "audit logs" of everything going
       | in the system (modifications to the fields, actions taken by
       | users on the dashboard, etc)
       | 
       | what would you recommend?
        
       | WJW wrote:
       | I agree with most points but would like to add one about
       | "tunability". This article focuses mostly on correctness, which
       | is of course important (and without which the whole system would
       | be useless) but in itself is often not sufficient for large
       | systems. I've seen it happen more than once that a big table
       | rewrite causes so much load the database might as well be down
       | completely.
       | 
       | One thing I would love to see implemented more thoroughly in
       | database systems is the ability to throttle and/or pause the load
       | that DML can impose. Normal SQL only says "what" the ALTER TABLE
       | should do, but the "how" is often rather lacking. At most you get
       | a CREATE INDEX CONCURRENTLY in postgres or an "ALGORITHM=instant"
       | in mysql, but rarely do you get finegrained enough to say "use at
       | most XYZ iops for this task", let alone that you can vary that
       | XYZ variable dynamically or to assign priorities to load caused
       | by different queries.
       | 
       | AFAIK TiDB and pt-osc provide ways to pause a running migration,
       | gh-ost can also throttle a migration dynamically. Vitess also has
       | several ways to manage migration, as it leverages gh-ost. For
       | postgress I don't think any of the currently popular tools have
       | good ways to manage this, but I would love to be proven wrong.
        
         | YorickPeterse wrote:
         | Good point, I forgot about tunability. At GitLab we did have a
         | mechanism for this for background migrations, IIRC it looked at
         | the vacuuming statistics of PostgreSQL and would back off if it
         | deemed there was too much load.
        
         | shlomi-noach wrote:
         | Vitess maintainer here. While Vitess can still use gh-ost, the
         | recommended way is to use Vitess' internal VReplication
         | mechanism. With this, your can throttle, pause, resume, cancel,
         | retry, revert a migration, as well as auto-resume from failover
         | (migration continues from point of interruption once a new
         | primary is instated).
         | 
         | Vitess throttling is by default based on replication lag, but
         | you can use different metrics, such as load average, or indeed
         | multiple metrics combined, to define what constitutes a load.
        
       | toolslive wrote:
       | Not exactly an SQL database but we were migrating data
       | _constantly_ (dead devices, load balancing, ...) for the
       | data/metadata of an exabyte scale object store...
       | 
       | Once client code can work with the mixed state (while the
       | migration is in progress) It no longer matters how long it takes.
       | Once the migration is robust enough so it can handle crashes,
       | interrupts, ... it no longer matters how often you trigger the
       | "continue". The migration is using too many iops ? just kill it,
       | schedule a continuance later.
       | 
       | Also, your smallest step needs to be an atomic multi-update (you
       | don't want to bother with partial failures)
        
       | Temporary_31337 wrote:
       | I work for a nosql db vendor. We don't have a fixed schema so
       | there are no migrations. If you want fast access to some parts of
       | data you just add an index and reindex async. If you need SQL for
       | Tableau or PowerBI you just add a view which is also like an
       | index.
       | 
       | But the underlying data and it's model can be in flux and we
       | handle exabyte scale ha dr rebalancing etc
        
         | YorickPeterse wrote:
         | If you enjoy maintaining 5 different formats of the same data,
         | then by all means use a document storage DB. But at some point
         | you'll have to migrate the data, at which point you'll face the
         | same issues (or possibly worse due to the lack of support for
         | schema migrations) as relational databases.
        
       | sgarland wrote:
       | > As I've hinted at before, migrations should be functions.
       | Functions as in "a function written in a programming language",
       | not functions as in "SQL snippets I just refer to as functions
       | because it makes me sound smart".
       | 
       | Disagree. SQL is a declarative language that is both clear and
       | universal. Feel free to generate the migrations however you'd
       | like, but I want to see the SQL.
       | 
       | Another bonus (and I'm sure I'll be told I'm wrong) is that you
       | don't need to write tests for the migration, because it's
       | declarative. Assuming you know what it means (if you don't, maybe
       | you shouldn't be administering a DB) and what its locking methods
       | entail, you will get _precisely_ what it says, and nothing more.
       | If you get a failure from, say, a duplicate entry when creating a
       | UNIQUE constraint, that's orthogonal to the migration itself -
       | you described the end state, the DB tried to make that happen,
       | but was unable to do so due to issues with your data. All the
       | tests in the world wouldn't catch that, short of selecting and
       | deduping the column[s], and at that point, you're just creating
       | work.
       | 
       | I am firmly convinced that any and all infrastructure should be
       | declaratively instantiated, and declaratively maintained. I do
       | not want or need to test my Terraform files, nor my DDL.
        
         | YorickPeterse wrote:
         | > Disagree. SQL is a declarative language that is both clear
         | and universal. Feel free to generate the migrations however
         | you'd like, but I want to see the SQL.
         | 
         | This works until you need something beyond the trivial "UPDATE
         | table SET column1 = column2", such as "Update the values in
         | column X with the contents of file Y that exists in Amazon S3",
         | or really anything else you can't express in SQL.
         | 
         | > Another bonus (and I'm sure I'll be told I'm wrong) is that
         | you don't need to write tests for the migration, because it's
         | declarative.
         | 
         | This indeed is wrong, and essentially comes down to "It looks
         | simple so it's correct". A SQL query "DELETE FROM users" might
         | be correct, but if you meant for it to be "DELETE FROM users
         | WHERE id IN (...)" it's going to cause problems.
         | 
         | In other words, at least for data migrations you _absolutely_
         | have to write tests or you _will_ run into problems.
        
           | sgarland wrote:
           | These are not migrations in the traditional sense. I was
           | referring to DDL. Nevertheless:
           | 
           | > "Update the values in column X with the contents of file Y
           | that exists in Amazon S3"
           | 
           | You actually can do that in SQL, assuming it's hosted on AWS
           | (or for Postgres, you've installed the correct extension). It
           | would be a bit convoluted (I think to handle the UPDATE,
           | you'd first dump from S3 into a temp table, then update from
           | that), but it would work.
           | 
           | > A SQL query "DELETE FROM users" might be correct, but if
           | you meant for it to be "DELETE FROM users WHERE id IN (...)"
           | it's going to cause problems.
           | 
           | If someone doesn't notice this issue in SQL, they're not
           | going to notice it in an ORM, either. It's also possible (and
           | a good idea) to enforce predicates for DELETE at a server
           | configuration level, such that the DB will refuse to execute
           | them. And really, if you actually want to delete everything
           | in a table, you should be using TRUNCATE anyway.
        
           | tucnak wrote:
           | You can express it in SQL; you would need to create a table
           | with "the contents of file Y" from S3, or use something like
           | PostgreSQL foreign data wrappers to that effect: we're using
           | pg_analytics[1] to query CSV/Parquet data, and the same
           | applies to most databases these days.
           | 
           | The limitation you're implying--is not that of SQL, but your
           | data model.
           | 
           | [1] https://github.com/paradedb/pg_analytics
        
             | YorickPeterse wrote:
             | Both you and sgarland are missing the point here: the
             | example I gave is literally just that: an example. There
             | could be many more complex cases, some relying on other
             | business logic provided by your application, some more
             | isolated. Some of those cases could be handled using some
             | random PG extension, but for many that won't be sufficient.
             | 
             | In contrast, a migration written in an actual programming
             | language can handle all such cases. Depending on the amount
             | of abstractions applied, it can also look close enough to a
             | declarative language (in other words, it doesn't have to be
             | verbose).
             | 
             | So yes, the limitation I'm implying very much _is_ a
             | limitation of SQL (or really any declarative query language
             | for that matter). It has nothing to do with the data model
             | as it applies equally well to using e.g. MongoDB, Redis, or
             | really anything else that stores a pile of data you may
             | want to transform in complex ways.
        
               | sgarland wrote:
               | I think you're missing the point that what you're
               | describing is not a migration as a DB{A,RE} would call
               | it. That is what I'm pushing back on.
               | 
               | If you're shifting or otherwise manipulating tuples, then
               | yes, you probably want to handle that logic in a more
               | general-purpose language (though it isn't _required_,
               | annoying though it might be otherwise).
               | 
               | But for DDL? No, I don't want anything in the way. The
               | state should be stored in VCS, of course, but I don't
               | want some abstraction trying to be clever about ALTER
               | TABLE foo. Just run it how I specified.
        
       | yayitswei wrote:
       | The database I'm working with, Rama, has an interesting way to do
       | "instant migrations" at scale. Your migration function runs on
       | every read until data is durably migrated in the background. So
       | you get migrated data immediately after deploy, while Rama
       | quietly persists the changes during normal processing. No
       | downtime.
       | 
       | https://blog.redplanetlabs.com/2024/09/30/migrating-terabyte...
        
       | victorNicollet wrote:
       | Avoiding SQL migrations was my #1 reason for moving to event
       | sourcing.
       | 
       | This approach cuts the "database server" into an event stream (an
       | append-only sequence of events), and a cached view (a read-only
       | database that is kept up-to-date whenever events are added to the
       | stream, and can be queried by the rest of the system).
       | 
       | Migrations are overwhelmingly cached view migrations (that don't
       | touch the event stream), and in very rare cases they are event
       | stream migrations (that don't touch the cached view).
       | 
       | A cached view migration is made trivial by the fact that multiple
       | cached views can co-exist for a single event stream. Migrating
       | consists in deploying the new version of the code to a subset of
       | production machines, waiting for the new cached view to be
       | populated and up-to-date (this can take a while, but the old
       | version of the code, with the old cached view, is still running
       | on most production machines at this point), and then deploying
       | the new version to all other production machines. Rollback
       | follows the same path in reverse (with the advantage that the old
       | cached view is already up-to-date, so there is no need to wait).
       | 
       | An event stream migration requires a running process that
       | transfers events from the old stream to the new stream as they
       | appear (transforming them if necessary). Once the existing events
       | have been migrated, flip a switch so that all writes point to the
       | new stream instead of the old.
        
       | bjornsing wrote:
       | Having worked with migrating petabyte scale sharded SQL databases
       | I get a sinking feeling when I think about automating this. I
       | think it would actually be a pretty good test case for artificial
       | super intelligence.
        
       ___________________________________________________________________
       (page generated 2024-10-29 23:02 UTC)