[HN Gopher] Building a better and scalable system for data migra...
___________________________________________________________________
Building a better and scalable system for data migrations
Author : YorickPeterse
Score : 64 points
Date : 2024-10-24 14:16 UTC (5 days ago)
(HTM) web link (yorickpeterse.com)
(TXT) w3m dump (yorickpeterse.com)
| gregw2 wrote:
| Considered doing the migration via Liquibase? I dont know that it
| supports parallelism in the migration process but you could do
| most of the rest in it (pre/post migration scripts, rollbacks,
| test cases (within precondition changesets), etc... Not a real
| programming language though; its xml config files or annotated
| SQL scripts...
| YorickPeterse wrote:
| I hadn't heard of Liquibase until now. Taking a brief look, it
| doesn't seem like it would be suitable though. For large scale
| migrations you really need to be able to express arbitrary
| logic ("migrate A to B but only if file X on the file system
| contains Y"), which you can't do with something like SQL or
| XML.
| svieira wrote:
| The great thing is, Liquibase even allows you to run
| arbitrary programs as migrations using `customChange`:
| https://docs.liquibase.com/change-types/custom-change.html
|
| Though you can get a _long_ way with just specifying the
| appropriate contexts before you kick off your migrations (and
| tagging your changesets with those context tags as well): htt
| ps://docs.liquibase.com/concepts/changelogs/attributes/co...
| gelatocar wrote:
| I found that liquibase doesn't really support rollbacks,
| particularly with mysql as you can't do transactions for schema
| updates, if the migration fails in the middle of an update it
| just gets left in a half-updated state.
| a8m wrote:
| I'm glad to see I'm not the ~only one bothered by this issue.
| This has been, and still is, one of the challenges in
| managing MySQL schemas.
|
| That's one of the reasons we implemented "migrate down"
| differently than other tools.
|
| I'm not here to promote my blog post, but if you are
| interested in seeing how we tackled this in Atlas, you can
| read more here: https://atlasgo.io/blog/2024/04/01/migrate-
| down
| brunoarueira wrote:
| I liked the line of thought from the author and suffered similar
| points at a much small scale! Once, I'd planned to do a
| migration, which moved multiple columns from one table to another
| and the associated data, but calculated wrong the time to
| complete and the amount of CPU/RAM, if I'd the right tools,
| probably I'd scheduled better and planned a better approach.
| jensenbox wrote:
| I have been eyeing up https://atlasgo.io/ - I am going to give it
| a whirl soon.
| jakozaur wrote:
| The SQL migrations are when running in-place, hard to reverse
| changes, where production is the only realistic env.
|
| I wish there was a better way to run blue/green DB deployments.
| Though this feature is rare (e.g. gh-ost) and not that usable at
| less than bug tech scale.
| fabianlindfors wrote:
| My project Reshape (https://github.com/fabianlindfors/reshape)
| might be interesting to you, it's a bit like blue/green DB
| deployments in that it uses a two-step process where it can
| support both the old and the new schema at the same time during
| the deployment phase.
| fabianlindfors wrote:
| For people interested in this subject, you might also be
| interested in my project Reshape:
| https://github.com/fabianlindfors/reshape. It takes database
| migrations a step further than most tools, guaranteeing zero-
| downtime migration for both schema and data changes.
| oulipo wrote:
| interesting thanks! How would it compare to other like
| bytebase/bytebase, Xataio/pgroll, liquibase/liquibase,
| flyway/flyway, graphile/migrate
| oulipo wrote:
| and why building a separate "reshapedb" if "postgres +
| reshape" is good enough? won't this bring just more issues?
| oulipo wrote:
| also separate question but I feel it would be nice to have a
| few "example" tutorials, where you go through setting a
| database with fake data, with a sample app which queries it
| in real-time in a different terminal and adds data or
| interact, and then going through a few updates / reverts, and
| be able to check that the sample app can still work perfectly
| without having to close it and restart, would be a quite
| useful "start tutorial" to get the idea of your lib
| tianzhou wrote:
| This post addresses the data migration specifics well. Besides,
| there is the part about scaling the organization to deploy
| database migrations. We've built a project
| https://github.com/bytebase/bytebase to address this.
| oulipo wrote:
| Not directly linked to migration (although I'd love to know the
| best way to handle those when you have a kind of Datalake with
| big number of past tables that have been archived on S3)
|
| I'm trying to move my database to Postgres, there is a part which
| is "describing all the objects" (object id, properties, etc), and
| a huge table which is a log of events, that I'm storing in case I
| want to data-mine it later.
|
| Of course this last table is:
|
| huge (or should become huge at some point) better suited by
| columnar storage might be archived from time to time on S3 My
| initial thinking was to store it in Postgres "natively" or as a
| "duckdb/clickhouse" extension with postgres-querying
| capabilities, keep the last 90 days of data in the database, and
| regularly have a script to export the rest as Parquet files on S3
|
| does this seem reasonable? is there a "best practice" to do this?
|
| I also want to do the same with "audit logs" of everything going
| in the system (modifications to the fields, actions taken by
| users on the dashboard, etc)
|
| what would you recommend?
| WJW wrote:
| I agree with most points but would like to add one about
| "tunability". This article focuses mostly on correctness, which
| is of course important (and without which the whole system would
| be useless) but in itself is often not sufficient for large
| systems. I've seen it happen more than once that a big table
| rewrite causes so much load the database might as well be down
| completely.
|
| One thing I would love to see implemented more thoroughly in
| database systems is the ability to throttle and/or pause the load
| that DML can impose. Normal SQL only says "what" the ALTER TABLE
| should do, but the "how" is often rather lacking. At most you get
| a CREATE INDEX CONCURRENTLY in postgres or an "ALGORITHM=instant"
| in mysql, but rarely do you get finegrained enough to say "use at
| most XYZ iops for this task", let alone that you can vary that
| XYZ variable dynamically or to assign priorities to load caused
| by different queries.
|
| AFAIK TiDB and pt-osc provide ways to pause a running migration,
| gh-ost can also throttle a migration dynamically. Vitess also has
| several ways to manage migration, as it leverages gh-ost. For
| postgress I don't think any of the currently popular tools have
| good ways to manage this, but I would love to be proven wrong.
| YorickPeterse wrote:
| Good point, I forgot about tunability. At GitLab we did have a
| mechanism for this for background migrations, IIRC it looked at
| the vacuuming statistics of PostgreSQL and would back off if it
| deemed there was too much load.
| shlomi-noach wrote:
| Vitess maintainer here. While Vitess can still use gh-ost, the
| recommended way is to use Vitess' internal VReplication
| mechanism. With this, your can throttle, pause, resume, cancel,
| retry, revert a migration, as well as auto-resume from failover
| (migration continues from point of interruption once a new
| primary is instated).
|
| Vitess throttling is by default based on replication lag, but
| you can use different metrics, such as load average, or indeed
| multiple metrics combined, to define what constitutes a load.
| toolslive wrote:
| Not exactly an SQL database but we were migrating data
| _constantly_ (dead devices, load balancing, ...) for the
| data/metadata of an exabyte scale object store...
|
| Once client code can work with the mixed state (while the
| migration is in progress) It no longer matters how long it takes.
| Once the migration is robust enough so it can handle crashes,
| interrupts, ... it no longer matters how often you trigger the
| "continue". The migration is using too many iops ? just kill it,
| schedule a continuance later.
|
| Also, your smallest step needs to be an atomic multi-update (you
| don't want to bother with partial failures)
| Temporary_31337 wrote:
| I work for a nosql db vendor. We don't have a fixed schema so
| there are no migrations. If you want fast access to some parts of
| data you just add an index and reindex async. If you need SQL for
| Tableau or PowerBI you just add a view which is also like an
| index.
|
| But the underlying data and it's model can be in flux and we
| handle exabyte scale ha dr rebalancing etc
| YorickPeterse wrote:
| If you enjoy maintaining 5 different formats of the same data,
| then by all means use a document storage DB. But at some point
| you'll have to migrate the data, at which point you'll face the
| same issues (or possibly worse due to the lack of support for
| schema migrations) as relational databases.
| sgarland wrote:
| > As I've hinted at before, migrations should be functions.
| Functions as in "a function written in a programming language",
| not functions as in "SQL snippets I just refer to as functions
| because it makes me sound smart".
|
| Disagree. SQL is a declarative language that is both clear and
| universal. Feel free to generate the migrations however you'd
| like, but I want to see the SQL.
|
| Another bonus (and I'm sure I'll be told I'm wrong) is that you
| don't need to write tests for the migration, because it's
| declarative. Assuming you know what it means (if you don't, maybe
| you shouldn't be administering a DB) and what its locking methods
| entail, you will get _precisely_ what it says, and nothing more.
| If you get a failure from, say, a duplicate entry when creating a
| UNIQUE constraint, that's orthogonal to the migration itself -
| you described the end state, the DB tried to make that happen,
| but was unable to do so due to issues with your data. All the
| tests in the world wouldn't catch that, short of selecting and
| deduping the column[s], and at that point, you're just creating
| work.
|
| I am firmly convinced that any and all infrastructure should be
| declaratively instantiated, and declaratively maintained. I do
| not want or need to test my Terraform files, nor my DDL.
| YorickPeterse wrote:
| > Disagree. SQL is a declarative language that is both clear
| and universal. Feel free to generate the migrations however
| you'd like, but I want to see the SQL.
|
| This works until you need something beyond the trivial "UPDATE
| table SET column1 = column2", such as "Update the values in
| column X with the contents of file Y that exists in Amazon S3",
| or really anything else you can't express in SQL.
|
| > Another bonus (and I'm sure I'll be told I'm wrong) is that
| you don't need to write tests for the migration, because it's
| declarative.
|
| This indeed is wrong, and essentially comes down to "It looks
| simple so it's correct". A SQL query "DELETE FROM users" might
| be correct, but if you meant for it to be "DELETE FROM users
| WHERE id IN (...)" it's going to cause problems.
|
| In other words, at least for data migrations you _absolutely_
| have to write tests or you _will_ run into problems.
| sgarland wrote:
| These are not migrations in the traditional sense. I was
| referring to DDL. Nevertheless:
|
| > "Update the values in column X with the contents of file Y
| that exists in Amazon S3"
|
| You actually can do that in SQL, assuming it's hosted on AWS
| (or for Postgres, you've installed the correct extension). It
| would be a bit convoluted (I think to handle the UPDATE,
| you'd first dump from S3 into a temp table, then update from
| that), but it would work.
|
| > A SQL query "DELETE FROM users" might be correct, but if
| you meant for it to be "DELETE FROM users WHERE id IN (...)"
| it's going to cause problems.
|
| If someone doesn't notice this issue in SQL, they're not
| going to notice it in an ORM, either. It's also possible (and
| a good idea) to enforce predicates for DELETE at a server
| configuration level, such that the DB will refuse to execute
| them. And really, if you actually want to delete everything
| in a table, you should be using TRUNCATE anyway.
| tucnak wrote:
| You can express it in SQL; you would need to create a table
| with "the contents of file Y" from S3, or use something like
| PostgreSQL foreign data wrappers to that effect: we're using
| pg_analytics[1] to query CSV/Parquet data, and the same
| applies to most databases these days.
|
| The limitation you're implying--is not that of SQL, but your
| data model.
|
| [1] https://github.com/paradedb/pg_analytics
| YorickPeterse wrote:
| Both you and sgarland are missing the point here: the
| example I gave is literally just that: an example. There
| could be many more complex cases, some relying on other
| business logic provided by your application, some more
| isolated. Some of those cases could be handled using some
| random PG extension, but for many that won't be sufficient.
|
| In contrast, a migration written in an actual programming
| language can handle all such cases. Depending on the amount
| of abstractions applied, it can also look close enough to a
| declarative language (in other words, it doesn't have to be
| verbose).
|
| So yes, the limitation I'm implying very much _is_ a
| limitation of SQL (or really any declarative query language
| for that matter). It has nothing to do with the data model
| as it applies equally well to using e.g. MongoDB, Redis, or
| really anything else that stores a pile of data you may
| want to transform in complex ways.
| sgarland wrote:
| I think you're missing the point that what you're
| describing is not a migration as a DB{A,RE} would call
| it. That is what I'm pushing back on.
|
| If you're shifting or otherwise manipulating tuples, then
| yes, you probably want to handle that logic in a more
| general-purpose language (though it isn't _required_,
| annoying though it might be otherwise).
|
| But for DDL? No, I don't want anything in the way. The
| state should be stored in VCS, of course, but I don't
| want some abstraction trying to be clever about ALTER
| TABLE foo. Just run it how I specified.
| yayitswei wrote:
| The database I'm working with, Rama, has an interesting way to do
| "instant migrations" at scale. Your migration function runs on
| every read until data is durably migrated in the background. So
| you get migrated data immediately after deploy, while Rama
| quietly persists the changes during normal processing. No
| downtime.
|
| https://blog.redplanetlabs.com/2024/09/30/migrating-terabyte...
| victorNicollet wrote:
| Avoiding SQL migrations was my #1 reason for moving to event
| sourcing.
|
| This approach cuts the "database server" into an event stream (an
| append-only sequence of events), and a cached view (a read-only
| database that is kept up-to-date whenever events are added to the
| stream, and can be queried by the rest of the system).
|
| Migrations are overwhelmingly cached view migrations (that don't
| touch the event stream), and in very rare cases they are event
| stream migrations (that don't touch the cached view).
|
| A cached view migration is made trivial by the fact that multiple
| cached views can co-exist for a single event stream. Migrating
| consists in deploying the new version of the code to a subset of
| production machines, waiting for the new cached view to be
| populated and up-to-date (this can take a while, but the old
| version of the code, with the old cached view, is still running
| on most production machines at this point), and then deploying
| the new version to all other production machines. Rollback
| follows the same path in reverse (with the advantage that the old
| cached view is already up-to-date, so there is no need to wait).
|
| An event stream migration requires a running process that
| transfers events from the old stream to the new stream as they
| appear (transforming them if necessary). Once the existing events
| have been migrated, flip a switch so that all writes point to the
| new stream instead of the old.
| bjornsing wrote:
| Having worked with migrating petabyte scale sharded SQL databases
| I get a sinking feeling when I think about automating this. I
| think it would actually be a pretty good test case for artificial
| super intelligence.
___________________________________________________________________
(page generated 2024-10-29 23:02 UTC)