[HN Gopher] Update millions of records in Rails
___________________________________________________________________
Update millions of records in Rails
Author : thunderbong
Score : 45 points
Date : 2023-04-26 17:36 UTC (5 hours ago)
(HTM) web link (blog.eq8.eu)
(TXT) w3m dump (blog.eq8.eu)
| colonwqbang wrote:
| Postgres has a built in function that converts strings into
| lowercase. Why not use that? This seems like an extremely
| convoluted way of performing the example task.
|
| One could suppose that it's just a contrived simple example to
| show the point, but the headline suggests otherwise.
| stevebmark wrote:
| The best part is the map that not only mutates its own
| arguments, but also mutates the arguments to the function,
| which all is in a "service" which has to be instantiated and
| then "called." I really wish the Rails community would pick up
| best practices from other languages.
| manv1 wrote:
| Really, a better solution might be to spew the results into
| another table and then schedule downtime and replace the original
| table (via rename or whatever).
|
| It all depends on your database, which is why you need an actual
| DBA role on your team instead of just using your database like an
| abstract storage entity.
| alex_lav wrote:
| > It all depends on your database, which is why you need an
| actual DBA role on your team instead of just using your
| database like an abstract storage entity.
|
| People seem to be getting along alright without it, no?
| Alifatisk wrote:
| Cool, is the same applicable if I run mysql?
| what-no-tests wrote:
| Ooohh umm...no transaction?
|
| WhenEVER I make an insert, update, or delete -- I always use a
| transaction.
|
| That's what we should always do, right?
|
| Right?
| whalesalad wrote:
| If you're updating a single entity it doesn't really matter. An
| exception is if you're doing something like select for update
| in psql where you're explicitly introducing locks. If you're
| performing multiple operations simultaneously and want them all
| to go or die together, yes.
| SkyPuncher wrote:
| Transactions are only necessary when the change is tightly
| coupled.
|
| ---
|
| In this case, it's okay for individual records to fail. They
| can either be manually processed, re-run, or root caused for
| failure and handled individually.
| swe_dima wrote:
| Groupping several updates inside a transaction actually makes
| it faster because database doesn't have to "commit" changes
| for each one. This has diminishing returns however, if too
| much data changes inside a transaction it can becomes slower
| due to memory issues. So something like 1000-10000 updates
| per transaction is a sweet spot I think.
| what-no-tests wrote:
| THIS.
|
| A lot of the work I do involves analyzing and improving
| Ruby applications' performance. Often, simply wrapping
| transactions around the database operations will help -
| without making other (often much needed) structural
| changes.
| chrisandchris wrote:
| I think you really don't want to keep a transaction > 5 hours
| open.
|
| > For our setup/task (just update few fields on a table) the
| process of probing different batch sizes & Sidekiq thread
| numbers with couple of thousands/millions records took about 5
| hours [...]
| jaggederest wrote:
| I've seen transactions (pathologically) held open for _weeks_
| with no huge issues. Postgres is _amazing_. You 're 100%
| right that you should not do it though. Also little things
| like if you have more than some very large number of
| transactions while one transaction is open, things start to
| break down.
| tinglymintyfrsh wrote:
| Wasteful and the wrong place to do it. Also, Sidekiq is slow,
| old, and commercial crippleware.
|
| Generate and send SQL (wrapped in transactions) from C bindings
| via a proper background job processor like beanstalkd in batches
| of say 50k. Getting Ruby involved in millions of records is
| asking for trouble.
| jaggederest wrote:
| This is the perfect time to drop down into the database and let
| the database handle it. If you're doing mass migration of data,
| the data should not ever touch rails.
|
| Also, this is another example of why normalizing your schemas is
| important - in a normalized schema, you would only have fiftyish
| state records, and a few tens of thousands of city records. (Yes,
| this is just an example, I'm sure they wouldn't actually update
| city and state on every record this way, but it's a useful
| example of why normalized data is easier to work with. Imagine if
| you missed a few records - now you have a random assortment of
| Cleveland, CLEVELAND, and cleveland, where if your schema were
| normalized with a constraint, you'd have one city_id pointing to
| Cleveland, OH)
| hk1337 wrote:
| Agreed. Rails uses ActiveRecord and you'll most likely be
| trying to load millions of records through an ORM which will
| make you go insane trying to get it to work efficiently.
|
| I have no problem with Rails or ActiveRecord but bulk upload of
| data through any ORM is a big no for me.
| izietto wrote:
| > Rails uses ActiveRecord and you'll most likely be trying to
| load millions of records through an ORM which will make you
| go insane trying to get it to work efficiently.
|
| You just need to use batches (that `.in_batches` in the
| link), it's an easy-peasy approach which is way more
| acceptable for your database than a single update of 500m
| records. I've been using it for years and it has never raised
| a single issue.
| FrancoisBosun wrote:
| The downcasing in done in Ruby-land, thus the records had
| to be loaded in memory, on the worker. If the downcasing
| was done on the DB itself, then the records would not have
| needed to travel all the way to the worker, only to be
| discarded 200ms later. Better to run an UPDATE query direct
| against the DB instance, with IDs to control concurrency:
| UPDATE addresses SET city = lower(city), state =
| lower(state) WHERE id BETWEEN a AND b
| revscat wrote:
| This is what the `in_batches` method already does. See
| [1], which is called by the `in_batches` method for
| update operations.
|
| [1] https://api.rubyonrails.org/classes/ActiveRecord/Rela
| tion.ht...
| inferiorhuman wrote:
| Yeah, one of the first things I learned (Ruby 1.8) was
| that Ruby is slow. If you can avoid creating a bunch of
| Ruby objects and push the logic into the database you'll
| typically see performance gains. The big exception was
| Mongo, where more "advanced" features got shoved off to
| the JS interpreter which was an expensive operation - at
| least until the 'aggregation framework' became a thing.
|
| Anyways. Before blindly doing string manipulation with
| postgres make sure that all the locale stuff is set
| correctly.
| itake wrote:
| I've tried this, but we ran into high CPU usage while the task
| was being run, causing other queries to time out or be delayed.
|
| Is there a way to send low priority updates that wont interfere
| with other requests?
| whalesalad wrote:
| Agreed. You could add/apply them as rails migrations, too, that
| way the transformations exist as code and can be
| referenced/seen in the future.
| jaggederest wrote:
| There's a useful gem that creates a data_migrations directory
| (and associated table once run) for things like this, that
| are not changing the schema. I tend to write them as rake
| tasks with an `up` and `down` method if they are reversible.
| Also, hopefully, the migrations are idempotent and just do
| nothing if there are no affected records (though not all data
| changes are like that of course)
| chao- wrote:
| > _There 's a useful gem that creates a data_migrations
| directory_
|
| Would you please share this useful gem's name? I have hand-
| rolled some variety of this solution so many times that it
| feels like just a part of life, like doing the laundry. If
| a gem can do that setup for me, I would gladly use it.
| jaggederest wrote:
| I was thinking of this one, I believe, but it's pretty
| old:
|
| https://github.com/OffgridElectric/rails-data-migrations
|
| This is maybe a more modern one I found:
|
| https://github.com/ilyakatz/data-migrate
| mrcb123 wrote:
| https://github.com/ilyakatz/data-migrate
|
| and for seeding data
| https://github.com/pboling/seed_migration
| simonw wrote:
| "Letting the database handle it" for truly large tables can
| have some potentially tricky issues.
|
| The most common I've seen relate to replication. If your
| database has replicas and uses row-based replication (as
| opposed to statement-based replication) running "update
| huge_table set title = upper(title)" could create a billion
| updated rows, which then need to be copied to all of those
| replicas. This can cause big problems.
|
| In those cases, it's better to update 1,000 rows at a time
| using some kind of batching mechanism, similar to the one
| described in the article.
| jaggederest wrote:
| I agree, but I would do the batching in the database rather
| than in a job, using something like a function that counts up
| or pg_sleep between queries with a subset of the range of
| existing rows.
| jmuguy wrote:
| We have a need to upsert millions of records in Rails on an
| ongoing basis (weekly). Biggest problem we ran into was
| attempting to use Sidekiq and using too many workers each making
| smallish inserts. You don't run out of connections in that case -
| PG cannot handle that many parallel inserts into the same
| table(s).
|
| After discussing with a friend that frequently works with large
| data sets, he suggested that single very large upserts/inserts
| (50k to 100k) records would be handled much better by Postgresql,
| and they were.
|
| Our final setup still has all those Sidekiq workers, but they
| dump the records as JSON blobs into Kafka. From there a single
| consumer pulls them out and buffers. Once it has 50k, it dumps
| them all in a single ActiveRecord Upsert. Its been working very
| well for us for a while now.
| albertgoeswoof wrote:
| If you want to build scalable software, this is the only line you
| need from this blog post:
|
| > You need to schedule few thousand record samples and monitor
| how well/bad will your worker perform
|
| I find it staggering how many senior developers don't know how to
| profile or performance test their software. Just throwing a few
| million rows into a local db and measuring your app performance
| is so basic, yet I've seen many teams not even consider this, and
| others farm it to a "QA".
| inferiorhuman wrote:
| I've seen two things that prompt this behavior:
|
| - Fear of the database. For some folks keeping the logic in the
| app which is fully under their control is comforting. I find
| this to be common by folks who've been burnt by their db in the
| past. It's a lot easier to assert (quality) control over your
| own software.
|
| - Difficulty in modeling data. This is more tricky, but IMO
| separates code monkeys from developers.
| what-no-tests wrote:
| Yes - often what once worked when the application only had a
| few hundred or thousand records doesn't work anymore, now that
| it has millions of records.
|
| Those milliseconds, N+1 queries, nested queries, large joins
| (owing to incongruence between logical data structure and how
| that data is stored on disk) and poorly-indexed tables combine
| into lots of sad times.
___________________________________________________________________
(page generated 2023-04-26 23:00 UTC)