[HN Gopher] Update millions of records in Rails
       ___________________________________________________________________
        
       Update millions of records in Rails
        
       Author : thunderbong
       Score  : 45 points
       Date   : 2023-04-26 17:36 UTC (5 hours ago)
        
 (HTM) web link (blog.eq8.eu)
 (TXT) w3m dump (blog.eq8.eu)
        
       | colonwqbang wrote:
       | Postgres has a built in function that converts strings into
       | lowercase. Why not use that? This seems like an extremely
       | convoluted way of performing the example task.
       | 
       | One could suppose that it's just a contrived simple example to
       | show the point, but the headline suggests otherwise.
        
         | stevebmark wrote:
         | The best part is the map that not only mutates its own
         | arguments, but also mutates the arguments to the function,
         | which all is in a "service" which has to be instantiated and
         | then "called." I really wish the Rails community would pick up
         | best practices from other languages.
        
       | manv1 wrote:
       | Really, a better solution might be to spew the results into
       | another table and then schedule downtime and replace the original
       | table (via rename or whatever).
       | 
       | It all depends on your database, which is why you need an actual
       | DBA role on your team instead of just using your database like an
       | abstract storage entity.
        
         | alex_lav wrote:
         | > It all depends on your database, which is why you need an
         | actual DBA role on your team instead of just using your
         | database like an abstract storage entity.
         | 
         | People seem to be getting along alright without it, no?
        
       | Alifatisk wrote:
       | Cool, is the same applicable if I run mysql?
        
       | what-no-tests wrote:
       | Ooohh umm...no transaction?
       | 
       | WhenEVER I make an insert, update, or delete -- I always use a
       | transaction.
       | 
       | That's what we should always do, right?
       | 
       | Right?
        
         | whalesalad wrote:
         | If you're updating a single entity it doesn't really matter. An
         | exception is if you're doing something like select for update
         | in psql where you're explicitly introducing locks. If you're
         | performing multiple operations simultaneously and want them all
         | to go or die together, yes.
        
         | SkyPuncher wrote:
         | Transactions are only necessary when the change is tightly
         | coupled.
         | 
         | ---
         | 
         | In this case, it's okay for individual records to fail. They
         | can either be manually processed, re-run, or root caused for
         | failure and handled individually.
        
           | swe_dima wrote:
           | Groupping several updates inside a transaction actually makes
           | it faster because database doesn't have to "commit" changes
           | for each one. This has diminishing returns however, if too
           | much data changes inside a transaction it can becomes slower
           | due to memory issues. So something like 1000-10000 updates
           | per transaction is a sweet spot I think.
        
             | what-no-tests wrote:
             | THIS.
             | 
             | A lot of the work I do involves analyzing and improving
             | Ruby applications' performance. Often, simply wrapping
             | transactions around the database operations will help -
             | without making other (often much needed) structural
             | changes.
        
         | chrisandchris wrote:
         | I think you really don't want to keep a transaction > 5 hours
         | open.
         | 
         | > For our setup/task (just update few fields on a table) the
         | process of probing different batch sizes & Sidekiq thread
         | numbers with couple of thousands/millions records took about 5
         | hours [...]
        
           | jaggederest wrote:
           | I've seen transactions (pathologically) held open for _weeks_
           | with no huge issues. Postgres is _amazing_. You 're 100%
           | right that you should not do it though. Also little things
           | like if you have more than some very large number of
           | transactions while one transaction is open, things start to
           | break down.
        
       | tinglymintyfrsh wrote:
       | Wasteful and the wrong place to do it. Also, Sidekiq is slow,
       | old, and commercial crippleware.
       | 
       | Generate and send SQL (wrapped in transactions) from C bindings
       | via a proper background job processor like beanstalkd in batches
       | of say 50k. Getting Ruby involved in millions of records is
       | asking for trouble.
        
       | jaggederest wrote:
       | This is the perfect time to drop down into the database and let
       | the database handle it. If you're doing mass migration of data,
       | the data should not ever touch rails.
       | 
       | Also, this is another example of why normalizing your schemas is
       | important - in a normalized schema, you would only have fiftyish
       | state records, and a few tens of thousands of city records. (Yes,
       | this is just an example, I'm sure they wouldn't actually update
       | city and state on every record this way, but it's a useful
       | example of why normalized data is easier to work with. Imagine if
       | you missed a few records - now you have a random assortment of
       | Cleveland, CLEVELAND, and cleveland, where if your schema were
       | normalized with a constraint, you'd have one city_id pointing to
       | Cleveland, OH)
        
         | hk1337 wrote:
         | Agreed. Rails uses ActiveRecord and you'll most likely be
         | trying to load millions of records through an ORM which will
         | make you go insane trying to get it to work efficiently.
         | 
         | I have no problem with Rails or ActiveRecord but bulk upload of
         | data through any ORM is a big no for me.
        
           | izietto wrote:
           | > Rails uses ActiveRecord and you'll most likely be trying to
           | load millions of records through an ORM which will make you
           | go insane trying to get it to work efficiently.
           | 
           | You just need to use batches (that `.in_batches` in the
           | link), it's an easy-peasy approach which is way more
           | acceptable for your database than a single update of 500m
           | records. I've been using it for years and it has never raised
           | a single issue.
        
             | FrancoisBosun wrote:
             | The downcasing in done in Ruby-land, thus the records had
             | to be loaded in memory, on the worker. If the downcasing
             | was done on the DB itself, then the records would not have
             | needed to travel all the way to the worker, only to be
             | discarded 200ms later. Better to run an UPDATE query direct
             | against the DB instance, with IDs to control concurrency:
             | UPDATE addresses SET city = lower(city), state =
             | lower(state) WHERE id BETWEEN a AND b
        
               | revscat wrote:
               | This is what the `in_batches` method already does. See
               | [1], which is called by the `in_batches` method for
               | update operations.
               | 
               | [1] https://api.rubyonrails.org/classes/ActiveRecord/Rela
               | tion.ht...
        
               | inferiorhuman wrote:
               | Yeah, one of the first things I learned (Ruby 1.8) was
               | that Ruby is slow. If you can avoid creating a bunch of
               | Ruby objects and push the logic into the database you'll
               | typically see performance gains. The big exception was
               | Mongo, where more "advanced" features got shoved off to
               | the JS interpreter which was an expensive operation - at
               | least until the 'aggregation framework' became a thing.
               | 
               | Anyways. Before blindly doing string manipulation with
               | postgres make sure that all the locale stuff is set
               | correctly.
        
         | itake wrote:
         | I've tried this, but we ran into high CPU usage while the task
         | was being run, causing other queries to time out or be delayed.
         | 
         | Is there a way to send low priority updates that wont interfere
         | with other requests?
        
         | whalesalad wrote:
         | Agreed. You could add/apply them as rails migrations, too, that
         | way the transformations exist as code and can be
         | referenced/seen in the future.
        
           | jaggederest wrote:
           | There's a useful gem that creates a data_migrations directory
           | (and associated table once run) for things like this, that
           | are not changing the schema. I tend to write them as rake
           | tasks with an `up` and `down` method if they are reversible.
           | Also, hopefully, the migrations are idempotent and just do
           | nothing if there are no affected records (though not all data
           | changes are like that of course)
        
             | chao- wrote:
             | > _There 's a useful gem that creates a data_migrations
             | directory_
             | 
             | Would you please share this useful gem's name? I have hand-
             | rolled some variety of this solution so many times that it
             | feels like just a part of life, like doing the laundry. If
             | a gem can do that setup for me, I would gladly use it.
        
               | jaggederest wrote:
               | I was thinking of this one, I believe, but it's pretty
               | old:
               | 
               | https://github.com/OffgridElectric/rails-data-migrations
               | 
               | This is maybe a more modern one I found:
               | 
               | https://github.com/ilyakatz/data-migrate
        
               | mrcb123 wrote:
               | https://github.com/ilyakatz/data-migrate
               | 
               | and for seeding data
               | https://github.com/pboling/seed_migration
        
         | simonw wrote:
         | "Letting the database handle it" for truly large tables can
         | have some potentially tricky issues.
         | 
         | The most common I've seen relate to replication. If your
         | database has replicas and uses row-based replication (as
         | opposed to statement-based replication) running "update
         | huge_table set title = upper(title)" could create a billion
         | updated rows, which then need to be copied to all of those
         | replicas. This can cause big problems.
         | 
         | In those cases, it's better to update 1,000 rows at a time
         | using some kind of batching mechanism, similar to the one
         | described in the article.
        
           | jaggederest wrote:
           | I agree, but I would do the batching in the database rather
           | than in a job, using something like a function that counts up
           | or pg_sleep between queries with a subset of the range of
           | existing rows.
        
       | jmuguy wrote:
       | We have a need to upsert millions of records in Rails on an
       | ongoing basis (weekly). Biggest problem we ran into was
       | attempting to use Sidekiq and using too many workers each making
       | smallish inserts. You don't run out of connections in that case -
       | PG cannot handle that many parallel inserts into the same
       | table(s).
       | 
       | After discussing with a friend that frequently works with large
       | data sets, he suggested that single very large upserts/inserts
       | (50k to 100k) records would be handled much better by Postgresql,
       | and they were.
       | 
       | Our final setup still has all those Sidekiq workers, but they
       | dump the records as JSON blobs into Kafka. From there a single
       | consumer pulls them out and buffers. Once it has 50k, it dumps
       | them all in a single ActiveRecord Upsert. Its been working very
       | well for us for a while now.
        
       | albertgoeswoof wrote:
       | If you want to build scalable software, this is the only line you
       | need from this blog post:
       | 
       | > You need to schedule few thousand record samples and monitor
       | how well/bad will your worker perform
       | 
       | I find it staggering how many senior developers don't know how to
       | profile or performance test their software. Just throwing a few
       | million rows into a local db and measuring your app performance
       | is so basic, yet I've seen many teams not even consider this, and
       | others farm it to a "QA".
        
         | inferiorhuman wrote:
         | I've seen two things that prompt this behavior:
         | 
         | - Fear of the database. For some folks keeping the logic in the
         | app which is fully under their control is comforting. I find
         | this to be common by folks who've been burnt by their db in the
         | past. It's a lot easier to assert (quality) control over your
         | own software.
         | 
         | - Difficulty in modeling data. This is more tricky, but IMO
         | separates code monkeys from developers.
        
         | what-no-tests wrote:
         | Yes - often what once worked when the application only had a
         | few hundred or thousand records doesn't work anymore, now that
         | it has millions of records.
         | 
         | Those milliseconds, N+1 queries, nested queries, large joins
         | (owing to incongruence between logical data structure and how
         | that data is stored on disk) and poorly-indexed tables combine
         | into lots of sad times.
        
       ___________________________________________________________________
       (page generated 2023-04-26 23:00 UTC)