find_in_batches and activerecord-import to process big data

Subscribe to receive new articles. No spam. Quality content.

Hi there. When you work on a greenfield project everything goes well. But at some point each big project has this issue with DataBase performance. It get's not easy to make changes to DB structure, insert data. Queries are getting slower and slower. It's time to do something with it.

Let's consider simple example: we have Rails application with a User model. Model has full_name field which consists of first name and last name. We have request: move out full_name into Profile model and split it into two separate fields: first_name and last_name.

To solve this problem we can create rake task which will fill out first_name and last_name fields properly. But there is a problem. Our project has been in a market for a while and we have millions of users.

Let's try to write code that would work with a greenfield project:

namespace :profiles do
  task :populate_first_and_last_name do
    User.all.each do |user|
      first_name, last_name = user.full_name.split(' ')
      Profile.create(user_id: user.id, first_name: first_name, last_name: last_name)
    end
  end
end

This code could work on a small app, but it causes a huge issue on a big database. We have a problem with this part: User.all. It sends the following query:

SELECT `users`.* FROM `users`

On a big databases such queries are evil that can cause bad consequences.

Let's find a way to process data by smaller parts.

As usually Rails already addressed this issue and provide find_in_batches method.

find_in_batches fetches data from database in batches which allows DB server to handle such queries. By default each batch equals to 1000 rows.

Let's rewrite our rake task with find_in_batches:

namespace :profiles do
  task :populate_first_and_last_name do
    User.find_in_batches.each do |users|
      users.each do |user|
        first_name, last_name = user.full_name.split(' ')
        Profile.create(user_id: user.id, first_name: first_name, last_name: last_name)
      end
    end
  end
end

First of all it takes first 1000 users, stores them into users variable. Then we go through each user and fill in first_name and last_name for profile. Only after that we fetch following 1000 of users. SQL queries look like this:

SELECT  `users`.* FROM `users`  ORDER BY `users`.`id` ASC LIMIT 1000

As we see Rails order users by id and set limit to 1000.

If we want to change size of batch, we can pass param batch_size:

User.find_in_batches(batch_size: 2000) do |users|
  # more code
end

This approach allows to reduce load on DB server and execute that task much faster. But there is one more problem in this code, it is: Profile.create. Our task does as many INSERT queries into DB as many users we have. But we have millions of them!

In this case we can use gem activerecord-import. It allows us to import a lot of DB rows by one INSERT query.

In documentation there are many examples of usage.

Using this gem, we could re-write rake task to this way:

namespace :profiles do
  task :populate_first_and_last_name do
    User.find_in_batches.each do |users|
      profiles = []
      users.each do |user|
        first_name, last_name = user.full_name.split(' ')
        profiles << Profile.new(first_name: first_name, last_name: last_name)
      end
      Profile.import profiles, validate: false
    end
  end
end

We still fetch users in batches, but now we prepare array of profiles which at the end of cycle does INSERT which creates thousand of profiles.

For the following thousand we apply the same logic.

Let's think what benefits we have with this approach: First of all we will not break our DB server by User.all query. We don't need to do something like find_each which could end up with N*2 number of queries. In that case having 100 millions of users would cause 200 millions of queries during migration. One for fetching user and another to create profile.

Using find_in_batches and activerecord-import we reduced time of executing task and number of queries drasticly. 100 millions of users we will migrate by 200 thousands of queries.

Having bit data, we always seek for performance improvements. find_in_batches is a good fit for solving such tasks.

Thanks!

Subscribe to receive new articles. No spam. Quality content.

Comments