find_in_batches and activerecord-import to process big data
Hi there. When you work on a greenfield project everything goes well. But at some point each big project has this issue with DataBase performance. It get's not easy to make changes to DB structure, insert data. Queries are getting slower and slower. It's time to do something with it.
Let's consider simple example: we have Rails application with a
User model. Model has
full_name field which consists of first name and last name. We have request: move out
Profile model and split it into two separate fields:
To solve this problem we can create
rake task which will fill out
last_name fields properly. But there is a problem. Our project has been in a market for a while and we have millions of users.
Let's try to write code that would work with a greenfield project:
namespace :profiles do task :populate_first_and_last_name do User.all.each do |user| first_name, last_name = user.full_name.split(' ') Profile.create(user_id: user.id, first_name: first_name, last_name: last_name) end end end
This code could work on a small app, but it causes a huge issue on a big database. We have a problem with this part:
User.all. It sends the following query:
SELECT `users`.* FROM `users`
On a big databases such queries are evil that can cause bad consequences.
Let's find a way to process data by smaller parts.
As usually Rails already addressed this issue and provide
find_in_batches fetches data from database in batches which allows DB server to handle such queries. By default each batch equals to 1000 rows.
Let's rewrite our
rake task with
namespace :profiles do task :populate_first_and_last_name do User.find_in_batches.each do |users| users.each do |user| first_name, last_name = user.full_name.split(' ') Profile.create(user_id: user.id, first_name: first_name, last_name: last_name) end end end end
First of all it takes first 1000 users, stores them into
users variable. Then we go through each user and fill in
last_name for profile. Only after that we fetch following 1000 of users.
SQL queries look like this:
SELECT `users`.* FROM `users` ORDER BY `users`.`id` ASC LIMIT 1000
As we see Rails order
id and set limit to 1000.
If we want to change size of batch, we can pass param
User.find_in_batches(batch_size: 2000) do |users| # more code end
This approach allows to reduce load on DB server and execute that task much faster. But there is one more problem in this code, it is:
Profile.create. Our task does as many
INSERT queries into DB as many users we have. But we have millions of them!
In this case we can use gem activerecord-import. It allows us to import a lot of DB rows by one
In documentation there are many examples of usage.
Using this gem, we could re-write
rake task to this way:
namespace :profiles do task :populate_first_and_last_name do User.find_in_batches.each do |users| profiles =  users.each do |user| first_name, last_name = user.full_name.split(' ') profiles << Profile.new(first_name: first_name, last_name: last_name) end Profile.import profiles, validate: false end end end
We still fetch users in batches, but now we prepare array of
profiles which at the end of cycle does
INSERT which creates thousand of profiles.
For the following thousand we apply the same logic.
Let's think what benefits we have with this approach:
First of all we will not break our DB server by
User.all query. We don't need to do something like
find_each which could end up with N*2 number of queries. In that case having 100 millions of users would cause 200 millions of queries during migration. One for fetching user and another to create profile.
activerecord-import we reduced time of executing task and number of queries drasticly. 100 millions of users we will migrate by 200 thousands of queries.
Having bit data, we always seek for performance improvements.
find_in_batches is a good fit for solving such tasks.