find_in_batches and activerecord-import to process big data
Hi there. When you work on a greenfield project everything goes well. But at some point each big project has this issue with DataBase performance. It get's not easy to make changes to DB structure, insert data. Queries are getting slower and slower. It's time to do something with it.
Let's consider simple example: we have Rails application with a User
model. Model has full_name
field which consists of first name and last name. We have request: move out full_name
into Profile
model and split it into two separate fields: first_name
and last_name
.
To solve this problem we can create rake task
which will fill out first_name
and last_name
fields properly. But there is a problem. Our project has been in a market for a while and we have millions of users.
Let's try to write code that would work with a greenfield project:
namespace :profiles do
task :populate_first_and_last_name do
User.all.each do |user|
first_name, last_name = user.full_name.split(' ')
Profile.create(user_id: user.id, first_name: first_name, last_name: last_name)
end
end
end
This code could work on a small app, but it causes a huge issue on a big database. We have a problem with this part: User.all
. It sends the following query:
SELECT `users`.* FROM `users`
On a big databases such queries are evil that can cause bad consequences.
Let's find a way to process data by smaller parts.
As usually Rails already addressed this issue and provide find_in_batches
method.
find_in_batches
fetches data from database in batches which allows DB server to handle such queries. By default each batch equals to 1000 rows.
Let's rewrite our rake task
with find_in_batches
:
namespace :profiles do
task :populate_first_and_last_name do
User.find_in_batches.each do |users|
users.each do |user|
first_name, last_name = user.full_name.split(' ')
Profile.create(user_id: user.id, first_name: first_name, last_name: last_name)
end
end
end
end
First of all it takes first 1000 users, stores them into users
variable. Then we go through each user and fill in first_name
and last_name
for profile. Only after that we fetch following 1000 of users.
SQL queries look like this:
SELECT `users`.* FROM `users` ORDER BY `users`.`id` ASC LIMIT 1000
As we see Rails order users
by id
and set limit to 1000.
If we want to change size of batch, we can pass param batch_size
:
User.find_in_batches(batch_size: 2000) do |users|
# more code
end
This approach allows to reduce load on DB server and execute that task much faster. But there is one more problem in this code, it is: Profile.create
. Our task does as many INSERT
queries into DB as many users we have. But we have millions of them!
In this case we can use gem activerecord-import. It allows us to import a lot of DB rows by one INSERT
query.
In documentation there are many examples of usage.
Using this gem, we could re-write rake task
to this way:
namespace :profiles do
task :populate_first_and_last_name do
User.find_in_batches.each do |users|
profiles = []
users.each do |user|
first_name, last_name = user.full_name.split(' ')
profiles << Profile.new(first_name: first_name, last_name: last_name)
end
Profile.import profiles, validate: false
end
end
end
We still fetch users in batches, but now we prepare array of profiles
which at the end of cycle does INSERT
which creates thousand of profiles.
For the following thousand we apply the same logic.
Let's think what benefits we have with this approach:
First of all we will not break our DB server by User.all
query. We don't need to do something like find_each
which could end up with N*2 number of queries. In that case having 100 millions of users would cause 200 millions of queries during migration. One for fetching user and another to create profile.
Using find_in_batches
and activerecord-import
we reduced time of executing task and number of queries drasticly. 100 millions of users we will migrate by 200 thousands of queries.
Having bit data, we always seek for performance improvements. find_in_batches
is a good fit for solving such tasks.
Thanks!