How to speed up import from csv to database?

V

vikholodov2018-02-08 16:49:55

Django

vikholodov, 2018-02-08 16:49:55

We have django, postgresql and 15 million products in a csv file, all imported in 1 stream.
I do it like this:

with open('new_products.csv', 'r', encoding='utf-8') as csvfile:
...
for row in rows:
    product = Product()
    product.price = row[0]
    product.shipping = row[1]
    product.save()

etc. Perhaps there is some way to speed up this process several times?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

N

neatsoft, 2018-02-08
@neatsoft

1. Iterate rows of several hundred pieces, and create objects corresponding to them with a single call to bulk_create:

import itertools

def grouper(iterable, n, fillvalue=None):
    args = [iter(iterable)] * n
    return itertools.zip_longest(*args, fillvalue=fillvalue)

...

limit = 250

for index, items in enumerate(grouper(rows, limit)):
    items = (i for i in items if i) # remove empty rows added by grouper

    products = []

    for item in items:
        product = Product(
            price=item[0],
            shipping=item[1],
        )
        products.append(product)

    Product.objects.bulk_create(products)

It is not worth creating all 15 million objects at once - nothing good will come of it.
2. If you need to create dependent objects, use transactions:

from django.db import transaction

...

limit = 250

for index, items in enumerate(grouper(rows, limit)):
    with transaction.atomic():
        for item in (i for i in items if i):
            product = Product.objects.create(
                price=item[0],
                shipping=item[1],
            )
            # product can be used there to create another objects, e.g.:
            for color in item[2].split(','):
                ProductColor.objects.create(
                    product=product,
                    color=color,
                )

Transactions, in this case, provide a significant performance boost.
3. Launch several processes, each of which should be given in the parameters from which position it should start, and how many lines to process. So that child processes do not have to read the entire file from the very beginning, it is better to pass the position not by line number, but by offset.
4. Combine 3 and 1 or 3 and 2.
5. Models in django are just tables in a database. You can add rows using SQL queries from a separate Python or Go script.