Optimizing objects.all() for a huge database. How to get everything and not freeze for N minutes?

A

Alexander Vtyurin2015-11-17 13:48:21

PostgreSQL

Alexander Vtyurin, 2015-11-17 13:48:21

Good day, friends.
I have a PostgreSQL database with two tables. In the first - about 3 million records (links to the Internet). It just so happened that to get the second table, you need to go through all the rows from the first table, take a link, do some manipulations on the Internet and write down the result (for each record in the first table, respectively, 200+ records in the second).
The essence of the problem:
The most obvious approach:

for i in Item.objects.all():
    doSomething(i)

On test data of 10 thousand records, this approach worked with a bang - the data very quickly ended up in my hands, but on real data, the computer simply freezes for an indefinite period and I have nothing left to do but press the reset button .
I read on Habré that this approach is absolutely wrong, because. it creates N+1 queries to the database.
Fixed it like this:

items = list(Item.objects.all())
for i in items:
    doSomething(i)

But the behavior of the computer has not changed - I had to reset it again.
Please, tell me, please, a way out of the situation. Perhaps it is possible to get data from the database in batches of smaller sizes (you will have to rewrite a lot of code if this is the only option)? Or maybe you should try manually compiling a database query without using the Django ORM?
Perhaps I am doing something fundamentally wrong, but I need to store all this data on the server so that users have the fastest access to it.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

M

marazmiki, 2015-11-17
@reapersuper14

When iterating, the entire querieset is loaded into memory, hence the problem. The solution proposed by Alexander Vtyurin , although somewhat clumsy, will work: the idea there is correct. A few years ago, this problem was very acute, so even the widely known in narrow circles Snippet #1949 appeared , made precisely on this principle.
But starting from Django version, if I'm not mistaken, 1.4, a regular tool appeared, designed for similar purposes - the iterator () method of the queriset.

A

angru, 2015-11-17
@angru

there is a suspicion that it's not about .all(), but about doSomething (3k times to get into the Internet and save something to the database). You can run this code to check:

for i, item in enumerate(Item.objects.all()):
    x = i + i

if the problem is not .all(), then it should work pretty quickly. In this case, take care of optimizing doSomething, look towards celery , for janga , I suppose, there are batteries, nope, it is already supported out of the box.

A

Alexander Vtyurin, 2015-11-17
@reapersuper14

Googled this solution:

while True:
    items = Item.objects.filter(pk__gte=i*1000, pk__lt=(i+1)*1000)
    try:
        for j in items:
            doSomething(j)
    except Item.DoesNotExist:
        break

    i += 1

As always, it was enough to correctly compose a search query and you're done.