Why does asyncio hang with big data?

N

Nikita Kamenev2021-08-19 23:12:20

Python

Nikita Kamenev, 2021-08-19 23:12:20

async def run(r):
    tasks = []
    sem = asyncio.Semaphore(1000)

    async with ClientSession() as session:
        for url in r:
            task = asyncio.ensure_future(bound_fetch(sem, url, session))
            tasks.append(task)

        responses = await asyncio.gather(*tasks)

with open('0.txt') as f:
    urls = f.read().splitlines()

que = []
for url in urls:
    que.append(url)

    if len(que) == 5000:
        loop = asyncio.get_event_loop()
        future = asyncio.ensure_future(run(que))
        loop.run_until_complete(future)
        que = []

loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(que))
loop.run_until_complete(future)

I feed an array of 5000 elements into the streams, if you remove this part and submit the entire array of 1 million, then it hangs almost instantly. If served in slices of 5k, it starts to hang in about a couple of minutes.

Where is the mistake?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

V

Vindicar, 2021-08-19
@Vindicar

You're loading the entire million addresses into memory, twice.
The first time you do f.read(), then a copy is created within .splitlines() (split into line pieces).
Well, yes, a million individual tasks is also shit. asyncio after all, you need to check whether this or that task can continue to work.
I would make a fixed size task-worker pool, and have each worker in the loop do an f.readline() on its own to get the url to load. And you don’t need to store the entire list in memory, and the control over the number of tasks is better.