Python how to make many requests and get response without error?

K

Kirill Gorelov2021-07-24 22:56:56

Python

Kirill Gorelov, 2021-07-24 22:56:56

The question is how to do many checks at once and get an answer. That is, to make a service for monitoring site pages.

We use the aiohttp python library in multithreading. Up to 25,000 checks in 2 minutes can be done locally. I have 8gb of ram. Maybe on hardware it will be more powerful and faster, I haven’t checked it yet, or maybe the code is shit. In the future, such checks can grow up to 300 thousand ...
I also read this article https://pawelmhm.github.io/asyncio/python/aiohttp/... , but there is only sending, without processing the response ...

Question :
1.How can I increase either the number of threads, or just the number of checks at a time?
2.How to correctly take into account the option that the pages may delay the response, timeout, but how competent is this decision?
3.How to process responses correctly if a lot of data comes at a time?

Answer:
It seems to me that the only solution is to use microservices.

As I see it.
A pool of tasks arrives, conditionally 100 thousand. Previously, I shove them across several servers, for example 5 pcs. So one server goes 20 thousand at a time. All these tasks first get into their databases, on their servers, and then they can be merged into one common table. But you can also read data from different servers ... But when there are even more such checks, you will need to add the number of servers and distribute all tasks between servers ....

Why do I need this? Yes, as a matter of fact, practice, solve interesting problems for yourself .... And train to develop microservices.

Maybe someone has already solved this or knows where such tasks are dealt with by examples, I will be grateful.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

P

Pavel Dunaev, 2021-07-29
@Kirill-Gorelov

First, NEVER mix multithreading with asynchrony. Asynchrony was invented in order not to make threads.
If I understood correctly, then you want to make a program that checks the performance of many sites at the same time.
See:

How can I increase either the number of threads or just the number of checks at a time?

Make several "workers" - that is, just run the program several times. BUT doing this on one device still most likely will not make any sense, since the main load will be on the network, and not on the processor.

Previously, I shove them across several servers, for example 5 pcs. So one server goes 20 thousand at a time. All these tasks first get into their databases, on their servers, and then they can be merged into one common table. But you can also read data from different servers ...

That's right. But the database can be made one for all servers (sqlite will not work, mysql or postgreesql will not be a bad option). You are making the main server, where there will be a database and 1 worker and a lot of extras. servers with 1 worker on each, which connect to the database on the main one.
The next step is to understand how an asynchronous request differs from a synchronous one.

Request

(as simplified as possible, usually more complicated).
1. Synchronous: all the time marked in red the program is not running, your OS suspends its execution. It turns out that the program can only execute one request at a time, while most of the time it does nothing.
2. Asynchronous: the OS immediately returns a response, such as "will be done" and the program can continue to work. But you need a specific response, so while asyncio is waiting for a response from the OS (which is waiting for a response from the server), it can perform other tasks (for example, start other requests). Now your program can make several requests at the same time, their number depends on the OS and the network.

How to correctly take into account the option that pages may delay the response, timeout, but how competent is this decision?

The OS will keep track of timeouts for each request separately. The timeout can be specified manually when creating a request, if this is not done, then it will be equal to 5 minutes (which is sooo much for a synchronous request, but it will do for an asynchronous one). You can read about the timeout in the documentation .
About tasks in asyncio. To execute multiple functions (pseudo) in parallel, use the asyncio.gather() function:

async def make_request(address):
    ...
    return address, response  # чтобы потом можно было понять, на что это ответ

async def main():
    urls = [...]
    reqs = []
    for url in urls:
        reqs.append(make_request(url))  # Обратите внимание на отсутствие await, нам не нужно ждать завершения сейчас

    results = await asyncio.gather(reqs)  # gather объединяет несколько корутин в одну, теперь мы ждем, пока выполнятся все запросы
    # теперь results - list[tuple[<address>, <response>]]
    for url, result in results:
        print(...)  # print можно делать и в обработчике, тут уже зависит от того, что вы хотите сделать с ответом

There are other ways to run tasks (ensure_future and create_task, they might work better for you).

How to process responses correctly if a lot of data comes at a time?

Handle each answer separately. Right after the request (in the example in the make_request function). Here is an example from the documentation .
When writing code with aiohttp, don't forget that there is a request. async with essentially contains await inside itself, so while such code is being executed, several more parallel tasks can be executed. It is also quite difficult to debug asynchronous code, so use logging (you can start with regular prints, but not print("xx") and print("yy"), but print(f"Starting request to {url}..." ) and print(f"Got response from {url}...") - otherwise you won't be able to understand which site takes a long time to respond, how long it takes to make a request to 1 site (logging and timestamps will help here) and on how many requests the program hangs). async with session.get(...) as resp:

C

C0COK, 2021-07-25
@C0COK

I didn’t quite understand what you want to achieve exactly, but what prevents you from creating asynchronous multithreading with try except, specifying timeout and verify=False ? each thread works on its own, does not wait for the previous one, and timeout is responsible for the maximum waiting for a response from the site.