Have I done the parser correctly?

D

Dmitry Matveev2016-11-20 02:33:00

Python

Dmitry Matveev, 2016-11-20 02:33:00

Hello, I made a parser for the office website. There can be from 0 to 350 matches. To load data for each match, you need to request the page for that match each time. How did I. I load the main page and read what matches there are. I remember their id. Then I create the appropriate number of threads and run them. A thread loads a function with a single instruction. Which loads the site and puts it in an array.
Then I wait for the completion of all threads. And I parse this data. When there are about 100 matches, the execution time of all threads is 0.7 seconds on average.
But sometimes situations arise when the time goes from 1.5 to 3 seconds with the same number of matches. Why might this be? Is it because of the GIL? And am I doing the right thing? And have I implemented asynchronous requests? Because I still can't figure out what an achinchrony request is. thanks for the help

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

R

Roman Kitaev, 2016-11-20
@DmMatveev

No, it's not because of the GIL.
Creating the number of threads by the number of pages is wrong. Creating a thread is a time-consuming task for the OS.
Not entirely correct, it is more correct to use what is already there.
If scrapy doesn't suit you, you can use ThreadPools instead of manually managing the threads lowlevel api. More or less like this:

from concurrent.futures import ThreadPoolExecutor
from requests import Session

session = Session()

urls = [
    'https://toster.ru/q/372757',
    'https://toster.ru/',
]

with ThreadPoolExecutor(7) as pool:  # 7 - количество тредов
    for response in pool.map(session.get, urls):
        do_something_with_response(response)

B

bnytiki, 2016-11-20
@bnytiki

Are you sending 100-300 requests to the server at the same time, and even from the same IP?
I would have completely banned you.
Collect some.