How to check the existence of a url in multiple threads?

I

Ivan Yakushenko2019-06-20 19:56:03

Python

Ivan Yakushenko, 2019-06-20 19:56:03

There is a simple code with which I check the validity of the page:

def check_url():
    for page in range(0, 239999):
        soup = BeautifulSoup(get_html(url + str(page)), 'html.parser')
        if soup.find('h3', class_='description_404_A hide'):
            print('Page not exists: {}'.format(url + str(page)))
        else:
            print('Page found: {}'.format(url + str(page)))
            with open('pages.txt', 'a') as file:
                file.write(url + str(page) + '\n')

It is natural to check for the existence of 239999 pages a little bit long.
Alternatively, I could just start multiple threads with multiprocessing, each one checking a different range of pages, but I don't think that's the python way.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

I

Ivan Shumov, 2019-06-20
@kshnkvn

many times it has already been said that horizontally scaling stateless handlers is to queues) RabbitMQ, for example.
And if you don’t want to take a steam bath, then take AWS SQS + AWS Lambda and get the processing of all this in .... I think it will do it in a couple of minutes) even freetier can meet

A

Astrohas, 2019-06-20
@Astrohas

asyncio

T

tsarevfs, 2019-06-20
@tsarevfs

It's simple: https://stackoverflow.com/a/3332884/1762922
If you want more words in Russian: toly.github.io/blog/2014/02/13/parallelism-in-one-line