How to implement a parser?

E

Egorian2018-07-27 14:53:59

Python

Egorian, 2018-07-27 14:53:59

I need a parser that can parse a site with 11,000 pages. The parser does a calculation of some data from the text from these pages.

url="https://website/post"
def parser(): 
 for num in range(1,11000):
    #получаю страницу сайта
    BeautifulSoup=bs4.BeautifulSoup( requests.get(url+str(num)).text,"html.parser" )
    # получаю блок с текстом
    post_text=BeautifulSoup.select(".post-text")
    print( "num= %s  " % num)
    try:
     print(post_text)
     print(post_text[0].text)
     ###
     ###Тут я обрабатываю текст и сохраняю число в переменную size
     ###
    except IndexError:
        pass
print(size)

I doubt that it will be fast, so I think multithreading is needed here, but I did not fully understand it.
How can such a task be divided into threads?
Allocate a certain number of pages for each thread so that the same page is not parsed several times?
How then to save the final size variable from each thread? Save to a file, adding to the number that is already in the file from other streams? Or to an array?
In multithreading, I do not rummage, tk. just started learning about it

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

J

Jock Tanner, 2018-07-28
@Tanner

Use Scrapy, it knows how to stream.