How to use threads non-sequentially?

Z

zlodiak2020-01-27 22:26:04

Python

zlodiak, 2020-01-27 22:26:04

Wrote a small parser using the threadings module. The problem is that in my case, multithreading does not reduce the script's running time.

As you can see, after starting the next thread, the script waits for its completion and only then proceeds to launch the next thread. Please tell me how to fix this.

import requests
from bs4 import BeautifulSoup
import threading


personal_pages_paths = []
domain = 'https://vk.com'
search_host = 'https://vk.com/people/'
lastnames = [
    'Иванов',
    'Петров',
    'Сидоров',
    'Козлов',
    'Смирнов',
    'Михайлов',
    'Соколов',
    'Кузнецов',
    'Попов',
    'Лебедев',
    'Волков',
    'Морозов',
    'Новиков',
]


def get_personal_page_paths(html_text):
    paths = []
    soup = BeautifulSoup(html_text, 'lxml')
    link_obj = soup.find('div', {'class': 'results'}).find_all('a', {'class': 'search_item'})

    for path in link_obj:
        paths.append(path['href'])

    return paths


def recieve_page_html(lastname_page):
    with requests.Session() as session:
        html = session.get(lastname_page)
        lastname_paths = get_personal_page_paths(html.text)
        personal_pages_paths.extend(lastname_paths)


def main():
    for lastname in lastnames:
        lastname_page = search_host + lastname
        lastname_paths = []
        paths = threading.Thread(target=recieve_page_html, args=(lastname_page,))
        paths.start()
        paths.join()

    print('PATHS:', personal_pages_paths)
    print('\n LENGTH: ', len(personal_pages_paths))


if __name__ == "__main__":
    main()

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

V

Vladimir, 2020-01-27
@zlodiak

It is necessary to start all the threads from the beginning, and only then in a separate join cycle, for this, the threads can be collected in a list:

workers = []
for lastname in lastnames:
        lastname_page = search_host + lastname
        lastname_paths = []
        paths = threading.Thread(target=recieve_page_html, args=(lastname_page,))
        paths.start()
        workers.append(paths)
for w in workers:
        w.join()