How to implement multithreading correctly?

S

stayHARD2015-08-27 14:33:04

Python

stayHARD, 2015-08-27 14:33:04

Hello.
There is a small site parser, you need to make it work faster.
Structure:

# блок с импортами
import ....

# блок с подключением к БД (оттуда нужно забрать урлы по которым пройтись)
conn = sqlite3.connect('db.sqlite3')
curs = conn.cursor()
select = curs.execute("SELECT url from test;")

# блок с основной функцией, которая забирает данные с сайта
def scrape(link):
    ....
    # блок записи в БД, используя подключение выше
    return "Success"

# блок, который вызывает функцию scrape для каждого элемента, что был взят из БД
for link in select.fetchall():
    scrape(link)

I tried to add multithreading (unfortunately unsuccessfully) using the module - threading
And this code:

thread_list = []

t = threading.Thread(target=scrape, args=(link,))
thread_list.append(t)

for thread in thread_list:
  thread.start()

for thread in thread_list:
    thread.join()

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

O

Oleg, 2015-08-27
@Bahusss

I offer you my version, through the task queue:

from Queue import Queue
from threading import Thread

def scrape(link):
    conn = sqlite3.connect('db.sqlite3')
    curs = conn.cursor()
    # insert link, close connection

class Worker(Thread):

    def __init__(self, tasks):
        super(Worker, self).__init__()
        self.tasks = tasks
        self.daemon = True

    def run(self):
        while True:
            link = self.tasks.get()
            try:
                scrape(link)
            finally:
                self.tasks.task_done()


if __name__ == '__main__':
    # максимальное количество одновременных потоков
    capacity = 0  # infinite
    queue = Queue(capacity)

    workers = 3
    for _ in range(workers):
        Worker(queue).start()

    for link in select.fetchall():
        queue.put(link)

    queue.join()
    print 'Done'

L

lega, 2015-08-27
@lega

For a site parser, it is better to use an asynchronous framework, an example www.py-my.ru/post/4f278211bbddbd0322000000

A

angru, 2015-08-27
@angru

As advised above - use some kind of asynchronous framework (I liked gevent at one time). You still need to make changes anyway, whether it be threads or asynchrony, and asynchrony, in my opinion, is better suited in this case, because you basically have IO operations, besides, all Python code still runs on one core ( Google the GIL) and SQLite can't thread, so I'm not sure if threads will give you any advantage.

S

sim3x, 2015-08-27
@sim3x

sqlite3 single-threaded file DBMS. Do not use it in multi-threaded mode
Use scrapy for parsing
If you really want to write everything yourself, then stackoverflow.com/a/6125654