Multithreaded page processing using Python3+Grab. How?

S

stayHARD2015-10-20 16:22:59

Python

stayHARD, 2015-10-20 16:22:59

Hello.
There was a need to write a fairly simple site handler (not a parser!).
One of the most important details is multithreading and speed.
Now the following code is written:

from queue import Queue
from threading import Thread
import time
from grab import Grab


def submit_form(i, q):
    while True:
        link = q.get()
        g = Grab()
        g.go(link)
        # Some actions with page 
        q.task_done()

start_time = time.time()
num_threads = 5
queue = Queue()

for i in range(num_threads):
    worker = Thread(target=submit_form, args=(i, queue))
    worker.setDaemon(True)
    worker.start()

q = [
"link1",
....
"link100"
]

for item in q:
    queue.put(item)

queue.join()
print("--- %s seconds ---" % (time.time() - start_time))

Sheet q contains ~100 links that need to be processed in parallel, independently of each other.
Now in 5 threads this miracle works out in ~ 50 seconds (It seems pretty good, right?)
When I set 30 threads (I need more, because over time there will be many times more links) I get

grab.error.GrabConnectionError: [Errno 7] Failed to connect to линк_на_сайт port 80: Connection refused

What could be the reason for this and how can the script processing be improved further?
Thanks for the advice :)
UPD:
After reading a little information on the connection refused python request, I concluded that I cannot create more than 1 connection within one second. Is it so?

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

S

sim3x, 2015-10-20
@sim3x

Forget about the hornbeam
Or use python2 and scrapy, or use python3 with its goodies, or just run synchronous scripts in parallel with parallel

cat file_with_links.txt | \
     parallel -j количество_потоков myscript.py --param1={}

A

Andrey K, 2015-10-20
@mututunus

Use aiohttp .

G

GetData, 2015-10-23
@GetData

> When I set 30 threads (I need more, because over time there will be many times more links) I get
On the website or backend, I can’t keep more than 30 simultaneous connections or the firewall / web server is configured to limit the number of connections.