Answer the question
In order to leave comments, you need to log in
How to speed up a large number of get requests?
A table of two columns id and url, 5 million records
. It is required to pull all the urls with a get request and get the IDs of the urls whose status is not 200.
If you do this sequentially, it takes a lot of time. (Especially considering that some give 504)
How to speed up Search?
Answer the question
In order to leave comments, you need to log in
The main thing happens in line with pool.map
import urllib2
from multiprocessing.dummy import Pool as ThreadPool
urls = [
'http://www.python.org',
'http://www.python.org/about/',
'http://www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html',
'http://www.python.org/doc/',
'http://www.python.org/download/',
'http://www.python.org/getit/',
'http://www.python.org/community/',
'https://wiki.python.org/moin/',
'http://planet.python.org/',
'https://wiki.python.org/moin/LocalUserGroups',
'http://www.python.org/psf/',
'http://docs.python.org/devguide/',
'http://www.python.org/community/awards/'
# etc..
]
# Make the Pool of workers
pool = ThreadPool(4)
# Open the urls in their own threads
# and return the results
results = pool.map(urllib2.urlopen, urls)
#close the pool and wait for the work to finish
pool.close()
pool.join()
If there are several requests to the same server, then it is worth using a session so as not to open a new connection each time. requests, for example, has requests.session .
The main loss of time is waiting for responses from remote servers.
Options for what to do to speed up:
Faster than the server can give - nothing
Use
scrapy - it has all the required
functionality
How about parallelizing tasks using workers? I always have a message broker for this case. I just throw tasks there and, depending on the number of tasks, I put more or less workers. Only there is a problem that if the resource is one or not a large number, then you can be banned
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question