A
A
Alexander Kulaga2018-10-17 20:58:35
Python
Alexander Kulaga, 2018-10-17 20:58:35

How to speed up a large number of get requests?

A table of two columns id and url, 5 million records
. It is required to pull all the urls with a get request and get the IDs of the urls whose status is not 200.
If you do this sequentially, it takes a lot of time. (Especially considering that some give 504)
How to speed up Search?

Answer the question

In order to leave comments, you need to log in

6 answer(s)
A
asd111, 2018-10-17
@AlexKulaga

The main thing happens in line with pool.map

import urllib2 
from multiprocessing.dummy import Pool as ThreadPool 

urls = [
  'http://www.python.org', 
  'http://www.python.org/about/',
  'http://www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html',
  'http://www.python.org/doc/',
  'http://www.python.org/download/',
  'http://www.python.org/getit/',
  'http://www.python.org/community/',
  'https://wiki.python.org/moin/',
  'http://planet.python.org/',
  'https://wiki.python.org/moin/LocalUserGroups',
  'http://www.python.org/psf/',
  'http://docs.python.org/devguide/',
  'http://www.python.org/community/awards/'
  # etc.. 
  ]

# Make the Pool of workers
pool = ThreadPool(4) 
# Open the urls in their own threads
# and return the results
results = pool.map(urllib2.urlopen, urls)
#close the pool and wait for the work to finish 
pool.close() 
pool.join()

V
Vadim Shatalov, 2018-10-17
@netpastor

If there are several requests to the same server, then it is worth using a session so as not to open a new connection each time. requests, for example, has requests.session .

O
Oleg, 2018-10-17
@OlegPyatakov

The main loss of time is waiting for responses from remote servers.
Options for what to do to speed up:

  1. Work in multiple threads
  2. Work in multiple processes
  3. Work through asynchronous requests

S
sim3x, 2018-10-18
@sim3x

Faster than the server can give - nothing Use
scrapy - it has all the required
functionality

I
Ivan Shumov, 2018-10-17
@inoise

How about parallelizing tasks using workers? I always have a message broker for this case. I just throw tasks there and, depending on the number of tasks, I put more or less workers. Only there is a problem that if the resource is one or not a large number, then you can be banned

D
Dimonchik, 2018-10-17
@dimonchik2013

https://compiletoi.net/fast-scraping-in-python-wit...

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question