How to go through the list of URLs as quickly as possible, get a code (200 OK or ...) and write the result of the URL type

Mindless_kiev2013-12-05 21:13:30

Python

Mindless_kiev, 2013-12-05 21:13:30

How to go through the list of URLs as quickly as possible, get a code (200 OK or ...) and write the result of the URL type - 200 OK to a file?

There is a task: to go through the list of URLs as quickly as possible, get a code ( 200 OK or ...) and write the result of the form URL - 200 OK to a file.
The script is already written in python 3.3 using urllib, but it takes unacceptably long time.
Actually the question is: is it possible to do this much faster using python?
Thanks in advance.

Answer the question

In order to leave comments, you need to log in

8 answer(s)

avalak, 2013-12-05
@avalak

0. Do not close the connection after each request (depends on how you have it implemented now and what you are polling)
1. Use HEAD request
2. Use streams (parallelize)
3. Use gevent (with python3 support, everything is complicated). Nevertheless, it is great friends with requests, and implementing it on it is not a problem at all
4. Use the chic requests library

Damir Makhmutov, 2013-12-05
@doodoo

HTTP HEAD Request
Obviously, you just need to send a HEAD request to the resource. There are quite a few
examples

IamG, 2013-12-06
@IamG

most of the time is spent waiting for a network response. You can, of course, parallelize through threads, but
the complexity of the code will increase.
I recommend doing it on twisted, it is designed for this kind of task.
it will take no more than a screen of lines, though it only works with python 2.x.

Ilya Evseev, 2013-12-06
@IlyaEvseev

The fastest way is through an asynchronous framework, without threads, let alone processes.
The main asynchronous framework for Python is now Twisted.
Here are links to documentation and examples:
stackoverflow.com/questions/2147148/twisted-http-client

Sergey, 2013-12-06
@seriyPS

If it is mandatory to use Python, then I would take gevent and urllib2. Instead of using HEAD (which many servers don't understand), I suggest you simply don't download the response body.
There is an example here https://github.com/surfly/gevent/blob/master/examp... but it is very simple - in practice it is better to have a pool of a limited number of greenlets.
Well, the line
should be replaced with

resp = urllib2.urlopen(url)
print resp.getcode()
resp.close()

To only download the headers, without the body.
As for keep-alive, don't bother, it's saving on matches (unless you have all requests to the same server)
Better don't look at Twisted - it's unpromising =)

justabaka, 2013-12-07
@justabaka

If the script should not be only in python, you can try calling curl in several threads :)
In general, it was already on SO without any python: stackoverflow.com/questions/6136022/script-to-get-...

dbihbka, 2013-12-09
@dbihbka

To parallelize urllib2 it is very convenient to use monkey_patch from gevent

Mindless_kiev, 2013-12-10
@Mindless_kiev

Thanks to all who responded.
It is a pity that some solutions are not suitable for python3.x.