Answer the question
In order to leave comments, you need to log in
How to go through the list of URLs as quickly as possible, get a code (200 OK or ...) and write the result of the URL type - 200 OK to a file?
There is a task: to go through the list of URLs as quickly as possible, get a code ( 200 OK or ...) and write the result of the form URL - 200 OK to a file.
The script is already written in python 3.3 using urllib, but it takes unacceptably long time.
Actually the question is: is it possible to do this much faster using python?
Thanks in advance.
Answer the question
In order to leave comments, you need to log in
0. Do not close the connection after each request (depends on how you have it implemented now and what you are polling)
1. Use HEAD request
2. Use streams (parallelize)
3. Use gevent (with python3 support, everything is complicated). Nevertheless, it is great friends with requests, and implementing it on it is not a problem at all
4. Use the chic requests library
HTTP HEAD Request
Obviously, you just need to send a HEAD request to the resource. There are quite a few
examples
most of the time is spent waiting for a network response. You can, of course, parallelize through threads, but
the complexity of the code will increase.
I recommend doing it on twisted, it is designed for this kind of task.
it will take no more than a screen of lines, though it only works with python 2.x.
The fastest way is through an asynchronous framework, without threads, let alone processes.
The main asynchronous framework for Python is now Twisted.
Here are links to documentation and examples:
stackoverflow.com/questions/2147148/twisted-http-client
If it is mandatory to use Python, then I would take gevent and urllib2. Instead of using HEAD (which many servers don't understand), I suggest you simply don't download the response body.
There is an example here https://github.com/surfly/gevent/blob/master/examp... but it is very simple - in practice it is better to have a pool of a limited number of greenlets.
Well, the line
should be replaced with
resp = urllib2.urlopen(url)
print resp.getcode()
resp.close()
To only download the headers, without the body. If the script should not be only in python, you can try calling curl in several threads :)
In general, it was already on SO without any python: stackoverflow.com/questions/6136022/script-to-get-...
To parallelize urllib2 it is very convenient to use monkey_patch from gevent
Thanks to all who responded.
It is a pity that some solutions are not suitable for python3.x.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question