How to increase speed of python tornado script?

R

rodion-dev2015-04-07 04:43:13

Python

rodion-dev, 2015-04-07 04:43:13

In the presence of such a python code
, it works the same way that 500 asynchronous requests, that 15000
is determined by the "process_max" variable,
you need to increase the speed to the maximum.
judging by the speed that the statistics gives out, the speed there is about 10-20 competitive requests.

import tornado
from tornado import httpclient
from tornado import gen
from functools import partial
import Queue
from tornado.httpclient import AsyncHTTPClient
import os
from time import gmtime, strftime
import json
from urlparse import urlparse

gloop = tornado.ioloop.IOLoop.instance()
qinput = Queue.Queue()
process_count = 0

process_max = 15000    # maximum count of query for one moment

#create tmp dir if not exists
dirname = "tmp"
if not os.path.exists(dirname):
    os.makedirs(dirname)

#fill queue
f = open('100000_hostsList.lst')
line = f.readline()
items = 0
hosts = []
while line:
    qinput.put("http://"+line)
    line = f.readline()

f.close()


def data_process(data, url, headers):
    data = {'url': url, 'data': data, 'headers': headers}

    dirname = "tmp/" + strftime("%Y-%m-%d_%H", gmtime())
    if not os.path.exists(dirname):
        os.makedirs(dirname)

    f = file(dirname + "/" + urlparse(url).hostname, "w+")
    f.write(json.dumps(data))
    f.flush()
    f.close()

@gen.engine
def process(url):
    global process_count, worker
    try:
        http_client = httpclient.AsyncHTTPClient()

        request = tornado.httpclient.HTTPRequest(url=str(url), connect_timeout=5.0, request_timeout=5.0, follow_redirects=True)
        response = yield tornado.gen.Task(http_client.fetch, request)

        if response.error: raise Exception(response.error)
        data_process(response.body, url, dict(response.headers))
    except Exception as e:
        print e
    process_count -= 1
    gloop.add_callback(worker)

def worker():
    global gloop, process_count, process_max
    print '# %d / %d (%d)' % (process_count, process_max, qinput.qsize())
    while process_count < process_max:
        if qinput.empty(): break
        url = qinput.get_nowait()
        process_count += 1
        gloop.add_callback(partial(process, url))
    if qinput.empty():
        if not process_count: gloop.stop()

print 'start'
gloop.add_callback(worker)
tornado.httpclient.AsyncHTTPClient.configure("tornado.curl_httpclient.CurlAsyncHTTPClient")
gloop.start()
print 'finish'

Reply

Answer the question

In order to leave comments, you need to log in

5 answer(s)

N

Nikita, 2015-04-07
@jkotkot

If you do not have a 500 nuclear computer, then the solution is most likely to WARNING!! REDUCE the number of threads. If there are many more threads than cores and they are constantly working, then switching between them takes time and resources. Nobody canceled the overhead for switching threads.

V

Vitaly Pukhov, 2015-04-07
@Neuroware

You need to look at the loading of host resources, probably the performance rests on the data_process disk operations.
As an option for checking, you can raise the RAM disk and write to it, if you get a win, you need to somehow get rid of writing to the disk.

I

Ilya, 2015-04-07
@766dt

It is necessary to make work with the disk asynchronous, and not just http requests.
Writing to disk may already be asynchronous for the script, and if so, then perhaps it would be enough to remove

f.flush()
f.close()

and control will return to the script immediately. How critical is it to wait for writing to disk?
You can also look for libraries for asynchronous file I / O, or alternatively - use any base with an asynchronous driver for python.

L

lega, 2015-04-07
@lega

In fact, there are 4 things that can "slow down": cpu, disk, network, servers (from where it is downloaded).
1) See if one core is 100% loaded (we are not talking about the entire processor), if there is 100%, then you need to "fork".
2) Disable saving:

def data_process(data, url, headers):
    pass

If at the same time cpu < 100% and the network is not loaded to the maximum, then (intermediate) servers give slowly.

R

rodion-dev, 2015-04-08
@rodion-dev

all cores 0%
resources are not used at all
, it's not about threads, but about asynchronous downloads.