Multiprocessor parser losing references when parsing (selenium+PhantomJS+ProcessPoolExecutor)?

V

vetal_mart2016-08-18 09:15:47

Python

vetal_mart, 2016-08-18 09:15:47

I want to write a parr for some website
which has a lot of JS code. For this, I decided to
choose the Selenium+PhantomJS+lxml bundle. Working with Python
The parser needs to be fast enough to process
at least 1000 links in 1 hour. For this purpose,
I decided to use multiprocessing. (not multithreading
- because of the GIL!). For splitting into processes
, I used concurrent.Future.ProcessPoolExecutor.
The problem is the following, for example, I give 10
links to the input, at the output it processes 9 at best
(maybe 6). This is bad! There is still some dependence
with an increase in the number of threads, the number of
lost links increases. The first thing I decided to do is
track where the program breaks, where it stops
running. (assert in my case, as far as I understand
, will not work, due to multiprocessing). Here I determined
that the break is in the browser.get (l) function - it does not load
the page. I tried adding time.sleep(x), then
adding a visible and non-visible wait. Nothing has changed either
. I started to explore the get () function from the
selenium module, found that it reloads
the execute () function from the same module, and there I got into the jungle, that my
knowledge does not allow me to figure it out, and there is not much time.
And at the same time, I tried to run in one process.
That is, the number of processes = 1. And also one link
lost. This made me think that maybe it's not
in selenium + phantomJS, but in the ProcessPoolExecutor. I replaced
this module with multiproessing.Pool - and lo and behold, the links were
no longer lost. But instead, another
problem appeared, more than 4 threads does not perform. If you put
more, it gives the following error:

"""
    multiprocessing.pool.RemoteTraceback: 
    Traceback (most recent call last):
    File "/usr/lib/python3.4/multiprocessing/pool.py", line 119, in worker
        result = (True, func(*args, **kwds))
    File "/usr/lib/python3.4/multiprocessing/pool.py", line 44, in mapstar
        return list(map(*args))
    File "interface.py", line 34, in hotline_to_mysql
        w = Parse_hotline().browser_manipulation(link)
    File "/home/water/work/parsing/class_parser/parsing_classes.py", line 352, in browser_manipulation
        browser.get(l)
    File "/usr/local/lib/python3.4/dist-packages/selenium/webdriver/remote/webdriver.py", line 247, in get
        self.execute(Command.GET, {'url': url})
    File "/usr/local/lib/python3.4/dist-packages/selenium/webdriver/remote/webdriver.py", line 233, in execute
        response = self.command_executor.execute(driver_command, params)
    File "/usr/local/lib/python3.4/dist-packages/selenium/webdriver/remote/remote_connection.py", line 401, in execute
        return self._request(command_info[0], url, body=data)
    File "/usr/local/lib/python3.4/dist-packages/selenium/webdriver/remote/remote_connection.py", line 471, in _request
        resp = opener.open(request, timeout=self._timeout)
    File "/usr/lib/python3.4/urllib/request.py", line 463, in open
        response = self._open(req, data)
    File "/usr/lib/python3.4/urllib/request.py", line 481, in _open
        '_open', req)
    File "/usr/lib/python3.4/urllib/request.py", line 441, in _call_chain
        result = func(*args)
    File "/usr/lib/python3.4/urllib/request.py", line 1210, in http_open
        return self.do_open(http.client.HTTPConnection, req)
    File "/usr/lib/python3.4/urllib/request.py", line 1185, in do_open
        r = h.getresponse()
    File "/usr/lib/python3.4/http/client.py", line 1171, in getresponse
        response.begin()
    File "/usr/lib/python3.4/http/client.py", line 351, in begin
        version, status, reason = self._read_status()
    File "/usr/lib/python3.4/http/client.py", line 321, in _read_status
        raise BadStatusLine(line)
    http.client.BadStatusLine: ''

    The above exception was the direct cause of the following exception:

    Traceback (most recent call last):
    File "interface.py", line 69, in <module>
        main()
    File "interface.py", line 63, in main
        executor.map(hotline_to_mysql, link_list)
    File "/usr/lib/python3.4/multiprocessing/pool.py", line 260, in map
        return self._map_async(func, iterable, mapstar, chunksize).get()
    File "/usr/lib/python3.4/multiprocessing/pool.py", line 599, in get
        raise self._value
    http.client.BadStatusLine: ''
    """
    import random
    import time
    import lxml.html as lh
    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
    from multiprocessing import Pool
    from selenium.webdriver.common.keys import Keys
    from concurrent.futures import Future, ProcessPoolExecutor, ThreadPoolExecutor
    AMOUNT_PROCESS = 5

    def parse(h)->list:
        # h - str, html of page
        lxml_ = lh.document_fromstring(h)
        name = lxml_.xpath('/html/body/div[2]/div[7]/div[6]/ul/li[1]/a/@title')
        prices_ = (price.text_content().strip().replace('\xa0', ' ')
                    for price in lxml_.xpath('//*[@id="gotoshop-price"]'))
        markets_ =(market.text_content().strip() for market in
                lxml_.find_class('cell shop-title'))
        wares = [[name[0], market, price] for (market, price)
                in zip(markets_, prices_)]
        return wares


    def browser_manipulation(l):
        #options =  []
        #options.append('--load-images=false')
        #options.append('--proxy={}:{}'.format(host, port))
        #options.append('--proxy-type=http')
        #options.append('--user-agent={}'.format(user_agent)) #тут хедеры рандомно

        dcap = dict(DesiredCapabilities.PHANTOMJS)
        #user agent takes from my config.py
        dcap["phantomjs.page.settings.userAgent"] = (random.choice(USER_AGENT))
        browser = webdriver.PhantomJS(desired_capabilities=dcap)
        #print(browser)
        #print('~~~~~~', l)
        #browser.implicitly_wait(20)
        #browser.set_page_load_timeout(80)
        #time.sleep(2)
        browser.get(l)
        time.sleep(20)
        result = parse(browser.page_source)
        #print('++++++', result[0][0])
        browser.quit()
        return result

    def main():
        #open some file with links

        with open(sys.argv[1], 'r') as f:
            link_list = [i.replace('\n', '') for i in f]
        with Pool(AMOUNT_PROCESS) as executor:
            executor.map(browser_manipulation, link_list)

    if __name__ == '__main__':
        main()

Actually questions: where can be an error? because of
selenium and phantom, ProcessPoolExecutora, or did I
write the code somewhere wrong?
How can I increase the parsing speed so that 1000
links in 1 hour. ?
Finally, is there some other way to parse
dynamic pages? (of course in python)
Thanks for the answers.

Reply

Answer the question

In order to leave comments, you need to log in

[[+comments_count]] answer(s)

A

Alexey Sundukov, 2016-11-12
@alekciy

Probably not very relevant, but I'll leave a note for history. The loss of pages on the parsing is a regular situation. The abundance of JS, code crashes when working with the DOM, network problems, everything can lead to the generation of an exception and the data retrieval fail. So when developing, you should immediately swear at the regularity of such a situation and simply catch unparsed pages and send them for re-parsing.
1000 pages in 1 hour is more than a real challenge. He himself received a speed of 1000 in 15 minutes. Achieved simply by raising the cluster. Requires a lot of resources (I got something like 10 nodes for each up to 5GB of RAM).