Answer the question
In order to leave comments, you need to log in
Multiprocessor parser losing references when parsing (selenium+PhantomJS+ProcessPoolExecutor)?
I want to write a parr for some website
which has a lot of JS code. For this, I decided to
choose the Selenium+PhantomJS+lxml bundle. Working with Python
The parser needs to be fast enough to process
at least 1000 links in 1 hour. For this purpose,
I decided to use multiprocessing. (not multithreading
- because of the GIL!). For splitting into processes
, I used concurrent.Future.ProcessPoolExecutor.
The problem is the following, for example, I give 10
links to the input, at the output it processes 9 at best
(maybe 6). This is bad! There is still some dependence
with an increase in the number of threads, the number of
lost links increases. The first thing I decided to do is
track where the program breaks, where it stops
running. (assert in my case, as far as I understand
, will not work, due to multiprocessing). Here I determined
that the break is in the browser.get (l) function - it does not load
the page. I tried adding time.sleep(x), then
adding a visible and non-visible wait. Nothing has changed either
. I started to explore the get () function from the
selenium module, found that it reloads
the execute () function from the same module, and there I got into the jungle, that my
knowledge does not allow me to figure it out, and there is not much time.
And at the same time, I tried to run in one process.
That is, the number of processes = 1. And also one link
lost. This made me think that maybe it's not
in selenium + phantomJS, but in the ProcessPoolExecutor. I replaced
this module with multiproessing.Pool - and lo and behold, the links were
no longer lost. But instead, another
problem appeared, more than 4 threads does not perform. If you put
more, it gives the following error:
"""
multiprocessing.pool.RemoteTraceback:
Traceback (most recent call last):
File "/usr/lib/python3.4/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.4/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "interface.py", line 34, in hotline_to_mysql
w = Parse_hotline().browser_manipulation(link)
File "/home/water/work/parsing/class_parser/parsing_classes.py", line 352, in browser_manipulation
browser.get(l)
File "/usr/local/lib/python3.4/dist-packages/selenium/webdriver/remote/webdriver.py", line 247, in get
self.execute(Command.GET, {'url': url})
File "/usr/local/lib/python3.4/dist-packages/selenium/webdriver/remote/webdriver.py", line 233, in execute
response = self.command_executor.execute(driver_command, params)
File "/usr/local/lib/python3.4/dist-packages/selenium/webdriver/remote/remote_connection.py", line 401, in execute
return self._request(command_info[0], url, body=data)
File "/usr/local/lib/python3.4/dist-packages/selenium/webdriver/remote/remote_connection.py", line 471, in _request
resp = opener.open(request, timeout=self._timeout)
File "/usr/lib/python3.4/urllib/request.py", line 463, in open
response = self._open(req, data)
File "/usr/lib/python3.4/urllib/request.py", line 481, in _open
'_open', req)
File "/usr/lib/python3.4/urllib/request.py", line 441, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 1210, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/lib/python3.4/urllib/request.py", line 1185, in do_open
r = h.getresponse()
File "/usr/lib/python3.4/http/client.py", line 1171, in getresponse
response.begin()
File "/usr/lib/python3.4/http/client.py", line 351, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.4/http/client.py", line 321, in _read_status
raise BadStatusLine(line)
http.client.BadStatusLine: ''
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "interface.py", line 69, in <module>
main()
File "interface.py", line 63, in main
executor.map(hotline_to_mysql, link_list)
File "/usr/lib/python3.4/multiprocessing/pool.py", line 260, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.4/multiprocessing/pool.py", line 599, in get
raise self._value
http.client.BadStatusLine: ''
"""
import random
import time
import lxml.html as lh
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from multiprocessing import Pool
from selenium.webdriver.common.keys import Keys
from concurrent.futures import Future, ProcessPoolExecutor, ThreadPoolExecutor
AMOUNT_PROCESS = 5
def parse(h)->list:
# h - str, html of page
lxml_ = lh.document_fromstring(h)
name = lxml_.xpath('/html/body/div[2]/div[7]/div[6]/ul/li[1]/a/@title')
prices_ = (price.text_content().strip().replace('\xa0', ' ')
for price in lxml_.xpath('//*[@id="gotoshop-price"]'))
markets_ =(market.text_content().strip() for market in
lxml_.find_class('cell shop-title'))
wares = [[name[0], market, price] for (market, price)
in zip(markets_, prices_)]
return wares
def browser_manipulation(l):
#options = []
#options.append('--load-images=false')
#options.append('--proxy={}:{}'.format(host, port))
#options.append('--proxy-type=http')
#options.append('--user-agent={}'.format(user_agent)) #тут хедеры рандомно
dcap = dict(DesiredCapabilities.PHANTOMJS)
#user agent takes from my config.py
dcap["phantomjs.page.settings.userAgent"] = (random.choice(USER_AGENT))
browser = webdriver.PhantomJS(desired_capabilities=dcap)
#print(browser)
#print('~~~~~~', l)
#browser.implicitly_wait(20)
#browser.set_page_load_timeout(80)
#time.sleep(2)
browser.get(l)
time.sleep(20)
result = parse(browser.page_source)
#print('++++++', result[0][0])
browser.quit()
return result
def main():
#open some file with links
with open(sys.argv[1], 'r') as f:
link_list = [i.replace('\n', '') for i in f]
with Pool(AMOUNT_PROCESS) as executor:
executor.map(browser_manipulation, link_list)
if __name__ == '__main__':
main()
Answer the question
In order to leave comments, you need to log in
Probably not very relevant, but I'll leave a note for history. The loss of pages on the parsing is a regular situation. The abundance of JS, code crashes when working with the DOM, network problems, everything can lead to the generation of an exception and the data retrieval fail. So when developing, you should immediately swear at the regularity of such a situation and simply catch unparsed pages and send them for re-parsing.
1000 pages in 1 hour is more than a real challenge. He himself received a speed of 1000 in 15 minutes. Achieved simply by raising the cluster. Requires a lot of resources (I got something like 10 nodes for each up to 5GB of RAM).
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question