How to speed up data parsing with Python/Selenium?

B

Bjornie2016-12-17 02:31:27

Python

Bjornie, 2016-12-17 02:31:27

In the current version, parsing is done with chromedriver. In practice, I have about 100,000 links that contain tables. Each table has a "Details" button, which the parser now presses, copies the contents of the popup, closes it, and so on.
In general, to parse probably a million such lines, it will take me a month of continuous work of selenium. Looking for a way to somehow speed it up.
The problem is, I set small delays that are needed to exactly let the popup load and let it close, otherwise element is not found errors occur.
In general, save. Tell me how it's really done to speed up the work at least 10 times. (in half an hour he went through about 400 pages, parsing about 2000 lines). It's like going through it myself, clicking on each "Read More" link, but giving the copy to the script. This can hardly be called full automation. especially with such volumes (I do not rate them as large).
Are there "real" boosters of such operations? I understand that selenium is made for testing, or at least for parsing pages where there are no heaps of pop-ups that all need to be clicked.
upd: after posting, I continued to google and found the following in one discussion:

javascript tables is exactly why I went with selenium for some sites. However, rather than parsing directly with selenium, I was passing driver.page_source (raw html containing whatever javascript generated) to bs4 and parsing with bs4. I was shocked to find out that this round about method was faster than using selenium.find_element_by_XXXXX methods without ever invoking bs4.

Is it really true?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

A

Alexey Sundukov, 2016-12-17
@Bjornie

For 100k links, especially if you need to bypass them quite often (or there are few resources on the server), it already makes sense to think about more custom (read, file a low-level mechanism with your hands), but faster mechanisms. Somehow requests to get AJAX data via curl. Or if the data is received in frametime on the client through tricky JS, then use SpiderMonkey, V8 or other server engines.
I made a parser on a PhantomJS cluster, which was supposed to bypass a little more than 1k pages in 15 minutes and parse various tricky signs from them. It required something like ten PhantomJS instances, 20 GB of RAM and 16 CPU cores. On such a cluster, 100k per day will really digest.
When the time requirement tightened up to 5 minutes, I filed for SpiderMonkey.
You need to use wait(). Then further the code will be executed when the desired element appears on the page.
The presence/absence of pop-ups does not matter. Everything that appears in the DOM can be processed. I regularly pull data from Yandex Worstat. There are a lot of different cunning handlers out there. But everything is solved by PhantomJS through webdriver sooner or later.
Maybe. But is this true in your context, no one except the experiment will say. Those. we take this statement and check it in our parsing task.

A

Artem Kislenko, 2016-12-17
@webwork

You can use phantomjs instead of chromedriver, it will speed up page navigation.
But I'm almost 100% sure that you don't need a javascript interpreter to parse the required data.
If a popup opens without loading data (without ajax), then the data is somewhere in html and can be paired.
If with loading, then you need to make a request directly (to the url from where the data is loaded).