Parsing a product card and its result in real time?

V

Vitaly B2018-05-09 09:55:38

Parsing

Vitaly B, 2018-05-09 09:55:38

Task description: on a web page with a field under the url and a button, we get the result of parsing (name, price, size/color options) of the entered address. For example, parsing a product card on the taobao website.
Additional task: get the source of the page with the already executed js.
Platform: Ubuntu 16.04 under VirtualBox (4gb under RAM and maximum CPU, KVM virtualization) on a home PC (Internet 100Mbps)
Attempts to implement on: PhantomJS, Selenium (with various drivers), Scrapy, Beautuful soup (with PyQt4 + xvfb and without them).
Test sites: Taobao.com and Dns-shop.ru
Results:It works, but the same script with the same url can be executed before receiving the results for both 5 seconds and up to 5 minutes. And of course, not only on the addresses of Taobao goods, but also on the DNS-shop and other Russian ones.
5 seconds to execute the script can still be tolerated, but if more, then there is simply no point. Yes, I know that taobao has an API, but if such an option were available, then I would not turn to parsing.
How to overcome such a long delay? Or what other options are there?
With a direct request to the address from where TaoBao loads the data I need about the price / options, etc. it did not work out (brains are not enough) to figure it out. When I access this url I get 403 error.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

E

Evgen, 2018-05-09
@Verz1Lka

Try splash for rendering.

A

Alexey Sundukov, 2018-05-10
@alekciy

will be executed before receiving the results both 5 seconds and up to 5 minutes

As I understand it, the headless browser is launched in one instance and all requests go to it? Then you just have deadlocks inside it. When I encountered a similar problem on the go, I could not solve it within the framework of one running process, so I did it easier. In your parser, make a connection manager whose tasks include:
1) launching additional copies of the browser if necessary;
2) a guarantee that only one request goes into one copy and until a response is received, a new task (within the same domain) cannot be sent.
I also note that PhantomJS branch 2 is slower than 1.9. Where the old version worked in half a second, the new one thinks for a second and a half.