How to do simultaneous JS parsing of a large number of WEB pages?

F

forklive2018-03-28 22:03:00

JavaScript

forklive, 2018-03-28 22:03:00

Good afternoon!
Suppose - there are many open web pages, the content of which changes dynamically and very often. Maybe once a second. A lot is a few thousand.
The task is to parse these pages.
Why parsing with JS?
Because content is generated by scripts. Those. if it is stupid to send GET requests, then firstly the IP will be quickly banned. And secondly - not the fact that what is needed will return.
Therefore, you can make your own JS script for each type of page, which will run on each page, for example, once a second, and output a JSON array with the necessary information to a wrapper program, which in turn will add data to the database.
The problem is that just to open these few thousand pages in a browser, you need several hundred computers.
Google gave me some tips:
It is clear that to perform such a task, you will have to rent several servers, and each server should have the maximum amount of RAM (for example, 64 GB), and the processor is more powerful, because. on each page, you will need to execute a JS script once a second.
Are there any tools that can open WEB pages, keep them in memory, execute JS scripts on these pages, receive data from the console, but at the same time be not as resource-intensive as a regular browser?
Parsing QT+JS
pages SCRAPY
Node.js
Which way to dig?

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

S

Stalker_RED, 2018-03-28
@Stalker_RED

just to open these few thousand pages in a browser, you need several hundred computers.

10 tabs per computer, seriously?
Have a look at phantomjs and selenium.
Well, most likely the data is not generated directly in these tabs, but is transmitted over the network. To understand what there for the protocol did not try?

E

Evgen, 2018-03-29
@Verz1Lka

All "dynamically generated content every second" is nothing but those stupid get (possibly post) requests that you don't want to use. Those. the site is written so that every second the page sends a request of a certain format (using the correct headers and parameters) to the service. The most efficient, just so for speed, is to fake these requests, and read the responses. To prevent you from being banned, change the IP and headers (for example, User Agent, Cookie).
For queries, you can use scrapy (multithreading is supported).
If you still want to directly emulate the entire browser, try headless chrome and selenium.
PS If you let me look at the page, I'll tell you which technology is more suitable.

F

forklive, 2018-03-29
@forklive

Thanks for answers!
>PS If you let me look at the page, I'll tell you which technology is more suitable.
These are all betting sites that are blocked in the Russian Federation.
For example - https://www.betfair.com/sport/inplay, well, choose any event there.
Yes, most of them send GET-POST requests and the response is either JSON or just HTML.
And yes - it's very difficult to understand - which field belongs to what. For example - during the initial load, a table arrives, in which each cell has its own ID.
And then, in dynamic queries, pairs come - "Cell ID - cell value".
And each of the 50-100 offices has its own algorithm with its own characteristics.
For example - you looked in Chrome what request the page sends - you send it in another tab - and the server already returns some kind of error. Those. the server already understands that this is some kind of left request. And now you need to figure out what's wrong ...
Therefore, in order to achieve some kind of universality, kmk - it's better to write JS scripts. The page of the browser (or the browser emulator) will send all the necessary requests by itself, and all that remains is to take an array from the JSON console, in which you will no longer get confused.
Well, this is my train of thought...
>10 tabs per computer, seriously?
Maybe not 10. But keep in mind that a script is executed every second in each tab, and then JSON is parsed. My experiments have shown that even with 20 such open pages, a significant load is created.

M

Mikhail Sisin, 2018-03-29
@JabbaHotep

My colleague wrote a parser for betting (on order), 2000 requests had to be processed every 10 seconds (including the actual data collection, parsing and writing to the database). I can say that with Python he could not keep up, so Go was used.