What is the best way to collect data from third-party sites?

A

Alexey FRZ2020-02-10 13:38:21

Java

Alexey FRZ, 2020-02-10 13:38:21

Help clarify the picture, the Internet is not quite specific information.
There is a task: collecting data from certain sites (these are double values) and comparing them with a reference figure, getting the difference, the values change many times per minute. These figures on the site are on different pages. The speed of data acquisition and analysis has the highest priority. If you download each page and parse with regular expressions, it will turn out very slowly. While you download pages from all sites and analyze them, the data may no longer be relevant.

TOTAL: there are 10 sites where on different pages there are double values that change, say, 6 times per minute. How to pull them out as quickly as possible?

UPD: I will solve the problem in Java + JSoup

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

S

Sergey Shvyrev, 2020-02-10
@leshqow

Great selection of technologies.
It remains only to launch crawlers in a lot of flow.
Well, or, if you implement miuroservices, run several instances in parallel, which will process each site that needs to be controlled.
I would recommend the second option. This is more convenient, for example, some error occurred in one of the coaulcra - the rest continued to work while the cuber picks up the fallen one.
In my company, parsers are implemented on Vertx.
Reasons for choosing this framework:

Own DNS resolver. host, either 8.8.8.8 or a custom list.
Non-blocking requests. You can run many queries at the same time. Saves on free slow proxies.
complex blocking operations, such as parsing a response or waiting for a captcha, are implemented in separate threads. Worker threads, with a separate pool. It is very comfortable to separate from the common loica.

I write a lot. Listen, I have already implemented a project open on github. Participated in the competition, but lost. Everything that I write about is implemented there. Only not all Javadocs are painted.
Maybe I'll write an article on Habr, than to stuff everything in one answer ... What do you think?