Answer the question
In order to leave comments, you need to log in
What is the best way to collect data from third-party sites?
Help clarify the picture, the Internet is not quite specific information.
There is a task: collecting data from certain sites (these are double values) and comparing them with a reference figure, getting the difference, the values change many times per minute. These figures on the site are on different pages. The speed of data acquisition and analysis has the highest priority. If you download each page and parse with regular expressions, it will turn out very slowly. While you download pages from all sites and analyze them, the data may no longer be relevant.
TOTAL: there are 10 sites where on different pages there are double values that change, say, 6 times per minute. How to pull them out as quickly as possible?
UPD: I will solve the problem in Java + JSoup
Answer the question
In order to leave comments, you need to log in
Great selection of technologies.
It remains only to launch crawlers in a lot of flow.
Well, or, if you implement miuroservices, run several instances in parallel, which will process each site that needs to be controlled.
I would recommend the second option. This is more convenient, for example, some error occurred in one of the coaulcra - the rest continued to work while the cuber picks up the fallen one.
In my company, parsers are implemented on Vertx.
Reasons for choosing this framework:
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question