Answer the question
In order to leave comments, you need to log in
What is the best way to implement a multi-threaded web scraper / site parser?
In what programming language and using what libraries / frameworks can you write a multi-threaded web scraper / site parser.
php, nodejs, go, C++ ....
Where, according to a specific user request, it is necessary to simultaneously parse goods from third-party store sites for this request - do not go deep into the site, just superficially on the first page.
That is, the user drives in "jeans" - the server runs up to 50 parallel scripts or functions, each has 1-5 http requests (several requests are possible due to complex authorization on the site, due to entering captcha, etc.) / and other logic, unique for each donor site.
And then the information is collected from all streams and issued to the user from all sites.
Key information will be cached for some time, but as a rule, you need to prepare for a high load when users will simultaneously search for different words / phrases. jeans, jacket, shirt, etc.
And the server, not finding information in the cache, will re-parse data from third-party sites in parallel streams. That is, users request 50 different phrases at the same time, and the server needs to create N parallel parser functions with their own logic
Answer the question
In order to leave comments, you need to log in
The choice of language for such a task comes down to the choice of ready-made libraries that you just need to link. The problem has already been solved hundreds of times, there is no point in sculpting your bikes. In almost any language that is used in large quantities for web development, there are ready-made libraries for solving such problems.
Well, definitely not in C ++, it will be long, expensive and pointless, this is a language for other tasks.
You can also advise doing the language that is best known/liked by the person who will directly program it. Or, if the performer does not know any language suitable for the task - in PHP, simply because it will be easier to understand.
I would do not parallel, but asynchronous and write in python, and save the information in some database, perhaps postgres.
Go is good for this. At the same time, 50 threads is very little for him. All in one channel send and give.
I did parsing sites with JavaScript through simulation, tested all this on Linux servers with rabbitmq, the network worked for half a year until I got tired of Chromium and Lazarus-IDE on the server side, with the installation ...
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question