A
A
Andreda2017-11-15 18:28:53
Parsing
Andreda, 2017-11-15 18:28:53

What is the best way to implement a multi-threaded web scraper / site parser?

In what programming language and using what libraries / frameworks can you write a multi-threaded web scraper / site parser.
php, nodejs, go, C++ ....
Where, according to a specific user request, it is necessary to simultaneously parse goods from third-party store sites for this request - do not go deep into the site, just superficially on the first page.
That is, the user drives in "jeans" - the server runs up to 50 parallel scripts or functions, each has 1-5 http requests (several requests are possible due to complex authorization on the site, due to entering captcha, etc.) / and other logic, unique for each donor site.
And then the information is collected from all streams and issued to the user from all sites.
Key information will be cached for some time, but as a rule, you need to prepare for a high load when users will simultaneously search for different words / phrases. jeans, jacket, shirt, etc.
And the server, not finding information in the cache, will re-parse data from third-party sites in parallel streams. That is, users request 50 different phrases at the same time, and the server needs to create N parallel parser functions with their own logic

Answer the question

In order to leave comments, you need to log in

7 answer(s)
P
Pavel Shvedov, 2017-11-15
@mmmaaak

go+goquery

K
Konstantin, 2017-11-15
@puchkovk

The choice of language for such a task comes down to the choice of ready-made libraries that you just need to link. The problem has already been solved hundreds of times, there is no point in sculpting your bikes. In almost any language that is used in large quantities for web development, there are ready-made libraries for solving such problems.
Well, definitely not in C ++, it will be long, expensive and pointless, this is a language for other tasks.
You can also advise doing the language that is best known/liked by the person who will directly program it. Or, if the performer does not know any language suitable for the task - in PHP, simply because it will be easier to understand.

E
Evgen, 2017-11-16
@Verz1Lka

python + scrapy.org

D
devalone, 2017-11-15
@devalone

I would do not parallel, but asynchronous and write in python, and save the information in some database, perhaps postgres.

A
Alexander Pushkarev, 2017-11-15
@AXP-dev

Go is good for this. At the same time, 50 threads is very little for him. All in one channel send and give.

E
Emil Revencu, 2018-01-02
@Revencu

Python: Multithreading + Requests + LXML
More RAM = More Threads

Y
Yuri Paimurzin, 2020-01-16
@rusellsystems

I did parsing sites with JavaScript through simulation, tested all this on Linux servers with rabbitmq, the network worked for half a year until I got tired of Chromium and Lazarus-IDE on the server side, with the installation ...

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question