P
P
PRAIT2018-03-26 00:36:50
Parsing
PRAIT, 2018-03-26 00:36:50

Data parsing, which language is more practical?

Greetings to all guys.
I will try to describe as briefly and as clearly as possible.
You need to write a full-fledged parser that will collect information from sites according to the categories specified by users.
Simple example 1 - It is necessary to collect information from users with a nickname starting with the first letter ( A ) from the site toster.ru, how many questions were solved by these users, the percentage, how many messages they left under which tags the most resolved issues, etc.
Simple example 2 - It is necessary to collect the average cost of work per hour of a php developer from the freelance site in Russia or Ukraine geo. Percentage of positive and negative reviews, etc.
Actually the question is in the implementation. In what language will it be more practical to cut this script?
Somewhere, I heard that such undertakings are well implemented on ASP.NET. Actually, the question is for those who know, is this so?
Classic PHP since the script will be on the server, I think it will take a long time to process with multiple accesses per second. Let's say 500 people parse every second according to (хххх mb) data. Is it a small load for PHP how long will it take to complete the task?
A look at Golang - how does GO handle such tasks?
A look at node js and java - would love to hear your opinion.
Friends, please do not throw unflattering phrases in my direction. Since there was no such experience before, I decided to ask you.
I do not quite understand what is better to use in these situations.
Task: Speed, Reliability, Multithreading, which would withstand a large number of accesses per second.
Thanks to all! :)

Answer the question

In order to leave comments, you need to log in

7 answer(s)
P
pantsarny, 2018-03-26
@pantsarny

on any you can.
php will work, there is a multi-threaded curl, a lot of code on the net, php-fpm works stably.

D
Danil Sapegin, 2018-03-26
@ynblpb_spb

It seems to me that it is worth paying attention to javascript engines for parsers, such as phantomjs and casperjs.
With their help, extracting data from the page becomes dozens of times easier.
Multi-threading work with these applications can already be implemented in any language and stored in a convenient format in a database or somewhere else.

S
Stalker_RED, 2018-03-26
@Stalker_RED

It doesn't matter which language you choose. The one you know better at the moment, or the one you want to learn in the process. All the same, you will run into not the speed of the parser, but the width of the channel. Well, or you will be banned for too much load on the server.

X
xmoonlight, 2018-03-26
@xmoonlight

regex - minimum!
Further - any language.

Classic PHP since the script will be on the server, I think it will take a long time to process with multiple accesses per second. Let's say 500 people parse every second according to (хххх mb) data. Is it a small load for PHP how long will it take to complete the task?
I would advise you to download everything GRADUALLY in the beginning, so as not to strain the donor resource .
And after that - calmly parse on your server.
Try HTTrack

M
Mikhail Sisin, 2018-03-26
@JabbaHotep

Go is a great language and handles multithreading very well. If high loads and competitive launches are planned, then from all that is listed - only Go.
However, for example 1 and 2, it is not clear how you will use multithreading effectively. Decide first on the volumes, how many requests you will make. How often the datasets will be updated and so on. After that, you can select the tool.

B
beduin01, 2018-03-26
@beduin01

D is a great language and handles multithreading very well. If high loads and competitive launches are planned, then from all that is listed - only D.

Y
Yuri Paimurzin, 2020-01-16
@rusellsystems

I did parsing sites with JavaScript through simulation, tested all this on Linux servers with rabbitmq, the network worked for half a year until I got tired of Chromium and Lazarus-IDE on the server side, with the installation ...

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question