Answer the question
In order to leave comments, you need to log in
What language will parse faster?
Here is a list of 5k sites, and through the web form it is added to processing in a PHP script. One line contains URL:LOGIN:PASS , the script enters, logs in and checks the user's rights, namely, it loads 3 pages (using the cURL library) and builds a DOM model from one of them using simple_html_dom parser and writes the result to a file (or to the database) , but in PHP this is done for a very long time, and will throw out when the script runs out of time. What is the best way and in what language can this be done to speed up the verification process, or at least it is possible to do multithreading in PHP? Please advise or poke where to read. Thanks in advance.
Answer the question
In order to leave comments, you need to log in
Yes basically it is possible on any - it is main to use a multithreading.
Delays on the network are many times more than the parsing time.
In addition to running parsing on 100+ sites at the same time, there is no way to “speed up” it.
For a python, scrapy can do this out of the box
When it comes to processing remote resources, the stability and quality of the connection, its performance, and the performance of remote resources become critical. If we implement in C the most efficient site parser that will chew it, say, in 1ms and exactly the same on some fat Python, which will take, say, 15ms to process, obviously these numbers are nothing compared to the time that will be spent on connecting and downloading the required document: 100ms to connect, 1mb / 10mbps, total 200ms just to get a document that may still arrive with errors or not arrive at all, and the remote server will also need time to process it.
In total, the most asynchronous work with loading a document becomes important, and its processing can take as much as loading, because it is a bottleneck that cannot be processed faster. Some way out may be to launch parallel processes (threads) on various resources, but you should not abuse this, since your channel is not rubber and the quality of the connection can drop many times over, and the system has serious limitations on the number of simultaneous connections.
dlang.org
curl you don't need to eat code.dlang.org/packages/requests
From Go after C# you will definitely spit. It's like switching to Basic.
Any language, I like JavaScript for these purposes, namely Node.js + cheerio (it will be possible to use selectors from jQuery for the parser) + Promise (it is convenient to manage parser flows)
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question