Answer the question
In order to leave comments, you need to log in
The best language for web scraping
For the last few weeks I have been writing a web scraper in php. I had doubts before. But, after reading this article and comments to it, I was convinced that I need to look for another way. Or rather, another language.
What, in your opinion, PL (+ Framework / library) is best suited for the task of parsing web pages?
I will be very grateful for reasoned answers. And even more for links to articles on using PL in this direction and / or to project repositories on the topic.
A little about the specific problem I am working on: 50+ independent sites (manufacturers of certain types of products) from which you need to collect a database of their products. It is necessary not only to run once, but also to repeat the run at least once a day, or when new products appear (and therefore add code when new features appear in new products). Due to the large number of sites (which will only increase over time), the ability to scale is necessary. At the same time, the unification of all parameters is extremely important.
Answer the question
In order to leave comments, you need to log in
As soon as the word "parsing" appears, first of all it is worth remembering Perl (Practical Extraction and Report Language). To solve this problem, there is:
Half of my work experience is writing spiders and web scrapers.
I also wrote them in PHP + CURL, then in bare python + threads.
Then I learned about Scrapy (an asynchronous framework for grabbing websites in Python) and implemented about 5 independent projects on it, including one where you need to aggregate and periodically update information from 20 different forums. Its main problem is that it is asynchronous, but single-threaded. So they can’t load more than one core and can’t make a long query to the database. And it's a very good framework.
Then I made several spiders in Python using Celery.
And quite recently I rewrote a rather heavily loaded spider (50-70Mbps through proxy lists) from Python + Celery to Erlang and realized that this is IT! Not only did I start working 2-3 times faster, but I realized that it’s hard to come up with something more suitable for this task.
Let me explain - with an Erlang spider, you can, for example, change the number of threads, update the code, reload the configs without stopping the process. You can profile the code on the go to find out why the speed has dropped or what is so CPU-intensive. You can combine green threads, asynchronous networking, and long database queries. And all this is actually out of the box. In the end, the code is more logical.
PHP has XPath, there are libraries like PhpQuery, and so on. It is possible to run multiple requests at the same time via multi curl. So you can organize it all at the very least. Everything is the same and even more in python and in any other language. So any language you know will do for this task.
PL in this task is a tertiary task after parsing html and searching for key elements.
Since it is the last task that is the most important, then you need to focus on it - usually a bunch of regular expressions and a certain controller for working out variations and exceptions, since regular expressions are not an ideal tool for this case.
I wrote a similar book aggregator system in php, but only because WP.
What do you think is the best PL for the task <any_task>?
The one you know best.
That the Habr parser is buggy, maybe it will turn out that way
$s = file_get_contents( 'yandex.html' );
$tidy = new tidy();
$tidy->parseString( $s, array(
'output-xml' => true,
'clean' => true,
'numeric-entities' => true
), 'utf8' );
$tidy->cleanRepair();
$xml = simplexml_load_string( tidy_get_output( $tidy ) );
$adwords = $xml->xpath( '//*[@class="b-adv"]' );
var_dump( $adwords );
exit;
$tads = $xml->xpath( '//*[@id="tads"]' );
var_dump( $tads );
exit;
$a = $xml->xpath( '//a[@href]' );
//var_dump( $a );
array_walk( $a, function( $item ) {
$href = $item->attributes()->href;
if ( strpos( $href, 'start=' ) !== false )
echo $href."\n";
//var_dump( $attrs );
//exit;
} );
For such a task, a couple of years ago I used Perl with the connection of CPAN's libraries (in particular, HTML::Parser), because I really liked its work with regexps. You can see something similar here or here
PHP has work with DOM, XPath. Well, regular expressions, of course. What else do you need =)
Perhaps you will find this test code snippet useful, which I used to search for directory blocks on Yandex pages.
<source lang="php">
<?php
$s = file_get_contents( 'yandex.html' );
$tidy = new tidy();
$tidy->parseString( $s, array(
'output-xml' => true,
'clean' => true,
'numeric-entities' => true
), 'utf8' );
$tidy->cleanRepair();
$xml = simplexml_load_string( tidy_get_output( $tidy ) );
$adwords = $xml->xpath( '//*[class="b-adv"]' );
var_dump( $adwords );
exit;
$tads = $xml->xpath( '//*[id="tads"]' );
var_dump( $tads );
exit;
$a = $xml->xpath( '//a[href]' );
//var_dump( $a );
array_walk( $a, function( $item ) {
$href = $item->attributes()->href;
if ( strpos( $href, 'start=' ) !== false )
echo $href."\n";
//var_dump( $attrs );
//exit;
} );
?>
</source>
You can choose what you like best
without leaving PHP .
Well, more on SHD and phpQ .
I used the Qt + QtWebKit library for parsing sites. Remarkably provides the entire DOM, and you can also pull out the current sizes and coordinates of frames and generally page elements.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question