The best language for web scraping

Andrey Kuntsevich2012-12-25 01:34:26

HTML

Andrey Kuntsevich, 2012-12-25 01:34:26

For the last few weeks I have been writing a web scraper in php. I had doubts before. But, after reading this article and comments to it, I was convinced that I need to look for another way. Or rather, another language.

What, in your opinion, PL (+ Framework / library) is best suited for the task of parsing web pages?
I will be very grateful for reasoned answers. And even more for links to articles on using PL in this direction and / or to project repositories on the topic.

A little about the specific problem I am working on: 50+ independent sites (manufacturers of certain types of products) from which you need to collect a database of their products. It is necessary not only to run once, but also to repeat the run at least once a day, or when new products appear (and therefore add code when new features appear in new products). Due to the large number of sites (which will only increase over time), the ability to scale is necessary. At the same time, the unification of all parameters is extremely important.

Answer the question

In order to leave comments, you need to log in

15 answer(s)

mithraen, 2012-12-25
@titulusdesiderio

As soon as the word "parsing" appears, first of all it is worth remembering Perl (Practical Extraction and Report Language). To solve this problem, there is:

modules for working with HTTP - both low-level and special, such as WWW::Mechanize - is convenient if you need to write a script that sequentially performs a set of operations (for example, you need to automate some user actions in the web interface, but there is no API not provided);
modules for asynchronous work via HTTP (AnyEvent::HTTP) - allow you to write a robot that, without the need to create many threads, will execute several requests simultaneously;
regular expressions are a powerful tool for parsing data, and in perl its use is most convenient (it is part of the syntax of the language);
libraries for parsing HTML into a tree (eg HTML::Parser);

Python is a good general-purpose scripting language, but for data parsing tasks, Perl code will be much easier.

Sergey, 2012-12-26
@seriyPS

Half of my work experience is writing spiders and web scrapers.
I also wrote them in PHP + CURL, then in bare python + threads.
Then I learned about Scrapy (an asynchronous framework for grabbing websites in Python) and implemented about 5 independent projects on it, including one where you need to aggregate and periodically update information from 20 different forums. Its main problem is that it is asynchronous, but single-threaded. So they can’t load more than one core and can’t make a long query to the database. And it's a very good framework.
Then I made several spiders in Python using Celery.
And quite recently I rewrote a rather heavily loaded spider (50-70Mbps through proxy lists) from Python + Celery to Erlang and realized that this is IT! Not only did I start working 2-3 times faster, but I realized that it’s hard to come up with something more suitable for this task.
Let me explain - with an Erlang spider, you can, for example, change the number of threads, update the code, reload the configs without stopping the process. You can profile the code on the go to find out why the speed has dropped or what is so CPU-intensive. You can combine green threads, asynchronous networking, and long database queries. And all this is actually out of the box. In the end, the code is more logical.

Sergey, 2012-12-25
Protko @Fesor

PHP has XPath, there are libraries like PhpQuery, and so on. It is possible to run multiple requests at the same time via multi curl. So you can organize it all at the very least. Everything is the same and even more in python and in any other language. So any language you know will do for this task.

Wott, 2012-12-25
@Wott

PL in this task is a tertiary task after parsing html and searching for key elements.
Since it is the last task that is the most important, then you need to focus on it - usually a bunch of regular expressions and a certain controller for working out variations and exceptions, since regular expressions are not an ideal tool for this case.
I wrote a similar book aggregator system in php, but only because WP.

Nikolai Vasilchuk, 2012-12-25
@Anonym

What do you think is the best PL for the task <any_task>?
The one you know best.

XaMuT, 2012-12-25
@XaMuT

Ruby and nokogiri - nowhere easier;)
Article on Habré

KEKSOV, 2012-12-25
@KEKSOV

That the Habr parser is buggy, maybe it will turn out that way

$s = file_get_contents( 'yandex.html' );

$tidy = new tidy();
$tidy->parseString( $s, array(
    'output-xml'       => true,
    'clean'            => true,
    'numeric-entities' => true
), 'utf8' );

$tidy->cleanRepair();
$xml = simplexml_load_string( tidy_get_output( $tidy ) );

$adwords = $xml->xpath( '//*[@class="b-adv"]' );
var_dump( $adwords );
exit;

$tads = $xml->xpath( '//*[@id="tads"]' );
var_dump( $tads );
exit;

$a = $xml->xpath( '//a[@href]' );
//var_dump( $a );

array_walk( $a, function( $item ) {
    $href = $item->attributes()->href;
    if ( strpos( $href, 'start=' ) !== false )
        echo $href."\n";
    //var_dump( $attrs );
    //exit;
} );

Guran, 2012-12-25
@Guran

For such a task, a couple of years ago I used Perl with the connection of CPAN's libraries (in particular, HTML::Parser), because I really liked its work with regexps. You can see something similar here or here

zarincheg, 2012-12-25
@zarincheg

PHP has work with DOM, XPath. Well, regular expressions, of course. What else do you need =)

KEKSOV, 2012-12-25
@KEKSOV

Perhaps you will find this test code snippet useful, which I used to search for directory blocks on Yandex pages.
<source lang="php"> <?php $s = file_get_contents( 'yandex.html' ); $tidy = new tidy(); $tidy->parseString( $s, array( 'output-xml' => true, 'clean' => true, 'numeric-entities' => true ), 'utf8' ); $tidy->cleanRepair(); $xml = simplexml_load_string( tidy_get_output( $tidy ) ); $adwords = $xml->xpath( '//*[class="b-adv"]' ); var_dump( $adwords ); exit; $tads = $xml->xpath( '//*[id="tads"]' ); var_dump( $tads ); exit; $a = $xml->xpath( '//a[href]' ); //var_dump( $a ); array_walk( $a, function( $item ) { $href = $item->attributes()->href; if ( strpos( $href, 'start=' ) !== false ) echo $href."\n"; //var_dump( $attrs ); //exit; } ); ?> </source>

Alexey Akulovich, 2012-12-25
@AterCattus

You can choose what you like best without leaving PHP .
Well, more on SHD and phpQ .

Inquisitor, 2012-12-26
@Inquisitor

I used the Qt + QtWebKit library for parsing sites. Remarkably provides the entire DOM, and you can also pull out the current sizes and coordinates of frames and generally page elements.

KEKSOV, 2012-12-27
@KEKSOV

There was a good habro article on this topic

mikiAsano, 2013-12-17
@mikiAsano

I use Java and the jSpout library

Evgen, 2017-11-16
@Verz1Lka

python + scrapy.org