I
I
Ilya Parshakov2018-02-08 20:49:46
PHP
Ilya Parshakov, 2018-02-08 20:49:46

How to scrape data using Symphony DomCrawler?

Hello! I'm trying to figure out parsing on Laravel using Symphony DomCrawler and I'm asking for help to figure it out. Studying the manual, not everything is clear and by googling I came across one article, a site that no longer exists, but partially having gained access through a saved copy of Yandex .
Code example:

/**
 * Get content from html.
 *
 * @param $parser object parser settings
 * @param $link string link to html page
 *
 * @return array with parsing data
 * @throws \Exception
 */
public function getContent($parser, $link)
{
    // Get html remote text.
    $html = file_get_contents($link);

    // Create new instance for parser.
    $crawler = new Crawler(null, $link);
    $crawler->addHtmlContent($html, 'UTF-8');

    // Get title text.
    $title = $crawler->filter($parser->settings->title)->text();

    // If exist settings for teaser.
    if (!empty(trim($parser->settings->teaser))) {
        $teaser = $crawler->filter($parser->settings->teaser)->text();
    }

    // Get images from page.
    $images = $crawler->filter($parser->settings->image)->each(function (Crawler $node, $i) {
        return $node->image()->getUri();
    });

    // Get body text.
    $bodies = $crawler->filter($parser->settings->body)->each(function (Crawler $node, $i) {
        return $node->html();
    });

    $content = [
        'link' => $link,
        'title' => $title,
        'images' => $images,
        'teaser' => strip_tags($teaser),
        'body' => $body
    ];

    return $content;
}

And the $parser line, which the getContent() method accepts, is not clear here .
What should it contain? From the method, you can see that it is used, for example, as $parser->settings->teaser , and contains a selector for searching, but how is this object created?
In general, I ask for help, who knows who uses this method.
Thank you all in advance for your replies!

Answer the question

In order to leave comments, you need to log in

2 answer(s)
V
Valery, 2018-02-13
@Akuma

Why do you even need to know what is in the properties of this $parser object?
Just write your selectors and that's it. The most common CSS selectors (well, :contains is also supported).
You tore the method out of the documentation, but forgot about the context. This is just an example. Rewrite in your own way and the problem will disappear by itself.

U
UksusoFF, 2018-02-08
@UksusoFF

Most likely there css/xpath selectors are unique for a particular site. The documentation describes this in some detail.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question