Parse / rob web pages without garbage?

Mixa2016-04-07 12:32:23

HTML

Mixa, 2016-04-07 12:32:23

Recently, many lazy-reading services have appeared that “rob” site content directly from the page (not from feeds), beautifully clearing everything superfluous and leaving only marked-up text without any spans, font sizes, etc., and pictures. For example https://getpocket.com/
Question. Has anyone come across scripts in the public domain with which you can do this and tie it to your own project so that you can “suck in” pages for yourself? ;)

Answer the question

In order to leave comments, you need to log in

9 answer(s)

Eugene 222, 2016-04-08
@mik222

This task is called data region mining and is a rather tricky problem. layout can be different everywhere, and you are solving the problem of finding the main content on the site (i.e. cropping ads, navigation blocks, left inserts, hidden content, etc.)
Here is an algorithm for you:

1. Для каждой html ноды в дереве, вычислить её площадь(рендерите через phantom.js и вычисляете площадь через Element.getBoundingClientRect())
2. Удаляете все, что меньше средней площади на этом уровне. (Вычищаем не имеющие значения блоки)
3. Спускаетесь вниз на один уровень и повторяете алгоритм

As a result, get a set of text blocks that have the maximum volume on the page.
You will need to empirically adjust the algorithm for your use case:
For example, if you have a region with a large number of text blocks in front of you, then get the text from all children and put it in the region (this way we avoid cutting out bold italic text).
Further, it is up to you to combine these regions into an article / articles (in the case of the feed).
--------
There are more interesting algorithms for calculating pairwise similarity between arbitrary child nodes in order to find the data region
But you need to read published articles on this topic, for example:
dl.acm.org/citation.cfm?id =1060761

Alexey Cheremisin, 2016-04-07
@leahch

Of course there is - lxml.de/lxmlhtml.html#cleaning-up-html
Well, you can already select what you need from the cleaned one.
Yes, it's better to use the requests library - docs.python-requests.org/en/master

ThunderCat, 2016-04-07
@ThunderCat

Piece handmade, each site has its own little bicycle.
Well, not a bicycle, rather, other wheels are screwed to one bicycle.

xmoonlight, 2016-04-07
@xmoonlight

well, in short: this is the task of finding the MAIN content of the page.
1. Remove all containers with more than 1 child elements.
2. Clean up the body container from all tags except container tags (div,td)
3. Find the container (div,td) with the longest text.
4. Feel free to rob him.

Mikhail S, 2016-04-07
@sokolov86

JavaScript https://github.com/mozilla/readability

Eugene, 2016-05-04
@beatleboy

Apist is a great thing! Allows you to easily parse pages, refer to elements in jquery style. Habr parsing example:

public function index()
{
  return $this->get('/', [
    'title' => Apist::filter('.page_head .title')->text()->trim(),
    'posts' => Apist::filter('.posts .post')->each([
      'title'      => Apist::filter('h1.title a')->text(),
      'link'       => Apist::filter('h1.title a')->attr('href'),
      'hubs'       => Apist::filter('.hubs a')->each(Apist::filter('*')->text()),
      'author'     => [
        'username'     => Apist::filter('.author a'),
        'profile_link' => Apist::filter('.author a')->attr('href'),
        'rating'       => Apist::filter('.author .rating')->text()
      ]
    ])
  ]);
}

Returns data in an array:

{
    "title": "Публикации",
    "posts": [
        {
            "title": "Проверьте своего хостера на уязвимость Shellshock (часть 2)",
            "link": "http:\/\/habrahabr.ru\/company\/host-tracker\/blog\/240389\/",
            "hubs": [
                "Блог компании ХостТрекер",
                "Серверное администрирование",
                "Информационная безопасность"
            ],
            "author": {
                "username": "smiHT",
                "profile_link": "http:\/\/habrahabr.ru\/users\/smiHT\/",
                "rating": "26,9"
            }
        },
        {
            "title": "Курсы этичного хакинга и тестирования на проникновение от PentestIT",
            "link": "http:\/\/habrahabr.ru\/company\/pentestit\/blog\/240995\/",
            "hubs": [
                "Блог компании PentestIT",
                "Учебный процесс в IT",
                "Информационная безопасность"
            ],
            "author": {
                "username": "pentestit-team",
                "profile_link": "http:\/\/habrahabr.ru\/users\/pentestit-team\/",
                "rating": "36,4"
            }
        },
        ...
    ]
}

More details here

Alexander Taratin, 2016-04-07
@Taraflex

php-readability Which port to choose?
https://github.com/masukomi/ar90-readability

Vladimir Proskurin, 2016-04-07
@Vlad_IT

For Python 3 I used https://pypi.python.org/pypi/newspaper Gets only page content. In most sites with a normal layout, it works fine.

KkJ, 2016-04-08
@KkJ

Full of Scrapy
.