Answer the question
In order to leave comments, you need to log in
Parse / rob web pages without garbage?
Recently, many lazy-reading services have appeared that “rob” site content directly from the page (not from feeds), beautifully clearing everything superfluous and leaving only marked-up text without any spans, font sizes, etc., and pictures. For example https://getpocket.com/
Question. Has anyone come across scripts in the public domain with which you can do this and tie it to your own project so that you can “suck in” pages for yourself? ;)
Answer the question
In order to leave comments, you need to log in
This task is called data region mining and is a rather tricky problem. layout can be different everywhere, and you are solving the problem of finding the main content on the site (i.e. cropping ads, navigation blocks, left inserts, hidden content, etc.)
Here is an algorithm for you:
1. Для каждой html ноды в дереве, вычислить её площадь(рендерите через phantom.js и вычисляете площадь через Element.getBoundingClientRect())
2. Удаляете все, что меньше средней площади на этом уровне. (Вычищаем не имеющие значения блоки)
3. Спускаетесь вниз на один уровень и повторяете алгоритм
Of course there is - lxml.de/lxmlhtml.html#cleaning-up-html
Well, you can already select what you need from the cleaned one.
Yes, it's better to use the requests library - docs.python-requests.org/en/master
Piece handmade, each site has its own little bicycle.
Well, not a bicycle, rather, other wheels are screwed to one bicycle.
well, in short: this is the task of finding the MAIN content of the page.
1. Remove all containers with more than 1 child elements.
2. Clean up the body container from all tags except container tags (div,td)
3. Find the container (div,td) with the longest text.
4. Feel free to rob him.
Apist is a great thing! Allows you to easily parse pages, refer to elements in jquery style. Habr parsing example:
public function index()
{
return $this->get('/', [
'title' => Apist::filter('.page_head .title')->text()->trim(),
'posts' => Apist::filter('.posts .post')->each([
'title' => Apist::filter('h1.title a')->text(),
'link' => Apist::filter('h1.title a')->attr('href'),
'hubs' => Apist::filter('.hubs a')->each(Apist::filter('*')->text()),
'author' => [
'username' => Apist::filter('.author a'),
'profile_link' => Apist::filter('.author a')->attr('href'),
'rating' => Apist::filter('.author .rating')->text()
]
])
]);
}
{
"title": "Публикации",
"posts": [
{
"title": "Проверьте своего хостера на уязвимость Shellshock (часть 2)",
"link": "http:\/\/habrahabr.ru\/company\/host-tracker\/blog\/240389\/",
"hubs": [
"Блог компании ХостТрекер",
"Серверное администрирование",
"Информационная безопасность"
],
"author": {
"username": "smiHT",
"profile_link": "http:\/\/habrahabr.ru\/users\/smiHT\/",
"rating": "26,9"
}
},
{
"title": "Курсы этичного хакинга и тестирования на проникновение от PentestIT",
"link": "http:\/\/habrahabr.ru\/company\/pentestit\/blog\/240995\/",
"hubs": [
"Блог компании PentestIT",
"Учебный процесс в IT",
"Информационная безопасность"
],
"author": {
"username": "pentestit-team",
"profile_link": "http:\/\/habrahabr.ru\/users\/pentestit-team\/",
"rating": "36,4"
}
},
...
]
}
php-readability Which port to choose?
https://github.com/masukomi/ar90-readability
For Python 3 I used https://pypi.python.org/pypi/newspaper Gets only page content. In most sites with a normal layout, it works fine.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question