How to get the content of a page?

Zawchik2012-09-17 15:09:18

PHP

Zawchik, 2012-09-17 15:09:18

Good day habrasoobshchestvo!
The page has been read through cURL, now you need to extract the content from it, i.e. the text itself, without menus, comments, headers, footers, and so on. Advise how this can be done with "little blood"? I suspect it must be some pattern in preg_match...

Answer the question

In order to leave comments, you need to log in

15 answer(s)

EugeneOZ, 2012-09-17
@EugeneOZ

No need to parse HTML with regular expressions.
Read on StackOverflow: stackoverflow.com/a/1732454/680786

Yuri Morozov, 2012-09-18
@metamorph

People write scientific articles on this topic, but you want to do without preg_match :)

Urvin, 2012-09-17
@Urvin

function getContentFromHtml($aText)
{
  return preg_replace(
      array(
        '@<head[^>]*?>.*?</head>@siu',
        '@<style[^>]*?>.*?</style>@siu',
        '@<script[^>]*?.*?</script>@siu',
        '@<object[^>]*?.*?</object>@siu',
        '@<embed[^>]*?.*?</embed>@siu',
        '@<applet[^>]*?.*?</applet>@siu',
        '@<noframes[^>]*?.*?</noframes>@siu',
        '@<noscript[^>]*?.*?</noscript>@siu',
        '@<noembed[^>]*?.*?</noembed>@siu',

        '@</?((address)|(blockquote)|(center)|(del))@iu',
        '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
        '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
        '@</?((table)|(th)|(td)|(caption))@iu',
        '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
        '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
        '@</?((frameset)|(frame)|(iframe))@iu',
        '@<[^>]*>@siu',
        '@&[^;]+?;@siu',
        '@(\s+)@siu'
      ),
      array(
        '',
        '',
        '',
        '',
        '',
        '',
        '',
        '',
        '',

        '$0',
        '$0',
        '$0',
        '$0',
        '$0',
        '$0',
        '$0',
        '',
        ' ',
        ' '
      ),
      $aText
    );
}

This is what I have in my stash. Not a fountain, of course.

Gesper, 2012-09-17
@Gesper

Try smoking code.google.com/p/boilerpipe/

avalak, 2012-09-17
@avalak

Use Simple HTML DOM or other libraries with similar functionality.

m08pvv, 2012-09-17
@m08pvv

If we are talking about an article on Habré, then opening the article code immediately shows that the entire article is contained in
<div id="post_123456" class="post shortcuts_item">
At the same time, post_123456 is the number of the post that is in the URL
The text itself (without a title, a list of hubs, etc.) is contained in
<div class="content html_format">
Well, if we are talking about is about the general case, then you need to use the html parser, because regular expressions are not enough

Alexey Akulovich, 2012-09-17
@AterCattus

Many sites made on popular engines have a template layout. And this is in your favor. Any DLE, WordPress and the like clearly highlight the main content of the page with css classes. It is possible to identify on this basis the applied engine and write requests to SHD (Simple HTML DOM, mentioned above) once. For unrecognized sites, you should look for sign blocks (main, content, body, etc.).

uadeveloper, 2012-09-17
@uadeveloper

You can also use the “almost” ready-made option
habrahabr.ru/post/114323/

Zawchik, 2012-09-17
@Zawchik

Thank you all for the solutions, however, there are still many questions.
For example, the sites that will be the main ones when the script is run, not only do not contain the h1 tag in principle, but are also laid out on tables and ASP with all the resulting garbage.
But, say, the evernote chrome plugin immediately unmistakably highlighted the desired column. Or this is how they did it on VKontakte - you feed them a link, they immediately “view” and there is the content of the article.
Something similar is needed...

Stepan, 2012-09-17
@L3n1n

I think you will not find any universal parser.
Of course, you can try to write functionality for all popular CMS, but is it worth it?
For parsing, I also recommend Simple HTML DOM. He is just perfect for this task.

deadkrolik, 2012-09-17
@deadkrolik

Do you have a task to write a universal parser? That for any page approximately could give out title and a body?

Alexander Khmelev, 2013-02-25
@akhmelev

Very relevant. Please share if you found a solution?

Maxim Dyachenko, 2013-04-28
@Mendel

Here I used two algorithms for these purposes - to add all this already described options for selecting clean text. Filtering here was a little smarter - all i, strong, h1, etc. replaced with b.
All p, span, div, etc. I replaced it with a separator of some kind (I don’t remember already). All insignificant tags like head img etc. deleted.
The result was a lot of blocks of text that included clean text, highlighted text, and text with links.
then I calculated for each block the amount of text in the block, what percentage of this text is selected, and what percentage is under the link.
I discarded blocks in which the text was too short or had too many highlights or links.
If anyone is interested in my six-year-old shit code (and it was 99% wild), then I can give it in a personal. But it is better to reproduce the algorithm yourself. It will be better :)

Alexey, 2013-12-19
@photo_profile

Try simplehtmldom.sourceforge.net