Answer the question
In order to leave comments, you need to log in
How to get the content of a page?
Good day habrasoobshchestvo!
The page has been read through cURL, now you need to extract the content from it, i.e. the text itself, without menus, comments, headers, footers, and so on. Advise how this can be done with "little blood"? I suspect it must be some pattern in preg_match...
Answer the question
In order to leave comments, you need to log in
No need to parse HTML with regular expressions.
Read on StackOverflow: stackoverflow.com/a/1732454/680786
People write scientific articles on this topic, but you want to do without preg_match :)
function getContentFromHtml($aText)
{
return preg_replace(
array(
'@<head[^>]*?>.*?</head>@siu',
'@<style[^>]*?>.*?</style>@siu',
'@<script[^>]*?.*?</script>@siu',
'@<object[^>]*?.*?</object>@siu',
'@<embed[^>]*?.*?</embed>@siu',
'@<applet[^>]*?.*?</applet>@siu',
'@<noframes[^>]*?.*?</noframes>@siu',
'@<noscript[^>]*?.*?</noscript>@siu',
'@<noembed[^>]*?.*?</noembed>@siu',
'@</?((address)|(blockquote)|(center)|(del))@iu',
'@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
'@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
'@</?((table)|(th)|(td)|(caption))@iu',
'@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
'@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
'@</?((frameset)|(frame)|(iframe))@iu',
'@<[^>]*>@siu',
'@&[^;]+?;@siu',
'@(\s+)@siu'
),
array(
'',
'',
'',
'',
'',
'',
'',
'',
'',
'$0',
'$0',
'$0',
'$0',
'$0',
'$0',
'$0',
'',
' ',
' '
),
$aText
);
}
If we are talking about an article on Habré, then opening the article code immediately shows that the entire article is contained in
<div id="post_123456" class="post shortcuts_item">
At the same time, post_123456 is the number of the post that is in the URL
The text itself (without a title, a list of hubs, etc.) is contained in
<div class="content html_format">
Well, if we are talking about is about the general case, then you need to use the html parser, because regular expressions are not enough
Many sites made on popular engines have a template layout. And this is in your favor. Any DLE, WordPress and the like clearly highlight the main content of the page with css classes. It is possible to identify on this basis the applied engine and write requests to SHD (Simple HTML DOM, mentioned above) once. For unrecognized sites, you should look for sign blocks (main, content, body, etc.).
You can also use the “almost” ready-made option
habrahabr.ru/post/114323/
Thank you all for the solutions, however, there are still many questions.
For example, the sites that will be the main ones when the script is run, not only do not contain the h1 tag in principle, but are also laid out on tables and ASP with all the resulting garbage.
But, say, the evernote chrome plugin immediately unmistakably highlighted the desired column. Or this is how they did it on VKontakte - you feed them a link, they immediately “view” and there is the content of the article.
Something similar is needed...
I think you will not find any universal parser.
Of course, you can try to write functionality for all popular CMS, but is it worth it?
For parsing, I also recommend Simple HTML DOM. He is just perfect for this task.
Do you have a task to write a universal parser? That for any page approximately could give out title and a body?
Here I used two algorithms for these purposes - to add all this already described options for selecting clean text. Filtering here was a little smarter - all i, strong, h1, etc. replaced with b.
All p, span, div, etc. I replaced it with a separator of some kind (I don’t remember already). All insignificant tags like head img etc. deleted.
The result was a lot of blocks of text that included clean text, highlighted text, and text with links.
then I calculated for each block the amount of text in the block, what percentage of this text is selected, and what percentage is under the link.
I discarded blocks in which the text was too short or had too many highlights or links.
If anyone is interested in my six-year-old shit code (and it was 99% wild), then I can give it in a personal. But it is better to reproduce the algorithm yourself. It will be better :)
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question