D
D
Dmitry2016-07-05 01:50:20
Yii
Dmitry, 2016-07-05 01:50:20

How to get only text from html page?

Good evening.
You need to get all the text that is on the page of the site.
Without binding to tag attributes. It is necessary to exclude script, iframe.
At the end, you need to write it to a text file, with line breaks.
Started with the following:

$str = file_get_contents('http://site.com');
$doc = new DOMDocument();
@$doc->loadHTML($str);
$body = $doc->getElementsByTagName('body');
...

After that, a dead end, I can not figure out how to do it better and correctly.
I get a DOMElement where textContent contains all the text.
How can I parse it so that it can be written to a file? Advise how to do it right?
ps The order in the file should be something like this:
Заголовок
Подзаголовок
Текст
Меню
Текст
Текст
и т.д.

Answer the question

In order to leave comments, you need to log in

2 answer(s)
V
Vitaly, 2016-07-05
@vshvydky

simple_html_dom -> find -> plaintext

X
xmoonlight, 2016-07-05
@xmoonlight

$text=<<<t
hellow <b>test</b> <iframe src=javascript>browser is bad!</iframe>
<script>alert('hi');</script>
test2<br>
<script>alert('hi');</script>


test3


t;
$text=preg_replace('#(\<iframe.*?\/iframe>)#u',"",$text);
$text=preg_replace('#(\<script.*?\/script>)#u',"",$text);
$text=preg_replace('#(\<(\/?[^>]+)>)#u',"",$text);
$text=preg_replace('#((\n\r)+)#u',"",$text);
echo "<pre>".$text."</pre>";
/*
hellow test 
test2
test3
*/

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question