Is there a way to remove extra end tags when parsing?

K

Khurshed Abdujalil2018-02-01 11:52:27

PHP

Khurshed Abdujalil, 2018-02-01 11:52:27

I'm doing the parsing of one site, I'm watching a lot of closing tags, </div>because of which my layout also crashes.
Tried like this

$content = preg_replace("/<\/?div[^>]*\>/i", "", $content);

it does not work ... Maybe someone came across?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

N

novrm, 2018-02-01
@novrm

You need an html markup filter.
With the right settings, htmlpurifier will do.

Z

zzamzam, 2018-02-01
@Inlore

Or you can parse through DOMDocument and get the contents of the body without tags

$url = 'http://yandex.ru';
$result = file_get_contents($url);

$dom = new \DOMDocument();
libxml_use_internal_errors(true);
/* По-умолчанию loadHTML использует iso-8859-1, поэтому явно указываем преобразование */
$dom->loadHTML(mb_convert_encoding($result, 'HTML-ENTITIES', 'UTF-8'));
libxml_use_internal_errors(false);
$bodyContent = $dom->getElementsByTagName('body')[0]->textContent;

Unnecessary parts, such as scripts and styles, will remain in the text, but you can regularly remove them from html before creating the DOMDocument.
If you do not need the entire body, you can get the content of individual elements