K
K
Khurshed Abdujalil2018-02-01 11:52:27
PHP
Khurshed Abdujalil, 2018-02-01 11:52:27

Is there a way to remove extra end tags when parsing?

I'm doing the parsing of one site, I'm watching a lot of closing tags, </div>because of which my layout also crashes.
Tried like this

$content = preg_replace("/<\/?div[^>]*\>/i", "", $content);
it does not work ... Maybe someone came across?

Answer the question

In order to leave comments, you need to log in

2 answer(s)
N
novrm, 2018-02-01
@novrm

You need an html markup filter.
With the right settings, htmlpurifier will do.

Z
zzamzam, 2018-02-01
@Inlore

Or you can parse through DOMDocument and get the contents of the body without tags

$url = 'http://yandex.ru';
$result = file_get_contents($url);

$dom = new \DOMDocument();
libxml_use_internal_errors(true);
/* По-умолчанию loadHTML использует iso-8859-1, поэтому явно указываем преобразование */
$dom->loadHTML(mb_convert_encoding($result, 'HTML-ENTITIES', 'UTF-8'));
libxml_use_internal_errors(false);
$bodyContent = $dom->getElementsByTagName('body')[0]->textContent;

Unnecessary parts, such as scripts and styles, will remain in the text, but you can regularly remove them from html before creating the DOMDocument.
If you do not need the entire body, you can get the content of individual elements

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question