A
A
Adel1ne2015-04-07 12:59:38
PHP
Adel1ne, 2015-04-07 12:59:38

Problem with encoding in Simple HTML DOM Parser?

Hello!
Faced an encoding issue when using PHP Simple HTML DOM Parser.
Extract html text from paragraphs using the innertext() function. The text may contain tags,
for example text1 or even a link somewhere
Well, now to the point, here is the code:

foreach ($html->find('div[class="text"] p') as $text) {
$fulltext .= iconv("Windows-1251", "UTF-8", $text->innertext());
}

In addition, a lot of things are pulled out on the page through plaintext.
The problem is this:
The page I'm parsing has Windows-1251 encoding, my code (index.php) and itself Any ideas?
simple_html_dom.php itself are UTF-8 encoded. pulls out information in the page encoding, that is, in my case - Windows-1251.
OK, we do the conversion using iconv and, in theory, everything should be fine. Most of the text is displayed correctly in UTF-8 encoding, but the ambush is that for some reason the text enclosed in tags is displayed as gibberish. Either iconv doesn’t work on him, or something else, but I didn’t understand how to defeat this thing. And if you encode your page in Windows-1251, it still won't help.

Answer the question

In order to leave comments, you need to log in

3 answer(s)
V
Vit, 2015-04-07
@Adel1ne

Put all the content of the html page into a string variable, convert it to the desired encoding (UTF-8) and only then feed it into a simple html dom parser. That's what I've always done and never experienced any problems.

A
Adel1ne, 2015-04-07
@Adel1ne

Vit Vit , Put all html page content in string variable

Can you tell me how to do it?

E
Eugene, 2016-10-10
@Jekshmek

$d= mb_convert_encoding($d, 'utf-8', mb_detect_encoding($d));

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question