Problem with encoding in Simple HTML DOM Parser?

A

Adel1ne2015-04-07 12:59:38

PHP

Adel1ne, 2015-04-07 12:59:38

Hello!
Faced an encoding issue when using PHP Simple HTML DOM Parser.
Extract html text from paragraphs using the innertext() function. The text may contain tags,
for example text1 or even a link somewhere
Well, now to the point, here is the code:

foreach ($html->find('div[class="text"] p') as $text) {
$fulltext .= iconv("Windows-1251", "UTF-8", $text->innertext());
}

In addition, a lot of things are pulled out on the page through plaintext.
The problem is this:
The page I'm parsing has Windows-1251 encoding, my code (index.php) and itself Any ideas?
simple_html_dom.php itself are UTF-8 encoded. pulls out information in the page encoding, that is, in my case - Windows-1251.
OK, we do the conversion using iconv and, in theory, everything should be fine. Most of the text is displayed correctly in UTF-8 encoding, but the ambush is that for some reason the text enclosed in tags is displayed as gibberish. Either iconv doesn’t work on him, or something else, but I didn’t understand how to defeat this thing. And if you encode your page in Windows-1251, it still won't help.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

V

Vit, 2015-04-07
@Adel1ne

Put all the content of the html page into a string variable, convert it to the desired encoding (UTF-8) and only then feed it into a simple html dom parser. That's what I've always done and never experienced any problems.

A

Adel1ne, 2015-04-07
@Adel1ne

Vit Vit , Put all html page content in string variable

Can you tell me how to do it?

E

Eugene, 2016-10-10
@Jekshmek

$d= mb_convert_encoding($d, 'utf-8', mb_detect_encoding($d));