Why do I get wrong encoding when parsing html?

M

Makanchor2020-02-12 08:24:18

Parsing

Makanchor, 2020-02-12 08:24:18

I'm parsing the page https://classinform.ru/fkko-2017.html.

In the browser, everything is in order, when copying by hand, it is also perfectly copied. When I do UrlFetchApp.fetch(), Cyrillic turns into �, while encoding is utf-8.

Request parameters

var options = {
  "method": "get",
  "headers": {},
}

CHADNT?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

A

Alexander Ivanov, 2020-02-12
@Makanchor

Usually, you always need to specify the encoding when fetching. But it so happened that everyone is used to UTF-8.
Specify the encoding of your content when extracting

const data = UrlFetchApp.fetch('https://classinform.ru/fkko-2017.html.');
console.log(data.getContentText('windows-1251'));

S

Sergey Pankov, 2020-02-12
@trapwalker

There's a page in cp1251. This encoding is specified in a special tag on the page:

<meta http-equiv="content-type" content="text/html; charset=windows-1251">

When copying from a browser, the system takes this encoding into account and converts it on the fly. You should convert html-code to utf-8 before parsing, or convert separately cut fragments.