How to determine the encoding of a site in Python?

A

AlexRAV2016-11-12 08:48:31

Python

AlexRAV, 2016-11-12 08:48:31

You need to parse this page: dailybloggz.com/tfy/andreeva_secrets
No matter how I try, it turns out all sorts of crap with encodings, initially, when outputting, we get Russian text in the following form:

\x8d\xe5\xe4\xe0\xe2\xed\xee \xe2\xe8\xe4\xe5\xeb\xe0

Any online decoders failed, except for one https://2cyr.com/decode/?lang=ru
He defined the encoding as MACCYRILLIC, I do: At the output I get the same as at the input. .encode('utf-8') - same result. Well, actually the question itself: How to determine the encoding? Or maybe I'm doing something wrong myself? Whole module code:
print(soup.prettify().encode('MACCYRILLIC'))

import urllib.request
from bs4 import BeautifulSoup


url = 'http://dailybloggz.com/tfy/andreeva_secrets/'
req = urllib.request.Request(url, headers={'User-Agent': "Magic Browser"})
con = urllib.request.urlopen(req)
soup = BeautifulSoup(con, "html5lib")
print(soup.prettify().encode('utf-8'))

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

A

asd111, 2016-11-12
@AlexRAV

The website is utf-8.
just checked

from pprint import pprint

import requests

data = requests.get("http://dailybloggz.com/tfy/andreeva_secrets/")
pprint(data.content.decode('utf-8'))

All norms output.
Maybe your console is not in utf-8?
The encoding is specified in the meta charset at the top of the page.

A

Andy_U, 2016-11-12
@Andy_U

Here is the code running in the Pycharm console (charset detection - automatically!):

# -*- coding: utf-8 -*-

import urllib.request

resource = urllib.request.urlopen('http://dailybloggz.com/tfy/andreeva_secrets')
charset = resource.headers.get_content_charset()
print(charset)
content =  resource.read().decode(charset)
print(content)