Answer the question
In order to leave comments, you need to log in
How to determine the encoding of a site in Python?
You need to parse this page: dailybloggz.com/tfy/andreeva_secrets
No matter how I try, it turns out all sorts of crap with encodings, initially, when outputting, we get Russian text in the following form:
\x8d\xe5\xe4\xe0\xe2\xed\xee \xe2\xe8\xe4\xe5\xeb\xe0
print(soup.prettify().encode('MACCYRILLIC'))
import urllib.request
from bs4 import BeautifulSoup
url = 'http://dailybloggz.com/tfy/andreeva_secrets/'
req = urllib.request.Request(url, headers={'User-Agent': "Magic Browser"})
con = urllib.request.urlopen(req)
soup = BeautifulSoup(con, "html5lib")
print(soup.prettify().encode('utf-8'))
Answer the question
In order to leave comments, you need to log in
The website is utf-8.
just checked
from pprint import pprint
import requests
data = requests.get("http://dailybloggz.com/tfy/andreeva_secrets/")
pprint(data.content.decode('utf-8'))
Here is the code running in the Pycharm console (charset detection - automatically!):
# -*- coding: utf-8 -*-
import urllib.request
resource = urllib.request.urlopen('http://dailybloggz.com/tfy/andreeva_secrets')
charset = resource.headers.get_content_charset()
print(charset)
content = resource.read().decode(charset)
print(content)
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question