N
N
nexus02020-01-11 04:41:36
Python
nexus0, 2020-01-11 04:41:36

Problem with encoding in requests_html?

Unable to parse site header in correct encoding.

>>> from requests_html import HTMLSession
>>> session = HTMLSession()
>>> r = session.get('https://pm.by/live.html')
>>> print(r.encoding)
WINDOWS-1251
>>> r.html.xpath('//title/text()')
['������ Live � ������ �� ����� ���� (�� ���� �����): �� ��������']

The site has cp1251 encoding, when I make an xpath request I get bugs.
Krakozyabry do not want to distill even in bytes, using the encode method.
>>> r.html.xpath('//title/text()')[0].encode('cp1251')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/encodings/cp1251.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-5: character maps to <undefined>

What could be the problem?

Answer the question

In order to leave comments, you need to log in

1 answer(s)
D
Drill, 2020-01-11
@nexus0

Try orr.content.decode('cp1251')
r.html.encoding = 'cp1251'

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question