Using requests and encoding the resulting page - how to fix problems with Russian characters?

W

wanomgn2014-03-20 06:29:37

Python

wanomgn, 2014-03-20 06:29:37

Dear friends.
The simplest program in Python 3.3.5 under Win7 x64
If you parse lenta.ru, then everything works as it should: anchors are shown in Russian letters.
But if you run the same thing on da.ru, then all Russian anchors turn out to be crooked.
Tell me how to fix it so that in all cases the Russian characters are normal?

import requests
from lxml import html

r = requests.get('http://lenta.ru')
#r = requests.get('http://da.ru')
docHtml = r.text
parsed_body = html.fromstring(docHtml)
for y in parsed_body.xpath("//a"):
    url=y.get("href")
    anchor=y.text
    print(url,anchor)

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

T

theromis, 2018-02-25
@theromis

I have the same problem, it helped to force the response encoding `request.get`
```
r = requests.get(link, timeout=60, verify=False, headers=headers)
r.encoding = 'utf-8'
print r.text # became pure
```

Q

q1t, 2014-03-20
@q1t

maybe a problem with unicode?

encoded = str.encode(original, 'utf-8')
print(encoded)

W

wanomgn, 2014-03-20
@wanomgn

the short cuts look like
this
: \xc3\x91\xc2\x80\xc3\x90\xc2\xbe\xc3\x91\xc2\x82\xc3\x90\xc2\xbe\xc3\x91\xc2\x82\xc3\x90\xc2\xb8\xc3 \x90\xc2\xbf\xc3\x90\xc2\xb8\xc3\x91\xc2\x80\xc3\x90\xc2\xbe\xc3\x90\xc2\xb2\xc3\x90\xc2\xb0\xc3\x90 \xc2\xbd\xc3\x90\xc2\xb8\xc3\x90\xc2\xb5 \xc3\x91\xc2\x81\xc3\x90\xc2\xb0\xc3\x90\xc2\xb9\xc3\x91\xc2 \x82\xc3\x90\xc2\xb0'