Problem with encoding when parsing a Russian site?

F

Fantinum2018-08-14 18:19:31

Python

Fantinum, 2018-08-14 18:19:31

There is a problem with the encoding when parsing the site https://beton24.ru/sochi/beton/

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://beton24.ru/sochi/beton/')
bs = BeautifulSoup(html.read())
result = bs.findAll("span", "catalog-index__link-text")[1]
parse = str(result)

To pull out the price of concrete, I convert result to str, and it turns into 'from\xa03\u2009836\xa0₽'
Who faced how to solve? Thank you!

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

I

igorzakhar, 2018-08-14
@Fantinum

We look at HTML through, for example, Chrome DevTools:
We read in the documentation for BeautifulSoup 4 (section "Entities" ):

>>> from urllib.request import urlopen
>>> from bs4 import BeautifulSoup
>>> html = urlopen('https://beton24.ru/sochi/beton/')
>>> bs = BeautifulSoup(html.read(), 'lxml')
>>> result = bs.findAll("span", "catalog-index__link-text")[1]
>>> result.text.replace(u'\xa0',' ').replace(u'\u2009', '')
'от 3836 ₽'