How to parse utf-8 in lxml?

I

Ilya2016-02-18 14:19:06

Python

Ilya, 2016-02-18 14:19:06

According to the lxml documentation (+ question on SO ), the input to lxml.html.fromstring() must be an unencoded string, since lxml itself will try to determine the encoding, otherwise, if there are invalid characters in the already decoded string, it will raise such an error:

ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

I did as recommended, but ran into a problem where some characters are not processed correctly. In this case, everything is fine if you manually decode the string:

>>> bs = b'Hyv\xc3\xa4 juoni!'
>>> lxml.html.fromstring(bs).text
'HyvÃ¤ juoni!'
>>> lxml.html.fromstring(bs.decode()).text
'Hyvä juoni!'

Actually the question is how to make lxml correctly decode utf-8?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

I

Ilya, 2016-02-18
@766dt

In general, the problem turned out to be in the strange behavior of chardet:

>>> cchardet.detect('Hyvä juoni'.encode())
{'confidence': 0.8032709360122681, 'encoding': 'WINDOWS-1252'}
>>> cchardet.detect('Hyv juoni'.encode())
{'confidence': 0.0, 'encoding': 'ASCII'}
>>> cchardet.detect('ä'.encode())
{'confidence': 0.5049999952316284, 'encoding': 'UTF-8'}

So far, the best thing I have come up with is to manually set the content encoding, if it is known.
But if the encoding is unknown, and at the same time it is incorrectly determined using chardet, then I don’t see a solution yet.

A

abcd0x00, 2016-02-20
@abcd0x00

If the encoding is not declared, how does he know that utf-8 is there?
Decode before transmission.

>>> import lxml.html
>>> 
>>> s = b'<div>Hyv\xc3\xa4 juoni!</div>'.decode('utf-8')
>>> 
>>> doc = lxml.html.fromstring(s)
>>> doc
<Element div at 0xb744be3c>
>>> doc.text
'Hyvä juoni!'
>>>