Answer the question
In order to leave comments, you need to log in
How to parse utf-8 in lxml?
According to the lxml documentation (+ question on SO ), the input to lxml.html.fromstring() must be an unencoded string, since lxml itself will try to determine the encoding, otherwise, if there are invalid characters in the already decoded string, it will raise such an error:
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
>>> bs = b'Hyv\xc3\xa4 juoni!'
>>> lxml.html.fromstring(bs).text
'Hyvä juoni!'
>>> lxml.html.fromstring(bs.decode()).text
'Hyvä juoni!'
Answer the question
In order to leave comments, you need to log in
In general, the problem turned out to be in the strange behavior of chardet:
>>> cchardet.detect('Hyvä juoni'.encode())
{'confidence': 0.8032709360122681, 'encoding': 'WINDOWS-1252'}
>>> cchardet.detect('Hyv juoni'.encode())
{'confidence': 0.0, 'encoding': 'ASCII'}
>>> cchardet.detect('ä'.encode())
{'confidence': 0.5049999952316284, 'encoding': 'UTF-8'}
If the encoding is not declared, how does he know that utf-8 is there?
Decode before transmission.
>>> import lxml.html
>>>
>>> s = b'<div>Hyv\xc3\xa4 juoni!</div>'.decode('utf-8')
>>>
>>> doc = lxml.html.fromstring(s)
>>> doc
<Element div at 0xb744be3c>
>>> doc.text
'Hyvä juoni!'
>>>
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question