Encoding Chinese characters when parsing?

G

gowa662016-06-14 15:54:39

Python

gowa66, 2016-06-14 15:54:39

I am writing a parser for a Chinese online store.

from urllib.request import urlopen
from urllib.parse import urljoin
from lxml.html import fromstring

URL = 'http://list.suning.com/0-258003-0.html'
ITEM_PATH = '.clearfix .product .border-out .border-in .wrap .res-info .sell-point'

def parse_items():
    f = urlopen(URL)
    list_html = f.read().decode('utf-8')
    list_doc = fromstring(list_html)
    for elem in list_doc.cssselect(ITEM_PATH):
        a = elem.cssselect('a')[0]
        href = a.get('href')
        title = a.text
        em = elem.cssselect('em')[0]
        title2 = em.text
        print(href, title, title2)

def main():
    parse_items()

if __name__ == '__main__':
    main()

I get an error

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

Who can explain the coding?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

F

Fixid, 2016-06-14
@gowa66

Change to python3, there all strings are originally Unicode. If you stay on python2, then you can’t use str, there are alternative methods for working with strings on the Internet
Try without decode
And show the type of the object that causes an error