Python problem?

A

avonar2011-05-03 21:02:30

Python

avonar, 2011-05-03 21:02:30

Python problem?

pastebin.com/vjn4QeKv
why is this piece of code not working?
I need a unicode string, but it throws an error

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

M

marazmiki, 2011-05-03
@marazmiki

Windows, huh? Try

print link_text.encode('UTF-8')

B

buriy, 2011-05-04
@buriy

On Windows (including Win7):

>>> import sys
>>> print sys.stdin.encoding
cp866
>>> print sys.stdout.encoding
cp866

This encoding is a feature of the Windows cmd.
So

print link_text.encode('cp866','replace')

will give Russian text in the cp866 console, replacing Unicode characters that are not in this encoding with a question mark ("?").
When trying to output unicode, this conversion, but without replacing bad characters, will start on its own, and break because there are characters that cannot be represented in cp866.
How to find these symbols?

>>> t=link_text.encode('cp866','replace').decode('cp866')
>>> for i in xrange(len(t)):
>>>      if link_text[i:i+1] != t[i:i+1]: link_text[i:i+1]
>>>
u'\xea'
u'\xab'
u'\xbb'
u'\xea'
u'\xea'
u'\xea'
u'\xea'
u'\xea'
u'\xea'
u'\xea'
u'\xea'
u'\xea'
u'\u2014'
>>> import htmlentitydefs
>>> for i in xrange(len(t)):
>>>     if link_text[i:i+1] != t[i:i+1]: htmlentitydefs.codepoint2name[ord(link_text[i:i+1])]
>>> 
'ecirc'
'laquo'
'raquo'
'ecirc'
'ecirc'
'ecirc'
'ecirc'
'ecirc'
'ecirc'
'ecirc'
'ecirc'
'ecirc'
'mdash'

In general, as usual - html entities.
Similar problems can occur on Linux when using a non-utf8 console.
For example:
<source lang="python>
>>> import sys
>>> sys.stdin.encoding
'KOI8-R'
>>> sys.stdout.encoding
'KOI8-R'
>>> e='hello'
>> > e
'\xd0\xbf\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82'
>>> print e
hello
>>> e.decode('koi8-r') #added spaces after each \u, to protect against the habraparser
u'\u 043f\x a9\u 044f\u 2500\u 043f\u 2566\u 043f\u 2561\u 043f\u 2563\u 044f\u 250c'
>> > print e.decode('koi8-r'
As you can see, when outputting unicode, print converts it to the console encoding, and when outputting non-unicode, print prints the bytes “as is”.

F

Fak3, 2011-05-03
@Fak3

did you forget import urllib ?
import urllib link='http://www.barcelona-nsk.ru/catalog/mebel/jacob-delafone/reve/mebel-pod-rakovinu-117x43,5x37sm-reve' link_text = unicode(''.join(urllib.urlopen(link).readlines()), 'utf-8') print link_text

S

Sergey, 2011-05-03
@seriyPS

what a wonderful code)))
Why join the output of readlines if you can do it read().replace('\n', '')?
I would write something like this

import urllib
link='http://www.barcelona-nsk.ru/catalog/mebel/jacob-delafone/reve/mebel-pod-rakovinu-117x43,5x37sm-reve'
body=urllib.urlopen(link).read().replace('\n', '').decode('utf8')

Although perhaps a matter of taste ...
And so you were correctly advised, see habrahabr.ru/blogs/python/117236/