How to encode a string with characters from different encodings?

O

Oleg Pravdin2015-11-05 13:08:30

Python

Oleg Pravdin, 2015-11-05 13:08:30

>>> a='привет, '.encode('utf-8')
>>> b='мир!'.encode('cp1251')
>>> c=a+b
>>> c
b'\xd0\xbf\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82, \xec\xe8\xf0!'

How to losslessly encode a string with in UTF-8?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

O

Oleg Pravdin, 2015-11-05
@opravdin

I solved the problem through a crutch: since in my case the conflict was only with quotes "Christmas trees", I checked whether the bytes \xab and \xbb belong to the letters Y (\xd0\xab) and l (\xd0\xbb). If not, then replace it with a space.

text=bytes()
i=0
while  i<=len(rawtext)-1:
  if rawtext[i]==187 and rawtext[i-1]!=208:
    text+=bytes([32])
  elif rawtext[i]==171 and rawtext[i-1]!=208:
    text+=bytes([32])
  else:
    text+=bytes([rawtext[i]])
  i+=1
return(text.decode('utf-8', 'ignore'))

S

Slava Kryvel, 2015-11-05
@kryvel

Can i ask you? Why do you need this?
Because anyway, this line will not be displayed anywhere * correctly. Because most software uses one encoding table for all content.
If there is a good reason for this, then keep the data in binary form and do not glue them together as strings.