A
A
akelsey2020-01-02 22:45:08
Python
akelsey, 2020-01-02 22:45:08

How to remove line breaks in unicode?

Standard methods do not delete completely.

In short, the situation is that I take a certain document from the elasticsearch library, with a certain structure, which I need to redo. Those. these strings are already in Elastic and I expect that there will be no need to fiddle with them, but for a strange, even incomprehensible reason, the standard library does not escape quotes, instead of fields in quotes, arrays in JSON are returned with apostrophes, i.e. For example:

{
"lastname": "Иванов",
"education": [
'пту №1',
'университет патрисы лумумбы'
],
"hobbies": "Люблю вышивать "крестиком" и 
вязать на спицах"
}


partially coped with quotes,
drew a stupid function (attention, a side effect! perfectionists can bleed their eyes):
def filter(mystr) -> str:
    mystr = str(mystr)
    mystr = mystr.replace('\r\n', u' ')
    mystr = mystr.replace('\r', u' ')
    mystr = mystr.replace('\n', u' ')
    mystr = mystr.replace('\\r', u' ')
    mystr = mystr.replace('\\n', u' ')
    mystr = mystr.replace('\\r\\n', u' ')
    mystr = mystr.replace('\'', u'"')
    mystr = mystr.replace('\"', u'\\\"')
    mystr = mystr.replace(u'True', u'true')
    mystr = mystr.replace(u'False', u'false')
    mystr = mystr.replace(u'None', u'null')
    re.sub('^\s+|\n|\r|\s+$', u'', mystr)
    return str(mystr)


It partially works, but out of 100 documents, hyphenated lines are skipped in 5.
I looked at the hex line with the transfer (between "and" & "in":
5ff0cc7a8a3c6548680141.png
I see 0xD 0xA which should have been killed by the function, but they remain there, but in Unicode it should be 0x000D and 0x000A if I'm not mistaken. (i.e. ".the characters themselves are encoded correctly 0xD0B8 & 0xD0B2). Or is it ok for UTF-8?
Is there any standard method or function or library that can feed JSON for bulk into elasticsearch, so that it bypasses all values ​​recursively key values ​​and does everything " get hurt."
Thank you.

Answer the question

In order to leave comments, you need to log in

1 answer(s)
T
teenager_python, 2020-01-03
@teenager_python

Регуляркой:
import re

mystr = " balabla\n zzz "

re.sub("^\s+|\n|\r|\s+$", '', mystr)

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question