A
A
Alexander2020-01-23 22:50:11
linux
Alexander, 2020-01-23 22:50:11

How to get Cyrillic from RTF file using python or linux?

Good evening, I'm trying to extract Russian text from an rtf file, I'm trying to execute the unrtf utility from the console

def rtf_file_to_text(path: str) -> str:
    """
        Возвращает текст из rtf документа
    """

    cmd = ['unrtf', path]
    p = Popen(cmd, stdout=PIPE)
    stdout, stderr = p.communicate()
    text = stdout.decode('utf-8')
    return text

getting text
<b><font face="Times New Roman"><font size="4">&#1054;&#1073;&#1086;&#1089;&#1085;&#1086;&#1074;&#1072;&#1085;&#1080;&#1077; &#1085;&#1072;&#1095;&#1072;&#1083;&#1100;&#1085;&#1086;&#1081; (&#1084;&#1072;&#1082;&#1089;&#1080;&#1084;&#1072;&#1083;&#1100;&#1085;&#1086;&#1081;) &#1094;&#1077;&#1085;&#1099; </font></font></b>&#1082;&#1086;&#1085;&#1090;&#1088;&#1072;&#1082;&#1090;&#1072;

All characters are like this. How to extract characters in the encoding I need? Tried to use different python libraries but they give the same result. Maybe there is some other utility for linux with which you can pull out the text?

Answer the question

In order to leave comments, you need to log in

2 answer(s)
A
Alexander, 2020-01-25
@AlexMine

I found a solution to this question, I think this is one of the normal solutions in my case. Installed libreoffice on the server using

import os

os.system('lowriter --headless --convert-to txt file.rtf")

Converted to a txt file, and already took the text from this file in full.

A
AUser0, 2020-01-24
@AUser0

I think it should be possible somehow like this:

try:
     # Python 2.6-2.7 
     from HTMLParser import HTMLParser
except ImportError:
     # Python 3+
     from html.parser import HTMLParser
h = HTMLParser()
return h.unescape(text)

PS It's just that I'm in Python with no-tooth-foot.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question