How to get CYRILLIC text in python3 from PDF?

D

domanskiy2021-06-28 15:49:54

Python

domanskiy, 2021-06-28 15:49:54

I can't achieve normal text export from PDF
I use the PyPDF2 library There
are no problems with English text.
But Cyrillic...
It turns out like this:

˛˚˛
ˇ˛˝ˇ©˚ˇ˘˛™‰˚˘”˛˙˛˚˛‘˙˘˛ˆ˚‡
˛à˛‰
’˙˛”˛˚˚˘”Ł˛
˛˚‰
˛˚˚ˇ˛‰•˘˛ˇ˛’˚‰‰˘•˛˛˚ˇ˛ˇ‰•Ł˛˘›
˛
Ł˛¨˘˚˛
˛ﬂ˛
˛–˛
˛ﬂ•˛

The reading code itself:

from PyPDF2 import PdfFileReader

pdf_file = 'test.pdf'

pl = open(pdf_file, 'rb')
plread = PdfFileReader(pl)
getpage37 = plread.getPage(37)
text37 = getpage37.extractText()

print(text37.encode('utf-8').decode('utf-8'))

Tried different encodings

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

V

Vindicar, 2021-06-28
@domanskiy

Try pdfplumber, I worked with it without problems.

with pdfplumber.PDF(srcfile) as pdf:
  pages = [page.extract_text() for page in pdf.pages]
text = '\n'.join(pages)

C

Coder 1448, 2021-06-29
@wows15

There was such a task. Parse pdf. I tried all the libraries - nothing worked. Font what was in the document too did not know. The pages were very complex in structure.
I ended up using pytesseract and opencv. Not perfect, not fast, but it worked.