D
D
domanskiy2021-06-28 15:49:54
Python
domanskiy, 2021-06-28 15:49:54

How to get CYRILLIC text in python3 from PDF?

I can't achieve normal text export from PDF
I use the PyPDF2 library There
are no problems with English text.
But Cyrillic...
It turns out like this:

˛˚˛
ˇ˛˝ˇ©˚ˇ˘˛™‰˚˘”˛˙˛˚˛‘˙˘˛ˆ˚‡
˛à˛‰
’˙˛”˛˚˚˘”Ł˛
˛˚‰
˛˚˚ˇ˛‰•˘˛ˇ˛’˚‰‰˘•˛˛˚ˇ˛ˇ‰•Ł˛˘›
˛
Ł˛¨˘˚˛
˛fl˛
˛–˛
˛fl•˛


The reading code itself:
from PyPDF2 import PdfFileReader

pdf_file = 'test.pdf'

pl = open(pdf_file, 'rb')
plread = PdfFileReader(pl)
getpage37 = plread.getPage(37)
text37 = getpage37.extractText()

print(text37.encode('utf-8').decode('utf-8'))


Tried different encodings

Answer the question

In order to leave comments, you need to log in

2 answer(s)
V
Vindicar, 2021-06-28
@domanskiy

Try pdfplumber, I worked with it without problems.

with pdfplumber.PDF(srcfile) as pdf:
  pages = [page.extract_text() for page in pdf.pages]
text = '\n'.join(pages)

C
Coder 1448, 2021-06-29
@wows15

There was such a task. Parse pdf. I tried all the libraries - nothing worked. Font what was in the document too did not know. The pages were very complex in structure.
I ended up using pytesseract and opencv. Not perfect, not fast, but it worked.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question