Answer the question
In order to leave comments, you need to log in
How to get CYRILLIC text in python3 from PDF?
I can't achieve normal text export from PDF
I use the PyPDF2 library There
are no problems with English text.
But Cyrillic...
It turns out like this:
˛˚˛
ˇ˛˝ˇ©˚ˇ˘˛™‰˚˘”˛˙˛˚˛‘˙˘˛ˆ˚‡
˛à˛‰
’˙˛”˛˚˚˘”Ł˛
˛˚‰
˛˚˚ˇ˛‰•˘˛ˇ˛’˚‰‰˘•˛˛˚ˇ˛ˇ‰•Ł˛˘›
˛
Ł˛¨˘˚˛
˛fl˛
˛–˛
˛fl•˛
from PyPDF2 import PdfFileReader
pdf_file = 'test.pdf'
pl = open(pdf_file, 'rb')
plread = PdfFileReader(pl)
getpage37 = plread.getPage(37)
text37 = getpage37.extractText()
print(text37.encode('utf-8').decode('utf-8'))
Answer the question
In order to leave comments, you need to log in
Try pdfplumber, I worked with it without problems.
with pdfplumber.PDF(srcfile) as pdf:
pages = [page.extract_text() for page in pdf.pages]
text = '\n'.join(pages)
There was such a task. Parse pdf. I tried all the libraries - nothing worked. Font what was in the document too did not know. The pages were very complex in structure.
I ended up using pytesseract and opencv. Not perfect, not fast, but it worked.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question