Answer the question
In order to leave comments, you need to log in
PDF OCR console version or python?
Is there something suitable for PDF recognition (interested in text) via the command line or to fence your own in python, then can you tell me a more or less adequate library?
Answer the question
In order to leave comments, you need to log in
cuneiform from the command line normally recognizes and saves. And so - tesseract and pyocr. PDF does not directly know how, but rasterizing it to PNG is business something ...
Something like this:
from wand.image import Image as Img
from wand.color import Color
from PIL import Image
import pyocr
import pyocr.builders
import os
from timeit import default_timer as timer
pdf_name = '1.pdf'
pdf_path = os.path.join(os.getcwd(), pdf_name)
img_name = 'pdf_1'
image = f'{img_name}.png'
with Img(filename=pdf_path, resolution=300) as img:
img.format = 'png'
img.background_color = Color('white')
img.alpha_channel = 'remove'
img.save(filename=image)
tools = pyocr.get_available_tools()[1]
lang = tools.get_available_languages()
builder = pyocr.builders.TextBuilder()
start = timer()
text = tools.image_to_string(Image.open(image), lang='rus',
builder=builder)
end = timer()
print(f"{end - start} \n\n")
print(text)
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question