I
I
Igor Statkevich2019-10-16 22:26:32
Python
Igor Statkevich, 2019-10-16 22:26:32

PDF OCR console version or python?

Is there something suitable for PDF recognition (interested in text) via the command line or to fence your own in python, then can you tell me a more or less adequate library?

Answer the question

In order to leave comments, you need to log in

1 answer(s)
A
Alexey Guest007, 2019-10-17
@MadInc

cuneiform from the command line normally recognizes and saves. And so - tesseract and pyocr. PDF does not directly know how, but rasterizing it to PNG is business something ...
Something like this:

from wand.image import Image as Img
from wand.color import Color
from PIL import Image
import pyocr
import pyocr.builders
import os
from timeit import default_timer as timer

pdf_name = '1.pdf'
pdf_path = os.path.join(os.getcwd(), pdf_name)

img_name = 'pdf_1'
image = f'{img_name}.png'

with Img(filename=pdf_path, resolution=300) as img:
    img.format = 'png'
    img.background_color = Color('white')
    img.alpha_channel = 'remove'    
    img.save(filename=image)

tools = pyocr.get_available_tools()[1]
lang = tools.get_available_languages()

builder = pyocr.builders.TextBuilder()

start = timer()
text = tools.image_to_string(Image.open(image), lang='rus', 
                             builder=builder)
end = timer()
print(f"{end - start} \n\n")

print(text)

PyOCR - there is normal documentation

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question