PDF OCR console version or python?

I

Igor Statkevich2019-10-16 22:26:32

Python

Igor Statkevich, 2019-10-16 22:26:32

Is there something suitable for PDF recognition (interested in text) via the command line or to fence your own in python, then can you tell me a more or less adequate library?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

A

Alexey Guest007, 2019-10-17
@MadInc

cuneiform from the command line normally recognizes and saves. And so - tesseract and pyocr. PDF does not directly know how, but rasterizing it to PNG is business something ...
Something like this:

from wand.image import Image as Img
from wand.color import Color
from PIL import Image
import pyocr
import pyocr.builders
import os
from timeit import default_timer as timer

pdf_name = '1.pdf'
pdf_path = os.path.join(os.getcwd(), pdf_name)

img_name = 'pdf_1'
image = f'{img_name}.png'

with Img(filename=pdf_path, resolution=300) as img:
    img.format = 'png'
    img.background_color = Color('white')
    img.alpha_channel = 'remove'    
    img.save(filename=image)

tools = pyocr.get_available_tools()[1]
lang = tools.get_available_languages()

builder = pyocr.builders.TextBuilder()

start = timer()
text = tools.image_to_string(Image.open(image), lang='rus', 
                             builder=builder)
end = timer()
print(f"{end - start} \n\n")

print(text)

PyOCR - there is normal documentation