F
F
Fagot782020-02-10 17:15:17
Scanning
Fagot78, 2020-02-10 17:15:17

How to automatically remove information from a scanned document?

There is a task - scanning of large volume of documents. These documents contain information that needs to be automatically removed. By keywords. Let's say a certain nomenclature in the specifications. So that a scan of a document (pdf, jpg, etc.) does not already have these words.
Is there such software?

Answer the question

In order to leave comments, you need to log in

2 answer(s)
V
Vladimir Korotenko, 2020-02-10
@firedragon

Finerideder
CineForms
Both have sdk for enterprise clients

U
U235U235, 2020-02-11
@U235U235

Option 1:
Recognize with tesseract in HOCR, find the necessary words and their coordinates in it. Imagemagick'om we paint over words on scans on coordinates.
Option 2:
Recognize with FineReader, export to djvu, extract text layer with coordinates from djvu and parse it. Further, the same with Imagemagick.
All this can be automated with scripts.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question