How to automatically remove information from a scanned document?

F

Fagot782020-02-10 17:15:17

Scanning

Fagot78, 2020-02-10 17:15:17

There is a task - scanning of large volume of documents. These documents contain information that needs to be automatically removed. By keywords. Let's say a certain nomenclature in the specifications. So that a scan of a document (pdf, jpg, etc.) does not already have these words.
Is there such software?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

V

Vladimir Korotenko, 2020-02-10
@firedragon

Finerideder
CineForms
Both have sdk for enterprise clients

U

U235U235, 2020-02-11
@U235U235

Option 1:
Recognize with tesseract in HOCR, find the necessary words and their coordinates in it. Imagemagick'om we paint over words on scans on coordinates.
Option 2:
Recognize with FineReader, export to djvu, extract text layer with coordinates from djvu and parse it. Further, the same with Imagemagick.
All this can be automated with scripts.