Organize a quick search by content in pdf documents?

H

HexUserHex2021-06-25 22:32:34

Text recognising

HexUserHex, 2021-06-25 22:32:34

Greetings,

there is a fairly large amount of pdf documents (50gb), you need to organize a search for their contents, tell me how to do this as simply and quickly as possible (requires a temporary solution without using ELK, etc.)?

What options do I see:
1. parse in python and save to the database, and already search there, the difficulty is that pdf is not so easy to parse as html, j son, xml

2. find some miracle utility that recognizes text and based on it, it will create a json / xml object and save it, and then search through these files ...

I will be any ideas and suggestions, my goal is simply to find pdf files in which they occur by keywords.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

D

Dimonchik, 2021-06-25
@HexUserHex

docfetcher.sourceforge.net/en/index.html
and other desktop search

R

Roman Mirilaczvili, 2021-06-26
@2ord

If you do it yourself, then with the Solr full-text engine. It already includes a module for processing PDF documents and has its own HTTP API for requests. You need to write your own client program.
Or take ready-made software, as Dimonchik suggested .
Added
Found https://www.opensemanticsearch.org/