H
H
HexUserHex2021-06-25 22:32:34
Text recognising
HexUserHex, 2021-06-25 22:32:34

Organize a quick search by content in pdf documents?

Greetings,

there is a fairly large amount of pdf documents (50gb), you need to organize a search for their contents, tell me how to do this as simply and quickly as possible (requires a temporary solution without using ELK, etc.)?

What options do I see:
1. parse in python and save to the database, and already search there, the difficulty is that pdf is not so easy to parse as html, j son, xml

2. find some miracle utility that recognizes text and based on it, it will create a json / xml object and save it, and then search through these files ...

I will be any ideas and suggestions, my goal is simply to find pdf files in which they occur by keywords.

Answer the question

In order to leave comments, you need to log in

2 answer(s)
D
Dimonchik, 2021-06-25
@HexUserHex

docfetcher.sourceforge.net/en/index.html
and other desktop search

R
Roman Mirilaczvili, 2021-06-26
@2ord

If you do it yourself, then with the Solr full-text engine. It already includes a module for processing PDF documents and has its own HTTP API for requests. You need to write your own client program.
Or take ready-made software, as Dimonchik suggested .
Added
Found https://www.opensemanticsearch.org/

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question