PDF Parsing with Block Position Extraction

C

ChemAli2012-01-13 13:59:16

Parsing

ChemAli, 2012-01-13 13:59:16

Is it possible to parse a pdf file (text and images) in such a way as to extract individual blocks of text from it and determine the coordinates of the location of these blocks?

The ultimate task: searching for text in a file highlighting what was found.

The implementations I've found stop at extracting solid text.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

E

egorinsk, 2012-01-13
@egorinsk

Certainly. really. These coordinates are stored in a PDF file, and there is no problem extracting them from there. Details in the PDF specification.

S

Sergey, 2012-01-13
Protko @Fesor

The inflamed brain gave rise to the idea of translating PDF into images, finding block coordinates, parsing text, selecting what is needed in the desired block and then taking the block coordinates ... O_o. That's right, there was one project where you had to look for empty spaces in a PDF document and fill them with advertising garbage. In the search context, there are many options. The problem needs to be more clearly defined. What they say is at the entrance and what should be at the exit.