How to parse pdf?

K

kr_ilya2021-02-20 00:36:44

Parsing

kr_ilya, 2021-02-20 00:36:44

How to parse pdf?

There is a need to parse data from a pdf document. The document itself consists of a table with three columns, with some unnecessary text in front of the table. In fact, only a table is needed. It's several pages long. In the table, in addition to text information, there are hyperlinks that also need to be parsed.

How can you effectively pull out all the data from this table, so that later you can easily operate with them? I was thinking json. But I don't know if I'm looking in the right direction.

The programming language is not important. If only there was a library that can implement this.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

S

Saboteur, 2021-02-20
@saboteur_kiev

what does json have to do with parsing?
pdf is such a thing that there a table can be a picture, and then only recognize it.

A

Alan Gibizov, 2021-02-20
@phaggi

If pdf is not a picture, if you can select and copy text in an open adobe reader pdf file, then in order to pull out a table from there, you need to open pdf using word. Then from Word the table can be copied and pasted into excel.
Accordingly, it is easier to automate this process based on MS office and VBA.

1

12rbah, 2021-02-20
@12rbah

How can you effectively pull out all the data from this table, so that later you can easily operate with them?

- if you need to extract a separate table, like just write your own parser (or go to freelance), if you just extract the entire text, then use popler-utils (you can extract page by page) and then parse the extracted text, you only need to determine where the beginning of the table is where the end.