How to parse pdf as a structure in c#?

E

Evgeny Elchev2012-07-03 06:18:46

Programming

Evgeny Elchev, 2012-07-03 06:18:46

Interested in how to parse pdf not just as a mountain of text, but as a structure. In particular reading tables.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

A

ant99, 2012-07-03
@ant99

stackoverflow.com/questions/3424588/programmatically-extract-pdf-tables

Considering your requirement, Straight forward answer for your question would be it is quite not possible. The reason is, unlike word/excel, PDF specification does not have an object called Table. The table which you see in those PDF documents are just a series of rectangle drawn in such a way that it looks like table and it is up to PDF Writer which created those PDF files, because some might draw table kind of structure using Series of Line.

In other words, the PDF specification does not support describing tables as objects; Tables in PDF are represented by a set of rectangular areas and lines. You can create your own algorithm that, by certain criteria, will recognize such a set of areas as a table, or you can use existing libraries and utilities that already implement this (given in the last comment on the link).

Z

Zhbert, 2012-07-03
@Zhbert

Well, first you need to find a description of the PDF format, and then push yourself from that.

L

lexa107, 2012-07-03
@lexa107

There is a library project for working with pdf in C#. You can see here