Automated migration of PDF to SQL

A

AJ2012-12-20 10:02:00

SQL

AJ, 2012-12-20 10:02:00

Good, people.
There are enough jokes in the life of developers. Especially with the fantasies of customers. Another such joke befell me.
There is a catalog of spare parts for construction equipment. In…PDF format.
25 Gb files contain explosion diagrams, part numbers, names and other necessary information. And you need to overtake this excellent amount into an acceptable database format. Currently SQL.
I'm sure there is a text format. But no one will provide it. Show jumpers and the manufacturer are not interested in this. Any AutoCD are sewn up in a closed format.

Prompt the shortest way from PDF to SQL. So far, only PDF->XLSX->Parser->SQL comes into my head
. But figs knows it. Suddenly who faced.

Thanks in advance for your replies.

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

S

Sergey, 2012-12-20
@Ualde

Look, it's close to the topic, especially in the comments: habrahabr.ru/post/130601/

K

KEKSOV, 2012-12-20
@KEKSOV

Here is another utility for extracting text multivalent.sourceforge.net/Tools/ By the way, ABBY also has a utility that can be useful
Frankly, PDF can be so cleverly heaped up that you can get figs out of it in machine-readable form

C

ChemAli, 2012-12-21
@ChemAli

Did some simple search in pdf. Converted pdf2xml, then stupidly searched for xml.
In your case, I think this will not help much, because the layout differs from page to page, and text blocks are written in xml with the coordinates of the text location and the text itself. That is, structured data can hardly be obtained.

A

Alexander, 2012-12-21
@akalend

PDF -> text -> parser -> sql