Convert tables without cell borders from pdf to excel or csv?

A

ac130kz2017-11-15 19:38:11

Python

ac130kz, 2017-11-15 19:38:11

There is a pdf file with a multipage table without cell borders.
Example of a pair of rows:
Download pdf file
Must be converted to excel or csv with correct cell division. The difficulty lies in the fact that many converters, including those built into Adobe Acrobat, PyPDF2 and others, read the file incorrectly - add extra lines and break the markup. I adapted to use the PDF2XL program, which has a manual mode that allows you to manually set the borders for the cells. However, I would like to automate this process using Python or another language.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

A

Alexander Samokhin, 2017-11-16
@ac130kz

To solve a similar problem, I wrote a script that used pdfminer.
The main operations that he performed:
1. converted pdf to xml. Here is an example of the result of the transformation.

<textbox id="17" bbox="384.771,365.240,431.953,377.063">
<textline bbox="384.771,365.240,431.953,377.063">
<text font="DJHCLP+TT66ACo00" bbox="384.771,365.240,396.357,377.063" size="11.823">N</text>
<text font="DJHCLP+TT66ACo00" bbox="396.337,365.240,408.821,377.063" size="11.823">G</text>
<text font="DJHCLP+TT66ACo00" bbox="408.800,365.240,419.489,377.063" size="11.823">S</text>
<text font="DJHCLP+TT66ACo00" bbox="419.469,365.240,431.953,377.063" size="11.823">O</text>
</textline>
</textbox>

The bbox attribute value is the text coordinates X1, Y1, X2, Y2 .
2. parsed xml, created "text elements";
3. calculated the average value of Y for the elements. Those elements whose average Y is the same belong to the same line, provided that they are on the same page;
4. sorted the elements by page number and average Y ;
5. sorted the elements belonging to the same row by X1 ;
6. collected strings in the required format from the sorted elements.

A

Alexey Cheremisin, 2017-11-15
@leahch

Alas, the pdf format does not know anything about tables at all, there are no such structures in it. And the pdf itself was invented for a preprint and carries only text-graphics and instructions for their positioning on the page. Each cell represents a block of text and positioning instructions, that's it. At one time, we deliberately perverted in order to make copying difficult, mixed blocks in pdf, then in general copy-paste was a monstrous mess of fragments of different paragraphs of the page. So formally, nothing meaningfully structured can be pulled out of pdf. If you want to somehow exchange tables, then there are xls and csv and xml for this ...
Actually, you can put something in pdf, but it sucks back.
Yes, and how the format was invented from scraps of postscript - the language for printing, the round-trip conversion is flawless :-)

D

Dimonchik, 2017-11-15
@dimonchik2013

pdf in Python is hard to
convert to csv, process with masks, from there to excel