Answer the question
In order to leave comments, you need to log in
Convert tables without cell borders from pdf to excel or csv?
There is a pdf file with a multipage table without cell borders.
Example of a pair of rows:
Download pdf file
Must be converted to excel or csv with correct cell division. The difficulty lies in the fact that many converters, including those built into Adobe Acrobat, PyPDF2 and others, read the file incorrectly - add extra lines and break the markup. I adapted to use the PDF2XL program, which has a manual mode that allows you to manually set the borders for the cells. However, I would like to automate this process using Python or another language.
Answer the question
In order to leave comments, you need to log in
To solve a similar problem, I wrote a script that used pdfminer.
The main operations that he performed:
1. converted pdf to xml. Here is an example of the result of the transformation.
<textbox id="17" bbox="384.771,365.240,431.953,377.063">
<textline bbox="384.771,365.240,431.953,377.063">
<text font="DJHCLP+TT66ACo00" bbox="384.771,365.240,396.357,377.063" size="11.823">N</text>
<text font="DJHCLP+TT66ACo00" bbox="396.337,365.240,408.821,377.063" size="11.823">G</text>
<text font="DJHCLP+TT66ACo00" bbox="408.800,365.240,419.489,377.063" size="11.823">S</text>
<text font="DJHCLP+TT66ACo00" bbox="419.469,365.240,431.953,377.063" size="11.823">O</text>
</textline>
</textbox>
Alas, the pdf format does not know anything about tables at all, there are no such structures in it. And the pdf itself was invented for a preprint and carries only text-graphics and instructions for their positioning on the page. Each cell represents a block of text and positioning instructions, that's it. At one time, we deliberately perverted in order to make copying difficult, mixed blocks in pdf, then in general copy-paste was a monstrous mess of fragments of different paragraphs of the page. So formally, nothing meaningfully structured can be pulled out of pdf. If you want to somehow exchange tables, then there are xls and csv and xml for this ...
Actually, you can put something in pdf, but it sucks back.
Yes, and how the format was invented from scraps of postscript - the language for printing, the round-trip conversion is flawless :-)
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question