How to parse tables in a pdf file with Python?

N

NikitaPythonGO2018-03-08 22:47:09

Python

NikitaPythonGO, 2018-03-08 22:47:09

I wrote a program that takes text from PDF and writes it to a file. But as a result, part of the text goes wrong at all as it should. Who knows exactly how you can parse exactly by cells, from one to another. Or maybe there are options for how to do this.
The code of my program:

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO

def getPDFText(pdfFilenamePath):
    retstr = StringIO()
    parser = PDFParser(open(pdfFilenamePath, 'rb'))
    try:
        document = PDFDocument(parser)
    except Exception as e:
        print(pdfFilenamePath, '')
        return ''
    if document.is_extractable:
        rsrcmgr = PDFResourceManager()
        device = TextConverter(rsrcmgr,retstr, codec='ascii' , laparams = LAParams())
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.create_pages(document):
            interpreter.process_page(page)
        return retstr.getvalue()
    else:
        print(pdfFilenamePath,"")
        return ''

if __name__ == '__main__':
    words = getPDFText('09.03_0.pdf')
    print(words)
    file = open('new.txt', 'w')
    file.write(words)

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

M

Moskus, 2018-03-09
@Moskus

PDF, in general, is not at all intended for storing structured data. The table in it can (with certain export settings from some programs capable of creating PDF) be saved so that it can be read as structured data (while maintaining the order of lines and cells are separated by tabs, for example). But in general, text in a PDF loses its structure and is stored simply as a vector image consisting of text characters.
Accordingly, for the general case, PDF should be analyzed as a graphic image, segmenting it into lines, blocks, etc. In practice, this is what OCR tools do, minus the need to recognize individual characters.

E

Egor Urvanov, 2018-07-22
@Hedgehogues

I also faced such problem. At the same time, I did not find open source tools that allow solving such a problem. Therefore, I had to write my own HoChiMinh parser .
Now I don't support it. But it is in working order and pretty well highlights the framework of regular tables that are oriented to the sides of the pdf page. However, it also depends on OCR, which extracts the text from the cell. By default, this is Tesseract. But for quality work it is better to use another tool.

E

Emil Revencu, 2019-02-20
@Revencu

Try it with Adobe Acrobat Pro

K

krvkir, 2019-08-28
@krvkir

The camelot library helped me .
There is also a Java lib tabula and a python wrapper for it - tabula-py . But if there are several tables on the sheet, it merges them all into one dataframe, which is not always convenient.