Answer the question
In order to leave comments, you need to log in
How to parse tables in a pdf file with Python?
I wrote a program that takes text from PDF and writes it to a file. But as a result, part of the text goes wrong at all as it should. Who knows exactly how you can parse exactly by cells, from one to another. Or maybe there are options for how to do this.
The code of my program:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO
def getPDFText(pdfFilenamePath):
retstr = StringIO()
parser = PDFParser(open(pdfFilenamePath, 'rb'))
try:
document = PDFDocument(parser)
except Exception as e:
print(pdfFilenamePath, '')
return ''
if document.is_extractable:
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr,retstr, codec='ascii' , laparams = LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
return retstr.getvalue()
else:
print(pdfFilenamePath,"")
return ''
if __name__ == '__main__':
words = getPDFText('09.03_0.pdf')
print(words)
file = open('new.txt', 'w')
file.write(words)
Answer the question
In order to leave comments, you need to log in
PDF, in general, is not at all intended for storing structured data. The table in it can (with certain export settings from some programs capable of creating PDF) be saved so that it can be read as structured data (while maintaining the order of lines and cells are separated by tabs, for example). But in general, text in a PDF loses its structure and is stored simply as a vector image consisting of text characters.
Accordingly, for the general case, PDF should be analyzed as a graphic image, segmenting it into lines, blocks, etc. In practice, this is what OCR tools do, minus the need to recognize individual characters.
I also faced such problem. At the same time, I did not find open source tools that allow solving such a problem. Therefore, I had to write my own HoChiMinh parser .
Now I don't support it. But it is in working order and pretty well highlights the framework of regular tables that are oriented to the sides of the pdf page. However, it also depends on OCR, which extracts the text from the cell. By default, this is Tesseract. But for quality work it is better to use another tool.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question