A
A
Anton Piskunov2013-01-18 14:56:56
PHP
Anton Piskunov, 2013-01-18 14:56:56

How to parse PDF using PHP?

Task: open the document and get a digestible text.
Tested: everything. Indeed, everything, I have all Google in lilac links, and even contextual advertising already feels sorry for me and offers to buy a pdf book reader.

In addition to PHP, I am ready to use any other technology that will give a guaranteed result. But still, I would not want to leave the native language for the project.

The hardcore version of writing a parser from scratch based on format specs is not particularly desirable due to the complexity of the format and the zoo of versions.

Answer the question

In order to leave comments, you need to log in

8 answer(s)
E
egorinsk, 2013-01-19
@egorinsk

How do you imagine such a transformation if text is stored in PDF in lines with certain coordinates, and not in paragraphs, for example? Also, the text can be stored as a picture or vector format. Tables are stored as a set of text chunks and lines. The title is just a slightly larger line of text.
To restore the logical structure of the text, you need a system like the one used in fineReader products. This system is complex and Abbee spent a lot of money on its development, it is unlikely that you can solve the problem easier. And without this, the maximum that you can pull out of the file is a set of blocks like “a line of such and such text is located at such and such coordinates”. Text can be broken with hyphens.
Paragraphs, of course, can still be somehow restored from this by lining up lines in ascending order of coordinates, but hyphenation will remain, and any non-standard things, like a caption for a picture, will break this algorithm.
In summary, choose a different source format, or give up the idea of ​​converting PDF to meaningful text, convert it to an image for example. Otherwise, you will be adding crutches all your life, as soon as someone wants to slip a text into your system that was composed in a different way.

P
plaha, 2013-01-18
@plaha

And if you use third-party software through exec () from php? Translator pdf to txt, for example

S
Skull, 2013-01-18
@Skull

I converted to XML using pdftohtml. Further, using SimpleXMLElement, I parsed 3-page tables of contents from the received document.
Or your document initially looks like plain text driven into pdf

K
KEKSOV, 2013-01-18
@KEKSOV

A similar question was discussed here . Unfortunately, the author of the question did not comment on the applicability of the utilities given in the answers. Perhaps it makes sense to write him a personal message with a similar question, all of a sudden, he has moved in a positive direction.

O
oENDark, 2013-01-18
@oENDark

Try an envelope from pdf -> excel, and there are already a variety of excel parsers

D
Denis Medved, 2013-01-18
@BuCeFaL

FPDI www.setasign.de/products/pdf-php-solutions/fpdi/

A
asm0dey, 2013-01-19
@asm0dey

You can run the openoffice daemon, it can convert pdf to html, as far as I remember. Or in RTF.

S
sivabur, 2015-04-18
@sivabur

class pastebin.com/dvwySU1a

include('class.pdf2text.php');
$a = new PDF2Text();
$a->setFilename('Videographer_RFP.pdf'); //grab the test file at http://www.newyorklivearts.org/Videographer_RFP.pdf
$a->decodePDF();
echo $a->output();

There are current problems with some characters myself while I understand why
But there will be pure text.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question