K
K
King_Of_Magic2013-02-07 23:35:45
PHP
King_Of_Magic, 2013-02-07 23:35:45

How can I parse doc (docx), pdf in PHP?

At the moment, the task is to parse the contents of the Word document, namely, there is a large table (do not ask why not in Excel, I am not the author of the source material) and you need to read the values ​​of all its cells, and you will also need to understand what it refers to (it built according to the type of the Pythagorean multiplication table - at the intersection of rows and columns, the desired data).

With pdf - everything is the same: you need text, a table, pictures ...

What libraries for PHP do you recommend?

Answer the question

In order to leave comments, you need to log in

4 answer(s)
S
Sergey, 2013-02-07
Protko @Fesor

With DOCX, everything is simpler - it's just a compressed container containing XML files from which data can already be parsed. There are a lot of libraries, you need to look at the required functionality.
But PDF - I only know about FPDF. I would recommend finding some tool like PDF2HTML and just run a command from PHP to pull out the information.

B
barker, 2013-02-07
@barker

There are a lot of solutions for docx, as mentioned above. And pdf, obviously, in the general case, it is impossible to parse into “text, table, pictures”. Rather, it is possible, but limited.

A
Alexey Akulovich, 2013-02-08
@AterCattus

I have parsed PDF after pdftohtml . There are no tables there - the layout can be found out by the coordinates in the styles. It is crooked and inconvenient, but I did not find another solution.
Output in xml format, IMHO, is more convenient for parsing.

V
Vitaly Yushkevich, 2013-02-08
@yushkevichv

Much has already been said about pdf. As for Doc / Docx -
1) habrahabr.ru/post/138666/
habrahabr.ru/post/140012/
It is quite sensibly described here how it all works from the inside
2) Somewhere from the same series of articles there was also about .doc
3) Mine advice to you www.phpdocx.com/
At one time, they also thought that it was easier to do everything by hand. There is a free version. For most simple tasks, it should be enough.
PS but about how everything works, read it anyway.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question