Answer the question
In order to leave comments, you need to log in
How can I parse doc (docx), pdf in PHP?
At the moment, the task is to parse the contents of the Word document, namely, there is a large table (do not ask why not in Excel, I am not the author of the source material) and you need to read the values of all its cells, and you will also need to understand what it refers to (it built according to the type of the Pythagorean multiplication table - at the intersection of rows and columns, the desired data).
With pdf - everything is the same: you need text, a table, pictures ...
What libraries for PHP do you recommend?
Answer the question
In order to leave comments, you need to log in
With DOCX, everything is simpler - it's just a compressed container containing XML files from which data can already be parsed. There are a lot of libraries, you need to look at the required functionality.
But PDF - I only know about FPDF. I would recommend finding some tool like PDF2HTML and just run a command from PHP to pull out the information.
There are a lot of solutions for docx, as mentioned above. And pdf, obviously, in the general case, it is impossible to parse into “text, table, pictures”. Rather, it is possible, but limited.
I have parsed PDF after pdftohtml . There are no tables there - the layout can be found out by the coordinates in the styles. It is crooked and inconvenient, but I did not find another solution.
Output in xml format, IMHO, is more convenient for parsing.
Much has already been said about pdf. As for Doc / Docx -
1) habrahabr.ru/post/138666/
habrahabr.ru/post/140012/
It is quite sensibly described here how it all works from the inside
2) Somewhere from the same series of articles there was also about .doc
3) Mine advice to you www.phpdocx.com/
At one time, they also thought that it was easier to do everything by hand. There is a free version. For most simple tasks, it should be enough.
PS but about how everything works, read it anyway.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question