Answer the question
In order to leave comments, you need to log in
How to get plain text from .doc file in php?
Greetings! I've been struggling with the code for reading text from files of different formats for a month now, now I can only get clean text from pdf, txt, docx. Now the .doc format slows down all the work with its file content. I googled hundreds of requests and not a single solution that they give on the network helped, there is such a solution:
function parseWord($userDoc)
{
$fileHandle = fopen($userDoc, "r");
$line = @fread($fileHandle, filesize($userDoc));
$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
{
} else {
$outtext .= $thisline." ";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\[email protected]\/\_\(\)]/","",$outtext);
return $outtext;
}
function read_doc_file($filename) {
if(file_exists($filename))
{
if(($fh = fopen($filename, 'r')) !== false )
{
$headers = fread($fh, 0xA00);
// 1 = (ord(n)*1) ; Document has from 0 to 255 characters
$n1 = ( ord($headers[0x21C]) - 1 );
// 1 = ((ord(n)-8)*256) ; Document has from 256 to 63743 characters
$n2 = ( ( ord($headers[0x21D]) - 8 ) * 256 );
// 1 = ((ord(n)*256)*256) ; Document has from 63744 to 16775423 characters
$n3 = ( ( ord($headers[0x21E]) * 256 ) * 256 );
// 1 = (((ord(n)*256)*256)*256) ; Document has from 16775424 to 4294965504 characters
$n4 = ( ( ( ord($headers[0x21F]) * 256 ) * 256 ) * 256 );
// Total length of text in the document
$textLength = ($n1 + $n2 + $n3 + $n4);
$extracted_plaintext = fread($fh, $textLength);
// simple print character stream without new lines
//echo $extracted_plaintext;
// if you want to see your paragraphs in a new line, do this
return nl2br($extracted_plaintext);
// need more spacing after each paragraph use another nl2br
}
}
}
Answer the question
In order to leave comments, you need to log in
The topic is very interesting and I had to figure it out to the end.
Until complete happiness, you lack this:
And all together will be:
function read_doc_file($filename) {
if (file_exists($filename)) {
if (($fh = fopen($filename, 'r')) !== false) {
$headers = fread($fh, 0xA00);
// 1 = (ord(n)*1) ; Document has from 0 to 255 characters
$n1 = ( ord($headers[0x21C]) - 1 );
// 1 = ((ord(n)-8)*256) ; Document has from 256 to 63743 characters
$n2 = ( ( ord($headers[0x21D]) - 8 ) * 256 );
// 1 = ((ord(n)*256)*256) ; Document has from 63744 to 16775423 characters
$n3 = ( ( ord($headers[0x21E]) * 256 ) * 256 );
// 1 = (((ord(n)*256)*256)*256) ; Document has from 16775424 to 4294965504 characters
$n4 = ( ( ( ord($headers[0x21F]) * 256 ) * 256 ) * 256 );
// Total length of text in the document
$textLength = ($n1 + $n2 + $n3 + $n4);
$extracted_plaintext = fread($fh, $textLength);
$extracted_plaintext = mb_convert_encoding( $extracted_plaintext, 'UTF-8', 'UTF-16LE' );
return nl2br($extracted_plaintext);
} else {
return FALSE;
}
} else {
return FALSE;
}
}
$text = read_doc_file('test.doc');
$text = "A strange string ø, æ, å, ж, п, ą, ū, ė, …";
foreach(mb_list_encodings() as $chr){
echo mb_convert_encoding( $text, 'UTF-8', $chr ) . " : " . $chr . "<br><br>";
}
And if you make it easier and install catdoc?
catdoc foo.doc > foo.txt
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question