How to get plain text from .doc file in php?

D

desperate_one2019-10-08 13:52:13

PHP

desperate_one, 2019-10-08 13:52:13

Greetings! I've been struggling with the code for reading text from files of different formats for a month now, now I can only get clean text from pdf, txt, docx. Now the .doc format slows down all the work with its file content. I googled hundreds of requests and not a single solution that they give on the network helped, there is such a solution:

function parseWord($userDoc) 
{
    $fileHandle = fopen($userDoc, "r");
    $line = @fread($fileHandle, filesize($userDoc));   
    $lines = explode(chr(0x0D),$line);
    $outtext = "";
    foreach($lines as $thisline)
      {
        $pos = strpos($thisline, chr(0x00));
        if (($pos !== FALSE)||(strlen($thisline)==0))
          {
          } else {
            $outtext .= $thisline." ";
          }
      }
     $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\[email protected]\/\_\(\)]/","",$outtext);
    return $outtext;
}

UPD : The code above was thrown off incorrectly, but even if Cyrillic is shoved into the parser condition, this will not solve the problem in it, so in any case, this is not the problem, there is another script, it seems to parse the .doc file more correctly, but that’s all also does not support Cyrillic. Why am I speaking more correctly, because the code above returns the wrong number of characters and even Latin characters are incorrect, but the following code returns the correct number of characters, even saves paragraphs, but transmits any characters other than Latin letters as squares.

function read_doc_file($filename) {
     if(file_exists($filename))
    {
        if(($fh = fopen($filename, 'r')) !== false ) 
        {
           $headers = fread($fh, 0xA00);

           // 1 = (ord(n)*1) ; Document has from 0 to 255 characters
           $n1 = ( ord($headers[0x21C]) - 1 );

           // 1 = ((ord(n)-8)*256) ; Document has from 256 to 63743 characters
           $n2 = ( ( ord($headers[0x21D]) - 8 ) * 256 );

           // 1 = ((ord(n)*256)*256) ; Document has from 63744 to 16775423 characters
           $n3 = ( ( ord($headers[0x21E]) * 256 ) * 256 );

           // 1 = (((ord(n)*256)*256)*256) ; Document has from 16775424 to 4294965504 characters
           $n4 = ( ( ( ord($headers[0x21F]) * 256 ) * 256 ) * 256 );

           // Total length of text in the document
           $textLength = ($n1 + $n2 + $n3 + $n4);

           $extracted_plaintext = fread($fh, $textLength);

           // simple print character stream without new lines
           //echo $extracted_plaintext;

           // if you want to see your paragraphs in a new line, do this
           return nl2br($extracted_plaintext);
           // need more spacing after each paragraph use another nl2br
        }
    }   
    }

I tried phpword, it only works with docx, which can be read with 10 lines of code.
But it does not accept Cyrillic, and I need support for all languages. Does anyone have a solution or at least advice on how to come to it, how to get just plain text from .doc files in general?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

P

Pavel Chesnokov, 2019-10-08
@desperate_one

The topic is very interesting and I had to figure it out to the end.
Until complete happiness, you lack this:
And all together will be:

function read_doc_file($filename) {
    if (file_exists($filename)) {
        if (($fh = fopen($filename, 'r')) !== false) {
            $headers = fread($fh, 0xA00);

            // 1 = (ord(n)*1) ; Document has from 0 to 255 characters
            $n1 = ( ord($headers[0x21C]) - 1 );

            // 1 = ((ord(n)-8)*256) ; Document has from 256 to 63743 characters
            $n2 = ( ( ord($headers[0x21D]) - 8 ) * 256 );

            // 1 = ((ord(n)*256)*256) ; Document has from 63744 to 16775423 characters
            $n3 = ( ( ord($headers[0x21E]) * 256 ) * 256 );

            // 1 = (((ord(n)*256)*256)*256) ; Document has from 16775424 to 4294965504 characters
            $n4 = ( ( ( ord($headers[0x21F]) * 256 ) * 256 ) * 256 );

            // Total length of text in the document
            $textLength = ($n1 + $n2 + $n3 + $n4);

            $extracted_plaintext = fread($fh, $textLength);
            $extracted_plaintext = mb_convert_encoding( $extracted_plaintext, 'UTF-8', 'UTF-16LE' );
            return nl2br($extracted_plaintext);

        } else {
            return FALSE;
        }
    } else {
        return FALSE;
    }
}

$text = read_doc_file('test.doc');

In the meantime, I studied, I found an interesting test, it may come in handy:

$text = "A strange string ø, æ, å, ж, п, ą, ū, ė, …"; 
foreach(mb_list_encodings() as $chr){ 
    echo mb_convert_encoding( $text, 'UTF-8', $chr ) . " : " . $chr . "<br><br>";    
}

D

developer007, 2019-10-09
@developer007

And if you make it easier and install catdoc?

catdoc foo.doc > foo.txt