G
G
germn2011-12-20 14:43:49
Character encoding
germn, 2011-12-20 14:43:49

Text processing with mystem in php?

The mystem program performs a morphological analysis of the text in Russian.
I'm trying to process a string with it:

&lt;?php<br/>
function mystem($q) {<br/>
 $out = array();<br/>
 exec('echo ' . $q . ' | ' . dirname(__FILE__) . '\mystem\mystem.exe -i', $out);<br/>
 $q = implode('', $out);<br/>
 return $q;<br/>
}<br/>
<br/>
echo mystem('в мурелки шлепают пельсиски');<br/>
?&gt;

XAMPP 1.7.3, php 5.3.1, windows 7, I output the result in a document with utf-8 encoding
.
�{�??}�{�??}��{��??}���{���??}�{�??}����{����??}���{���??}��{��??}���{���??}

The problem is in the encoding, because. by default mystem works with cp1251. I'm trying to add the -e option to change the encoding (see the documentation ), I change the corresponding line:
exec('echo ' . $q . ' | ' . dirname(__FILE__) . '\mystem\mystem.exe -e utf-8 -i', $out);

Outputs nothing.
I also tried to work with windows-1251:
&lt;?php<br/>
function mystem($q) {<br/>
 $out = array();<br/>
 $q = iconv(&quot;utf-8&quot;, &quot;windows-1251&quot;, $q);<br/>
 exec('echo ' . $q . ' | ' . dirname(__FILE__) . '\mystem\mystem.exe -i', $out);<br/>
 $q = implode('', $out);<br/>
 return $q;<br/>
}<br/>
<br/>
echo mystem('в мурелки шлепают пельсиски');<br/>
?&gt;

I output the result in a document with windows-1251 encoding
. Output:
ў{ў??}гаҐ{гаҐ??}ЄЁ{ЄЁ??}и{и=INTJ=|и=PART=|и=S,сокр=им,ед|=S,сокр=им,мн|=S,сокр=род,ед|=S,сокр=род,мн|=S,сокр=дат,ед|=S,сокр=дат,мн|=S,сокр=вин,ед|=S,сокр=вин,мн|=S,сокр=твор,ед|=S,сокр=твор,мн|=S,сокр=пр,ед|=S,сокр=пр,мн|и=CONJ=}ҐЇ{ҐЇ??}ов{ов??}ЇҐ{ЇҐ??}мбЁбЄЁ{мбЁбЄЁ??}

Only the letter and is processed correctly , everything else is gibberish.
Why is this happening? What am I doing wrong? Please explain the errors. Thanks in advance.
Update: antoo , thanks, your solution works.
As a result, I did this for the code in utf-8:
&lt;?php<br/>
function mystem($q) {<br/>
 $q = iconv(&quot;utf-8&quot;, &quot;windows-1251&quot;, $q);<br/>
 $result = exec('echo &quot;'.$q.'&quot; | mystem.exe -i -e cp866');<br/>
 $result = iconv(&quot;cp866&quot;, &quot;utf-8&quot;, $result);<br/>
 return $result;<br/>
}<br/>
<br/>
header(&quot;Content-type: text/html; charset=utf-8&quot;);<br/>
echo mystem('в мурелки шлепают пельсиски');<br/>
?&gt;<br/>

It still remains a mystery to me why the string in windows-1251 is given to the script, and cp866 is indicated (and not cp1251, which seems to be correct), but the task is solved.
The fact that the script does not work correctly with -e utf8 , apparently, is a mystem cant, because it copes with files in this encoding without problems.

Answer the question

In order to leave comments, you need to log in

2 answer(s)
@
@antoo, 2011-12-20
@germn

This option works for me:
image

<?php
function mystem($q) {
  $result = exec('echo "'.$q.'" | mystem.exe -i -e cp866');
  $result = iconv("cp866", "windows-1251", $result);
  return $result;
}

echo mystem('в мурелки шлепают пельсиски');

Script encoding: ANSI (Notepad++)

@
@antoo, 2011-12-20
_

mystem is a console application, the console uses 866 encoding.
you can try:
$q = convert_cyr_array($q,"k","w");

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question