@
@
@zipo2014-05-21 20:38:42
PHP
@zipo, 2014-05-21 20:38:42

Comparing text data using php?

Tell me a ready-made solution in php, which takes two texts and gives their difference as an answer. Such a comparator using php.
A solution that understands utf8 is desirable. But in any case, I would appreciate any suggestions. Including links to descriptions of text comparison algorithms. Or maybe their Java/C++ implementations.
I need to determine exactly the difference between large texts. I have no task to determine the similarity of lines, texts.

Answer the question

In order to leave comments, you need to log in

5 answer(s)
V
Velross, 2015-05-17
@Velross

I think this is what you need:
easywebscripts.net/php/php_text_differences.php

D
Dmitry Kozlov, 2016-04-08
@asperin

/**
 * Выделение различий в текстах (с точностью до строк или слов)
 * Изменения оборачиваются в тег "span" с классами 'added', 'deleted', 'changed
 * алгоритм: http://easywebscripts.net/php/php_text_differences.php
 *
 * @return array - тексты A и B
 * @param string $textA
 * @param string $textB
 * @param string $delimeter - "пробел": будет искать изменения с точностью до слова, "\n": с точностью до строки
 */
function getTextDiff($textA, $textB, $delimeter = "\n") {
    if (!is_string($textA) || !is_string($textB) || !is_string($delimeter)) {
        return FALSE;
    }

    // Получение уникальных слов(строк)
    $arrA = explode($delimeter, str_replace("\r", "", $textA));
    $arrB = explode($delimeter, str_replace("\r", "", $textB));
    $unickTable = array_unique(array_merge($arrA, $arrB));
    $unickTableFlip = array_flip($unickTable);

    // Приводим к тексту из идентификаторов
    $arrAid = $arrBid = array();
    foreach($arrA as $v) {
        $arrAid[] = $unickTableFlip[$v];
    }
    foreach($arrB as $v) {
        $arrBid[] = $unickTableFlip[$v];
    }

    // Выбор наибольшей общей последовательности
    $maxLen = array();
    for ($i = 0, $x = count($arrAid); $i <= $x; $i++) {
        $maxLen[$i] = array();
        for ($j = 0, $y = count($arrBid); $j <= $y; $j++) {
            $maxLen[$i][$j] = '';
        }
    }
    for ($i = count($arrAid) - 1; $i >= 0; $i--) {
        for ($j = count($arrBid) - 1; $j >= 0; $j--) {
            if ($arrAid[$i] == $arrBid[$j]) {
                $maxLen[$i][$j] = 1 + $maxLen[$i+1][$j+1];
            } else {
                $maxLen[$i][$j] = max($maxLen[$i+1][$j], $maxLen[$i][$j+1]);
            }
        }
    }
    $longest = array();
    for ($i = 0, $j = 0; $maxLen[$i][$j] != 0 && $i < $x && $j < $y;) {
        if ($arrAid[$i] == $arrBid[$j]) {
            $longest[] = $arrAid[$i];
            $i++;
            $j++;
        } else {
            if ($maxLen[$i][$j] == $maxLen[$i+1][$j]) {
                $i++;
            } else {
                $j++;
            }
        }
    }

    // Сравниваем строки, ищем изменения
    $arrBidDiff = array();
    $i1 = 0; $i2 = 0;
    for ($i = 0, $iters = count($arrBid); $i < $iters; $i++) {
        $simbol = array();
        if (isset($longest[$i1]) && $longest[$i1] == $arrBid[$i2]) {
            $simbol[] = $longest[$i1];
            $simbol[] = "*";
            $arrBidDiff[] = $simbol;
            $i1++;
            $i2++;
        } else {
            $simbol[] = $arrBid[$i2];
            $simbol[] = "+";
            $arrBidDiff[]     = $simbol;
            $i2++;
        }
    }
    $arrAidDiff = array();
    $i1 = 0; $i2 = 0;
    for ($i = 0, $iters = count($arrAid); $i < $iters; $i++) {
        $simbol = array();
        if (isset($longest[$i1]) && $longest[$i1] == $arrAid[$i2]) {
            $simbol[] = $longest[$i1];
            $simbol[] = "*";
            $arrAidDiff[] = $simbol;
            $i1++;
            $i2++;
        } else {
            $simbol[] = $arrAid[$i2];
            $simbol[] = "-";
            $arrAidDiff[] = $simbol;
            $i2++;
        }
    }

    // Меняем идентификаторы обратно на текст
    $arrAdiff = array();
    foreach($arrAidDiff as $v) {
        $arrAdiff[] = array(
            $unickTable[$v[0]],
            $v[1],
        );
    }
    $arrBdiff = array();
    foreach($arrBidDiff as $v) {
        $arrBdiff[] = array(
            $unickTable[$v[0]],
            $v[1],
        );
    }

    // Если на одной и той же позиции у текста A "добавлено" а у B "удалено" - меняем метку на "изменено"
    $max = max(count($arrAdiff), count($arrBdiff));
    for ($i1 = 0, $i2 = 0; $i1 < $max && $i2 < $max;) {
        if (!isset($arrAdiff[$i1]) || !isset($arrBdiff[$i2])) {
            // no action
        } elseif ($arrAdiff[$i1][1] == "-" && $arrBdiff[$i2][1] == "+" && $arrBdiff[$i2][0] != "") {
            $arrAdiff[$i1][1] = "*";
            $arrBdiff[$i2][1] = "m";
        } elseif ($arrAdiff[$i1][1] != "-" && $arrBdiff[$i2][1] == "+") {
            $i2++;
        } elseif ($arrAdiff[$i1][1] == "-" && $arrBdiff[$i2][1] != "+") {
            $i1++;
        }
        $i1++;
        $i2++;
    }

    // Оборачиваем изменения в теги для последующей стилизации
    $textA = array();
    foreach($arrAdiff as $v) {
        if ('+' == $v[1]) {
            $textA[] = '<span class="added">' . $v[0] . '</span>';
        } elseif ('-' == $v[1]) {
            $textA[] = '<span class="deleted">' . $v[0] . '</span>';
        } elseif ('m' == $v[1]) {
            $textA[] = '<span class="changed">' . $v[0] . '</span>';
        } else {
            $textA[] =$v[0];
        }
    }
    $textA = implode($delimeter, $textA);
    $textB = array();
    foreach($arrBdiff as $v) {
        if ('+' == $v[1]) {
            $textB[] = '<span class="added">' . $v[0] . '</span>';
        } elseif ('-' == $v[1]) {
            $textB[] = '<span class="deleted">' . $v[0] . '</span>';
        } elseif ('m' == $v[1]) {
            $textB[] = '<span class="changed">' . $v[0] . '</span>';
        } else {
            $textB[] =$v[0];
        }
    }
    $textB = implode($delimeter, $textB);

    return array($textA, $textB);
}

S
Stepan, 2014-05-21
@L3n1n

int similar_text(string str_first, string str_second [, double percent])
This function determines if two strings are similar.
The similar_text() function determines the similarity of two strings using Oliver's algorithm. The function returns the number of characters that matched in the strings str_first and str_second. The third optional parameter is passed by reference and stores the percentage of matching strings in it.
www.softtime.ru/bookphp/gl3_11.php
If you need to display exactly the difference in lines, use the console diff.

A
Alexey Kuleshov, 2014-05-22
@GingerbreadMSK

Applying Fuzzy Search Algorithms in PHP

V
Vlad Zhivotnev, 2014-05-22
@inkvizitor68sl

ftp.gnu.org/gnu/diffutils C++ sorts

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question