A
A
Alexey2015-02-06 20:49:24
PHP
Alexey, 2015-02-06 20:49:24

How to correctly compare arrays and evaluate their similarity?

So, let's say we have numbers arranged in a circle:
08da16694f0b4fa0841218dd8ca2061b.png
If we translate this into an array, it will turn out
$numbers = [1,2,3,4,5,7,2,8];
But if we start counting from a different element,
6c2e7d4bff1c41ea8f936f4ff82b67bf.png
then we get such an array
$numbers = [2,3,4,5,7,2,8,1];
In fact, the circles are the same, but the resulting arrays are different.
Question 1: How to compare them correctly?
Let's say these circles are slightly different, by a couple of values
a13bbf0ccb6f401fb1233e2d3165dea1.png
. In this case, circles with numbers are similar to 6/8 = 75%
Question 2: How to determine the percentage of their similarity?
Unfortunately, their brains are not quite enough. I ask not for ready-made code, but at least algorithms

Answer the question

In order to leave comments, you need to log in

8 answer(s)
N
nowm, 2015-02-06
@TsSaltan

If two arrays are the same length, you can simply move around the first array and compare its elements with the elements of the second. Then you can simply select the maximum match and overtake in percentages. More or less like this:

$arr1 = [1,2,3,4,5,7,2,8];
$arr2 = [2,9,5,5,7,2,8,1];

$len = count($arr1);
$conformity = [];

for($i = 0; $i < $len; $i++) {
  /**
   * $temp содержит нули в позициях, где числа в двух массивах 
   * по одному и тому же индексу не равны. Единицы — там, где равны.
   */
  $temp = array_map(function($x,$y){return intval($x==$y);}, $arr1, $arr2);
  
  // Элементы полученного массива суммируются и добавляются в отчётный массив
  $conformity[] = array_sum($temp);
  
  // Массив прокручивается на одну позицию
  $arr1[] = array_shift($arr1);
}

//С помощью max($conformity) выбирается максимальное совпадение элементов
echo sprintf("Max conformity is %s%%\n", number_format(100*(max($conformity)/$len), 2));

This is specifically for the situation when the length of the "rings" is the same.
Update: one more option:
$arr1 = [1,2,3,4,5,7,2,8];
$arr2 = [2,9,5,5,7,2,8,1];

function conformity($arr1, $arr2) {
  $len = count($arr1);
  $max = $curr = 0;
  
  for($i = 0; $i < $len; $i++) {
    array_map(function($x,$y)use(&$curr){$curr += intval($x==$y);}, $arr1, $arr2);
    
    if($curr == $len) {
      return 100;
    }

    $max = $max > $curr ? $max : $curr;
    $curr = 0;
    
    $arr1[] = array_shift($arr1);
  }
  
  return 100*($max/$len);
};

echo sprintf("Max conformity is %s%%\n", number_format(conformity($arr1, $arr2), 2));

I
Igor Ermolaev, 2015-02-06
@ErmIg

Essentially, rings of numbers are periodic functions. It is better to compare not the values ​​themselves, but their Fourier spectra. If we discard the phase of the complex Fourier spectrum, then the spectra of such rings will be similar, even if they are counted from different positions.

A
Armenian Radio, 2015-02-06
@gbg

There are an unlimited number of ways to compare arrays.
As a rule, before comparison, certain requirements are first put forward (equivalence criteria), and then a comparison is invented on these criteria.

A
Alexander, 2015-02-06
Madzhugin @Suntechnic

It is necessary to formulate what similarity is and it will immediately become clear how to compare;)

N
Nikita Zvonarev, 2015-02-07
@nikitazvonarev

In fact, I've thought about it, and came up with a better Fourier.
For example, if you say that arrays are the same up to a rotation, then you can and should compare them for equivalence (I'm not talking about percentage similarity, it's more difficult here), then you can interpret them as strings, and compose such, for example:
S + " $" + T + T,
where S is one array, and after the separator, the right array is written twice in a row. Then it is enough to start calculating the prefix function using the Knuth-Morris-Pratt algorithm in linear time. If you are interested in similar pieces, then you need to dig in the direction of suffix arrays and trees, if you want linear time

S
Sumor, 2015-02-06
@Sumor

There is no correct way to compare two arrays with two or more elements.
Valid comparison methods depend on your subject area, on where these very circles with numbers come from. What is the capacity of the arrays? Is she the same? Is it a set of numbers (a set) or does their order on the circle matter? Are the numbers represented by a quantitative scale (mathematical calculations can be done) or are they qualitative values ​​(mathematical calculations are impossible or do not make logical sense)?
As a measure of similarity, you can choose, for example:
1. Number of different elements: [1,2,3,4,5,7,2,8] [2,9,5,5,7,2,8,1] - the measure is 2
2. The sum of the modules of the difference of elements: [1,2,3,4, 5 ,7, 2 ,8] [1,2,3,4, 6,7, 4 ,8] - the measure is 3
3. Analogue of the Levenshtein distance: [1,2,3,4,5,7,2,8] [2,3,4,5,7,2,8,1 ] - measure is 2

F
FanatPHP, 2015-02-06
@FanatPHP

i think the diff algorithm should be fine

M
Mrrl, 2015-02-06
@Mrl

There is a solution that works in the worst case in O(N*sqrt(N*log(N))), and in the typical case in O(N*log(N)).
Let our arrays be A and B.
Let's create an array Q of 2*N triples containing (array element, array index, whether it's A or B).
Sort by the "array element" field.
We get an array C of length N, in which we will consider C[k]=number of matches when shifted by k positions.
We look at the sorted array Q. For each value of X, it is immediately clear how many times and at what places X occurred in arrays A and B. Let it occur p times in A (at positions a1,a2,...,ap) and q times - in B (in positions b1,b2,...,bq). If p*q < N*log(N), then we modify C in p*q operations, increasing all C[(bj-ai) mod N] by 1.
Otherwise, we build arrays of 0 and 1 containing the masks of the occurrence of X in A and B, and calculate their convolution using the fast Fourier transform. We add it to C.
The worst case for this algorithm is when the arrays contain approximately sqrt(N/log(N)) different values ​​that occur approximately the same number of times.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question