How to quickly compare many arrays?

vvovas2012-09-11 05:20:05

Search Engine Optimization

vvovas, 2012-09-11 05:20:05

There are two lists of arrays: the first is 10 million arrays, the second is 100 arrays. All arrays are the same length and sorted.
Task:
For each array from the first list, collect statistics of matches with each array from the second list.

Those. we take the first array from the 1st list, compare it with each array from the 2nd, and put the results (the number of matched elements) into the dictionary. In a dictionary, the key is the result, and the value is the number of times that result occurred. Such a dictionary must be built for each array from the 1st list

Example:
List 1: {1,2,3}
List 2: {1,2,3}; {1,2,5}; {1,3,4}; {4,5,6}

Compare the array from the 1st list with each of the 2nd:
{1,2,3} and {1,2,3} — 3 matched elements.
{1,2,3} and {1,2,5} are 2 matched elements.
{1,2,3} and {1,3,4} are 2 matched elements.
{1,2,3} and {4,5,6} - 0 matched elements.

Result:
{0,1}; //0 matches - 1 times
{1,0}; //1 match - 0 times
{2,2}; //2 matches - 2 times { 3,1
}} //3 matches - 1 time

I tried to represent the second list as a tree, but it is slower (probably because the tree was built on dictionaries).
When it was necessary to simply check whether such an array exists in the list, the tree helped a lot, but here it is necessary to count the number of matches.

Answer the question

In order to leave comments, you need to log in

8 answer(s)

vassabi, 2012-09-11
@vassabi

judging by the example, you don’t have arrays, but sets, which can be represented as bit arrays (for 1-50, 64 bits is enough)
thus. array elements is 1 in the corresponding position ({1,2,5} ==> 0b0001 0110) the number of matched elements, this is the number of ones after the & operation.
(calculation of the number of units for acceleration is taken from the tables)

Vampiro, 2012-09-11
@Vampiro

I didn’t master the example, but as a rule, to compare such data, I convert arrays to strings - work with string primitives is faster.
For example, searching for the substring "_2_" in the strings "1_2_3", "1_2_5", etc. is a very fast operation. If the dimensions of the arrays are the same, then you can combine the arrays into text, knowing the position of the substring, you can optionally determine which array such a string belongs to. Think along those lines, I was honestly trying to light up your problem, but we have extremely different ways of thinking :)

leventov, 2012-09-11
@leventov

Try to precalculate 50 arrays with array numbers out of 100 that contain the index. Then the calculation of statistics for 1 array out of 10 million will take approximately (100*10/50)*10=200 operations instead of 30*100=3000 when comparing arrays head-on.

avalak, 2012-09-11
@avalak

I think MapReduce would be useful here.

Wott, 2012-09-12
@Wott

you need to find a single-valued transformation to turn the array into a number and compare already numbers
about sets - the comment above
if there are repeating elements, then you can sort them and convert them to a number using the elements as digits in the positional calculus according to the base, which is the maximum of elements or more.
something like this…

runnig, 2012-09-23
@runnig

If you sort the arrays and "compress" in a primitive way to avoid multiple passes through the same elements: [1,1,1,1 0,0, 2, 2, 2, 2, 3] -> [(0 ,2), (1,4),(2,2), (3,1)]
after that we set pointers to the beginning of the reference array from the first group and the one to be checked from the second and check the number of matches, taking into account the second number.
// arr1, arr2 = проверяемые отсортированные сжатые массивы, // arr[нечётный_индекс] = эл-т массива, // arr[чётный_индекс] = кол-во вхождений этого значения int i, j; int matches = 0; for(i = 0, j = 0; i < arr1.size() && j < arr2.size(); ) { if(arr1[i] == arr2[j]) { matches+= min(arr1[i+1], arr2[j+1]; i+=2; j+=2; } else if(arr1[i] < arr2[j]) { i+=2; } else { arr2[j] < arr1[i] ) { j+= 2; } }
The complexity of checking two compressed sorted arrays is O(m+n), m and n of their length.
Some arrays are likely to match at all. In this case, it is better to calculate the hash of their elements before comparing, and first compare the lengths and hashes of the arrays before starting the comparison. If the hashes match, then you can skip the check and add the length of the uncompressed array.

runnig, 2012-09-23
@runnig

Even if there are only 50 values in the array, then 50 elements can be easily placed in 64-bit variables (uint64_t in C ++, probably in C # too). If it is important to count the number of identical values in two arrays, and not the number of occurrences of a value, then we simply take a bitwise AND operation and count the set bits:
pastebin.com/3rV0CEnQ

Pavel, 2012-10-10
@PavelMSTU

As I understand it, you have two lists - one is VERY large, the second is small (100 in total). Sort the array of 100 arrays itself so that each next tuple is greater than the previous tuple.
( tuple {x1,x2,x3, ..} is greater than {y1,y2,y3, ...} if x1>y1 or x1=y1 but x2>y2 or x1=y1,x2=y2 but x3> y3 ...)
Take the first element (i.e. an array) from an array of arrays of 10 million arrays. Compare the LAST element of the selected element and compare with the FIRST elements of the sorted array of arrays (out of 100). Then take the second from the end and compare with the second elements of the tuples. And so on for all the elements of the selected first array of 10 million. Then take the second array element (of 10 million), and so on.
The rule on which you save is the following:
If any element (mi) for the nth array out of 10M is less than element i in tuple j, so we can no longer consider tuple j and all tuples greater than tuple j to compare the nth array.
This will save money.