Finding two intersecting arrays among k sorted ones

D

Dmitry2012-06-05 13:33:24

.NET

Dmitry, 2012-06-05 13:33:24

Given:
k sorted arrays of integers.
Need to find:
Intersecting arrays, with the number of common elements greater than minSupport;

Example:

Input:
1,3,7,8
2,3,8,10
3,10,11,12,13,14

minSupport = 1

Output:

1 and 2: 2, 8
1 and 3: 3
2 and 3:3, 10

Here is an implementation of the straight forward approach:

    var minSupport = 2;
    var random = new Random(123);

    var sortedArrays = Enumerable.Range(0,100)
    .Select(x => Enumerable.Range(0,30).Select(t => random.Next(1000)).Distinct()
    .ToList()).ToList();
    var result = new List<int[]>();
    var resultIntersection = new List<List<int>>();

    foreach (var array in sortedArrays)
    {
        array.Sort();
    }

    var sw = Stopwatch.StartNew();

    //****MAIN PART*****//

    for (int i = 0; i < sortedArrays.Count-1; i++)
    {
        for (int j = i+1; j < sortedArrays.Count; j++)
        {
            var intersect = sortedArrays[i].Intersect(sortedArrays[j]).ToList();
            if(intersect.Count()>=minSupport)
            {
                result.Add( new []{i,j});
                resultIntersection.Add(intersect);
            }
        }
    }

    //*****************//

    sw.Stop();

    Console.WriteLine(sw.Elapsed);

It works incredibly slowly

Here is an example of an approach that works a little faster (2 times)

var maxValue = 1000;
    
    var reverseIndexDict = new List<int>[maxValue];
    
    for (int i = 0; i < maxValue; i++)
    {
        reverseIndexDict[i] = new List<int>();
    }
    
    for (int i = 0; i < sortedArrays.Count; i++)
    {
        for (int j = 0; j < sortedArrays[i].Count; j++)
        {
            reverseIndexDict[sortedArrays[i][j]].Add(i);
        }
    }
    
    
    
    for (int i = 0; i < sortedArrays.Count; i++)
    {
        var tempArr = new Dictionary<int, List<int>>();
        
        for (int j = 0; j < sortedArrays[i].Count; j++)
        {
            var sortedArraysij = sortedArrays[i][j];
            
            
            for (int k = 0; k < reverseIndexDict[sortedArraysij].Count; k++)
            {
                if(!tempArr.ContainsKey(reverseIndexDict[sortedArraysij][k]))
                {
                    tempArr[reverseIndexDict[sortedArraysij][k]] = new[]{sortedArraysij}.ToList();
                }
                else
                {
                   tempArr[reverseIndexDict[sortedArraysij][k]].Add(sortedArrays[i][j]);
                }
                
            }
        }
        
        
        for (int j = 0; j < reverseIndexDict.Length; j++)
        {
            if(reverseIndexDict[j].Count>=minSupport)
            {
                result.Add(new[]{i,j});
                resultIntersection.Add(reverseIndexDict[j]);
            }
        }
        
    }

       // Далее фильтруем рез-ты

How can you do it with maximum speed? What keywords to google for?

EDIT:

Implemented the variation suggested by nickme. The speed is not pleasing, it increases 2 times for every 1000 elements in the range from 1 to 6 and works slower than my second option (I haven’t tested it on very large values). Here is the code, maybe I messed something up somewhere.

public void Process(List<int[]> arrays)
{
    var arrayInfoCollection = arrays.Select( (x,i) => new ArrayInfo(i,x)).ToList();
    var minValue = arrayInfoCollection.First().Array[0];
    var minIndex = arrayInfoCollection.First().Id;
    var resultMatrix = new int[arrayInfoCollection.Count,arrayInfoCollection.Count];
    var iterators = (new int[arrayInfoCollection.Count]).ToList();
    
        
    while(iterators.Count!=0)
    {
    
    
    // Находим минимальный элемент
    minValue = arrayInfoCollection[0].Array[iterators[0]];
    for (int i = 0; i < arrayInfoCollection.Count; i++)
    {
        if(arrayInfoCollection[i].Array[iterators[i]]<=minValue)
        {
            minValue = arrayInfoCollection[i].Array[iterators[i]];
                 minIndex = i;            
        }
    }

    
    // Находим пересечения
    for (int i = 0; i < iterators.Count; i++)
    {
               
        if(arrayInfoCollection[i].Array[iterators[i]]==minValue)
        {
            resultMatrix[arrayInfoCollection[i].Id,arrayInfoCollection[minIndex].Id]++;
        }
    }
    
    
    // Двигаемся дальше
    if(arrayInfoCollection[minIndex].Array.Length-1==iterators[minIndex])
    {
       // Если мы уже уперлись в конец массива, удаляем его
       arrayInfoCollection.RemoveAt(minIndex);
       iterators.RemoveAt(minIndex);
    }
    else
    {
       iterators[minIndex]++;
    }
    
    
    }
    
    resultMatrix.Dump();
    
}


public class ArrayInfo
{
    public int Id { get; set; }
    public int[] Array { get; set; }
    
    public ArrayInfo(int id, int[] array)
    {
        Id = id;
        Array = array;
    }
}

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

N

nickme, 2012-06-05
@nickme

You can try a variant of merge (see merge sort), where you store by index from each array and increment (by one) each time the one that points to the minimum element (if there are several of them, then increment all of them). If more than one index has been updated, then increase by one the length of the common parts of the corresponding pairs of arrays (you need a matrix of intersection lengths). The advantage of this algorithm is that you go through each array exactly once...

M

MikhailEdoshin, 2012-06-06
@MikhailEdoshin

It is convenient to merge sorted arrays using a heap. Let n be the total number of elements and k be the number of arrays. We make a heap of k elements, put the first elements of the arrays into it. Then, at each step, we extract the minimum element from it (log k ) and add back the one that follows it in the array (log k ). The total complexity of this step is 2 n log k .

P

PuzzleW, 2012-06-06
@PuzzleW

and where does the information about the fact that Intersect works with complexity n+m?!

P

PuzzleW, 2012-06-06
@PuzzleW

I read msdn.microsoft.com/ru-ru/library/bb355408.aspx and
I quote:
When enumerating the object returned by this method, Intersect enumerates the first elements, collecting all the differing elements of this sequence into a collection. Then, the elements of the second sequence are enumerated, with the marking of the elements included in both sequences. Finally, the tagged items are returned in the order in which they were collected.
If I understand correctly, Intersec meant that your arrays are sorted.
it stupidly enumerates the first set (possibly spending a lot of time removing duplicates) then adds the second one to it (again, guaranteed to spend a lot of time checking if the same element is in the first set to "mark" that the element is present in both sets)
/ / I will be glad to be wrong in assessing the complexity of the algorithm implemented in Intersect
in your case (I've been thinking for the whole evening already) it seems to me very important to take into account the fact that your arrays are sorted.
those. you need to implement your own intersecting element search function that takes into account the nature of your input.
well, for example, even heuristics of the form: if A[A.length] < B[1] then exit, nothing shines for us here, they will save you a lot of time ... (though only in rare cases :) )
while I'm thinking about a general algorithm of this kind:
we take for A an array with A[1] < B[1] (we start indexing the array from 1, not from zero, although this is not important)
i=1;
j=1;
count=0;
look through A[] to find the element >= B[j] naturally increasing i
if A[i]==B[j] then j++; continue viewing A starting from the current i (well, yes, do not forget count++ :) )
if A[i] > B[j] then we are not lucky to find a match with B[j], you can swap arrays (in the sense that B[ j] now we have the “beginning” of trying to find the first matching element, and hence the intersecting sequence,
and don’t forget the end of the array, as well as edge cases like an array of 1 element, arrays that have no intersecting elements, etc.
and before trying to implement an algorithm based on my fantasies (honestly, Cormen and Knuth have been around for a very long time, I would recommend measuring the speed of intersect, for example, by feeding it two arrays of the same (just don’t think of passing the same array - this can be checked and the execution time will be = O(length()) or O(count())), as well as variants of two different arrays sorted in the same and in opposite directions. and also THESE arrays, but NOT SORTED. i.e. .generate random arrays, test intersect on them, then sort them and test intersect again, all within the same test run. If I'm right, then the execution time should practically not differ. If it differs, then my proposal probably makes no sense)
ps I'm trying to understand the meaning of the algorithm proposed by nickme without a leaf and a pen - there is more and more suspicion that I proposed the same idea, but in a simplified-perverted format.
pps It's very difficult for you to describe the condition - you need ALL possible arrays in which (simultaneously) the intersection power is >= given? Or are you satisfied with only pairs of arrays that satisfy the same condition? if the pairs are what, is there a need to display all or not, and so on. clarify the task, and it will be even better if you give an understanding of a slightly higher level - what will you do with the result then? it may be easier to implement an algorithm aimed at the final result than to calculate your current intermediate step.