How to efficiently group rows?

M

MaxLich2018-04-17 22:01:12

Java

MaxLich, 2018-04-17 22:01:12

Hello. It is necessary to solve such a problem, and it does not work.
The input is a set of strings of the form:
A;B;C
X;Y;Z
J;A;Z
U;V;W
E;E;E
D;F;G
If two strings have common elements in the same position (one or more ), then they are considered to belong to the same group. If two groups intersect, then it is considered that this is one group. That is, the following lines will form one group:
F;I;J
F;X;A
D;X;P
The first and second lines have a common element - "F", the second and third - "X". And they all belong to the same group.
You need to find all such groups, count their number, and print these groups in descending order of their size (the size of a group is the number of rows in it). I've been fighting for almost two days now, and I can't solve this problem. I came up with one solution, but it does not work when there are about 1,000,000 rows: the calculation takes too long. Well, accordingly, with a large amount of data, the question arises about the economical use of memory.
At first I had the following algorithm:
1. I create a list of lists of strings: List> , it will store groups
2. I throw the first line into it (respectively, I create the first group).
3. I sort through the lines from the original list:
3.1. I take the next line, I delete it from the list
3.2 I receive the iterator for search of groups
3.3. I'm going through the groups
3.3.1 I take the next group
3.3.2 I go through the lines in it
3.3.2.1 For each line from the group, I check if there are matching elements with the elements of the line from the original list:
a) if so, then I check if the line has already been added to which group:
1) if it was added, then I take the current group and completely add it to the one where the line was added, after that I delete the current group
2) if it was not added, then add a line to the current group and remember this group
, then exit the iteration cycle group rows
b) if not, then go to the next iteration
3.4 After leaving the group cycle, I check if there is data about the group in which a line is found that has common elements with the current line, if there is no information about such a group, then I create a new group and add the current line there.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

M

MaxLich, 2018-04-22
@MaxLich

Found one solution. Algorithm:

store the result as a list of lists: [group_number -> [group_strings]]
use an auxiliary list of hash tables: [word_position -> { word -> group_number }] and an auxiliary hash table to store which group was merged into which

we look for each word of the line in the corresponding (position of the word in the line) hash table
a) if the word exists, remember the number of the group (value from the hash table) in which it is found
b) if the word does not exist, then add it to the list of new words
Group search method code:

code

private static List<List<String>> findLineGroups(List<String> lines) {
        class NewLineElement {
            private String lineElement;
            private int columnNum;

            private NewLineElement(String lineElement, int columnNum) {
                this.lineElement = lineElement;
                this.columnNum = columnNum;
            }
        }

        if (lines == null)
            return Collections.emptyList();

        List<List<String>> linesGroups = new ArrayList<>(); //список групп, каждый элемент вида "номер группы - список строк группы"
        if (lines.size() < 2) {
            linesGroups.add(lines);
            return linesGroups;
        }

        List<Map<String, Integer>> columns = new ArrayList<>(); // список стобцов, каждый столбец - мапа с парами "элемент строки/столбца-номер группы"
        Map<Integer, Integer> unitedGroups = new HashMap<>(); //мэп с парами "номер некоторой группы - номер группы, с которой надо объединить данную"
        for (String line : lines) {
            String[] lineElements = line.split(";");
            TreeSet<Integer> groupsWithSameElems = new TreeSet<>(); //список групп, имеющих совпадающие элементы
            List<NewLineElement> newElements = new ArrayList<>(); //список элементов, которых нет в мапах столбцов

            for (int elmIndex = 0; elmIndex < lineElements.length; elmIndex++) {
                String currLnElem = lineElements[elmIndex];
                if (columns.size() == elmIndex)
                    columns.add(new HashMap<>());
                if ("".equals(currLnElem.replaceAll("\"","").trim()))
                    continue;

                Map<String, Integer> currCol = columns.get(elmIndex);
                Integer elemGrNum = currCol.get(currLnElem);
                if (elemGrNum != null) {
                    while (unitedGroups.containsKey(elemGrNum)) // если группа с таким номером объединена с другой,
                        elemGrNum = unitedGroups.get(elemGrNum); //то сохраняем номер группы, с которой была объединена данная
                    groupsWithSameElems.add(elemGrNum);
                } else {
                    newElements.add(new NewLineElement(currLnElem, elmIndex));
                }
            }
            int groupNumber;
            if (groupsWithSameElems.isEmpty()) {
                linesGroups.add(new ArrayList<>());
                groupNumber = linesGroups.size() - 1;
            } else {
                groupNumber = groupsWithSameElems.first();
            }
            for (NewLineElement newLineElement : newElements) {
                columns.get(newLineElement.columnNum).put(newLineElement.lineElement, groupNumber);
            }
            for (int matchedGrNum : groupsWithSameElems) { //перебираем все группы с таким же элементом
                if (matchedGrNum != groupNumber) {
                    unitedGroups.put(matchedGrNum, groupNumber); //сохраняем инф-цию об объединённых группах
                    linesGroups.get(groupNumber).addAll(linesGroups.get(matchedGrNum)); //объединяем группы
                    linesGroups.set(matchedGrNum, null); //помечаем группу с текущим номер, как несуществующую
                }
            }
            linesGroups.get(groupNumber).add(line);
        }
        linesGroups.removeAll(Collections.singleton(null)); //удаляем несуществующие группы
        return linesGroups;
    }

A

Alexey Cheremisin, 2018-04-17
@leahch

Oddly enough, you don't have to iterate groups at all!
1) your group consists of one element. In your example, F and X are the two groups in which to put the line numbers.
2) in one pass, we run through the lines and add them to the corresponding groups of terms, which we keep in the hashtable, where the key is the term itself, and the value is an array of line numbers.
3) after we have filled in the hash, we run through it once and see who has the array length greater than one, these will be the original groups.
If we need to additionally form groups of two or three terms, then it does everything the same, but we set the treeset of these elements as the key.

import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map.Entry;
import java.util.TreeSet;

public class Groups {

  public static void main(String[] args) {
    String[] myData = {
        "F;I;J", 
        "F;X;A",
        "X;D;P",

        "A;B;C",
        "X;Y;Z",
        "J;A;Z",
        "U;V;W",
        "E;E;E",
        "D;F;G",
    };
    
    HashMap<String, TreeSet<Integer>> groups = new HashMap<String, TreeSet<Integer>>();
    
    for(int line=0; line< myData.length; line++ ) { // бежим по строкам
      
      List<String> terms = Arrays.asList(myData[line].split(";")); // разбиваем на термы
      
      for(String term: terms) { // пробегаем по термам
        TreeSet<Integer> group = groups.get(term); // выдергиваем группу
        
        if(group == null) { // если группы нет
          group = new TreeSet<Integer>();
          groups.put(term, group);
        }
        group.add(line); // добваляем строку
      }
    }
    
    // выводим результат
    for(Entry<String, TreeSet<Integer>> group: groups.entrySet()) {
      if(group.getValue().size() >1)
        System.out.printf("%s - %s\n", group.getKey().toString(), group.getValue().toString());
    }
  }
}

And the result

A - [1, 3, 5]
D - [2, 8]
F - [0, 1, 8]
J - [0, 5]
X - [1, 2, 4]
Z - [4, 5]

D

Dimonchik, 2018-04-17
@dimonchik2013

bloom filter