T
T
TomaZe2017-03-30 00:15:35
Java
TomaZe, 2017-03-30 00:15:35

How to effectively split a set of string values ​​into non-overlapping groups?

Rows look like this:
A1;B1;C1
A2;B2;C2
How to find a set of unique rows and divide it into non-overlapping groups according to the following criterion: if two rows have matching non-empty values ​​in one or more columns, they belong to the same group
For example, rows
1,2,3
4,5,6
1,5,7
Belong to the same group.
Initially, I thought to do it through a set of hashsets (three hashsets for each column) to quickly see if a string is included in the list of unique values, followed by adding either to the list of already grouped strings or to the list of unique strings. But the algorithm in this case has a performance bottleneck: if it is necessary to merge groups, it is necessary to go through each group in the list. The algorithm works slowly on a large amount of data (>1 million records), with a large number of merges. If there are few merges (of the order of thousands), it works quickly. Caught a gag in this place and do not know how to optimize this bottleneck or whether it is necessary to use other data structures and algorithms.
Maybe someone will tell you in which direction to dig.

Answer the question

In order to leave comments, you need to log in

2 answer(s)
K
Konstantin Stepanov, 2017-03-30
@koronabora

The only option I see is to store multiple unique values ​​for each column. If the string has a match in one of the columns, then add to this set and update the unique values ​​of the columns.
IMHO, you can't do it faster. Only if you do not optimize the search for unique column values ​​by converting it to binary, for example.

M
MaxLich, 2018-04-22
@MaxLich

Found one solution. Algorithm:

  1. store the result as a list of lists: [group_number -> [group_strings]]
  2. use an auxiliary list of hash tables: [word_position -> { word -> group_number }] and an auxiliary hash table to store which group was merged into which
  3. we look for each word of the line in the corresponding (position of the word in the line) hash table
    a) if the word exists, remember the number of the group (value from the hash table) in which it is found
    b) if the word does not exist, then add it to the list of new words
    Group search method code:
    code
    private static List<List<String>> findLineGroups(List<String> lines) {
            class NewLineElement {
                private String lineElement;
                private int columnNum;
    
                private NewLineElement(String lineElement, int columnNum) {
                    this.lineElement = lineElement;
                    this.columnNum = columnNum;
                }
            }
    
            if (lines == null)
                return Collections.emptyList();
    
            List<List<String>> linesGroups = new ArrayList<>(); //список групп, каждый элемент вида "номер группы - список строк группы"
            if (lines.size() < 2) {
                linesGroups.add(lines);
                return linesGroups;
            }
    
            List<Map<String, Integer>> columns = new ArrayList<>(); // список стобцов, каждый столбец - мапа с парами "элемент строки/столбца-номер группы"
            Map<Integer, Integer> unitedGroups = new HashMap<>(); //мэп с парами "номер некоторой группы - номер группы, с которой надо объединить данную"
            for (String line : lines) {
                String[] lineElements = line.split(";");
                TreeSet<Integer> groupsWithSameElems = new TreeSet<>(); //список групп, имеющих совпадающие элементы
                List<NewLineElement> newElements = new ArrayList<>(); //список элементов, которых нет в мапах столбцов
    
                for (int elmIndex = 0; elmIndex < lineElements.length; elmIndex++) {
                    String currLnElem = lineElements[elmIndex];
                    if (columns.size() == elmIndex)
                        columns.add(new HashMap<>());
                    if ("".equals(currLnElem.replaceAll("\"","").trim()))
                        continue;
    
                    Map<String, Integer> currCol = columns.get(elmIndex);
                    Integer elemGrNum = currCol.get(currLnElem);
                    if (elemGrNum != null) {
                        while (unitedGroups.containsKey(elemGrNum)) // если группа с таким номером объединена с другой,
                            elemGrNum = unitedGroups.get(elemGrNum); //то сохраняем номер группы, с которой была объединена данная
                        groupsWithSameElems.add(elemGrNum);
                    } else {
                        newElements.add(new NewLineElement(currLnElem, elmIndex));
                    }
                }
                int groupNumber;
                if (groupsWithSameElems.isEmpty()) {
                    linesGroups.add(new ArrayList<>());
                    groupNumber = linesGroups.size() - 1;
                } else {
                    groupNumber = groupsWithSameElems.first();
                }
                for (NewLineElement newLineElement : newElements) {
                    columns.get(newLineElement.columnNum).put(newLineElement.lineElement, groupNumber);
                }
                for (int matchedGrNum : groupsWithSameElems) { //перебираем все группы с таким же элементом
                    if (matchedGrNum != groupNumber) {
                        unitedGroups.put(matchedGrNum, groupNumber); //сохраняем инф-цию об объединённых группах
                        linesGroups.get(groupNumber).addAll(linesGroups.get(matchedGrNum)); //объединяем группы
                        linesGroups.set(matchedGrNum, null); //помечаем группу с текущим номер, как несуществующую
                    }
                }
                linesGroups.get(groupNumber).add(line);
            }
            linesGroups.removeAll(Collections.singleton(null)); //удаляем несуществующие группы
            return linesGroups;
        }

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question