X
X
xgyrfalconx2021-08-20 17:20:27
Python
xgyrfalconx, 2021-08-20 17:20:27

How to quickly compare huge lists of dictionaries?

Let's say we have two lists of dictionaries:
a = [{'column1': 'Russia','column2':'Moscow', 'column3': '02/01/2019'}, {'column1': 'Russia','column2 ':'Murmansk', 'column3': '02/01/2018'}...]
b = [{'column1': 'Russia','column2':'Moscow', 'column3': '02/01/2019'} , {'column1': 'Russia','column2':'Murmansk', 'column3': '02/01/2018'}...]

The number of dictionaries in the lists is about 190000.
The keys of the dictionaries are the same in both lists and remain unchanged , but the values ​​can change in both dictionary a and dictionary b. We need to find the differences between dictionary a and b and vice versa. I use this comparison:

no_in_b = []
        no_in_a = []
        for i in a:
            if i not in b:
                no_in_b.append(i)
            else:
                pass

        for i in b:
            if i not in a:
                no_in_a.append(i)
            else:
                pass

But doing this kind of comparison takes about an hour, is there any way to compare lists of dictionaries faster?

Answer the question

In order to leave comments, you need to log in

3 answer(s)
S
ScriptKiddo, 2021-08-20
@xgyrfalconx

from collections import defaultdict

a = [{'a': '1', 'b': '2'},
     {'a': '1', 'b': '2'},
     {'a': '5', 'b': '7'}]
b = [{'a': '3', 'b': '4'},
     {'a': '1', 'b': '2'}]


def make_hash_table(list_of_dicts):
    hashes = defaultdict(list)

    for i, item in enumerate(list_of_dicts):
        calculated_hash = hash(frozenset(item.items()))
        hashes[calculated_hash].append(i)

    return hashes


# Считаем хеши
a_hashes = make_hash_table(a)

b_hashes = make_hash_table(b)

# Определяем через вычитание множеств каких хешей у нас нет

not_in_a = [b[b_hashes[hash_value][0]] for hash_value in b_hashes.keys() - a_hashes.keys()]

not_in_b = [a[a_hashes[hash_value][0]] for hash_value in a_hashes.keys() - b_hashes.keys()]

print('Not in a \n')
print(not_in_a)
print()
print('Not in b \n')
print(not_in_b)

OUT
Not in a 

[{'a': '3', 'b': '4'}]

Not in b 

[{'a': '5', 'b': '7'}]

A
Alexandroppolus, 2021-08-20
@Alexandroppolus

Sort both arrays, comparing by column1, column2, column3
(the comparison function first compares rows from column1, if they are equal - column2, etc.)
then it will be possible to make a comparison in linear time, bypassing both arrays in parallel.

M
Mikhail Krostelev, 2021-08-21
@twistfire92

I would advise you to use Pandas for such things.
Create 2 DataFrames, and then use merge
Details of all actions can be found in Google.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question