Answer the question
In order to leave comments, you need to log in
How to quickly compare huge lists of dictionaries?
Let's say we have two lists of dictionaries:
a = [{'column1': 'Russia','column2':'Moscow', 'column3': '02/01/2019'}, {'column1': 'Russia','column2 ':'Murmansk', 'column3': '02/01/2018'}...]
b = [{'column1': 'Russia','column2':'Moscow', 'column3': '02/01/2019'} , {'column1': 'Russia','column2':'Murmansk', 'column3': '02/01/2018'}...]
The number of dictionaries in the lists is about 190000.
The keys of the dictionaries are the same in both lists and remain unchanged , but the values can change in both dictionary a and dictionary b. We need to find the differences between dictionary a and b and vice versa. I use this comparison:
no_in_b = []
no_in_a = []
for i in a:
if i not in b:
no_in_b.append(i)
else:
pass
for i in b:
if i not in a:
no_in_a.append(i)
else:
pass
Answer the question
In order to leave comments, you need to log in
from collections import defaultdict
a = [{'a': '1', 'b': '2'},
{'a': '1', 'b': '2'},
{'a': '5', 'b': '7'}]
b = [{'a': '3', 'b': '4'},
{'a': '1', 'b': '2'}]
def make_hash_table(list_of_dicts):
hashes = defaultdict(list)
for i, item in enumerate(list_of_dicts):
calculated_hash = hash(frozenset(item.items()))
hashes[calculated_hash].append(i)
return hashes
# Считаем хеши
a_hashes = make_hash_table(a)
b_hashes = make_hash_table(b)
# Определяем через вычитание множеств каких хешей у нас нет
not_in_a = [b[b_hashes[hash_value][0]] for hash_value in b_hashes.keys() - a_hashes.keys()]
not_in_b = [a[a_hashes[hash_value][0]] for hash_value in a_hashes.keys() - b_hashes.keys()]
print('Not in a \n')
print(not_in_a)
print()
print('Not in b \n')
print(not_in_b)
Not in a
[{'a': '3', 'b': '4'}]
Not in b
[{'a': '5', 'b': '7'}]
Sort both arrays, comparing by column1, column2, column3
(the comparison function first compares rows from column1, if they are equal - column2, etc.)
then it will be possible to make a comparison in linear time, bypassing both arrays in parallel.
I would advise you to use Pandas for such things.
Create 2 DataFrames, and then use merge
Details of all actions can be found in Google.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question