How to remove duplicates by key in a large file?

A

Andrew2019-10-20 23:09:10

Python

Andrew, 2019-10-20 23:09:10

There is a json file with 8 million+ lines and 700+ MB of size, in this format:

{'title':'7778', 'mes':'ruseo', 'coord': '755'}
{'title':'77789', 'mes':'ruseo', 'coord': '755'}
{'mes': 'seoru', 'title' : '7778', 'coord' : '-'}
{'mes': 'seoru', 'title' : '7778', 'coord' : '-'}

half title at the beginning, half in the middle. It is necessary to remove duplicates by title so that only unique ones remain.
Please tell me any operational way to do this.
Now I'm working on python .
1) json.loads and set (if list) = memory error. (on the server 64 GB of RAM)
2) I open it as a text file, I break it, and I already check it by uniqueness, but everything is very long. For 2 days, only 1.5 million lines passed.
(if detailed:

The code

import json

with open(r'C:\json3toster.json', 'r', encoding="utf-8") as fp:
    ds = fp.readlines()

print(len(ds))
mem = []
for record in ds:
    name = record.replace('{','').split(',')
    for dat in name:
        dat2 = dat.split(': ')
        if dat2[0] == ' "title"':
            newline = dat2[1]
            if any(newline in lice for lice in mem):
                pass
            else:
                mem.append(record)
print(len(mem))
for newjs in mem:
    with open(r'd:/json_fin.json', 'a', encoding='utf-8') as fg:
        fg.write(newjs)

)
Initially, I cleaned it from duplicate lines through sort .
Any solution is possible, not necessarily a python.
Thanks for any hint.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

A

Andrey Dugin, 2019-10-20
@prolisk

You are doing something terrible. Search in the list instead of a set is especially bad. Do this:

from ast import literal_eval as eval  # ast.literal_eval() безопасный, обычный eval() - нет

with open('input.txt', 'r') as fi, open('output.txt', 'w') as fo:
    cache = set()
    for line in fi:
        title = eval(line).get('title')
        if title not in cache:
            cache.add(title)
            fo.write(line)

You can originally implement caching with a decorator:

from ast import literal_eval as eval
from functools import lru_cache

@lru_cache(None)
def process(title):
    print(record, file=fo)

with open('input.txt', 'r') as fi, open('output.txt', 'w') as fo:
    for record in map(eval, fi):
        process(record['title'])

Well, look at the cache statistics at the same time:

>>> process.cache_info()
CacheInfo(hits=994960, misses=5040, maxsize=None, currsize=5040)

L

longclaps, 2019-10-20
@longclaps

~~as an option you can try this~~ Shit question:

import re

with open(r'C:\json3toster.json', 'r', encoding="utf-8") as fp:
    ds = fp.readlines()
d = {"'title'": 0, "'mes'": 1, "'coord'": 2}
print(len(ds))
findall, buf = re.compile(r"'[^']*'").findall, [''] * 3
for i, s in enumerate(ds):
    l = findall(s)
    while l:
        w = l.pop()
        buf[d[l.pop()]] = w
    ds[i] = '\t'.join(buf)
ds.sort()
a = ''
with open(r'd:/json_fin.json', 'a', encoding='utf-8') as fg:
    for s in ds:
        title, mes, coord = s.split('\t')
        if a != title:
            a = title
            fg.write(f"{{'title': {title} 'mes': {mes}, 'coord': {coord}}}\n")