How to quickly remove duplicates in a large (350GB) file?

T

Twindo2016-06-28 21:09:12

NoSQL

Twindo, 2016-06-28 21:09:12

What is the fastest way to remove duplicates in a large (350GB) file (~10 billion records, lines up to 255 characters long). What tools are better to use for this (and how to tune them for this task)? Share real experience, if anyone has one?
Z.Y. Intel Core i5-3550, 8GB RAM

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

S

sim3x, 2016-06-28
@sim3x

It won't work quickly
Ready

sort \
--unique \
--parallel <threads count> -T /path/to/temp/dir/ /path/to/huge/file >>/out/file

Install a subd with 16gb ram on a remote machine.
Make a table with a unique field
. And hammer your machine with inserts.

S

Sergey, 2016-06-28
@begemot_sun

Use a sequential pass filtered with a bloom filter.

A

Alexey Cheremisin, 2016-06-29
@leahch

Good question!
At a third glance, I would act according to the following algorithm.
I would take a database, well, for example mysql .... For storing caches and collisions.
So, we need to go through the records of the bigfile and form a new bigfile.
1) take a line of the file, count from it (or some part), for example sha1.
2) search the database for our sha1
(table in "hashes" with fields "hash" and "offset", "count")
2.1) If not found:
2.2) If found:
3) Collision handling
(table in "collisions" with fields "hash" and "offset", "count")
In principle, this process can be parallelized to an infinite number of processes. though, you need to think about it.
PS. You can also make an additional field in each of the tables "count", update it if a comparison of records has occurred, for statistics.