Answer the question
In order to leave comments, you need to log in
How to quickly remove duplicates in a large (350GB) file?
What is the fastest way to remove duplicates in a large (350GB) file (~10 billion records, lines up to 255 characters long). What tools are better to use for this (and how to tune them for this task)? Share real experience, if anyone has one?
Z.Y. Intel Core i5-3550, 8GB RAM
Answer the question
In order to leave comments, you need to log in
It won't work quickly
Ready
sort \
--unique \
--parallel <threads count> -T /path/to/temp/dir/ /path/to/huge/file >>/out/file
Good question!
At a third glance, I would act according to the following algorithm.
I would take a database, well, for example mysql .... For storing caches and collisions.
So, we need to go through the records of the bigfile and form a new bigfile.
1) take a line of the file, count from it (or some part), for example sha1.
2) search the database for our sha1
(table in "hashes" with fields "hash" and "offset", "count")
2.1) If not found:
2.2) If found:
3) Collision handling
(table in "collisions" with fields "hash" and "offset", "count")
In principle, this process can be parallelized to an infinite number of processes. though, you need to think about it.
PS. You can also make an additional field in each of the tables "count", update it if a comparison of records has occurred, for statistics.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question