How to optimize file sorting algorithm?

esin2012-07-04 00:02:01

Java

esin, 2012-07-04 00:02:01

Good day to all!
I recently ran a test that required sorting a 4 GB file using only 512 MB of RAM. Language - Java, task execution time (NOT algorithm operation time) - 4 hours. The file has a line of three columns, the second column is the date and time in ISO format, you need to sort by it.
I did the following: we start reading lines from the file into an ArrayList until we fill up the memory by about 250 MB, after which we sort the array using the Merge Sort algorithm (I chose it because it has a good execution time and has already dealt with it), and the sorted array is written to a temporary file. Then we continue to read lines from the source file, sorting and saving according to the same principle. After reading the entire source file, we use the same merge algorithm to assemble a single output file, storing intermediate results in, again, intermediate sorted files.
As a result of the assignment, I was told that the algorithm is not optimal. There are ideas on how to optimize it, but only slightly (for example, write a faster comparison of two substrings or something like that). But he did not come up with any global optimization methods.
Who has ideas - tell me please.
Thanks in advance!

Answer the question

In order to leave comments, you need to log in

8 answer(s)

Nazar Mokrinsky, 2012-07-04
@nazarpc

I'm no sorting specialist, but if you quickly iterate over the original (first) file and make a copy only from the second column and line number, more data will fit into memory (this is the second file).
And at the end of sorting, create a third file with the results, pulling out the line numbers from the sorted second file, and taking the corresponding line from the first one.

Rowdy Ro, 2012-07-04
@rowdyro

Well, that's what I would do.
Read the term, remember its offset from the beginning of the file (int32), the length of the string (int32), and convert the time to timestamp (int32) = ~12 bytes per entry (+ overhead of Java containers)
Sort everything in a crowd by timestamp into one container.
And run through the index, biting out the terms by offset and length from the source file, adding them to the new one.
512 MB will fit ~ 44 million term. (excluding container overhead)

apangin, 2012-07-05
@apangin

What kind of lines are in the file? Maybe they can be stored in memory more compactly. For example, a date/time can fit exactly into one long, see Date.getTime(). Maybe it's the same with the rest of the columns?

knekrasov, 2012-07-04
@knekrasov

Why didn't you like the classical algorithms for sorting sequences (see D. Knuth, vol. 3)? Quite a classic case. Key words "single-phase"/"multi-phase" "single-path", "double-path", "tournament" sorting.
As an exotic, you can make a variant with combined sorting (reading blocks, partial sorting, mixing, for example, like here ). But it will take a long time to work.
If there is a fear that one line will not fit into the memory limit, build an index and sort it.

Dmitry, 2012-07-04
@Neir0

I did not fully understand your solution, but if you take it stupidly, go through the file and select the top (512mb), write it to a file. Then go through again, selecting the top (512mb) but already starting from the lower border of the previous top. There are 8 passes in total.

Arktos, 2012-07-04
@Arktos

1. Sort data by index and second column
2. MergeSort uses 2x more RAM than it really needs. In operational it is better to sort by QuickSort (default sorting is Arrays.sort()), then you can sort 2 times more data.
3. Since the date is represented as an integer, you can use radix sort (which produces multiple counting sorts). She's even faster. That is, if the date is represented by a number < 10^18, then the sort can be done with just three counting sorts (on the 10^6 digit), which are performed in linear time
4. ArrayList takes a long time. Use an array
5. I did not understand how exactly you are merging files. They must be merged by analogy with MergeSort. That is, not sequentially, but also logarithmically

CKOPOBAPKuH, 2012-07-04
@CKOPOBAPKuH

we also need to consider the situation when there are only 3 lines in the file, each 650MB in size. you won't be able to read any whole row, and you have to do as rowdyro says

max7 M7, 2012-07-04
@max7

Do not consider it vulgar ;-)
But I would overtake the file in sqlite.
And then you can do whatever you want with it.