Answer the question
In order to leave comments, you need to log in
[Large txt ~1 GB] how to process?
Good day.
There is a large txt ~ 1GB, like:
xyz text text
...
It is necessary to collect all the repetitions of the xy pair and sum them z.
Ex:
1 2 5 test test
1 2 10 test test
After processing:
1 2 15 test
Code on matlab (very long):
fid = fopen(strcat('test.txt'));
T = textscan(fid, '%f %f %f %*[^\n]');
A = T{1};
B = T{2};
C = T{3};
A_2 = [];
B_2 = [];
C_2 = [];
while (~isempty(A))
temp_A = A(1);
temp_B = B(1);
temp_C = C(1);
total_sum = 0;
I_A = find(A==temp_A);
I_B = find(B==temp_B);
I_AB = intersect(I_A, I_B);
if (isempty(I_AB))
total_sum = temp_C;
A_2 = [A_2; temp_A];
B_2 = [B_2; temp_B];
C_2 = [C_2; total_sum];
A(1) = [];
B(1) = [];
C(1) = [];
else
for i=length(I_AB):-1:1
total_sum = total_sum + C(I_AB(i));
A(I_AB(i)) = [];
B(I_AB(i)) = [];
C(I_AB(i)) = [];
end
A_2 = [A_2; temp_A];
B_2 = [B_2; temp_B];
C_2 = [C_2; total_sum];
end
end
data = [A_2 B_2 C_2]
dlmwrite('out.txt', data, 'delimiter','\t','precision',3)
Answer the question
In order to leave comments, you need to log in
Run the file with the sort utility (if using linux). At the output, get a file that will be sorted. And then it's a matter of technology, you don't need to load the file into memory. Read line by line and sum z. As soon as x or y has changed, we reset z and count for the next pair.
the operating system API function for mapping files to RAM is just for this purpose.
if possible, I would still pre-trim (the last two unnecessary columns), for example, EmEditor works well with large files.
If you need to do this all the time and speed is as important as air, then choose C++. The c++ version is something like this:
1) Read the entire file into memory.
2) Parse each line (manually, since built-in functions have an overhead)
3) To calculate the amount, use the container
map<pair<int, int>, int> result;
get_parse(val1, val2, val3);
result[std::make_pair(val1, val2)] += val3;
1) delete the last 2 columns in txt (because they are not used)
- if it is possible not to process data that is not needed, then this should be done.
2) drive txt into sql -> process it there.
- a pointless exercise
3) perhaps there is something for large files in c ++.
- the essence of parsing large files (1, 10, 100, 1000 gigabytes per file and more) is to read it in chunks of several bytes (the size is selected based on the task) and analyze the file in streaming form (do not load it all into memory) .
Algorithm:
1. read stream by N bytes
2. when finding \n, consider that a new line has begun
3. as soon as you have collected the string, do whatever you want with it, in this case, break it by spaces (explode(" ", $str); example from php, but not the essence)
4. throw the third cell into a certain resulting array as a value and the first two cells as a key, if the data has already been then summarize.
Of course, this will not save you from the size of the resulting array equal to the number of unique keys (xy cells). For this, mapReduce is used, but for your task it is redundant.
How to get rid of a large resulting array:
1. the algorithm is the same
2. write the value to mysql or any other database in the form
INSERT INTO result_table
SET
key = ".$key.",
value = ".$value."
ON DUPLICATE KEY UPDATE value = value + ".$value."
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question