[Large txt ~1 GB] how to process?

D

delkov2016-02-10 19:38:57

C++ / C#

delkov, 2016-02-10 19:38:57

Good day.
There is a large txt ~ 1GB, like:
xyz text text
...
It is necessary to collect all the repetitions of the xy pair and sum them z.
Ex:
1 2 5 test test
1 2 10 test test
After processing:
1 2 15 test
Code on matlab (very long):

fid = fopen(strcat('test.txt'));
T = textscan(fid, '%f %f %f %*[^\n]');

A = T{1};
B = T{2}; 
C = T{3};

A_2 = [];
B_2 = [];
C_2 = [];

while (~isempty(A))

  temp_A = A(1);
  temp_B = B(1);
  temp_C = C(1);
  total_sum = 0;
  I_A = find(A==temp_A);
  I_B = find(B==temp_B);
  I_AB = intersect(I_A, I_B);


  if (isempty(I_AB))
    total_sum = temp_C;
    A_2 = [A_2; temp_A];
    B_2 = [B_2; temp_B];
    C_2 = [C_2; total_sum];
    A(1) = []; 
    B(1) = [];
    C(1) = [];
  else
    for i=length(I_AB):-1:1
      total_sum = total_sum + C(I_AB(i));
      A(I_AB(i)) = [];
      B(I_AB(i)) = [];
      C(I_AB(i)) = [];
    end 
    A_2 = [A_2; temp_A];
    B_2 = [B_2; temp_B];
    C_2 = [C_2; total_sum];
  end 

end


data = [A_2 B_2 C_2]

dlmwrite('out.txt', data, 'delimiter','\t','precision',3)

Possible solutions
1) delete the last 2 columns in txt (because they are not used)
2) drive txt into sql -> process it there.
3) perhaps there is something for large files in c++.
4) python?
Tell me, dear users!
Thank you.

Reply

Answer the question

In order to leave comments, you need to log in

6 answer(s)

K

kazmiruk, 2016-02-10
@delkov

Run the file with the sort utility (if using linux). At the output, get a file that will be sorted. And then it's a matter of technology, you don't need to load the file into memory. Read line by line and sum z. As soon as x or y has changed, we reset z and count for the next pair.

A

abcdefghijklmnopq, 2016-02-10
@abcdefghijklmnopq

the operating system API function for mapping files to RAM is just for this purpose.
if possible, I would still pre-trim (the last two unnecessary columns), for example, EmEditor works well with large files.

A

AtomKrieg, 2016-02-10
@AtomKrieg

If you need to do this all the time and speed is as important as air, then choose C++. The c++ version is something like this:
1) Read the entire file into memory.
2) Parse each line (manually, since built-in functions have an overhead)
3) To calculate the amount, use the container

map<pair<int, int>, int> result;
get_parse(val1, val2, val3);
result[std::make_pair(val1, val2)] += val3;

4) Write the entire container to memory (manual function), save to disk.
SQL will be exactly the slowest solution. If you want to use python, then use the Pandas library ( stackoverflow.com/questions/15570099/pandas-pivot-... or Numpy (but then without texts, look for an example yourself)

O

Oleg Shevelev, 2016-02-10
@mantyr

1) delete the last 2 columns in txt (because they are not used)
- if it is possible not to process data that is not needed, then this should be done.
2) drive txt into sql -> process it there.
- a pointless exercise
3) perhaps there is something for large files in c ++.
- the essence of parsing large files (1, 10, 100, 1000 gigabytes per file and more) is to read it in chunks of several bytes (the size is selected based on the task) and analyze the file in streaming form (do not load it all into memory) .
Algorithm:
1. read stream by N bytes
2. when finding \n, consider that a new line has begun
3. as soon as you have collected the string, do whatever you want with it, in this case, break it by spaces (explode(" ", $str); example from php, but not the essence)
4. throw the third cell into a certain resulting array as a value and the first two cells as a key, if the data has already been then summarize.
Of course, this will not save you from the size of the resulting array equal to the number of unique keys (xy cells). For this, mapReduce is used, but for your task it is redundant.
How to get rid of a large resulting array:
1. the algorithm is the same
2. write the value to mysql or any other database in the form

INSERT INTO result_table
SET
    key = ".$key.",
    value = ".$value."
ON DUPLICATE KEY UPDATE value = value + ".$value."

Where result_table.key is PRIMARY KEY

A

abcd0x00, 2016-02-11
@abcd0x00

I would first split the original file into several different files, where each file stores the same x y's.
Then each of these files was turned into one line in turn and dumped it into a common output file.