How to optimize the processing of long cycles?

I

IvanKalinin2013-11-18 08:04:41

Delphi

IvanKalinin, 2013-11-18 08:04:41

Good day everyone, could you help to optimize the operation of the following procedure:

procedure TForm1.Button3Click(Sender: TObject);
var L1,L2,L3:TStringList;
D1:TOpenDialog;
n1:string;
i:integer;
begin
L1:=TStringList.Create; //создание 1го стринглиста
L2:=TStringList.Create; //создание 2го стринглиста
L3:=TStringList.Create; //создание 3го стринглиста
 
D1:=TOpenDialog.Create(self);
 
if D1.Execute then n1:=D1.Filename else exit;
L1.LoadFromFile(n1);
 
if D1.Execute then L2.LoadFromFile(D1.Filename) else exit;
 
for i:=0 to L1.Count-1 do
if L2.IndexOf(L1[i])=-1 then L3.Add(L1[i]);
 
L3.SaveToFile('3.txt');
D1.Free;L1.Free; L2.Free; L3.Free;
 
end;

In general terms:
there is a file 1 with a set of lines
there is a file 2 with a set of lines-2
The task from file 1 is to delete all the lines that occur in file 2 and write the result to file 3.
BUT the problem is this. There are about 2 million lines in file 1, 20 thousand lines in file.
In this algorithm, 1000 lines out of 2 million are processed in 7-8 seconds, which can be seen even with the naked eye, which is very long.
Are there any ideas on how to optimize file processing in order to significantly speed up the process
On one of the resources it was suggested to use hashed tables, but I did not learn how to use them.

Reply

Answer the question

In order to leave comments, you need to log in

5 answer(s)

H

Hint, 2013-11-18
@Hint

I would first check the execution time with a sorted list (in this case, IndexOf uses a smart search algorithm, not a simple enumeration). That is, add "L2.Sorted := True". If the result does not suit you, then experiment further. For example, consider using a TDictionary hash table and the ContainsKey method for L2 instead of TStringList.
Also, you have memory leaks. In the exit case, the objects are not destroyed. Use try and finally.

M

Masterme, 2013-11-18
@Masterme

Pseudocode if first duplicated into a list:

L3 = L1.clone
for line in L3
    if line not in L2
        L3.remove line
    endif
endfor
L3.savetofile '3.txt'

Pseudocode if you write directly to a file (if you work only with files, then stringlists are not needed, you can read from the first file line by line):

for line in L1
    if line not in L2
        line.append_to_file '3.txt'
    endif
endfor

Now to the question of checking a string in L2 and hashes.
There are two concepts of hashes.
The first is a hash function from some object H(S), such that if S1=S2, then H(S1)=H(S2). This means that to check the equality of objects, it is enough to compare their hashes. Comparison objects can be strings, files, and so on. The hash function is usually MD5.
The second is hash arrays or associative arrays . They are good because the search for a value in them is always O(1) in complexity. That is, to check, you do not need to iterate over all the elements.
Now an example:
let's say we have three strings: 'abc', 'def', 'ghi'.
md5('abc') = '900150983cd24fb0d6963f7d28e17f72'
md5('def') = '4ed9407630eb1000c0f6b63842defa7d'
md5('ghi') = '826bbc5d0522f5f20a1da4b60fa8c871'
building a hash array

L2_hash = {
  '900150983cd24fb0d6963f7d28e17f72' = 'abc',
  '4ed9407630eb1000c0f6b63842defa7d' = 'def',
  '826bbc5d0522f5f20a1da4b60fa8c871' = 'ghi'
}

now, when checking if there is a string some_string, you can do this (pseudocode)

if L2_hash[md5(some_string)]
  // строка содержится
endif

moreover, it is not necessary to store strings, you can do this

L2_hash = {
  '900150983cd24fb0d6963f7d28e17f72' = 1,
  '4ed9407630eb1000c0f6b63842defa7d' = 1,
  '826bbc5d0522f5f20a1da4b60fa8c871' = 1
}

All this is pseudocode, I already forgot Pascal. I hope the concept is clear

N

Nikita Permin, 2013-11-18
@NekitoSP

My advice to you is don't use TStringList to handle large amounts of data. It is better to load your files into simple arrays of strings and then you can bypass them with the same enumeration. The result will be noticeable to the naked eye, I say this from my experience.

M

M_PRO, 2013-12-29
@M_PRO

Option 1 - insert ProgressBar, Application.ProcesMessages, cursor - watch.
Option 2 - don't use IndexOf in such a context.
Pseudocode:

List1.Sorted := True;
List2.Sorted := True;
Index1 := 0;
Index2 := 0;
while (Index1 < List1.Count - 1) and (Index2 < List1.Count - 1) do
begin
   Cmp := CompareStrings(List1[Index1],
                                         List2[Index2]);
   if Cmp = 0 then Inc(Index1);
   if Cmp > 0 then Inc(Index2);
   if Cmp < 0 then 
   begin
      List3.Add(List1[Index1]);
      Inc(Index1);
   end;
end;

It will be easier and faster than hashes.

A

Arman Toksimbaev, 2014-01-04
@toxicdream

LoadFromFile creates a copy of the file in RAM.
This can be very long with a large file.
In your case, List1 is best left on disk and read from it line by line.
List3 is also not stored in memory, but immediately written to disk.
List2 can be left in memory, but as already mentioned, it is desirable to pre-sort it.