Working with a huge number of files?

R

Richard_Novozhilov2021-12-29 16:25:45

Software design

Richard_Novozhilov, 2021-12-29 16:25:45

C++, C#, RUST? The task is to process a large number of files (more than 2 million). What technology to choose?

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

S

Saboteur, 2021-12-29
@saboteur_kiev

bash will do just fine

find . -name "file.ext" | xargs -n 10 -P 10 "phraze"

R

rPman, 2021-12-29
@rPman

If a million files are on one physical device, then multithreading is not required and even harmful. To search across multiple drives, simply launch multiple instances of the search application, each with its own list of files on their drives.
Sequential reading of files for a simple search for a substring is a very simple task, you take c ++, do a loop with fgetstr (if you need line-by-line processing), prepare the searched strings in all encodings used as a set of char* bytes (ideally in the form of constants, i.e. generating code) by simply comparing them with strcmp... if there are many such strings, then prepare character-by-character search tables (generate successively nested switch cases) - this approach is the fastest of all possible, allows you to process millions of lines per second
ps if anything, gui can be implemented in one programming language (c# .net) and search in c++, launching the application from the gui, passing the necessary parameters on the command line or in a special file
pps if the search needs to be done often, can you still put these files in the database and create indexes for the data you are looking for?

G

Griboks, 2021-12-29
@Griboks

The bottleneck is in the file system. Everything else doesn't matter.

V

Vladimir Korotenko, 2021-12-30
@firedragon

https://www.elastic.co/elasticsearch/features#elas...
And some kind of web snout. Search will be much faster than your attempts.