How to parse 5-7 GB files?

V

Vitaly2014-03-14 08:15:09

Parsing

Vitaly, 2014-03-14 08:15:09

Tell me what methods are there for working with massive objects in memory, I need to read a file weighing about 5-6 GB, split it into columns and filter and dedupe by columns, such amounts of data in memory can’t be held very well, what can I think of to do this implement?

Reply

Answer the question

In order to leave comments, you need to log in

7 answer(s)

C

caper, 2014-03-14
@vipuhoff

If the file structure is known (for example, a delimited file), then:
1. transfer to an intermediate MS SQL table using BULK INSERT (or bcp) (0.5 GB in 12 sec, 4 GB in 1.5 min.)
2. data checks/transformations and layout into the necessary tables already from the intermediate table (depends on the complexity of checks/transformations, etc.)
If the structure is unknown, everything is somewhat worse ... either read in portions, or - Memory-mapped files ?

R

Ruslan Lopatin, 2014-03-14
@lorus

Way "on the forehead": overtake in the database and already in it to do whatever you want.

G

gleb_kudr, 2014-03-15
@gleb_kudr

FileStream reads your file in chunks as you wish. It sequentially returns an array of bytes, you already do anything with this array (for example, process and write to another file). It is only necessary to think over the algorithm normally in order to do this with a minimum number of passes.
StreamReader does the same for text files.

A

afiskon, 2014-03-15
@afiskon

I understand that you have just lines there? Then cat ... | cut -d '|' -f 1,2,3 | sort - something like this.
Or write a script in Perl/Python/whatever. 5-7 GB - not so much. Unless you need some other groupings with other files, then yes, use some kind of RDBMS.

V

Vitaliy, 2014-03-17
@vipuhoff

Thanks to all! I will read files through FileStream, pre-process and store them in the database!

N

Nikolai Turnaviotov, 2014-03-18
@foxmuldercp

I recently wrote a small utility just to create tsql instructions from a csv file for inserting into a table. I can throw in the source codes on the seasharp, add a little and you will write directly to the table.