What to convert a huge, hundreds of GB, CSV file into for the fastest possible reading by "key"?

hostovik2020-09-14 09:05:35

Database

hostovik, 2020-09-14 09:05:35

There is csv, from tens of GB to 1 TB, that is, much more than my RAM.
There are about 5 fields in a table row. The size of one field can be considered unfixed.
The number of lines is from 100 million to 2 billion. It will be read by one user, no writes to the file.

Tables have a unique field, "hash". What should I convert the csv file to in order to access the row by index as quickly as possible? -
A) sql. postgre type. convenient, but this database supports multi-user recording, replication - I don’t need all this, overkill
B) sql of the sqllight type. flies in small volumes. but I'm not sure that it works well with large files, including the ability to quickly create indexes
. Q) nosql mongo-type database?
D) files with indexes - processing by python
I think option D is the best. or are there other options? where exactly to look?

Answer the question

In order to leave comments, you need to log in

7 answer(s)

h4r7w3l1, 2020-09-24
@hostovik

apache parquet
column type in binary format, solution for economical data storage in particular csv tables.
Compression 1tb csv source order 80-85% = ~ 120GB in parquet
Reading speed ~ x34 times faster than raw csv file
Thanks to the column type, it is possible to make selections without reading the entire file,
0*2hvPbX_y6u5389E3
but this is a solution for storing / reading data, due to the intricacies of working with this format For data, see the documentation, there are many subtleties, perhaps I will categorically contradict the task.

Ronald McDonald, 2020-09-14
@Zoominger

What should I convert the csv file to in order to access the row by index as quickly as possible?

to a SQL table.
MySQL at your service. SQLite is fine too.

Alexey Ukolov, 2020-09-14
@alexey-m-ukolov

this database supports multi-user recording, replication - I don’t need all this, overkill

What exactly is the downside? Don't need it - don't use it.

D) files with indexes - processing by python
I think that option D) is the best.

For each change in requirements, then you have to redo everything. Given the volume of data, it is better to immediately take something relatively flexible.

WinPooh32, 2020-09-14
@WinPooh32

Embedded kv-storage from Google - leveldb to help you. There are wrappers for almost every popular programming language.
leveldb is used to store transactions in the Bitcoin client, and there, for a moment, the database volume has exceeded 200 gigs.
Just keep in mind that there will be no magic when the database does not fit in RAM, and the bottleneck will be on the disk subsystem. Strong acceleration will give a fast ssd, especially nvme via pci.
Therefore, we cover ourselves with caches for frequent requests and, perhaps, rejoice.
Because Since this is an embedded storage, then you will have to implement all the network binding yourself.

Viktor T2, 2020-09-14
@Viktor_T2

Lightning Memory-Mapped Database (LMDB)
Tokyo Cabinet

Vladimir Korotenko, 2020-09-14
@firedragon

https://www.quora.com/What-tools-are-data-scientis... Look at the answers in this thread, it seems that these are your tasks

uvelichitel, 2020-09-14
@uvelichitel

It will be just reading by one user, no writing to the file.

Something like this DJ Bernstein came up with ConstantDataBase ( CDB ) Here is a modern implementation in Python https://github.com/bbayles/python-pure-cdb (available on PyPI)