How to load a large dataset into the server's memory (+ a couple of related problems)?

U

utopialvo2021-11-07 17:51:25

big data

utopialvo, 2021-11-07 17:51:25

There is a dataset (raw format, but formatted correctly for pandas to read) with dimensions (4000, 20,000,000), which weighs about 550GB. There is also a server with 112 cores, 1.7 TB of RAM and 3 TB of swap.
The data looks like this: The first 6 columns are junk and not important for processing, the rest contain only 2, 1, 0 and Nan. The question is: is it even possible to load data into the server memory without using Big Data tools?
My way and attempts to solve the issue:
A direct attempt to load data with pandas was unsuccessful. As I understand it, pandas does not work well with data where the number of columns is in the millions.
I transposed the dataset through an external instrument. The data is now (20,000,000, 4000), which should easily be considered a pandas. Next, I tried to read the data again. The process started, but it led to the overflow of the server's memory.
After loading into memory, you need to reduce the dimension using PCA. I thought about splitting the file into pieces of several million in order to work in parts. Parts do load quite quickly and it is quite possible to work with them. But, as I understand it, from a mathematical point of view, this cannot be done, since component analysis must be performed on the entire dataset, and not on its parts.
I also thought about cleaning up the data from columns with zero and near-zero variance using "VarianceThreshold" from the sklearn library to reduce the number of columns. I took a threshold of 20% - i.e. VarianceThreshold (0.2). Faced the following error: "ValueError: No feature in X meets the variance threshold 0.20000". As I understand it, this means that all columns must be removed?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

S

ScriptKiddo, 2021-11-07
@ScriptKiddo

I think it's better to use some kind of DBMS for such volumes
1) You can try Clickhouse-local
https://clickhouse.com/docs/en/operations/utility...

$ echo -e "1,2\n3,4" | clickhouse-local --query "
    CREATE TABLE table (a Int64, b Int64) ENGINE = File(CSV, stdin);
    SELECT a, b FROM table;
    DROP TABLE table"
Read 2 rows, 32.00 B in 0.000 sec., 4987 rows/sec., 77.93 KiB/sec.
1   2
3   4

2) Also look at Dask + Parquet
https://docs.dask.org/en/stable/dataframe-best-pra...
https://docs.dask.org/en/stable/10-minutes-to- dask.html

R

rPman, 2022-02-23
@rPman

I apologize for the necropost
4000 columns * 20k rows * 1 byte value (you don’t even need to pack it into bits, since you have 4 options for values)
this is 80,000,000,000 bytes, i.e. read 80 gigabytes of data in the form of a matrix
of what kind of processing is needed? just an array will handle linear reading