How to organize the processing of large amounts of data?

K

kolosovas2019-07-11 13:55:19

PHP

kolosovas, 2019-07-11 13:55:19

There is a field of integers from 0 to 15,000,000.
Sequences of 500,000 values are calculated from this field every three days (algorithms for calculating sequences change, can be considered random). There are 15,000,000 such sequences, each sequence is sorted in ascending order. Next, you need to choose N sequences with the maximum field coverage. Accordingly, lay down the calculations in 3 hours and take a minimum of computing resources.
Example:
$sequence1 = array(n1=>1, n2=>5, n3=>100.....n500000 => 14900999) //1st sequence
$sequence2 = array(n1=>4, n2=> 5, n3=>99.....n500000 => 14900999) //2nd sequence of
Sequences $sequence1, $sequence2......
We count the sums of unique field values for N sequences.
We choose the sequence with the maximum sum.
Questions about the task:
|. Tell me, how best to organize a storage system, or maybe a solution that does not provide for the storage of all data?
||. What tools to choose for calculations?
|||. Tell me fast algorithms for comparing large amounts of data, because you need to compare 500,000 with 500,000 records
|V. Can this problem be solved at home?
My thoughts:
|. Calculations of one sequence take a long time, about 3 minutes, so I suppose to store sequences
Ideas that flew off
sql- indexed table for quick comparisons (one table > 10mb, 15,000,000 tables is 150TB, no one will give such a volume)
txt - file separated by commas (one file > 4mb, the whole calculation will fit into 60TB, which is also a lot)
zip- one file> 1mb, the whole calculation will fit into 15TB, closer to the topic, but still a lot.
Can eat other ideas of storage of the data?
||. I took php, sql, since I am familiar with them, it is possible to use other tools
|||. I tried comparing index tables with join, the speed is acceptable, but you need to have 15,000,000 index tables, which is a lot of memory
Comparison in php count(array_diff( $arr1, $arr2)), it was not possible to cram 500,000 values into two arrays, a memory error, I tried REDIS, it helps, but while craming two arrays there, it will take a lot of time
Run through the arrays in a loop and check if there is already a value, a variant in the forehead, the longest.
|V. Is it possible using 16GB of RAM, Core i7, 2TB HDD, to keep within 3 hours of calculations? And is it really possible to make such calculations in a reasonable time?

Reply

Answer the question

In order to leave comments, you need to log in

5 answer(s)

A

Adamos, 2019-07-11
@Adamos

In general, in combinatorial problems, storage is the bottleneck, and first of all, you should think not about how to optimize it, but how to avoid it. That is, reduce the algorithm to streaming data processing as soon as they arrive, and immediately discard those that are not relevant for further work.

X

xmoonlight, 2019-07-11
@xmoonlight

When analyzing the next sequence, it is necessary to provide criteria for stopping and moving to the next sequence. Maximum achievements (criteria) - store in an intermediate stack. That is, as soon as the current set is worse than the criterion, we immediately exit from the enumeration of the values of the current set (break).
Initially, we take all the sets at once and sort through their values in parallel. Then - we filter out the sets as the selected criteria worsen.

R

rPman, 2019-07-11
@rPman

In your case, even if the numbers are floats, where you need to store 15,000,000 * 500,000 * 4 = 30,000,000,000,000, this is 30 terabytes. It's just a linear blob, a file format of 4 bytes per number. And this is without indexes (they will appear when you need search queries on selections). Do not try to take universal databases, you have a narrow specialization and almost any other ready-made solution will require you to pay for either space or processor time.
You can't get away from these numbers!
3 minutes per sequence multiplied by 15 million pieces is a sentence, 31 thousand cpu days, you need to load a cluster of a thousand processors for a month, and it’s good if you can use gpu (this can allow one machine to do not a dozen calculations, but hundreds at a time), then you will get by with a dozen Amazon instances and count in a couple of three weeks.
Believe me, the cost of a place in this case is so miserable that it's even ridiculous;)
You need to speed up the calculations by an order of magnitude. Almost certainly, your algorithms are of the same type, and even more likely, it is possible to reuse data from neighboring sequencessomewhere in the middle of the algorithm. And what the hell is not joking, all of a sudden you manage to store not the final values of the sequences, but only intermediate ones from the end of the calculation algorithm, and as soon as you need a final number, take the last step of calculations (for example, if the neighboring hundred numbers differ only in the last step of calculations in the algorithm, store in 100 times less data and for each value, perform only the last stage of the calculation, even if it takes a second, this will be a good price for a hundredfold savings in space).
Great example, you need to calculate the Jacobian matrix for a neural network by changing the values of the weights one by one +e and -e. Those. you need to calculate a matrix of N * N numbers where N is the number of weights in the neural network. If you solve the problem head on, it means you need O (N ^ 3) calculations - this is a lot. But, since only one weight changes for each number from the matrix in the neural network, then in almost half of the cases the weight calculations will use the same numbers (especially if the weight has changed in the network close to its end), which means that if you store intermediate values of the calculations , you can drop them. In practice, it is not necessary to store EVERYTHING on a permanent basis, it is enough to use the knowledge in what order the calculations go (no matter in what order it will be calculated, for example, from the end), you can recursively calculate the neural network, storing these intermediate values on the stack.

S

Sergey Sokolov, 2019-07-11
@sergiks

A few unrelated ideas:

store each "sequence" (set) of 500,000 (no matter how many) numbers as a 1-bit color picture, 3873*3873px, to cover the range 0..15e6. There will be 15 million such pictures. Black pixel - number, white - no number. You can overlay pictures and see how dark it is) But it’s inefficient to do this digitally, if only with an analogue ..
store the sequence as a binary string, where the included bits mean the selected number. 15e6 bits is about 1875e3 bytes =~1.9Mb per set. 1875e3 * 15e6 = 28125e9 bytes = ~28TB
store as a binary file of 3 bytes (24 bits) per number. 0-15 million will fit perfectly: 2 ²⁴ = 16 777 216. See php pack() / unpack() functions. One set 500000 * 3 = 1.5Mb, 15M sets 22.5Tb
Don't store everything. Full coverage of the 0..15M range with perfectly matched 500K ranges would require only 30 such ranges.
Hypothesis . If all the samples are really random, you can take any N, they will be worse than the "real" maximum only slightly.
“Calculations of one sequence take a long time, about 3 minutes” 180 seconds * 15e6 = 27e8 seconds, which is almost 86 years. Have you been planning anything for a few days?

Y

Yuri, 2019-07-12
@riky

Comparison in php count(array_diff($arr1, $arr2)), failed to cram 500,000 values into two arrays, memory error

he himself recently dabbled in intersections, there were arrays of more than 10M numbers, not even sorted.
1) it's easy to stuff 500k in php, just use ini_set memory_limit.
2) of course you can't use array_diff , use array_diff_key it will be just an order of magnitude faster, since there is an index on the keys. Well, of course, arrays must be flipped beforehand array_flip. in time, even with a flip, it will be an order of magnitude faster.
3) in the end I did it on GO, I don’t remember exactly, but in terms of speed it turned out 3-5 times, probably faster. It's hard to compare exactly, since in php the data loading was also quite slow, and it consumes much more memory. if you need to calculate the intersection in sorted lists - you need to make a loop running through both arrays simultaneously in one pass.
more or less like this:

func intersectCount(ids1, ids2 []uint32) int {
  j := 0
  cnt := 0
  for i := 0; i < len(ids1); i++ {
    for ;(j < len(ids2)) && (ids2[j] < ids1[i]); j++ {}
    if (j < len(ids2)) && (ids2[j] == ids1[i]) {
      cnt++
    }
  }
  return cnt
}

in php, of course, it makes no sense to do this, because array_diff_key is in C and will be an order of magnitude faster.
well, in general, on the task, here you have already been told that you cannot find the ideal solution at home. just look for any good one, as far as acceptable for the task. the less resources you have, the worse it is likely to be.
I had 1000 lists of numbers, in lists from 1 to 15 million uint32 numbers. it was necessary to count the intersection of each with each. in one thread on a not very powerful computer, it took about 3-4 hours.
it takes a lot of time to read from the disk, so I loaded lists into memory of 200 pieces and calculated the intersection of each with each, then the next batch was loaded, and so on.
to calculate the intersection of 15 million lists each with each in the forehead in 3 hours looks unrealistic. we need a cheap way to select a small number of suitable ones in one round and look for the already optimal among them.