Answer the question
In order to leave comments, you need to log in
How to organize the processing of large amounts of data?
There is a field of integers from 0 to 15,000,000.
Sequences of 500,000 values are calculated from this field every three days (algorithms for calculating sequences change, can be considered random). There are 15,000,000 such sequences, each sequence is sorted in ascending order. Next, you need to choose N sequences with the maximum field coverage. Accordingly, lay down the calculations in 3 hours and take a minimum of computing resources.
Example:
$sequence1 = array(n1=>1, n2=>5, n3=>100.....n500000 => 14900999) //1st sequence
$sequence2 = array(n1=>4, n2=> 5, n3=>99.....n500000 => 14900999) //2nd sequence of
Sequences $sequence1, $sequence2......
We count the sums of unique field values for N sequences.
We choose the sequence with the maximum sum.
Questions about the task:
|. Tell me, how best to organize a storage system, or maybe a solution that does not provide for the storage of all data?
||. What tools to choose for calculations?
|||. Tell me fast algorithms for comparing large amounts of data, because you need to compare 500,000 with 500,000 records
|V. Can this problem be solved at home?
My thoughts:
|. Calculations of one sequence take a long time, about 3 minutes, so I suppose to store sequences
Ideas that flew off
sql- indexed table for quick comparisons (one table > 10mb, 15,000,000 tables is 150TB, no one will give such a volume)
txt - file separated by commas (one file > 4mb, the whole calculation will fit into 60TB, which is also a lot)
zip- one file> 1mb, the whole calculation will fit into 15TB, closer to the topic, but still a lot.
Can eat other ideas of storage of the data?
||. I took php, sql, since I am familiar with them, it is possible to use other tools
|||. I tried comparing index tables with join, the speed is acceptable, but you need to have 15,000,000 index tables, which is a lot of memory
Comparison in php count(array_diff( $arr1, $arr2)), it was not possible to cram 500,000 values into two arrays, a memory error, I tried REDIS, it helps, but while craming two arrays there, it will take a lot of time
Run through the arrays in a loop and check if there is already a value, a variant in the forehead, the longest.
|V. Is it possible using 16GB of RAM, Core i7, 2TB HDD, to keep within 3 hours of calculations? And is it really possible to make such calculations in a reasonable time?
Answer the question
In order to leave comments, you need to log in
In general, in combinatorial problems, storage is the bottleneck, and first of all, you should think not about how to optimize it, but how to avoid it. That is, reduce the algorithm to streaming data processing as soon as they arrive, and immediately discard those that are not relevant for further work.
When analyzing the next sequence, it is necessary to provide criteria for stopping and moving to the next sequence. Maximum achievements (criteria) - store in an intermediate stack. That is, as soon as the current set is worse than the criterion, we immediately exit from the enumeration of the values of the current set (break).
Initially, we take all the sets at once and sort through their values in parallel. Then - we filter out the sets as the selected criteria worsen.
In your case, even if the numbers are floats, where you need to store 15,000,000 * 500,000 * 4 = 30,000,000,000,000, this is 30 terabytes. It's just a linear blob, a file format of 4 bytes per number. And this is without indexes (they will appear when you need search queries on selections). Do not try to take universal databases, you have a narrow specialization and almost any other ready-made solution will require you to pay for either space or processor time.
You can't get away from these numbers!
3 minutes per sequence multiplied by 15 million pieces is a sentence, 31 thousand cpu days, you need to load a cluster of a thousand processors for a month, and it’s good if you can use gpu (this can allow one machine to do not a dozen calculations, but hundreds at a time), then you will get by with a dozen Amazon instances and count in a couple of three weeks.
Believe me, the cost of a place in this case is so miserable that it's even ridiculous;)
You need to speed up the calculations by an order of magnitude. Almost certainly, your algorithms are of the same type, and even more likely, it is possible to reuse data from neighboring sequencessomewhere in the middle of the algorithm. And what the hell is not joking, all of a sudden you manage to store not the final values of the sequences, but only intermediate ones from the end of the calculation algorithm, and as soon as you need a final number, take the last step of calculations (for example, if the neighboring hundred numbers differ only in the last step of calculations in the algorithm, store in 100 times less data and for each value, perform only the last stage of the calculation, even if it takes a second, this will be a good price for a hundredfold savings in space).
Great example, you need to calculate the Jacobian matrix for a neural network by changing the values of the weights one by one +e and -e. Those. you need to calculate a matrix of N * N numbers where N is the number of weights in the neural network. If you solve the problem head on, it means you need O (N ^ 3) calculations - this is a lot. But, since only one weight changes for each number from the matrix in the neural network, then in almost half of the cases the weight calculations will use the same numbers (especially if the weight has changed in the network close to its end), which means that if you store intermediate values of the calculations , you can drop them. In practice, it is not necessary to store EVERYTHING on a permanent basis, it is enough to use the knowledge in what order the calculations go (no matter in what order it will be calculated, for example, from the end), you can recursively calculate the neural network, storing these intermediate values on the stack.
A few unrelated ideas:
Comparison in php count(array_diff($arr1, $arr2)), failed to cram 500,000 values into two arrays, memory error
func intersectCount(ids1, ids2 []uint32) int {
j := 0
cnt := 0
for i := 0; i < len(ids1); i++ {
for ;(j < len(ids2)) && (ids2[j] < ids1[i]); j++ {}
if (j < len(ids2)) && (ids2[j] == ids1[i]) {
cnt++
}
}
return cnt
}
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question