Y
Y
Ytsu Ytsuevich2016-03-08 22:14:49
Mathematics
Ytsu Ytsuevich, 2016-03-08 22:14:49

How to calculate the standard deviation if the mean is unknown?

Let there be some very, very large file , where in each line there are numbers, for simplicity - integers.
It is impossible to calculate the average value using the sum / quantity formula , because the file is huge, the amount will not fit in the RAM (just measure it!). So I came up with the form (I'm sure I'm not the first):
((i-1) * avg + nextValue) / i where
i is the current number of calculated numbers (the current step starting from 1);
avg - current average value;
nextValue - the next value (from the file).
For example:
3
3
6
read line by line
for 3: (0 * 0 + 3) / 1 = 3
for 3: (1 * 3 + 3) / 2 = 3
for 6: (2 * 3 + 6) / 3 = 4
i.e. at any time you can stop and find out the current average. arithm. meaning.
Total: The average value changes at each step , and it is not given to look into the future.
Question: how to find out the standard deviation if you need to know the average value for it in advance?
Wikipedia: Standard deviation
P.S. Please note that the file is very large . Imagine the most powerful super computer and an infinitely large SSD (but with negligible RAM) will read data for weeks. And it will stop on a signal, after which it should immediately give an answer, and not start reading again (knowing the average value).

Answer the question

In order to leave comments, you need to log in

4 answer(s)
A
Andy_U, 2016-03-08
@Andy_U

If you know the number of numbers, their sum, and the sum of their squares, you can calculate both the mean and the variance. Any textbook on mathematical statistics and / or statistical data processing will help you. The only thing that needs to be monitored is the accumulation of rounding errors and possible overflow during summation.

R
Rsa97, 2016-03-08
@Rsa97

The formula doesn't make any sense.
Reason: (i-1) * avg = SUM(Value 0 ...Value i-1 ), which, according to you, will not fit in memory.
You can read the average of a block (for example, 100 numbers), then sum these averages and divide by the number of blocks. Continuing the algorithm, the averages for each 100 blocks can be calculated separately, as a superblock, then summed up, and so on.
The exact value of the standard deviation cannot be calculated without knowing the arithmetic mean. Accordingly, you need to know what these numbers are. It may well be that it is enough to take a small random sample to obtain estimates of the desired parameters.

E
evgeniy_lm, 2016-03-09
@evgeniy_lm

You calculate the standard deviation, i.e. it is assumed that all your numbers are approximately equal and, in principle, it does not matter how many numbers you take from the file 10, 100, 1000 or
1,000,000. The number of sample values ​​affects only the accuracy of the result, it goes without saying that no one requires perfect accuracy, choose a few random values ​​and don't worry

C
Cheshire-Cat, 2020-06-22
@Cheshire-Cat

Can be calculated using the Welford algorithm https://ru.qwe.wiki/wiki/Standard_deviation#Rapid_... https://en.wikipedia.org/wiki/Algorithms_for_calcu...
It allows you to calculate the mean and standard deviation sequentially.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question