Come up with a metric for the uniformity of reading a file?

K

kuaw262013-06-17 21:09:37

Programming

kuaw26, 2013-06-17 21:09:37

There is a file. We consider that it is divided into blocks of equal length. Minimum 4kb.

The file is being read.
You need to come up with a metric to track the "uniformity" of reading a file.
From 0 to 1. The closer to 1, the more even the reading.

So far I have thought of doing this:
Number of blocks = k
We start an array of counters, how many times each block has been read: m_i
Then we calculate the frequency for each block: p_i = m_i/n.
Then we find the expectation: mo = 1/k
Then we find the standard deviation from the expectation std_dev= sqrt([sum(p_i - mo)^2] / k
)

large limits: 1.0 - 0.7

Can you recommend something else to calculate uniformity?

If we use the coefficient of variation, i.e. divide the standard deviation by the mat. expectation, then for some values a number greater than 1 is obtained, which is unacceptable.

Reply

Answer the question

In order to leave comments, you need to log in

5 answer(s)

K

kuaw26, 2013-06-17
@kuaw26

Example of use: detection of "abnormal" files during data processing.
If there is a large file, and only a small part of it is constantly read, such a metric will allow it to be identified.
In general, we are writing a profiler for the Hadoop File System.

J

jcmvbkbc, 2013-06-18
@jcmvbkbc

std_dev with this definition, even with maximum non-uniformity (a single block is read) will tend to 0 as k increases. Maybe take 1 as a metric - (std_dev * sqrt (k))?

E

EugeneOZ, 2013-06-18
@EugeneOZ

For each file, create a hash in which to write down the list of blocks.
At each reading, in the block that is read, enter either one, or increase the content by +1 (depending on what is more accessible and in demand).
Then, when it comes time to analyze, look at this hash - if more than 60% of the hash is filled with emptiness (or values \u200b\u200bclose to zero, in the case of incrementing), then the file is “suspicious”. I think so :)

S

simbajoe, 2013-06-18
@simbajoe

The sum of the modules of deviation from the median divided by the total number of readings. I’m not sure, however, what will turn out exactly from 0 to 1 (you can normalize), but in fact - what you need.

Y

youngmysteriouslight, 2013-06-18
@youngmysteriouslight

We are given m_i. Uniformity suggests that m_i ~ k * mo. IMHO, it's more convenient to go straight to dp_i = abs(mo-p_i) - the deviation from the mean. The more uniform, the closer dp_i is to zero on average.
If we do not want to take into account the order of m_i, then the uniformity function must be symmetrical in all arguments.
You suggest using the standard deviation sqrt(average(dp_i ^ 2)). simbajoe suggests using average(dp_i). In the general case, you can write out any moment (after all, you can look at the numbers dp_i as a sample of some random variable) average(dp_i^t)or, leading to the same “dimension”, power(average(dp_i^t), 1/t), where t is any natural parameter. std_dev corresponds to t=2.
By the way, if f(x) monotonically strictly increases from 0 to 1 on [0,1], then no matter what estimate is taken (for example, std_dev), one can always offer an estimate for the uniformity of f(std_dev). Therefore, and average(dp_i^t), and power(average(dp_i^t), 1/t)are simultaneously suitable for the positions of the uniformity assessment. If you are not satisfied with the result, you can correct it using the f(x) function. For example, if std_dev is usually about 0 - 0.3, then sqrt(std_dev)about 0 - 0.5, and power(std_dev, 1/4)about 0 - 0.7, that is, it usually occupies almost the entire range. We do not have your statistics for which you want to build a metric, so try experimenting by choosing an exponent.
There is one more freedom available - the choice of the averaging method average, that is, not limited to the arithmetic mean, but also try other options.
However, there is a question whether m_i is an order of magnitude important. Depends on the task. The statistics [0.1, 0.2, 0.2, 0.2, 0.2, 0.1] and [0.2, 0.1, 0.1, 0.2, 0.2, 0.2] should give the same result or different? Does it matter if the loaded and idle blocks are scattered, or are they grouped in sections? To read the file, I think this should be taken into account. However, if this is not important to you, look away power(std_dev, 1/t).