Answer the question
In order to leave comments, you need to log in
Come up with a metric for the uniformity of reading a file?
There is a file. We consider that it is divided into blocks of equal length. Minimum 4kb.
The file is being read.
You need to come up with a metric to track the "uniformity" of reading a file.
From 0 to 1. The closer to 1, the more even the reading.
So far I have thought of doing this:
Number of blocks = k
We start an array of counters, how many times each block has been read: m_i
Then we calculate the frequency for each block: p_i = m_i/n.
Then we find the expectation: mo = 1/k
Then we find the standard deviation from the expectation std_dev= sqrt([sum(p_i - mo)^2] /
k
)
large limits: 1.0 - 0.7
Can you recommend something else to calculate uniformity?
If we use the coefficient of variation, i.e. divide the standard deviation by the mat. expectation, then for some values a number greater than 1 is obtained, which is unacceptable.
Answer the question
In order to leave comments, you need to log in
Example of use: detection of "abnormal" files during data processing.
If there is a large file, and only a small part of it is constantly read, such a metric will allow it to be identified.
In general, we are writing a profiler for the Hadoop File System.
std_dev with this definition, even with maximum non-uniformity (a single block is read) will tend to 0 as k increases. Maybe take 1 as a metric - (std_dev * sqrt (k))?
For each file, create a hash in which to write down the list of blocks.
At each reading, in the block that is read, enter either one, or increase the content by +1 (depending on what is more accessible and in demand).
Then, when it comes time to analyze, look at this hash - if more than 60% of the hash is filled with emptiness (or values \u200b\u200bclose to zero, in the case of incrementing), then the file is “suspicious”. I think so :)
The sum of the modules of deviation from the median divided by the total number of readings. I’m not sure, however, what will turn out exactly from 0 to 1 (you can normalize), but in fact - what you need.
We are given m_i. Uniformity suggests that m_i ~ k * mo. IMHO, it's more convenient to go straight to dp_i = abs(mo-p_i) - the deviation from the mean. The more uniform, the closer dp_i is to zero on average.
If we do not want to take into account the order of m_i, then the uniformity function must be symmetrical in all arguments.
You suggest using the standard deviation sqrt(average(dp_i ^ 2)). simbajoe suggests using average(dp_i). In the general case, you can write out any moment (after all, you can look at the numbers dp_i as a sample of some random variable) average(dp_i^t)
or, leading to the same “dimension”, power(average(dp_i^t), 1/t)
, where t is any natural parameter. std_dev corresponds to t=2.
By the way, if f(x) monotonically strictly increases from 0 to 1 on [0,1], then no matter what estimate is taken (for example, std_dev), one can always offer an estimate for the uniformity of f(std_dev). Therefore, and average(dp_i^t)
, and power(average(dp_i^t), 1/t)
are simultaneously suitable for the positions of the uniformity assessment. If you are not satisfied with the result, you can correct it using the f(x) function. For example, if std_dev is usually about 0 - 0.3, then sqrt(std_dev)
about 0 - 0.5, and power(std_dev, 1/4)
about 0 - 0.7, that is, it usually occupies almost the entire range. We do not have your statistics for which you want to build a metric, so try experimenting by choosing an exponent.
There is one more freedom available - the choice of the averaging method average, that is, not limited to the arithmetic mean, but also try other options.
However, there is a question whether m_i is an order of magnitude important. Depends on the task. The statistics [0.1, 0.2, 0.2, 0.2, 0.2, 0.1] and [0.2, 0.1, 0.1, 0.2, 0.2, 0.2] should give the same result or different? Does it matter if the loaded and idle blocks are scattered, or are they grouped in sections? To read the file, I think this should be taken into account. However, if this is not important to you, look away power(std_dev, 1/t)
.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question