How to discard outliers from the stream of similar ones (input data filtering)?

D

Dmitry2015-08-06 19:11:32

C++ / C#

Dmitry, 2015-08-06 19:11:32

there is such an input stream:

How can you extract from this the two most occurring numbers out of three, for each sensor?
Those. in this case, get the triples [*, 53, 27], [*, 46, 37], [2, 17 *], [*, 95, 51].
Now the arithmetic mean is used, but due to random bursts, the deviation from the "standard" is more than 3 units, which is not satisfactory: it turns out [25, 53, 25], [32, 46, 33], [5, 17, 4], [54, 95, 52].
Roughly speaking, out of 200 numbers in each column, it is necessary to discard obviously random bursts that are very different from the rest, then find for each column the number with the maximum frequency of occurrence (plus or minus the allowable error) and select the two most frequently occurring numbers from each triple.
What are these algorithms called? In particular, with the C++ language.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

A

Alexander Ruchkin, 2015-08-07
@VoidEx

I can offer a stupid solution.
Accumulate the number of occurring numbers, for example, instd::map<int, std::size_t>

std::map<int, std::size_t> m;
for (int v : vals) { ++m[v]; }
std::vector<std::pair<int, std::size_t> > v(m.begin(), m.end());
std::sort(v.begin(), v.end(), [] (std::pair<int, std::size_t> const & l, std::pair<int, std::size_t> const & r) { return l.second > r.second; });
// в v пары "число - кол-во таких чисел", отсортированы по убыванию
// можно откинуть нижнюю часть (те, которые встречаются реже, чем какой-то процент, например, 10%)
v.erase(
  std::find_if(v.begin(), v.end(), [] (std::pair<int, std::size_t> const & x) { return x.second < (v.size() / 10); }),
  v.end());
// а сверху взять часто встречающиеся
int row_value = v.front().first;

D

Dmitry, 2015-08-10
@Tomasina

The situation was saved by the median filter, at a depth of 10-20 iterations, everything superfluous is remarkably cut off.