How do Haar Cascades work in OpenCV?

P

Pudjak2021-12-03 09:00:36

OpenCV

Pudjak, 2021-12-03 09:00:36

Can you explain in plain language how it works?
Let's say there is some image on which faces need to be determined, and we pass it through this algorithm.
In it we have black and white masks and features 24 by 24 in size. There are also 25 levels of some kind.

So what happens to them in the end?
I guess the type of sign is 24 by 24 we go through the entire image. On each such pass, we also pass with masks to determine whether a person can be in this place or not (if so, how?). As a result, we have an image with several marks 24 by 24, on which, in theory, the face. And how will it ultimately be decided whether this face is in the whole image?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

V

Vindicar, 2021-12-03
@Vindicar

"Haar sign" is a rectangular filter divided into two areas - light and dark. This filter is applied to some area of the image (window). The value (response) of the feature is the sum of the brightness of the image pixels in the bright area minus the sum of the brightness of the pixels in the dark area. If this difference exceeds a certain threshold, then we consider that the given filter gave a response in the given place of the image.
This is a primitive classifier. When training a Haar classifier using the boosting algorithm, a set of such primitive classifiers is added to one composite classifier. But such a classifier either works for a long time or gives a lot of false positives. Even a 0.01% chance is a lot considering how many possible windows (possible face positions) an image can have.
Therefore, it uses the principle of "attention cascade". A chain of several composite classifiers is formed in such a way that each subsequent one sifts out as many negative examples as possible, but at the same time misses all or almost all positive ones (detection rate> 95%). This allows us to confine ourselves to calculating relatively fast and simple composite classifiers for the vast majority of windows in the image.
In the end, several nearby windows can be merged into one using non-maximum suppression. This is necessary, since the face can fall into several neighboring windows at once, slightly offset relative to each other.
If in the end we have responses, then we know the positions and sizes of possible faces in the image. And then there are heuristics based on this information. For example, if we want a close-up, then we can reject the image if the largest face occupies less than 75% of the image area.