A few clarifications on SIFT based on an article on Habrahabr. Can anyone help?

S

Sacerdos2016-05-22 13:35:09

Programming

Sacerdos, 2016-05-22 13:35:09

I would like to ask a few questions, based on the same article https://habrahabr.ru/post/106302/

First, let's define the window (neighborhood) of the key point, in which the gradients will be considered. In essence, this will be the window required for convolution with a Gaussian kernel, and it will be round and the blur radius for this kernel (sigma) is 1.5*keypoint_scale. For the Gaussian kernel, the so-called "three sigma" rule applies. It consists in the fact that the value of the Gaussian kernel is very close to zero at a distance exceeding 3*sigma. Thus, the window radius is defined as [3*sigma].
The direction of the key point is found from the histogram of directions O. The histogram consists of 36 components that evenly cover a gap of 360 degrees, and it is formed as follows: each point of the window (x, y) contributes equal to m*G(x, y, sigma ), to that component of the histogram that spans the gap containing the gradient direction theta(x, y).

That is, if my Gaussian uses a coefficient of 2, then at this stage I am studying a neighborhood with a radius of 6? Or you can "arbitrarily" take the radius, for example 4?
And as for the contribution, as I understand it, m is the value of the gradient and it must be multiplied by the value of the extremum at this point? So is it possible to multiply by 0?

The direction of the key point lies in the gap covered by the maximum component of the histogram. The values of the maximum component (max) and its two neighbors are interpolated by a parabola, and the maximum point of this parabola is taken as the direction of the key point. If there are more components in the histogram with values not less than 0.8*max, then they are similarly interpolated and additional directions are assigned to the key point.

Why are additional directions assigned (if there are any) if, when constructing the descriptor, we rotate the point and the environment by an angle equal to the main direction?

Here, a part of the image (on the left) and (on the right) the descriptor obtained from it are schematically shown. First, let's look to the left. Here you can see the pixels represented by small squares. These pixels are taken from the square window of the descriptor, which, in turn, is divided into four more equal parts (we will call them regions below). The little arrow in the center of each pixel represents that pixel's gradient. The interesting thing is that the center of this window is between the pixels. It should be chosen as close as possible to the exact coordinates of the key point. The last detail you can see is the circle representing the convolution window with a Gaussian kernel (similar to the window for calculating the direction of the key point). For this kernel, sigma is defined equal to half the width of the descriptor window.

What to take the radius of research, again 3 * sigma? Or can it be like there in the image for 4? The value of the Gaussian kernel at this point - what exactly is meant, the extremum?

The cue point descriptor consists of all received histograms. As already mentioned, the dimension of the descriptor in the figure is 32 components (2x2x8), but in practice, descriptors with a dimension of 128 components (4x4x8) are used.
The resulting descriptor is normalized, after which all its components, the value of which is greater than 0.2, are truncated to the value 0.2, and then the descriptor is normalized again. In this form, the descriptors are ready for use.

That is, as a result, the descriptor for a point is an array of 32 values, where there are 4 "regions" in which 8 directions are defined? But in such cases, what to do with points at the edge of the image, they are not considered (they will have such neighborhoods truncated)?
Normalization - how for vectors? Dividing by the root sum of squares of all 32 values?