Why does performance drop when using SIMD (gcc, auto-vectorization)?

A

Andryukha2015-08-24 09:46:09

GCC

Andryukha, 2015-08-24 09:46:09

I write (for fun) raytracer, all with my own hands, including vector and matrix operations. The linear algebra code was written in cycles. Here is an example excerpt for the vector addition function:
#define assume_aligned(a) (__builtin_assume_aligned(a, align_size)); assert_align(a)
#define align(a) decltype(a) a##_ = (decltype(a)) assume_aligned(a)
template
void add(const T* __restrict v, const T* __restrict u, T* __restrict o) {
align(v); align(u); align(o);
for (size_t i = 0; i < size; i++)
o_[i] = v_[i] + u_[i];
}
In general, all functions are in this vein (hope for -O3). I also used OpenMP (which immediately increased the speed by 4 times, in terms of the number of cores). And here is the problem with SIMD. I didn't want to use SIMD instructions (let the compiler do that). So if I use vectors of 4 floats (instead of three), and in my understanding it should work faster, it actually works 3 times slower. (asm I did not look). Note that all data is alignas(16) and the compiler indicates that "loop vectorized". I read that many also encountered unexpected results of auto-vectorization.
Actually, the question is: who has a positive experience with auto vectorization (gcc), without or in combination with OpenMP (#pragma omp simd)?
Thanks in advance.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

A

Armenian Radio, 2015-08-24
@gbg

(asm I did not look). - that's the main problem. Look.
There are positive experiences. On a close issue.