Answer the question
In order to leave comments, you need to log in
Why is the execution time of the following SIMD instructions such?
I'm trying to figure out vectorization, I wrote such a simple program:
#include <stdio.h>
int main()
{
unsigned int i;
double x1, x2, x3, x4, y1, y2, y3, y4;
x1 = 1.0000000001;
x2 = 1.0000000002;
x3 = 1.0000000003;
x4 = 1.0000000004;
y1 = x1;
y2 = x2;
y3 = x3;
y4 = x4;
for (i = 0; i < 3640000000; i++) {
y1 *= x1;
y2 *= x2;
y3 *= x3;
y4 *= x4;
}
printf("%g %g %g %g\n", y1, y2, y3, y4);
}
11d4: fxch %st(4)
11d6: fxch %st(3)
11d8: fxch %st(2)
11da: fmul %st,%st(1)
11dc: fxch %st(2)
11de: fmull -0x1fd8(%ebx)
11e4: fxch %st(3)
11e6: fmull -0x1fe0(%ebx)
11ec: fxch %st(4)
11ee: fmull -0x1fe8(%ebx)
11f4: sub $0x1,%eax
11f7: jne 11d4 <[email protected]+0x184>
11e8: mulsd %xmm7,%xmm0
11ec: mulsd %xmm6,%xmm1
11f0: mulsd %xmm5,%xmm2
11f4: mulsd %xmm4,%xmm3
11f8: sub $0x1,%eax
11fb: jne 11e8 <[email protected]+0x198>
11e8: vmulsd %xmm7,%xmm0,%xmm0
11ec: vmulsd %xmm6,%xmm1,%xmm1
11f0: vmulsd %xmm5,%xmm2,%xmm2
11f4: vmulsd %xmm4,%xmm3,%xmm3
11f8: sub $0x1,%eax
11fb: jne 11e8 <[email protected]+0x198>
11d0: mulpd %xmm3,%xmm1
11d4: mulpd %xmm2,%xmm0
11d8: add $0x1,%eax
11db: cmp $0xd8f5fe00,%eax
11e0: jne 11d0 <[email protected]+0x180>
11c7: vmulpd %ymm1,%ymm0,%ymm0
11cb: add $0x1,%eax
11ce: cmp $0xd8f5fe00,%eax
11d3: jne 11c7 <[email protected]+0x177>
Answer the question
In order to leave comments, you need to log in
I didn't find anything to read, but I did find a recommendation to use llvm-mca , and for code variants using SSE/AVX it gave me performance numbers that matched what I've seen in practice. Accordingly, if you wish, you can read the sources of llvm-mca (and get to the pipeline model somewhere in the bowels of llvm).
Why doesn't the fifth option run 4 times faster than the first?
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question