Why is the execution time of the following SIMD instructions such?

M

mlyamasov2021-11-14 00:43:20

assembler

mlyamasov, 2021-11-14 00:43:20

I'm trying to figure out vectorization, I wrote such a simple program:

#include <stdio.h>

int main()
{
    unsigned int i;
    double x1, x2, x3, x4, y1, y2, y3, y4;
    x1 = 1.0000000001;
    x2 = 1.0000000002;
    x3 = 1.0000000003;
    x4 = 1.0000000004;
    y1 = x1;
    y2 = x2;
    y3 = x3;
    y4 = x4;
    for (i = 0; i < 3640000000; i++) {
        y1 *= x1;
        y2 *= x2;
        y3 *= x3;
        y4 *= x4;
    }
    printf("%g %g %g %g\n", y1, y2, y3, y4);
}

Architecture: i686.

gcc options: -O1. 4 fmul instructions in a loop. Run time: 10s.

11d4:       fxch   %st(4)
11d6:       fxch   %st(3)
11d8:       fxch   %st(2)
11da:       fmul   %st,%st(1)
11dc:       fxch   %st(2)
11de:       fmull  -0x1fd8(%ebx)
11e4:       fxch   %st(3)
11e6:       fmull  -0x1fe0(%ebx)
11ec:       fxch   %st(4)
11ee:       fmull  -0x1fe8(%ebx)
11f4:       sub    $0x1,%eax
11f7:       jne    11d4 <[email protected]+0x184>

gcc options: -O1 -mfpmath=sse -msse2. 4 mulsd instructions in a loop. Execution time: 6.7s.

11e8:       mulsd  %xmm7,%xmm0
11ec:       mulsd  %xmm6,%xmm1
11f0:       mulsd  %xmm5,%xmm2
11f4:       mulsd  %xmm4,%xmm3
11f8:       sub    $0x1,%eax
11fb:       jne    11e8 <[email protected]+0x198>

gcc options: -O1 -mfpmath=sse -mavx. 4 vmulsd instructions in a loop. Execution time: 6.7s.

11e8:       vmulsd %xmm7,%xmm0,%xmm0
11ec:       vmulsd %xmm6,%xmm1,%xmm1
11f0:       vmulsd %xmm5,%xmm2,%xmm2
11f4:       vmulsd %xmm4,%xmm3,%xmm3
11f8:       sub    $0x1,%eax
11fb:       jne    11e8 <[email protected]+0x198>

gcc options: -O1 -mfpmath=sse -msse2 -ftree-vectorize. 2 mulpd instructions in a loop. Execution time: 6.7s.

11d0:       mulpd  %xmm3,%xmm1
11d4:       mulpd  %xmm2,%xmm0
11d8:       add    $0x1,%eax
11db:       cmp    $0xd8f5fe00,%eax
11e0:       jne    11d0 <[email protected]+0x180>

gcc options: -O1 -mfpmath=sse -mavx -ftree-vectorize. 1 vmulpd instruction in a loop. Execution time: 6.7s.

11c7:       vmulpd %ymm1,%ymm0,%ymm0
11cb:       add    $0x1,%eax
11ce:       cmp    $0xd8f5fe00,%eax
11d3:       jne    11c7 <[email protected]+0x177>

Why doesn't the fifth option run 4 times faster than the first? What can be read on this topic?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

J

jcmvbkbc, 2021-11-17
@mlyamasov

I didn't find anything to read, but I did find a recommendation to use llvm-mca , and for code variants using SSE/AVX it gave me performance numbers that matched what I've seen in practice. Accordingly, if you wish, you can read the sources of llvm-mca (and get to the pipeline model somewhere in the bowels of llvm).

Why doesn't the fifth option run 4 times faster than the first?

Because the execution speed is related to the complexity of the opcodes. Thus, the execution of vmulpd in the fifth variant takes 4 cycles and completely overlaps with the rest of the loop instructions, so there is no point in making it shorter by changing add+cmp+jne to sub+jne.