T
T
thecove2021-10-02 14:06:55
Distributed Computing
thecove, 2021-10-02 14:06:55

Why is my AMD Ryzen 3970 2-3x slower than Core i9 10850K?

This situation is strange. We bought an AMD Ryzen 3970 for work. It was supposed to multi-thread our mats on it. tasks.
Before that, we used Core i9 10850K (10 cores + 10 Hyper-threading).
As a result, I have such an ugly picture ....
40 million of our mat iterations ( mat operations do not contain floating point operations at all, only addition, subtraction, multiplication and shift )
- on Intel, when using 20 threads, it takes 20.5 seconds
- on Amd when using 64 threads takes 37.0 seconds
. At the same time, the resource monitor shows that all Intel cores are loaded at 100% and AMD at 35-45%

setting priorities when creating a thread has no effect:

threads[ core  ] = CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE)testMathThread, params, 0, NULL);
 ::SetThreadPriority( threads[ core ], THREAD_PRIORITY_HIGHEST);
 ::SetThreadAffinityMask( threads[ core ], 1 << core );

There are no global variables. Each thread has its own object in memory where they put their results.
If we do the calculation on one core, then the result is: 4 million Intel iterations 7.7 sec. AMD 9.25 sec

Enabling/disabling AMD's SMT in the BIOS makes almost no difference.
Switching in Windows power profile from AMD Rizen Balanced scheme to Maximum performance scheme then the processing time increases by 10-12% to 45s.

It seems that AMD has a fuse somewhere that does not allow the kernel to load 100%.
40-50% and that's all

The last hope to figure out what the hell is going on in the community. And then the hands are reaching back to the store to hand over.

PS Yes, the RAM is identical on both DDR4 3200 machines, 32 gigs each. But matan has little memory.

Answer the question

In order to leave comments, you need to log in

2 answer(s)
T
thecove, 2021-10-03
@thecove

Update Issue resolved!
With the problem for 12 hours sekas managed to figure it out.
The project used a self-written lib for RNG based on mt19937 and the person who wrote it 5 years ago made it thread-safe. Having crammed into all the challenges I don't know why AMD "rested" on these challenges longer than Intel, but the fact remains. Twice as many percent of the Reds lost time than the Blues. As a result, the blue ones have 100% load and the red ones have about 50. As a temporary solution (until the old lib was rewritten), I added my own Random class to each thread based on the standard rand() / srand() from C ++ , this is a solution on the knee. But the main reason was found and the accuracy of the calculations was not affected.
std::lock_guard guard(mMutex);
__declspec(thread) Random* random= nullptr;

class Random
{
public:
Random()
{
 _rand_state = 0;
}
void srand(unsigned int const seed)
{
    _rand_state = seed;
}
uint16_t rand()
{
    _rand_state = _rand_state * 214013 + 2531011;
    return (_rand_state >> 16) & RAND_MAX;
}
private:
uint32_t _rand_state; 
}

finally the result.
Here it was:
4 million AMD iterations 32 threads = 4.05 sec. CPU utilization 45%
4 million iterations AMD 64 threads = 3.61 sec. CPU utilization 47%
4 million Intel iterations 10 threads = 4.01 sec. CPU utilization 75%
4 million iterations Intel 20 threads = 2.61 sec. CPU usage 100%

after fixes it became:
4 million AMD iterations 32 threads = 1.25 sec. CPU utilization 60% ( 1 thread per physical core )
4 million AMD iterations 64 threads = 0.71 sec. CPU utilization 100% ( 1 thread per physical core + HP )
4 million Intel iterations 10 threads = 2.8 sec. CPU utilization 70% ( 1 thread per physical core )
4 million Intel iterations 20 threads = 2.1 sec. CPU utilization 100% ( 1 thread per physical core + HP )

As can be seen from the new test, AMD, as predicted by all known benchmarks, is about 3 times more productive than Intel when all cores are fully used.
Tests when loading on one core showed that Intel is 15-20 percent faster than AMD

R
Ronald McDonald, 2021-10-02
@Zoominger

- on Intel when using 20 threads takes 20.5 seconds
- on Amd when using 64 threads takes 37.0 seconds

Make the number of threads equal to the number of processor threads. Your performance is spent on switching the context of the cores from thread to thread.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question