B
B
buzzi8882015-07-09 16:58:43
Algorithms
buzzi888, 2015-07-09 16:58:43

Why is an x64 application twice as slow as x86?

I'm new to C. Made my first project in VisualStudio - compiled x64 and x86 releases.
The task is simple: we load a dictionary of 8 million words, build a prefix tree (implementation, you won’t believe it, from a comment on stackoverflow - the first thing that started up and worked)
Search for hundreds of words:
x86: 2-3 sec.
x64: 6-7 sec.
Exactly twice the performance of x64 vs x86 falls, the same picture when searching for 1000 words - exactly 2 times.
I would like to understand in general terms where such a difference comes from. Most likely, I don’t know and don’t understand something about x64.
Win 8.1, VS 2013 Express for Windows Desktop, build/compile settings did not touch
cl.exe product version 11.00.50727.1
I tried to build on gcc with the -m32 and -m64 keys - x64 is a little slower, about 5-10%
gcc version - Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn)

Answer the question

In order to leave comments, you need to log in

5 answer(s)
D
Deerenaros, 2015-07-09
@Deerenaros

Answering this question without any additional information is like guessing on coffee grounds. What kind of CPU - if it's an ancient Pentium D with an antediluvian pipeline and stupid registers - is one thing, but if it's the latest Core i7 on Haswell - another. As for the settings - honestly, "standard" does not mean anything at all. I'm not saying that it would not be bad to indicate the number of experiments with the maximum and minimum - it is quite possible that stupid OS with a scheduler somehow unsuccessfully allocate time. Any answer that can be specified here may be technically literate, but completely untrue.
I would venture to suggest such a scenario - we know almost nothing about optimizations and do a "standard" debug build. In these cases, the translator inserts special labels into the code, by which it will be possible to match the instructions with the file and line number. It is clear that both idle loops and useless dereferences will get into object files. There can be no question of any caching or rearrangements - what you asked for, you got.
Now let's think together about what are the differences between x86-64 and x86. In fact, the question is not well-posed - x86-64 almost completely includes x86. Of the changed - the size of the pointer (address), and the logic of the registers has been slightly redone (although they are all in place, just a few more dozens have been added) - now some of the arguments to the function are passed through additional registers, while in x86 all go through the stack. However, getting an advantage here is not so easy - the processor is also not a fool, in the case of linear information processing (or any long-term work with small areas of memory), it caches everything perfectly and working with the stack is generally not much slower than working with registers.
Now let's look at the code. What's there? A bunch of address arithmetic, few functions, and almost no arguments. 8 million words? I do not think that recursion will force the stack to get out of the cache, so there is a suspicion of parity of architectures in this case. However, a large amount of address arithmetic and an increased size of the address in bits ... how many times? Twice?
Well, okay, of course, the addition is implemented in 1 cycle. More likely. Of course, this is a processor issue, but even knowing the model it will be difficult to find out for sure, except with a synthetic test (many times to contact the address - the sum of two random numbers). Also, Windows 8.1 was never the standard for performance (quite the contrary), and VC++ was never the best compiler.
Try gcc (I'm just wondering where gcc came from on Windows) with the -O3 flag. And look at the native code for 64 bits and 32 bits (you can use objdump from binutils or look at the native code in the Visual Studio IDE - I don’t remember the exact location of the button, but you can search in the menus). There is probably not one reason, but many. So, a function call is accompanied by saving the context, while in x64 there are more registers, more context. We collect such moments bit by bit... So we get it.
PS A long time ago, I was talking to a teacher. A simple recompilation for 64 bits speeded up the code by 30%. It was a collective farm codec, a bit like libx264 (part of the code was pulled off from there). Naturally, the project was assembled with all the optimizations, with all the extensions of the instructions - with everything that was possible. And assembly for the x86-64 platform (with SSE, MMX, FMA and others). A terribly science-intensive motley code (written by everyone - from green graduate students to Stroustrup's peers and university professors) - a bunch of functions, structures, unions and very, very many parameters, many of which are passed to function arguments. Well, the target platform - terribly cut and remade Windows Embedded - there was simply nothing to plan.

J
jcmvbkbc, 2015-07-10
@jcmvbkbc

I would like to understand in general terms where such a difference comes from.

The easiest way to answer this question is usually through profiling, which will tell you where the program spends its time. You just have to understand why it is there.
When using gcc, this is done by adding the -pg option to the compile and link commands, running the application, and running gprof afterwards. There is also a profiler for VS.

S
Sergey, 2015-07-09
Protko @Fesor

Most likely in the x64 build there is some kind of extra work with memory. In any case, the problem lies somewhere in the work with pointers and compiler optimizations.

I
ichernob, 2015-07-09
@ichernob

But it can not be connected with memory pages and addressing?

L
Leonid Fedotov, 2015-07-09
@iLeonidze

you to RAM

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question