How does hardware parallelization work?

T

TimkaTV2016-01-08 01:53:44

Processors

TimkaTV, 2016-01-08 01:53:44

Good afternoon,
I'm sorry for the possibly childish question.
Actually, I would like to know whether hardware parallelization of instructions is used on the processor?
How is it possible to implement something like this?
And further. How the operating system selects the kernel to execute (for example on x86-64). Perhaps some register with which the select circuit gives the enable signal? Thank you!

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

M

Mercury13, 2016-01-08
@Mercury13

I'll try to tell.
Step 1. Pipeline architecture (Pentium 1).
Something like a multi-barreled machine gun. One cartridge is loaded, another is fired, the third is ejected. Stop... One command is selected, the second is decoded, the third and fourth are executed.
Step 2. Superscalar (either Pentium Pro or MMX).
We have several execution units (in this case, integers). If the commands do not contradict each other, they can be run on both blocks in parallel.
Step 3. Microinstructions and VLIW (if I'm not mistaken, it was first implemented from x86 in Transmeta Crusoe).
Consists of these steps.
1. We break the x86 operations into micro-instructions - for example, “transfer from eax to the adder”, “shift the contents of the adder by 1 to the right” ...
2. We assemble this very “very long operation word” from microinstructions, taking care that there are no data dependencies. One adder receives a word from eax, and in parallel the second one shifts by 1. Each of the bits of the “long word” controls its own processor unit: adder, memory, input-output ...
3. Well, we execute this very word.
For all these architectures, the so-called. branch prediction. For this whole thing to work, several operations need to be decoded ahead of time. The problem is branching: if we don't guess if a branching has happened, all the preparatory work is wasted. In microcontrollers with a short pipeline and operation time predictable up to a cycle, we don’t care for this: for example, the instructions for the AVR say that there is a double pipeline: one decoding cycle and one (two, three) execution. Usually, the decode clock is not important (and not specified in the instructions), but we lose it if a transition occurs.
And in x86, branch prediction algorithms are quite complex.
PS. In superscalar processors, there are several dozen registers that are dynamically labeled: now EAX=r5, and after two instructions - already r13. The so-called "registry renaming".

S

Sergey, 2016-01-08
Protko @Fesor

well, just imagine. Do you have a command line. Let's say in a row there are commands in the spirit of "add two numbers to me", "copy the value from memory to registers", "send something to the bus so that it goes to the device". All three of these operations require different things. For the first - a free ALU, for the second - send control signals to the RAM, for the third - the third. And all these operations take a lot of different time.
Conclusion - we can sort commands at the stage of processor pipeline and thus send commands for processing as resources become available.
Or for example ... we have already 4 ALUs at the processor core. That is, for good we can immediately perform 4 arithmetic operations. You can parallelize if the teams go in a row.
In general, it’s better to read here on the pipeline of processors. This is a very fat topic.
Read about schedulers, for example here: Process Scheduling in linux