How to explain the different execution speed of nested loops in different languages?

A

Alexey Al2015-01-01 23:45:23

Perl

Alexey Al, 2015-01-01 23:45:23

Good evening!
I read a post about how assembly language code runs faster than C.
thedeemon.livejournal.com/49226.html
There, the essence is that there are two int arrays with 256 elements:

The task is very simple: there are two arrays of 16-bit integers, find the sum of the squared differences. In the original, these were 16x16 blocks, and they had to be passed in a double cycle .... I run such a cycle 10 million times and look at the time, and also look at what kind of code was generated.

And I was wondering how long this process would take in C and in Python.
I wrote this code in C:

int main(int argc, char **argv)
{
  int g = 0;
  int a_list[] = {256 рандомных чисел от 1 до 100};
  int b_list[] = {256 рандомных чисел от 1 до 100};
  
  for (g = 0; g < 1000000; g++)
    array_func(a_list, b_list);
  return 0;
}

void array_func(int arr1[], int arr2[])
{
  int sum = 0, j=0;
  int x = 0, y = 0;
  for (y = 0; y < 16; y++)
  {
    for (x = 0; x < 16; x++)
    {
      short v = arr1[j] - arr2[j];
      sum += v*v;
      j++;
    }
  }
}

I have the same code in Python:

def main_func(a_list, b_list):
    summ = 0 #инициализация суммы
    j = 0 #независимый счетчик от 0 до 256
    for y in range(0, 16):
        for x in range(0, 16):
            p = a_list[j] - b_list[j]
            summ += p * p
            j += 1

for g in range(0, 1000000):
    main_func(a_list, b_list)

I measured the execution time using the Linux time command.
And I found out the following amazing things for me:

The second python completes a million iterations faster than the third: 40 s vs 52-55 s
Perl (I wrote on it, however, at random) works out the same million in two and a half minutes
Xi makes ten million in 11 seconds
If you rewrite everything in one function (respectively, three nested loops) - python and C are executed a couple of ms faster, and Perl - a minute faster (!?)

I'm just starting to learn how to program, and I have a few questions:

Why is the third python slower than the second in this case?
Why is Perl so slow and acting so weird?
And most importantly - why does the same JavaScript code in Mozilla work in 6.8 seconds?? How did it get faster than compiled C ?

PS JavaScript code:

function main_func(a_list, b_list) {
    for (var g = 0; g < 10000000; g++) {
        var summ = 0;
        var j = 0;
        for (var y = 0; y < 16; y++) {
            for (var x = 0; x < 16; x++) {
                var p = a_list[j] - b_list[j];
                summ += p * p;
                j++;
            }
        }
    }
}
console.time('test');
main_func(a_list, b_list);
console.timeEnd('test');

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

N

nirvimel, 2015-01-02
@palkokrut

I wasn't happy with how you assessed the performance of Python, so I took it upon myself to slightly tweak your example to show completely different results. Running my example, in addition to installing numpy ( pip install numpy), will require the installation of another interesting library ( pip install numba) with its installation, some difficulties may be associated on various operating systems (it also depends on llvm), but believe me, it's worth it, the resulting performance numbers should please.
To demonstrate the real speed of calculations in my example, a million iterations is quite small, the stone does not have time to warm up well. Please note that I replaced a million iterations with 100 million , so the result must be divided by 100for comparison with other languages . Here is the code itself:

from numba import jit
import numpy


@jit
def inner_func(a_list, b_list):
    sum = 0
    j = 0
    for y in range(0, 16):
        for x in range(0, 16):
            p = a_list[j] - b_list[j]
            sum += p * p
            j += 1
    return sum


@jit
def outer_func(a_list, b_list):
    sum = 0
    for g in range(0, 100000000):  # 100 000 000 == 10^8 !!!
        sum += inner_func(a_list, b_list)
    return sum


def main():
    maxint = numpy.iinfo(numpy.intc).max
    a_list = numpy.random.randint(maxint, size=256)
    b_list = numpy.random.randint(maxint, size=256)
    sum = outer_func(a_list, b_list)
    print(sum)


if __name__ == '__main__':
    main()

If you've managed to get this up and running and are impressed with the numbers, then I would ask that you tweak the Python part of your question to do justice to this great language.

O

Oleg Tsilyurik, 2015-01-02
@Olej

I'm just starting to learn how to program, and I have a few questions:

Actually, such "measurements" are largely nonsense.
Your results will depend on:
- the level of optimization set for the compiler...
- the compiler used from the same language (Clang from the LLVM touted here will be worse than GCC)
- how your arrays are (especially if they are larger) will get into processor caches and what level of caches, which depends on the way the arrays are placed, the order of navigation through them, etc. (and the difference in speed may not be 15%, but 4-5 times or more)
- on the number of cores and the capabilities of the language systems to use multiple processors in the execution system ...
- ... from the time of sunrise at your latitude ;-)
So such "measurements" are a rather meaningless exercise, and they can only be estimated by orders of magnitude.
For the sake of curiosity, see Programming Languages: Speed , Comparative Review of Programming Languages .

V

VZVZ, 2015-01-02
@VZVZ

In general, in such cases, everything comes down to interpretation (well, or the execution of machine code, if it is machine code; in fact, this is the same interpretation).
Namely, to two factors:
1) how well the interpreter algorithm is optimized (the code parsing algorithm, + is the interpreter doing too many unnecessary operations not related to code execution, for example, it can load something there)
> I measured the execution time with using the Linux time command.
That is, you did not look at the execution time of the algorithm, but the execution time of the program as a whole, from launch to completion? Well, in general, there are a lot of reasons, maybe the interpreter does not start immediately, and what it loads and does at startup, in general, some authors know.
2) how minimized (and generally optimized for interpretation) the interpreted code itself. This is why code in binary format (machine code, bytecode) is potentially faster than code in text format. Firstly, the binary is stupidly more compact, say, instead of the word function (which takes at least 8 bytes), there can be only 1-2 krakozyabras. Secondly, the binary format is strict, everything is more unambiguous there and there is no all this dregs with spaces, tabs, etc., the interpreter has less to think about what each byte means together with the bytes before it and the bytes after it. Minimizing the code with the help of utilities helps to make it more compact, but the interpreter still checks for tabs, spaces, etc., so the minified code in text format is still not always very fast anyway.
But in the case of C, by default, standard compilers simply mess up the code with all sorts of "rubbish", this is perfectly visible if you take OllyDbg and compare it at least in length with the code that assembler compilers give.

V

Vasily, 2015-01-02
@Foolleren

you forgot to do a multi-threaded calculation in C, so he lost to Java,
about libraries, there are a lot of them in C,
and different languages work differently with arrays, for example, a bolt scores on checking array boundaries, leaving it on the programmer’s conscience, there are still a lot depends on how the arrays are located in memory, alignment, I'm not talking about some cheats like unrolling the loop, here you have the length of the loop known in advance, the compiler should unroll such a loop as it doesn't work - fewer conditional jumps faster code