How to measure code performance in production?

Q

Q2W2017-09-29 19:35:29

Performance evaluation

Q2W, 2017-09-29 19:35:29

There is a backend on which many web applications are spinning.
I would like to achieve two goals:
1. Reduce the page generation time so that Google and users love our site.
2. Reduce the consumption of CPU resources so that fewer backends are needed.
So, when implementing various optimizations in the code, we usually take different measurements.
Most often, we simply measure the average amount of time required to process one http request.
Average per day, because for more time it is too long to wait. Well, in general, I would like to somehow make it so that you don’t even have to wait a day - you look for 5 minutes, collect data, calculate the average, and you conclude whether performance has improved or worsened.
So, the execution time of the spherical algorithm depends, incl. from server load. Those. you can change something in the code, roll it out to production, and see that the load has changed in the wrong direction, in which the efficiency of the algorithm has actually changed, because it's just that it's an hour of the peak or a day of the week such that many requests are sent by users.
I tried to measure only the user's time (as the time utility measures).
But when the server is not 100% loaded, this metric is also sensitive to the server load (a half-empty server works almost 2 times faster than a 100% loaded one).
A 100% load here can be considered the execution of as many single-threaded processes that eat up all the available processor time as there are physical processor cores on the server.
Those. where hyperthreading is enabled, you need to run 2 times fewer such processes than the cores are visible, and where it is disabled - the same number as the cores are visible.
This is how I froze:

#!/usr/bin/perl

use strict;
use warnings;
use Time::HiRes qw(time);

my $forks = shift;

my $lscpu = `lscpu`;
my($cpus) = $lscpu =~ /^CPU\(s\):\s+(\d)$/m;
my($threadsPerCore) = $lscpu =~ /^Thread\(s\) per core:\s+(\d)$/m;
my $cores = $cpus / $threadsPerCore;

sub load {
  my $a = 0;
  $a += rand() foreach(0 .. 100000000)
}

fork() for (1 .. $forks);

my $u = - times();
my $t = - time();
load();
$u += times();
$t += time();


printf "| %d | %d | %d | %.2f | %.2f |\n", $cpus, $cores, 2 ** $forks, $t, $u;

And here are my measurements on a machine with hyperthreading:

|----|-----|-----|-------|---------|
|CPUs|Cores|Procs| Time  |User time|
|----|-----|-----|-------|---------|
|  4 |  2  |  1  | 11.08 |  11.07  |
|  4 |  2  |  2  | 11.70 |  11.69  |
|  4 |  2  |  4  | 19.79 |  19.64  |
|  4 |  2  |  8  | 39.42 |  19.62  |
|  4 |  2  |  16 | 83.36 |  19.86  |
|----|-----|-----|-------|---------|

And on a machine without hyperthreading:

|----|-----|-----|-------|---------|
|CPUs|Cores|Procs| Time  |User time|
|----|-----|-----|-------|---------|
|  2 |  2  |  1  | 23.74 |  23.73  |
|  2 |  2  |  2  | 23.53 |  23.52  |
|  2 |  2  |  4  | 46.78 |  23.38  |
|  2 |  2  |  8  | 93.76 |  23.43  |
|----|-----|-----|-------|---------|

And on this machine, User time is about the same everywhere!
But what's wrong with the first car? As soon as I load more physical cores on it than it has, user time increases.
What's this? The magic of hyperthreading? But after all, htop shows that during the test only one virtual core out of 4 was loaded.
Launched on another machine with hyperthereading:

|----|-----|-----|-------|---------|
|CPUs|Cores|Procs| Time  |User time|
|----|-----|-----|-------|---------|
|  8 |  4  |  1  | 6.23  |  6.18   |
|  8 |  4  |  2  | 6.20  |  6.16   |
|  8 |  4  |  4  | 8.38  |  8.33   |
|  8 |  4  |  8  | 19.95 |  11.90  |
|  8 |  4  |  16 | 33.71 |  11.98  |
|----|-----|-----|-------|---------|

Here user time grows until we load all 8 virtual cores, and not real ones.

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

A

alejandro68, 2017-09-29
@alejandro68

Everything is much easier.
If performance is your problem, then just log the stages (with time) of processing each request. INDIVIDUAL request. Then collect and analyze. Yes, even in Excel.
You don't need to monitor at the core level.
You need to determine exactly what place in your request processing is slowing down.
From there you will start dancing about the nuclei, etc.
Average per day, because for more time it is too long to wait.
It is enough to wait a few minutes to collect results for tens and hundreds of requests.
But what's wrong with the first car? As soon as I load more physical cores on it than it has, user time increases.
What's this? The magic of hyperthreading? But after all, htop shows that during the test only one virtual core out of 4 was loaded.
Without descending to the level of an applied task, it is unlikely that you will be able to find out anything here.
It could be stupid blocking .
And who told you that your subsystem is designed to work really in parallel ?