Are there any ground rules for super optimization?

U

Urukhayy2015-03-20 09:59:44

Programming

Urukhayy, 2015-03-20 09:59:44

Are there rules that are recommended to follow for optimization?

Reply

Answer the question

In order to leave comments, you need to log in

5 answer(s)

I

Igor Ermolaev, 2015-03-20
@Urukhayy

There is a detailed answer for C++ code, but it can be generalized to other languages and platforms:
If the application is already written, then first you should use the profiler to find bottlenecks in the program (code sections that take the most time to execute). When such areas are found, then you can start optimizing them. It is important to remember that, as a rule, any optimization impairs the readability of the code, so you should not do optimization where it is not directly required. It is easy to make mistakes during the optimization process, therefore it is very desirable to overlay the optimized functionality with unit tests before optimization.
1) It is important to understand that algorithmic optimization can almost always give a better effect than software optimization. So, if the algorithm has complexity O (n ^ 2), then on large initial data it will be slower with any optimization than the unoptimized algorithm of complexity O (n). However, when choosing an algorithm, relying only on its complexity is not worth it: if the amount of initial data is not large, then it may well turn out that an algorithm with complexity O (n ^ 2) will work faster than with O (n).
2) It is very important to reuse (cache) intermediate data calculated in the program - because the fastest work is done that does not need to be done. Nevertheless, one should not get carried away with caching too much - if the amount of cached data is too large, then this can adversely affect the overall performance of the program (some data is faster to calculate on the fly than to read from memory).
3) Avoid excessive copying of data (for example, pass complex data types by reference, not by value).
4) If possible, you should avoid conditional statements in nested loops. Because conditional jumps caused by such statements are poorly handled by modern processors with a pipelined architecture.
5) The data in memory that are used by the algorithms should, if possible, be ordered and used sequentially. This will allow the processor to cache them efficiently. It is important to remember that accessing the CPU cache is much faster than accessing RAM.
6) If the algorithms allow it, then it may be worthwhile to implement their parallel execution (in separate threads or processes). This will effectively enable modern multi-core processes.
7) In some cases (for example, image processing), a great effect can be achieved using specialized processor extensions (SSE, SSE2, AXX, AVX2, and others). It is worth noting that most modern compilers (GCC, MSVS, ICC) support the direct use of these extensions directly from C++ code using special built-in functions (intrinsics). The disadvantages of this approach are the loss of portability (however, this problem is solved by having different branches of the program for different processors) and a significant complication of the program.
8) Also, a great effect can be achieved by using specialized accelerators, such as GPUs (CUDA, OpenCL technologies). The disadvantages of such solutions are the loss of universality and significant complication of the program, as well as the fact that, as a rule, not every algorithm works well on specialized accelerators.

A

Armenian Radio, 2015-03-20
@gbg

The main question is what are we optimizing for? Memory? Speed? Program size? Ease of modification? Here, as in an RPG - we pump strength, the character becomes dumber; Increased the speed - gobbled up a lot of memory. Etc.

L

Loot, 2015-03-21
@Cesavel

www.highload.com

O

Optimus, 2015-03-20
Pyan @marrk2

PHP

S

Sergey Melnikov, 2015-03-26
@RainM

Performance can still be squeezed out of the compiler by playing with flags. For starters, you can try O2, and then LTO (IPO). But, compile-time will grow.
If the algorithm is parallel, as previous answers have pointed out, OpenMP can be used. New versions of the standards support both explicit vectorization and offload to the accelerator.
In general, the best tool for working on performance is vTune. You can immediately see what and where is slowly working. It is already possible to start from the results of profiling and look, the optimization of which will bring the greatest total weld.