How to speed up this code (cellular automaton)?

E

Egorian2020-05-10 09:25:15

C++ / C#

Egorian, 2020-05-10 09:25:15

I am making an engine for playing cellular automata. Everything was good enough, until I started processing large cards, where the calculation speed sags. And I have no idea how to increase it.
There is a lot of code, but it is extremely simple.
The program works like this: I create a 2D array filled with objects of the Cell class. There is a Grid class for an array. I will give the code, and I will explain the meaning of some variables and methods later.

// Grid.h
private:
    Rule* rule;
    int ** _ranges;
    int _width;
    int _height;
    /// обычный массив, только с перегруженым оператором []
    OverflowArray<OverflowArray<Cell>> _grid;
    std::vector<Cell*> _cells_to_draw;
    ThreadPool * pool;
    std::vector<Cell*> * storage;

    static int ** DivideGridIntoZones(int zones, int length);
    void CalculateZone(int id,const int range[2]);
    void UpdateCellsStates(const int * range);
public:
    Grid(int width, int height, int threadsAmount,sf::RenderTarget& screen);
    ~Grid();
    Cell& GetCell(int x, int y);
    void CalculateCells(int threads=4);
    void DisplayCells();

// Cell.h
enum CellState{
    Empty,
    Alive,
};
class Cell{
private:
    CellState _state;
    // сначала программа считает следующее состояние клетки, а в следующем цикле их обновляет
   // для этого и нужен _nextState
    CellState _nextState;
public:
    Cell();
    Cell(int x,int y,CellState state=CellState::Empty){...};
    bool IsAlive(){ return _state==CellState::Alive;}
    CellState & GetNextState()  { return _nextState;}
    CellState & GetState()  { return _state;}
    void SetNextState(CellState state){ _nextState = state;}
};

I start an infinite loop, where CalculateCells is constantly launched, in which all calculations are carried out.
Now I will explain why _ranges and DivideGridIntoZones(int zones, int length)

Grid::Grid(...){
 // DivideGridIntoZones делит ширину на промежутки, количество которых равно threadsAmount
 // код многопоточный, поэтому каждый поток обрабатывает свою зону
// _width=100, threadsAmount = 2
// _ranges = 
_ranges = DivideGridIntoZones(threadsAmount,_width);
}

// Grid.cpp
// _width=1000, _height=1000, threadsAmount = 8
// с такими параметрами на большой карте с большим количеством живых клеток выжимаю 8 кадров
void Grid::CalculateCells(const int threadsAmount) {
    // этот код более-менее держит нагрузку и при таких параметрах выполняется за 0.01с
   // тут и происходит замена значения _state на значение _nextState
    for(int i=0;i<threadsAmount;i++){
       pool->AddJob( [this,i](){this->UpdateCellsStates(_ranges[i]);});
    }
    pool->WaitAll();
    //
    // моя главная проблема. При очень большом паттерне выполняется за 0.032с
    for(int i=0;i<threadsAmount;i++){
        pool->AddJob( [this,i]() {this->CalculateZone(i,_ranges[i]);});
    }
    pool->WaitAll();

 // some code
}

Now I will give the CalculateZone code, but first I will explain the meaning of storage. In this vector, threads write pointers to the cells that need to be drawn after the calculation. Then I connect storage to _cells_to_draw.

void Grid::CalculateZone(int id,const int range[2]){
    std::vector<Cell*> temp;
    Cell * cell;
    // вот  и использование range
    for(int y=0;y<_height;y++){
        for(int x=range[0];x<range[1];x++){
            cell = &GetCell(x,y);
            // сохраняю в переменную temp, чтобы её сохранить в storage
            if(cell->GetState()==CellState::Alive){
                temp.emplace_back(cell);
            }
            if(!_isPaused) {
                    // тут считается следующее состояние клетки и изменяется
                   // код приводить не буду, но скажу, что тут есть 9 обращений к массиву
                    rule->Execute(cell,x,y);
            }
            }
        }
    // сохраняю storage
    storage[id] = temp;
}

I think the code is enough.
My goal is to reach 20 frames per second, but I have no idea what to do to achieve this.
I found a good example on github that shows excellent performance.
Here is his rule processing . The author uses a pointer table, bit shift.
Is it really that much of a performance boost? Or do I have a huge error in my code that is eating up performance? I can only assume that somewhere I copy values too much, although I followed the rule "More pointers to the god of pointers", and I have already reviewed the code 10 times

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

E

Evgeny Shatunov, 2020-05-10
@Egorian

In a general sense, as I see from your code, you got into True Sharing, getting smeared with Cache Misses along the way and finally killing your performance with the help of an unreasonably huge cell size.
8B per cell, the state of which can fit in 1B, this is really a huge size.
enum CellState : uint8_twill reduce the state size from 4B to 1B. And this type should also be renamed, because. it is not CellState, but something related to the behavior of the cell. And it CellStatewill look like this:

// Renamed from `CellState`.
enum CellBehavior : uint8_t
{
    Empty,
    Alive,
};

struct CellState final
{
  CellBehavior	current_behavior : 4;
  CellBehavior	next_behavior : 4;
};

This allows you to reduce the cell size to 1 byte.
The processor pulls the data of RAM to itself in the internal cache . The processor has a lot of caches and they are all connected. The processor cache is divided into lanes , the work with which is synchronized between the processor cores. This is where two terms come in: False cacheline sharing and True cacheline sharing. If "False", then the data processed by different cores is divided into different cache lines. When "True" - the data required by different cores is in the same cache line and hello synchronization. And it's oh so slow.
Each processor today has a fortune teller that predicts what data you need to pull up from RAM to the CPU Cache. Fetching from RAM is a rather long procedure, so you need a fortune teller to predict what the fate of your algorithm is destined to choose in the next step. It happens that the fortuneteller makes a mistake and then your lagorithm gets into synchronization before the completion of the required memory fetch. And this is even slower than synchronization over cache lines. This is called a cache miss.
Fortunately, it is not the fortune teller who is to blame for her mistake, but you simply wrote the lagorithm incorrectly. Here, in order to make an algorithm out of a lagorithm, you should take care that it is more loyal to the fortune teller and the processor cache.
Dokin some more useful information.
Go to Adam Martinand to Unity , look at the ES/ESP/ECS paradigm . Learn DOD . Try refactoring from your current entity stream with fields to entity field streams. Redesign cell processing batching so that data is not synchronized between processor cores.
Perhaps understanding the Out of line approach will also help you , because there it is well explained why very large objects when they are threaded are not very friendly to the processor cache.
You can also add information about automatic vectorization here . This will enable SIMD instructions for your code. DOD is very elegant for processing your cells with SIMD commands.
I've been messing around here just to give you directions. I didn’t even write something, but you will definitely hook on everything that is not described when you study what I have described. I think you can already see how much all this material will result in if you write it in a convenient, understandable format and reveal each topic.