How to optimize CUDA?

T

Type Programmer2020-08-04 20:18:02

CUDA

Type Programmer, 2020-08-04 20:18:02

There is a piece of code, as for me it is quite simple, but it takes as long as 5 milliseconds, somehow this is a lot, as for parallel computing.

__global__ void kernel_compute_global_lighting(float* device_lenght_buff, CudaRenderWindow render_window, CudaRenderCamera camera, CudaRenderMap map, CudaRenderTextures textures, CudaRenderLight lights) {
  int pixel_coordinate_y = blockIdx.x * blockDim.x + threadIdx.x;
  int pixel_coordinate_x = blockIdx.y * blockDim.y + threadIdx.y;
  if (pixel_coordinate_y >= render_window.render_height || pixel_coordinate_x >= render_window.render_width)
    return;
  render_window.device_rendered_window[pixel_coordinate_y*render_window.render_width + pixel_coordinate_x].r = render_window.device_render_window[pixel_coordinate_y*render_window.render_width + pixel_coordinate_x].r * lights.device_light_pointers[0]->r;
  render_window.device_rendered_window[pixel_coordinate_y*render_window.render_width + pixel_coordinate_x].g = render_window.device_render_window[pixel_coordinate_y*render_window.render_width + pixel_coordinate_x].g * lights.device_light_pointers[0]->g;
  render_window.device_rendered_window[pixel_coordinate_y*render_window.render_width + pixel_coordinate_x].b = render_window.device_render_window[pixel_coordinate_y*render_window.render_width + pixel_coordinate_x].b * lights.device_light_pointers[0]->b;
}

In order to call this Kernel, I use the following parameters:
Grid
X : (900 + 31) / 32
Y : (1400 + 31) / 32);
Threads
X: 32
Y: 32

Tried to compress indirect addressing, made an intermediate object for calculations to reduce the number of memory accesses. But everything that, as for me, should speed up execution, on the contrary, only makes the delay longer.
How can I then deal with images above 1024x800?

Just tell me what exactly affects it the most, I'm tired of digging without results...

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

S

Sergey Tikhonov, 2020-08-05
@MegaCraZy6

First, you should make sure that you measured the speed of code execution, and not the speed of loading data, executing and unloading the result.
Second, try getting rid of the if altogether (you can add unused fields up to the size of the block).
Third, use vector multiplication instead of separate three-line operations.
Fourth, look at the block sizes for your card, you might not fit.