Grid size limits in Nvidia CUDA with 2D grid?

A

al_indigo2010-09-06 00:17:49

CUDA

al_indigo, 2010-09-06 00:17:49

Hello. My question may seem obvious to many at first glance, but I would still urge you not to consider it idiotic until you read it to the end.
So what is the crux of the matter. As is known from the CUDA documentation, the grid size of the kernel being run has limitations that depend on the specific device. In most modern video cards, the limit is 65535x65535x1. On my g210m and 8800gt video cards, this is exactly the case, I checked. But at this point I came across a rather strange thing - in my program, for some reason unknown to me, it is impossible to launch a kernel that would have a dimension (by threads) greater than 5808x5808 (this number may be less depending on the block size, I wrote a strict maximum ) or more than 264x264 (if measured in blocks) - and the last number is unchanged. As soon as the number of blocks to be started exceeds 265x265, the kernel starts, works, but always returns zero as a result.
The debugger from Nvidia Nsight is silent, no errors are thrown, the profiler gives the results of the work in which the kernel starts. The restriction pops up on all video cards on which I ran the program - in total on 8 different models (8400m g, 8800gt, 9600gso, 8500gt, 9600gt, ION, g210m, gf9300)
So all this makes me think that there are restrictions not only on the dimension of the grid, but also on the total number of threads in the grid (after all, there is a limit on the number of threads in the block - why not be here as well). Only here, neither the official documentation, nor the Boreskov / Kharlmov textbook, nor the best practices guide say anything about this - they just say that there are restrictions already voiced at the very beginning of the question.
Since I have been digging with this for about two hours a day for the past week, and there has been no progress, I ask for help - where to dig? Any comments are welcome, if you need to make any clarifications - tell

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

L

liq, 2010-09-06
@al_indigo

Just checked. I was unable to reproduce your problem.
I have GTX470.
So. Kernel wrote:

__global__ void testKernel( int* g_odata) 
{
  if(threadIdx.x==0)
  {
    g_odata[2*(blockIdx.y*gridDim.x+blockIdx.x)] = blockIdx.y;
    g_odata[2*(blockIdx.y*gridDim.x+blockIdx.x)+1] = blockIdx.x;
  }
}

I ran it on 8192x8192 blocks and 1024 threads (in your videos there are a maximum of 512 threads per block, on a fermi 1024):

    dim3  grid( 8192, 8192, 1);
    dim3  threads( 1024, 1, 1);
    testKernel<<< grid, threads, 0 >>>(  d_odata);

Naturally allocated memory, etc.
And got the last element of the array: 8191x8191.
I didn’t test it on large numbers, because the memory is running out :( We need to implement some kind of logic already.
But it’s not clear at all where you get these non-round values 265, 264?