Answer the question
In order to leave comments, you need to log in
Grid size limits in Nvidia CUDA with 2D grid?
Hello. My question may seem obvious to many at first glance, but I would still urge you not to consider it idiotic until you read it to the end.
So what is the crux of the matter. As is known from the CUDA documentation, the grid size of the kernel being run has limitations that depend on the specific device. In most modern video cards, the limit is 65535x65535x1. On my g210m and 8800gt video cards, this is exactly the case, I checked. But at this point I came across a rather strange thing - in my program, for some reason unknown to me, it is impossible to launch a kernel that would have a dimension (by threads) greater than 5808x5808 (this number may be less depending on the block size, I wrote a strict maximum ) or more than 264x264 (if measured in blocks) - and the last number is unchanged. As soon as the number of blocks to be started exceeds 265x265, the kernel starts, works, but always returns zero as a result.
The debugger from Nvidia Nsight is silent, no errors are thrown, the profiler gives the results of the work in which the kernel starts. The restriction pops up on all video cards on which I ran the program - in total on 8 different models (8400m g, 8800gt, 9600gso, 8500gt, 9600gt, ION, g210m, gf9300)
So all this makes me think that there are restrictions not only on the dimension of the grid, but also on the total number of threads in the grid (after all, there is a limit on the number of threads in the block - why not be here as well). Only here, neither the official documentation, nor the Boreskov / Kharlmov textbook, nor the best practices guide say anything about this - they just say that there are restrictions already voiced at the very beginning of the question.
Since I have been digging with this for about two hours a day for the past week, and there has been no progress, I ask for help - where to dig? Any comments are welcome, if you need to make any clarifications - tell
Answer the question
In order to leave comments, you need to log in
Just checked. I was unable to reproduce your problem.
I have GTX470.
So. Kernel wrote:
__global__ void testKernel( int* g_odata)
{
if(threadIdx.x==0)
{
g_odata[2*(blockIdx.y*gridDim.x+blockIdx.x)] = blockIdx.y;
g_odata[2*(blockIdx.y*gridDim.x+blockIdx.x)+1] = blockIdx.x;
}
}
dim3 grid( 8192, 8192, 1);
dim3 threads( 1024, 1, 1);
testKernel<<< grid, threads, 0 >>>( d_odata);
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question