Answer the question
In order to leave comments, you need to log in
How many threads and blocks does it take to multiply 2 matrices on Cuda?
Hello, I can’t figure out how many threads and blocks are needed to correctly multiply matrices when calling the kernel?
__global__ void mul(float *a, float *b, float *c, int m, int n, int k)
{
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
float sum = 0;
if (col < k && row < m)
{
for (int i = 0; i < n; i++)
{
sum += a[row * n + i] * b[i * k + col];
}
c[row * k + col] = sum;
}
}
Answer the question
In order to leave comments, you need to log in
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question