Answer the question
In order to leave comments, you need to log in
How can this piece of CUDA code be optimized?
Take a look at this piece of code:
void OpenNNL::calculateNeuronsOutputsAndDerivatives(double * inputs, double * deviceOutputs, double * deviceDerivatives)
{
int inputsCount = _inputsCount;
double * deviceTemp;
double * deviceInputs;
cudaCall(cudaMalloc ( (void**)&deviceInputs, inputsCount*sizeof(double) ));
cudaCall(cudaMemcpy ( deviceInputs, inputs, inputsCount*sizeof(double), cudaMemcpyDeviceToDevice ));
for(int i=0;i<_layersCount;i++)
{
cudaCall(cudaMalloc((void**)&deviceTemp, _neuronsPerLayerCount[i]*inputsCount*sizeof(double)));
dim3 threadsMul = dim3(BLOCK_SIZE, 1);
int blocksCount = floor((double) _neuronsPerLayerCount[i]*inputsCount / threadsMul.x) + 1;
dim3 blocksMul = dim3(blocksCount, 1);
weighting<<<blocksMul, threadsMul>>>(deviceTemp, deviceInputs, _neuronsInputsWeights, _inputsInPreviousLayers[i], inputsCount, _neuronsPerLayerCount[i]);
cudaCall(cudaFree(deviceInputs));
cudaCall(cudaMalloc((void**)&deviceInputs, _neuronsPerLayerCount[i]*sizeof(double)));
dim3 threadsSum = dim3(BLOCK_SIZE, 1);
blocksCount = floor((double) _neuronsPerLayerCount[i] / threadsSum.x) + 1;
dim3 blocksSum = dim3(blocksCount, 1);
calculateOutputsAndDerivatives<<<blocksSum, threadsSum>>>(deviceOutputs, deviceDerivatives, deviceInputs, deviceTemp, _neuronsBiases, inputsCount, _neuronsPerLayerCount[i], _neuronsInPreviousLayers[i]);
inputsCount = _neuronsPerLayerCount[i];
cudaCall(cudaFree(deviceTemp));
}
cudaCall(cudaFree(deviceInputs));
}
Answer the question
In order to leave comments, you need to log in
Here are the results of the profiler: ubuntuone.com/41OVwiE7NEd3fI9jALsFi6
But, alas, I do not understand what this means (I am new to CUDA). I would be glad if someone could explain this to me.
So do not copy the data, pass it directly to the input of the weighting kernel (if they are not changed by the kernel, of course).
It would also be nice to wait for the cores to finish working.
1. I would also get rid of cudaFree cudaMalloc in the inner loop, although I don't know exactly how slow they are.
2. I would use single precision, double on nvidia is an order of magnitude slower, especially on a mobile card.
3. I join the advice to wait until the end of the cores.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question