How can this piece of CUDA code be optimized?

R

Robotex2012-11-16 13:02:51

GPGPU

Robotex, 2012-11-16 13:02:51

Take a look at this piece of code:

void OpenNNL::calculateNeuronsOutputsAndDerivatives(double * inputs, double * deviceOutputs, double * deviceDerivatives)
    {
        int inputsCount = _inputsCount;
    
        double * deviceTemp;
        double * deviceInputs;
    
        cudaCall(cudaMalloc ( (void**)&deviceInputs, inputsCount*sizeof(double) ));
    
        cudaCall(cudaMemcpy ( deviceInputs, inputs, inputsCount*sizeof(double), cudaMemcpyDeviceToDevice ));
    
        for(int i=0;i<_layersCount;i++)
        {
            cudaCall(cudaMalloc((void**)&deviceTemp, _neuronsPerLayerCount[i]*inputsCount*sizeof(double)));
    
            dim3 threadsMul = dim3(BLOCK_SIZE, 1);
            int blocksCount = floor((double) _neuronsPerLayerCount[i]*inputsCount / threadsMul.x) + 1;
            dim3 blocksMul = dim3(blocksCount, 1);
    
            weighting<<<blocksMul, threadsMul>>>(deviceTemp, deviceInputs, _neuronsInputsWeights, _inputsInPreviousLayers[i], inputsCount, _neuronsPerLayerCount[i]);
    
            cudaCall(cudaFree(deviceInputs));
    
            cudaCall(cudaMalloc((void**)&deviceInputs, _neuronsPerLayerCount[i]*sizeof(double)));
    
            dim3 threadsSum = dim3(BLOCK_SIZE, 1);
            blocksCount = floor((double) _neuronsPerLayerCount[i] / threadsSum.x) + 1;
            dim3 blocksSum = dim3(blocksCount, 1);
    
            calculateOutputsAndDerivatives<<<blocksSum, threadsSum>>>(deviceOutputs, deviceDerivatives, deviceInputs, deviceTemp, _neuronsBiases, inputsCount, _neuronsPerLayerCount[i], _neuronsInPreviousLayers[i]);
    
            inputsCount = _neuronsPerLayerCount[i];
    
            cudaCall(cudaFree(deviceTemp));
        }
    
        cudaCall(cudaFree(deviceInputs));
    }

This feature runs very frequently. And it works very slowly. You can see cudaMemcpy at the beginning of the function. How could it be rewritten to avoid copying? The **inputs** array is already in global memory.

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

R

Robotex, 2012-11-16
@Robotex

Here are the results of the profiler: ubuntuone.com/41OVwiE7NEd3fI9jALsFi6
But, alas, I do not understand what this means (I am new to CUDA). I would be glad if someone could explain this to me.

D

Disasm, 2012-11-16
@Disasm

So do not copy the data, pass it directly to the input of the weighting kernel (if they are not changed by the kernel, of course).
It would also be nice to wait for the cores to finish working.

O

oleksandr_veles, 2012-11-16
@oleksandr_veles

1. I would also get rid of cudaFree cudaMalloc in the inner loop, although I don't know exactly how slow they are.
2. I would use single precision, double on nvidia is an order of magnitude slower, especially on a mobile card.
3. I join the advice to wait until the end of the cores.

M

Maxim Milakov, 2014-02-07
@mmilakov

Use CUBLAS