Answer the question
In order to leave comments, you need to log in
Linux, OpenMP and GCC: Is a multi-threaded program 5-6 times slower than a single-threaded one?
I'm learning OpenMP, trying to parallelize a neural network. But for some reason, it always turns out that a parallelized application is significantly slower than a regular one. What could be the problem?
Single threaded application:
pastebin.com/22vp3LYU
Compilation: g++ -lrt -O0 main.cpp -o nnlv2
Multithreaded:
pastebin.com/w3m01QTK
Compilation:
g++ -lrt -fopenmp -O0 main_openmp.cpp -o nnlv2_openmp
Answer the question
In order to leave comments, you need to log in
there is also
cudaMemcpyToSymbol( "alpha", &alpha, 4);
I skimmed through it and everything seems to be correct.
I can give advice to a novice radio operator - do not twist two nuts at the same time.
those. try first just copy to vidyuhi and back and check the values for matches.
then multithreaded.
I also do this (I will share a piece of code :) ):
__global__ void RunTest(unsigned long *heartbeat)
{
unsigned long tid = blockIdx.x*blockDim.x + threadIdx.x;
heartbeat[tid] = 1;
}
void Runkerneltest(int grids, int threads, unsigned long *heartbeat)
{
RunTest <<<grids, threads>>> (heartbeat);
}
bool TestKernel(int grids, int threads)
{
unsigned long* heartbeat = new unsigned long [grids*threads];
memset (heartbeat, 0, grids*threads*sizeof (unsigned long));
void* heartbeat_device = 0;
cudaError err;
err = cudaMalloc ((void**)&heartbeat_device, grids*threads*sizeof (unsigned long));
Check("cudaMalloc heartbeat_device",err);
err = cudaMemset (heartbeat_device, 0, grids*threads*sizeof(unsigned long));
Check("cudaMemset heartbeat_device",err);
Runkerneltest(grids,threads,(unsigned long*)heartbeat_device);
err = cudaThreadSynchronize();
Check("cudaThreadSynchronize()", err);
err = cudaMemcpy( heartbeat, heartbeat_device, grids*threads*sizeof(unsigned long), cudaMemcpyDeviceToHost );
Check("cudaMemcpy( heartbeat, heartbeat_device)", err);
bool error = false;
for (int i=0; i<grids*threads; i++)
if (heartbeat[i] != 1)
{
LOG("tid %d test fails",i);
error = true;
break;
}
cudaFree (heartbeat_device);
delete (heartbeat);
return (error ? false : true); //если была ошибка, то false
}
Optimized the code:
Single-threaded version: pastebin.com/KAx4RmSJ
Multi -threaded version: pastebin.com/fbe4gZSn
Now the multi-threaded version is only 2 times slower than the single-threaded version (and it needs to be at least 3 times faster). What else can be optimized?
Here are the latest versions of the code: ubuntuone.com/p/jPV/
Source code where it compiles like this:
nvcc -lrt main_cuda.cu -o nnlv2_cuda
Everything is still sad. Where little is slower, it also thinks it is not true. What am I doing wrong?
vidyuhi parameters "automatically" are taken from here:
cudaDeviceProp prop;
if(cudaGetDeviceProperties( & prop, i) == cudaSuccess)
{
LOG ("Device: %s\n",prop.name);
LOG ( "Compute capability : %d.%d\n", prop.major, prop.minor );
LOG ( "Name : %s\n", prop.name );
LOG ( "Total Global Memory : %ld\n", prop.totalGlobalMem );
LOG ( "Shared memory per block: %d\n", prop.sharedMemPerBlock );
LOG ( "Registers per block : %d\n", prop.regsPerBlock );
LOG ( "Warp size : %d\n", prop.warpSize );
LOG ( "Max Grid : %d\n", prop.maxGridSize[0] );
LOG ( "Max threads per block : %d\n", prop.maxThreadsPerBlock );
LOG ( "Total constant memory : %d\n", prop.totalConstMem );
...
}
So you have atomic there before k++. At each iteration, the threads will be synchronized to make k++ one after another. Naturally, it will be much slower.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question