Linux, OpenMP and GCC: Is a multi-threaded program 5-6 times slower than a single-threaded one?

Robotex2011-03-24 02:28:23

linux

Robotex, 2011-03-24 02:28:23

I'm learning OpenMP, trying to parallelize a neural network. But for some reason, it always turns out that a parallelized application is significantly slower than a regular one. What could be the problem?
Single threaded application:
pastebin.com/22vp3LYU
Compilation:
g++ -lrt -O0 main.cpp -o nnlv2
Multithreaded:
pastebin.com/w3m01QTK
Compilation:

g++ -lrt -fopenmp -O0 main_openmp.cpp -o nnlv2_openmp

PS Ignore the sigmoid function, I'm aware that it should look different :)

Answer the question

In order to leave comments, you need to log in

7 answer(s)

edeldm, 2011-04-05
@edeldm

there is also
cudaMemcpyToSymbol( "alpha", &alpha, 4);
I skimmed through it and everything seems to be correct.
I can give advice to a novice radio operator - do not twist two nuts at the same time.
those. try first just copy to vidyuhi and back and check the values for matches.
then multithreaded.
I also do this (I will share a piece of code :) ):

__global__ void RunTest(unsigned long *heartbeat)
{
  unsigned long tid = blockIdx.x*blockDim.x + threadIdx.x;
  heartbeat[tid] = 1;
}

void Runkerneltest(int grids, int threads, unsigned long *heartbeat)
{
  RunTest <<<grids, threads>>> (heartbeat);
}




bool TestKernel(int grids, int threads)
{
  unsigned long* heartbeat =  new unsigned long [grids*threads];
  memset (heartbeat, 0, grids*threads*sizeof (unsigned long)); 
  void* heartbeat_device = 0;
  cudaError err;
  err = cudaMalloc ((void**)&heartbeat_device, grids*threads*sizeof (unsigned long));
  Check("cudaMalloc heartbeat_device",err);
  err = cudaMemset (heartbeat_device, 0, grids*threads*sizeof(unsigned long));
  Check("cudaMemset heartbeat_device",err);

  Runkerneltest(grids,threads,(unsigned long*)heartbeat_device);

  err = cudaThreadSynchronize();
  Check("cudaThreadSynchronize()", err);
  err = cudaMemcpy( heartbeat, heartbeat_device, grids*threads*sizeof(unsigned long), cudaMemcpyDeviceToHost );
  Check("cudaMemcpy( heartbeat, heartbeat_device)", err);

  bool error = false;
  for (int i=0; i<grids*threads; i++)
    if (heartbeat[i] != 1) 
    {
      LOG("tid %d test fails",i);
      error = true;
      break;
    }

  cudaFree (heartbeat_device);
  delete (heartbeat);
  return (error ? false : true); //если была ошибка, то false
}

Sergey Lerg, 2011-03-24
@Lerg

Use explicit streams. For example, using the pthread library .

Robotex, 2011-03-24
@Robotex

Optimized the code:
Single-threaded version: pastebin.com/KAx4RmSJ
Multi -threaded version: pastebin.com/fbe4gZSn
Now the multi-threaded version is only 2 times slower than the single-threaded version (and it needs to be at least 3 times faster). What else can be optimized?

Robotex, 2011-03-25
@Robotex

Here are the latest versions of the code: ubuntuone.com/p/jPV/
Source code where it compiles like this:
nvcc -lrt main_cuda.cu -o nnlv2_cuda
Everything is still sad. Where little is slower, it also thinks it is not true. What am I doing wrong?

edeldm, 2011-04-06
@edeldm

vidyuhi parameters "automatically" are taken from here:

cudaDeviceProp prop;
if(cudaGetDeviceProperties( & prop, i) == cudaSuccess) 
{
  LOG ("Device: %s\n",prop.name);
  LOG ( "Compute capability     : %d.%d\n", prop.major, prop.minor );
  LOG ( "Name                   : %s\n", prop.name );
  LOG ( "Total Global Memory    : %ld\n", prop.totalGlobalMem );
  LOG ( "Shared memory per block: %d\n", prop.sharedMemPerBlock );
  LOG ( "Registers per block    : %d\n", prop.regsPerBlock );
  LOG ( "Warp size              : %d\n", prop.warpSize );
  LOG ( "Max Grid               : %d\n", prop.maxGridSize[0] );
  LOG ( "Max threads per block  : %d\n", prop.maxThreadsPerBlock );
  LOG ( "Total constant memory  : %d\n", prop.totalConstMem );
...
 }

mikhanoid, 2011-04-06
@mikhanoid

So you have atomic there before k++. At each iteration, the threads will be synchronized to make k++ one after another. Naturally, it will be much slower.

Robotex, 2011-05-24
@Robotex

But what if there shouldn't be any acceleration on a dual-core machine? In one core the main loop is processed, in the other a parallel zone.