How to implement element-wise multiplication of arrays using OpenMP, CUDA and GLSL?

Robotex2011-03-26 23:14:44

CUDA

Robotex, 2011-03-26 23:14:44

I've been trying to solve this problem for several days now, but my solutions are slower than the sequential single-threaded solution:

for(int i=0;i<LAYERS;i++)
  {
    for(int j=0;j<INPUTS*NEURONS;j++)
    {
      temp[j] = inputs[j%INPUTS] * weights[i][j];
    }
    
    for(int j=0;j<NEURONS;j++)
    {
      inputs[j] = 0;
      
      for(int l=0;l<INPUTS;l++)
      {
        inputs[j] += temp[j*INPUTS + l];
      }
      
      inputs[j] = sigmoid(inputs[j]);
    }
  }

Those. in short, we have an array of N elements, an array of N*M, and an array (or matrix, here weight) of N*M*K elements.
We multiply the first array element by element by a group of K elements of the third one (when the first one runs out of elements, we go to the first one and again go through its elements) and the result into the second one. Next, we summarize the groups of M elements in the second array and write the sums to the first array (after applying some function, but it doesn’t matter). Then we repeat everything with a new group of elements of the third array.
So, as I said, my solutions are much slower than the code above (openMP is slightly slower :) ). How to do it right?

Answer the question

In order to leave comments, you need to log in

8 answer(s)

svetlov, 2011-03-27
@svetlov

It seems that in the second variant the second loop is not parallelized. Replace input[j] with a local variable so that it falls into register. You may also need an intermediate array for adding these input[j] - so that you can then switch to it in one fell swoop after exiting the loop.

Robotex, 2011-03-26
@Robotex

Here I try to use cublas for CUDA, I found matrix multiplication, but I don’t understand how to multiply two arrays.

Robotex, 2011-03-27
@Robotex

What is the fundamental difference between this: pastebin.com/8dpNKBwt (faster than its single thread counterpart) and this pastebin.com/fbe4gZSn (slower)?

Robotex, 2011-03-27
@Robotex

So, with OpenMP, I decided like this: pastebin.com/sJ4fXiAb Now it seems like a noticeable acceleration.
Now I'm thinking about CUDA. so far, everything works out like that.
But with GLSL trouble. I'm not familiar with this technology at all, can anyone help?

svetlov, 2011-03-27
@svetlov

Well look.
There are no arrays.
Create a window - this is a must.
Initialize the pipeline.
Create a vertex buffer object with vertex coordinates. Yes, I know that you don’t have vertices - but otherwise the shader cannot be launched, it will work once per vertex. Vertices can be fake.
Create textures. Modern video cards support textures from float - there shouldn't be any problems. But you should make sure for the vidyuhi on which everything starts - otherwise you need to use a different calculation method.
We set all this stuff as render state. Run the created and compiled shader.
It takes values from textures and draws a picture for us. Or rather, two shaders are needed: vertex and pixel. The second one should write down the result. There are also tricks with recording intermediate results from the shader - but here you need to look at what a particular vidyuha can do.
We take the result. Where is the question. If pixel buffer object and render target are supported, then from PBO. Otherwise, from the screen buffer (I hope it will not come to this, the vidyuha will be more or less tolerable).
Maybe I forgot something else.
The above diagram is just a summary. In fact, the code turns out - mother do not worry. It's never like "simple array multiplication".

Robotex, 2011-03-27
@Robotex

CUDA: pastebin.com/TkrhuEWA
Something is wrong. If L is equal to 1, then it counts correctly, if 2 and above, then some exorbitant numbers come out.


for(int j=0;j<L;j++)
        {
          mulKernel<<<blocksMul, threadsMul>>>(devTemp, devInputs, devWeights, j*N*I, I);
          
          sumKernel<<<blocksSum, threadsSum>>>(devInputs, devTemp, N);
        }

svetlov, 2011-03-27
@svetlov

Something seems to me that without __syncthreads () things will not work.

Robotex, 2011-03-27
@Robotex

Here: pastebin.com/ggETxnX8 the multiplication of two arrays is implemented.
I have a few questions:
1. The number of elements here must be a multiple of 4. How can I fix it to work with any number.
2. How does the kernel know which index it works with?
3. How to use two kernels and run one after the other?