Answer the question
In order to leave comments, you need to log in
How to implement element-wise multiplication of arrays using OpenMP, CUDA and GLSL?
I've been trying to solve this problem for several days now, but my solutions are slower than the sequential single-threaded solution:
for(int i=0;i<LAYERS;i++)
{
for(int j=0;j<INPUTS*NEURONS;j++)
{
temp[j] = inputs[j%INPUTS] * weights[i][j];
}
for(int j=0;j<NEURONS;j++)
{
inputs[j] = 0;
for(int l=0;l<INPUTS;l++)
{
inputs[j] += temp[j*INPUTS + l];
}
inputs[j] = sigmoid(inputs[j]);
}
}
Answer the question
In order to leave comments, you need to log in
It seems that in the second variant the second loop is not parallelized. Replace input[j] with a local variable so that it falls into register. You may also need an intermediate array for adding these input[j] - so that you can then switch to it in one fell swoop after exiting the loop.
Here I try to use cublas for CUDA, I found matrix multiplication, but I don’t understand how to multiply two arrays.
What is the fundamental difference between this: pastebin.com/8dpNKBwt (faster than its single thread counterpart) and this pastebin.com/fbe4gZSn (slower)?
So, with OpenMP, I decided like this: pastebin.com/sJ4fXiAb Now it seems like a noticeable acceleration.
Now I'm thinking about CUDA. so far, everything works out like that.
But with GLSL trouble. I'm not familiar with this technology at all, can anyone help?
Well look.
There are no arrays.
Create a window - this is a must.
Initialize the pipeline.
Create a vertex buffer object with vertex coordinates. Yes, I know that you don’t have vertices - but otherwise the shader cannot be launched, it will work once per vertex. Vertices can be fake.
Create textures. Modern video cards support textures from float - there shouldn't be any problems. But you should make sure for the vidyuhi on which everything starts - otherwise you need to use a different calculation method.
We set all this stuff as render state. Run the created and compiled shader.
It takes values from textures and draws a picture for us. Or rather, two shaders are needed: vertex and pixel. The second one should write down the result. There are also tricks with recording intermediate results from the shader - but here you need to look at what a particular vidyuha can do.
We take the result. Where is the question. If pixel buffer object and render target are supported, then from PBO. Otherwise, from the screen buffer (I hope it will not come to this, the vidyuha will be more or less tolerable).
Maybe I forgot something else.
The above diagram is just a summary. In fact, the code turns out - mother do not worry. It's never like "simple array multiplication".
CUDA: pastebin.com/TkrhuEWA
Something is wrong. If L is equal to 1, then it counts correctly, if 2 and above, then some exorbitant numbers come out.
for(int j=0;j<L;j++)
{
mulKernel<<<blocksMul, threadsMul>>>(devTemp, devInputs, devWeights, j*N*I, I);
sumKernel<<<blocksSum, threadsSum>>>(devInputs, devTemp, N);
}
Something seems to me that without __syncthreads () things will not work.
Here: pastebin.com/ggETxnX8 the multiplication of two arrays is implemented.
I have a few questions:
1. The number of elements here must be a multiple of 4. How can I fix it to work with any number.
2. How does the kernel know which index it works with?
3. How to use two kernels and run one after the other?
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question