Is it possible to use multiple graphics cards to speed up scientific calculations in C++ or Python?

B

bobrikan2020-02-01 00:30:53

Python

bobrikan, 2020-02-01 00:30:53

At the moment I'm doing scientific calculations (calculating matrices, which parallels well). If initially it all started with sequential calculations on the CPU in python, then pretty soon the idea came about parallelization on the CPU, and then CUDA (I use the numba library in python). And the results are very good. However, they still got to the point that even on 1080ti, the account takes more than 15 hours.

So questions.
1) How fast is the Tesla V100 than the 1080ti?
2) How fast will the calculations be when switching to C++?
3) Is it possible to use multiple GPUs for calculations at the same time in python? That is, of course, you can break the matrix into 4 parts with your hands, run 4 identical codes with your hands (the only difference in the code will be in assigning cuda.device_select (n) to the video card code), and then use the handles to connect 4 matrices and get the desired matrix, but this is very troublesome.
4) A similar question about c ++
I searched the Internet for info - I did not find it.

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

V

Vladimir Korotenko, 2020-02-01
@firedragon

See.
https://www.microway.com/knowledge-center-articles...
Tesla's only problem is cost

A

Andrey Dugin, 2020-02-01
@adugin

I'll just leave this here as an illustration of overclocking matrix multiplication by a factor of 330 without any GPUs:
How to speed up matrix multiplication in numpy?

O

Oleg Kirillov, 2020-02-02
@madstrix

In my experience, the difference between python and c++ will be negligible in this case, because the calculations themselves are on the device, and the host is only responsible for I / O and general logic. Didn't work with CUDA, but for OpenCL it's all about copying data from/to the GPU and writing to files. You need to profile. I also heard that you can map a memory area from RAM or even ROM directly into the GPU address space. That will help to avoid unnecessary copying (but I personally have not tried it).
Regarding the use of multiple GPUs. Used. Gives a speedup of about 0.9n - 0.95n, where n is the number of calculators. The host receives data on the approximate performance of each device, the entire task is proportionally divided into parts and launched. Then the results are collected on the host.