I and computer vision - [CUDA] - [multi stream of CUDA under CPU multithreading]

Firstly, the problem lies in the multithreading under the cpu. When you want to call the same CUDA kernel function in multiple threads, you will find that the efficiency is very low. After verification, no matter how many threads you have, CUDA always puts the kernel function in the thread into the default stream for queue processing, which is equivalent to a single thread, but this problem has been solved after cuda7, Here are some tests for this problem.

The following links are used here:

https://www.cnblogs.com/wujianming-110117/p/14091897.html

https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/

CUDA 7 Stream stream simplifies concurrent heterogeneous computing, which refers to the efficient use of all processors in the system, including CPU and GPU. To do this, the application must execute functions concurrently on multiple processors. CUDA applications manage concurrency by executing asynchronous commands in streams, which are executed sequentially. Different streams can execute their commands concurrently or disorderly with each other. When an asynchronous CUDA command is executed without specifying a stream, the default stream is used at run time. Prior to CUDA 7, the default stream was a special stream that implicitly synchronized with all other streams on the device. CUDA 7 introduces a number of powerful new features, including a new option to use an independent default stream for each host thread, which avoids the serialization of traditional default streams. This article will show how to simplify concurrency between kernel and data copies in CUDA programs.

The specified stream is optional; You can call the CUDA command without specifying a stream (or by setting the stream parameter to zero). The following two lines of code start the kernel on the default stream.

  kernel<<< blocks, threads, bytes >>>();    // default stream
  kernel<<< blocks, threads, bytes, 0 >>>(); // stream 0

Default flows are useful when concurrency is not important to performance. Prior to CUDA 7, each device had a default stream for all host threads, which resulted in implicit synchronization. As described in the "implicit synchronization" section of the CUDA C programming guide, if the host thread issues any CUDA commands to the default flow between them, two commands from different flows cannot run concurrently.

CUDA 7 introduces a new option, the default flow per thread, which has two effects. First, it provides each host thread with its own default stream. This means that commands issued by different host threads to the default stream can be run concurrently. Second, these default flows are regular flows. This means that commands in the default flow can run simultaneously with commands in non default flows.

To enable per thread default flow in #nvcc7 and later, you can use the #nvcc # command line option CUDA or #define # compile # CUDA before including CUDA headers (cuda.h # or # cuda_runtime.h)_ API_ PER_ THREAD_ DEFAULT_ Stream preprocessor macro. It should be noted that #define CUDA cannot be used when the code is compiled by #nvcc #_ API_ PER_ THREAD_ DEFAULT_ Stream , in This behavior is enabled in the cu file because {nvcc} implicitly contains} CUDA at the top of the translation unit_ runtime. h .

The specific method is to right-click the CUDA C/C + + option in the project attribute (if you create a CUDA program, otherwise there is no such option) and add -- default stream per thread to the Command Line option.

 

The following describes the test method:

1. Open Nsight Monitor as an administrator
2. Click Nsight in the visual studio menu and select Start Perfoemance Analysis


3. Check the system and cuda options


4. Click lanuch to run
5. After the operation, the system will automatically output the operation results, and click timeline to view the usage of stream

The first example is to use the for loop to test multiple streams. The code is as follows:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <iostream>
#include <thread>
#include <stdio.h>

#include <thread>
#include <stdio.h>


const int N = 1 << 20;

__global__ void kernel(float *x, int n)
{
	int tid = threadIdx.x + blockIdx.x * blockDim.x;
	for (int i = tid; i < n; i += blockDim.x * gridDim.x) {
		x[i] = sqrt(pow(3.14159, i));
	}
}

int main()
{
	const int num_streams = 8;
	cudaStream_t streams[num_streams];
	float *data[num_streams];
	for (int i = 0; i < num_streams; i++) {
		cudaStreamCreate(&streams[i]);
		cudaMalloc(&data[i], N * sizeof(float));
		// launch one worker kernel per stream
		kernel << <1, 64, 0, streams[i] >> > (data[i], N);
		// launch a dummy kernel on the default stream
		kernel << <1, 1 >> > (0, 0);
	}
	cudaDeviceReset();
	return 0;
}

When the -- default stream per thread compilation command is not used, the running results are as follows:

You can see that kernel functions do not run concurrently.  

After adding:

The second example uses multithreading, and the code is as follows:

const int N = 1 << 20;

__global__ void kernel(float *x, int n)
{
	int tid = threadIdx.x + blockIdx.x * blockDim.x;
	for (int i = tid; i < n; i += blockDim.x * gridDim.x) {
		x[i] = sqrt(pow(3.14159, i));
	}
}

void launch_kernel(cudaStream_t stream)
{
	float *data;
	cudaMalloc(&data, N * sizeof(float));
	kernel << <1, 64,0, stream >> > (data, N);
	//cudaStreamSynchronize(0);
	return;
}

int main()
{
	const int num_threads = 8;
	std::thread threads[num_threads];
	cudaStream_t streams[num_threads];
	for (int i = 0; i < num_threads; i++) {
		cudaStreamCreate(&streams[i]);
		threads[i] = std::thread(launch_kernel, streams[i]);
	}

	for (int i = 0; i < num_threads; i++) {
		threads[i].join();
	}

	cudaDeviceReset();

	return 0;
}

Similarly, when the compilation command is not used, the results are as follows:

After use, see the following figure:

 

Tip:
When programming for concurrency, you also need to keep the following points in mind.
1. For the default flow of each thread, the behavior of the default flow in each thread is the same as that of the regular flow, as long as it is synchronized and concurrent. This is not true for traditional default streams.
2. The -- default stream option is applied according to the compilation unit to ensure that it is applied to all nvcc command lines that need it.
3.cudaDeviceSynchronize() continues to synchronize everything on the device, even using the new per thread default flow option. If you want to synchronize only a single stream, use cudaStreamSynchronize(cudaStream_t stream), as shown in the second example in.
4. Starting from CUDA 7, you can also use the handle cudaStreamPerThread to explicitly access the default stream per thread, or you can use the handle cudaStreamLegacy to access the old default stream. Note that cudaStreamLegacy is still implicitly synchronized with each thread's default stream if it happens to mix them in a program.
5. You can create a {non blocking flow that is not synchronized with the traditional default flow by passing the} cudaStreamCreate() flag to} cudaStreamCreate().

Tags: image processing C++ AI CUDA nvidia

Posted by Mouse on Mon, 03 Jan 2022 22:41:40 +1030