Serial Protocol Problem
problem analysis
Let's start by analyzing a simple protocol problem, assuming we have a simple array at this point
size=8; int array[size]={1, 2, 3, 4, 5, 6, 7, 8};
Our goal is to calculate the sum of arrays, that is, array[0]+array[1]+...+array[9], which can generally be implemented in a simple loop like this
int i; int sum = 0; for(i=0;i<size;i++) { sum+=array[i]; } printf("%d",sum); /* output: 36*/
Let's think about it, what if we use recursion?How will you do it?
We can think of this 10-length array as a bunch of nodes
Looking at the tree in reverse (from top to bottom), we're going to add up an array of 8 lengths.
First step: Calculate the sum of two consecutive positions of the array, that is, add 8/2=4 times array[0]+array[1]=3
array[2]+array[3]=7
array[4]+array[5]=11
array[6]+array[7]=15
We keep them in the original array in order, that is
array[0]=3
array[1]=7
array[2]=11
array[3]=15
Step 2: Follow the first step and calculate only subscript 0-4.
that is
array[0]=array[0]+array[1]=10
array[1]=array[2]+array[3]=26
Step 3: Still follow the previous approach, that is
array[0]=array[0]+array[1]=36
Result
Serial Code Implementation Process
By analogy, we can define a range value of the int function recursiveReduce, which has the following parameters
int recursiveReduce(int *data,const int size)
Then, according to our analysis, to calculate the sum of the first size value protocols, we must first calculate and add the sum of the first half and the second half of the former size values, and we can start writing our function body.
1. It is easy to imagine that if the length of the array is 1, then the return value of this function is obviously the first element of the array, then it can be as follows
int recursiveReduce(int *data,const int size){ if(size==1){ return data[0]; } }
2. We'll start with the first step as we thought before, which is to add two adjacent elements of an array, add the size/2 times, and save in the first half of the array
int recursiveReduce(int *data,const int size){ if (size==1){ return data[0]; } int const stride =size/2; for (int i=0; i<stride;i++) { data[i]+=data[i+stride]; } }
3. Still according to the previous analysis, what we need to do after running the second step program is to calculate the second step, which is to add the first half of the elements of the array and save them in the first half of the array, then we can accept the supplementary function and get the total code.
Total Code
int recursiveReduce(int *data,const int size){ if (size==1){ return data[0]; } int const stride =size/2; for (int i=0; i<stride;i++) { data[i]+=data[i+stride]; } return cpuRecursiveReduce(data,stride); }
Parallel Protocol
Concepts of grid and block for GPU
First in the gpu, we need to clarify some concepts about grid and block
In gpu, threads are saved in blocks as shown
There are several gird s in a kennel of the GPU, and several blocks in a grid. The specific number needs to be set in the code itself. The limit number of grids and blocks is related to the hardware configuration of the GPU.
Let's start with a grid and a block in the grid
There can be several blocks in a gird. blockIdx.x and blockIdx.y can be used to get the coordinates of a block in a grid
There can be several threads in a block, and threadIdx.x and threadIdx.y can be used to get the coordinates of threads in a block
With the GPU, we can distinguish each thread by its coordinates and assign them different tasks
General Parallel Protocol Issue
Let's start by thinking about the simplest case
We have a grid, a gird with only one block, a block with only one row of threads, and a row with exactly eight threads
That is, as shown in the diagram
We can assign tasks that add array positions I and i+1 to thread i. For example, to assign thread0 tasks that save values to array[0]+array[1], to assign thread2 tasks that save values to array[2]+array[3].And for the convenience of subsequent operations, unlike serial operations, we decided to store each addition where the array subscript equals the thread subscript, that is, array[0]=array[0]+array[1],array[2]=array[2]+array[3]
You can more intuitively understand the assignment of tasks by looking at the following figure
Ignore the waste (unused) of some threads before resolving it
Simple implementation of common parallel code
With the thread task assignment in mind, we can start writing our code
First allocate the data space in the main function and set the sizes of our grid s and block s.
/* Set the sizes of grid s and block s */ dim3 block(8,1); dim3 gird(1,1); int dataSize=8; int nBytes=dataSize*sizeof*(int); int *d_data,*d_odata,*gpuRef,h_data; /******************************************** * We use d_data saves data * With d_odata holds the final result of the protocol * In order to compare the results correctly, we need to compare them with serial results * So h_is assignedData to hold serial data * And gpuRef to accept the final result of the GPU * * cudaMalloc by Memory Allocation Method for GPU ********************************************/ cudaMalloc((void **)&d_data,nBytes); cudaMalloc((void **)&d_odata,sizeof(int)); h_data=malloc(nBytes); gpuRef=malloc(nBytes); int i; /* Let h_data[8]={1, 2, 3, 4, 5, 6, 7, 8} */ for(i=0;i<dataSize;i++){ h_data[i]=i+1; } /* GPU Data Copy Method */ cudaMemcpy(&d_data,&h_data,nBytes,cudaMemcpyHostToDevice);
Then write our GPU protocol functions
__global__ void reduceNeighbored(int *g_idata,int *g_odata,unsigned int size);
It is easy to know what a parameter means by its name
g_idata: Array pointer representing the global containing data
g_odata: Indicates the global pointer to the array that holds the result
size: represents the length of the array containing the data
We assign tasks one step at a time by getting the location of the threads before assigning them by subscript
__global__ void reduceNeighbored(int *g_data,int *g_odata,unsigned int size) { unsigned int tid=threadIdx.x; }
A simple rethink of the serial computing process
First step: Calculate the sum of two consecutive positions of the array, that is, add 8/2=4 times array[0]+array[1]=3
array[2]+array[3]=7
array[4]+array[5]=11
array[6]+array[7]=15
We keep them in the original array in order, that is
array[0]=3
array[1]=7
array[2]=11
array[3]=15
Step 2: Follow the first step of the calculation, where only the subscript 0-4 is calculated, that is, array[0]=array[0]+array[1]=10
array[1]=array[2]+array[3]=26
Step 3: Still follow the previous approach, that is
array[0]=array[0]+array[1]=36
Result
So in parallel, we can plan three phases in a loop
__global__ void reduceNeighbored(int *g_data,int *g_odata,unsigned int size) { unsigned int tid=threadIdx.x; int i; for(i=0;i<3;i++){ //thread(tid) operation in phase i } }
We can easily discover the rule that a thread's subscript at the first operation is a multiple of 2, the second operation is a multiple of 4, and the third operation is a multiple of 8.That is, the thread subscript for the first operation should be 2 i 2^{i} Multiples of 2i, and the results of the protocol are saved in g_after the end of the three operationsData[0], so it's easy to write code
__global__ void reduceNeighbored(int *g_data,int *g_odata,unsigned int size) { unsigned int tid=threadIdx.x; int stride; for(stride=1;i<blockDim.x;stride*=2){ if(tid%(2*stride)==0){ g_data[tid]+=g_data[tid+stride]; } } /* We set thread0 to copy the results */ if(tid==0){ *g_odata=g_data[0]; } }
I'm sure you can see at a glance what's wrong with this code, because of the uncertainty of parallel threads running, we can't control the consistency of the phases of all threads in the loop.
It is likely that thread6 will be in a stage where stride equals 1 while thread0 is still calculating g_data[6]+=g_data[7]
But thread4 is at stage stride=2, where thread4 is already calculating g_data[4]+=g_data[6]
This will make the result different from what we expected.
Fortunately, NVDIA provides a native fence function syncthreads for thread synchronization, which further improves our code.
__global__ void reduceNeighbored(int *g_data,int *g_odata,unsigned int size) { unsigned int tid=threadIdx.x; int stride; for(stride=1;i<blockDim.x;stride*=2){ if(tid%(2*stride)==0){ g_data[tid]+=g_data[tid+stride]; } /* Synchronize threads by adding a fence after each step of the loop */ __syncthreads(); } /* We set thread0 to copy the results */ if(tid==0){ *g_odata=g_data[0]; } }
Uncoupled implementation of normal parallel code (about grid and block coupling)
Considering a simple grid and a block, what we should do now is to reduce the coupling, and now assume that our data volume is a long one
2
11
2^{11}
An array of 211, it's clear that you can't have as many threads in a block in the GPU at all, so we need more than one grid.
Let's set the number of thread s in the block to 512, then the number of grid s should be equal to
2
11
/
512
2^{11}/512
211/512, we set in the program
dataSize=1<<12;
blocksize=512;
girdsize=dataSize/blocksize;
Obviously, this is problematic because integer division is rounding down, and if blocksize is not a divisor of dataSize, it is problematic
So you can set gridsize=(dataSize+blocksize-1)/blocksize;
As before, the main function now allocates data space and sets the sizes of our grids and block s.And since the latitude of our grid is no longer 1, we need to allocate space to the array where we store the gpu results, one grid for each location.
int size=1<<12; int blocksize=512; dim3 block(blocksize,1); dim3 grid((size+block.x-1)/block.x,1); size_t nBytes=size*sizeof(int); int *h_idata=(int *)malloc(nBytes); int *gpuRef=(int *)malloc(grid.x*sizeof(int)); int *tmp=(int*)malloc(nBytes); int *d_idata=NULL; int *d_odata=NULL; for (int i=0;i<size;i++){ h_idata[i]=(int)(rand()&0xFF); } memcpy(tmp,h_idata,nBytes); cudaMalloc((void **) &d_idata,nBytes); cudaMalloc((void **) &d_odata,grid.x*sizeof(int)); cudaMemcpy(d_idata,h_idata,nBytes,cudaMemcpyHostToDevice);
So during the protocol process, we need to react to the data subscript that needs to be processed by the thread through the threads subscript, and we need to create a local variable idata to point to the address where the block will start processing the data.
For point-to-address, since the amount of data is not particularly large, the latitude and longitude of grid s and blocks are one-dimensional in the code, so the y-coordinates of blocks and thread s are not considered.
Let's start with block0,0: the obvious starting coordinate is 0 for the data and ending at 512, which is blocksize-1.
block1,0: The corresponding starting coordinate is 512 of the data, which follows the end of the data to which block0,0 points.
As you can easily see from this rule, the subscript at the beginning of the data you point to is equal to the x-coordinate of the blocksize*block
Code implementation is (blockDim.x represents the latitude of the block in the x-axis)
index=blockIdx.x*blockDim.x;
So for each thread, its subscript should be start+threadIdx.x
So you can set the variable idx in your code
int idx=blockIdx.x*blockDim.x+threadIdx.x;
We can use the above data to determine that the location idata is pointing to should be equal to
int *idata=index+threadIdx.x;
And use the data subscript idx corresponding to thread s to determine whether overflow occurs
if(idx>=size){return;}
The complete code for the function is as follows
__global__ void reduceNeighbored(int *g_idata,int *g_odata,unsigned int size) { unsigned int tid=threadIdx.x; unsigned int idx=blockIdx.x * blockDim.x+threadIdx.x; int *idata=g_idata+blockIdx.x*blockDim.x; if (idx >= size){ return ; } for (int stride=1;stride<blockDim.x;stride*=2) { if ((tid%(2*stride))==0) { idata[tid]+=idata[tid+stride]; } __syncthreads(); } if (tid==0){ g_odata[blockIdx.x]=idata[0]; } }
Solve small problems
Recall previous thread task assignment graphs
We found that there are many threads that are not working, termed Thread bundle differentiation.
It is obvious that threads whose corresponding data array subscripts are not 2stride multiples are not assigned tasks at all stages, so we can easily modify the code. This time, we only set half of the threads and multiply their corresponding indexes by 2stride. Of course, to prevent subscript overflow, we need to set a critical condition, idx<blockDim.x.
The improvement code is
__global__ void reduceNeighboredLess(int *g_idata,int *g_odata,unsigned int size) { unsigned int tid=threadIdx.x; unsigned int idx=blockIdx.x * blockDim.x+threadIdx.x; int *idata=g_idata+blockIdx.x*blockDim.x; if (idx >= size){ return ; } for (int stride=1;stride<blockDim.x;stride*=2) { int index=2*tid*stride; if (index<blockDim.x){ idata[index]+=idata[index+stride]; } __syncthreads(); } if (tid==0){ g_odata[blockIdx.x]=idata[0]; } }
Nested Recursive Parallel Protocol Problem
GPU Dynamic Parallel
Next, we discuss nested parallelization protocols.
Nesting is recursion, and the core idea is that the parent thread calls out the child thread.
As shown, each row represents a set of sibling threads, and thread 0 in each row calls out a child thread
Specific can be referred to Dynamic Parallel
Use of dynamic parallelism in protocols
Or, as previously set, blocksize is 512, thread0 in each grid should create 256 sub-threads to complete the second stage after 512 threads in the first stage complete the task, and so on, until the number of threads equals 1, then it can be returned.
To make our code easier, we need a different way of thinking about the protocol and the assignment of tasks, as shown below
Statute thinking
Task Assignment Ideas
Clear your mind and you can easily write code
The code for the nested protocol is as follows
__global__ void gpuRecursiveReduce(int *g_idata,int *g_odata,const int iSize){ unsigned int tid=threadIdx.x; int *idata=g_idata+blockIdx.x*blockDim.x; int *odata=g_odata+blockIdx.x; if (iSize==2 && tid==0){ g_odata[blockIdx.x]=idata[0]+idata[1]; return; } int iStride=iSize>>1; if (iStride>1 && tid<iStride) { idata[tid] += idata[tid + iStride]; } __syncthreads(); if (tid==0){ gpuRecursiveReduce<<<1,iStride>>>(idata,odata,iStride); cudaDeviceSynchronize(); } __syncthreads(); }
_u is used multiple times to ensure thread synchronizationThe syncthreads() function allows the reader to understand the intent of synchronization everywhere and the possible consequences of not synchronizing
Complete code (including main function)
#include <stdio.h> #include <cuda_runtime.h> #include "../tool.h" /* CHECK Functions and GET_TIME functions are defined in the tool.h file * Readers should delete code when copying or copying and pasting*/ int cpuRecursiveReduce(int *data,const int size){ if (size==1){ return data[0]; } int const stride =size/2; for (int i=0; i<stride;i++) { data[i]+=data[i+stride]; } return cpuRecursiveReduce(data,stride); } __global__ void gpuRecursiveReduce(int *g_idata,int *g_odata,const int iSize){ unsigned int tid=threadIdx.x; int *idata=g_idata+blockIdx.x*blockDim.x; int *odata=g_odata+blockIdx.x; if (iSize==2 && tid==0){ g_odata[blockIdx.x]=idata[0]+idata[1]; return; } int iStride=iSize>>1; if (iStride>1 && tid<iStride) { idata[tid] += idata[tid + iStride]; } __syncthreads(); if (tid==0){ gpuRecursiveReduce<<<1,iStride>>>(idata,odata,iStride); cudaDeviceSynchronize(); } __syncthreads(); } int main(int argc,char* argv[]){ int size=32; int blocksize=32; int gridsize=(size+blocksize-1)/blocksize; size_t nBytes=size*sizeof(int); int *hData,*diData,*doData,*gpuRef,cpuSum=0.0,gpuSum=0.0; double start,finish,elapsed; if (argc>1){ gridsize=atoi(argv[1]); blocksize=(size+gridsize-1)/gridsize; } dim3 block(blocksize,1); dim3 grid(gridsize,1); hData=(int *)malloc(nBytes); gpuRef=(int *)malloc(grid.x*sizeof(int)); CHECK(cudaMalloc((void**)&diData,nBytes)); CHECK(cudaMalloc((void**)&doData,grid.x*sizeof(int))); for (int i=0;i<size;i++){ *(hData)=i; } CHECK(cudaMemcpy(diData,hData,nBytes,cudaMemcpyHostToDevice)); GET_TIME(start); cpuSum=cpuRecursiveReduce(hData,size); GET_TIME(finish); elapsed=finish-start; printf("calculate %d numbers,cpu cost %lf sc. result is %d\n",size,elapsed,cpuSum); GET_TIME(start); gpuRecursiveReduce<<<grid,block>>>(diData,doData,block.x); CHECK(cudaDeviceSynchronize()); CHECK(cudaGetLastError()); GET_TIME(finish); elapsed=finish-start; CHECK(cudaMemcpy(gpuRef,doData,grid.x*sizeof(int),cudaMemcpyDeviceToHost);); for (int i=0;i<grid.x;i++){ gpuSum+=gpuRef[i]; } printf("calculate %d numbers,gpu cost %lf sc. result is %d\n",size,elapsed,gpuSum); return 0; }
Paste Out Results
(base) $ nvcc -arch=sm_50 -rdc=true -c nestedReduce.cu (base) $ nvcc -arch=sm_50 -dlink nestedReduce.o -o device_link.o (base) $ g++ device_link.o nestedReduce.o -o reduce.out -L/usr/local/cuda-10.1/lib64 -lcudart -lcudadevrt (base)$./reduce.out ./reduce.out starting .... at device 0: GeForce GTX 750 Ti calculate 32 numbers,cpu cost 0.000002 sc. result is 31 calculate 32 numbers,gpu cost 0.000196 sc. result is 31
Problems with dynamic parallelism
Due to the special nature of nesting, problems and errors may occur during compilation
error: calling a __global__ function("gpuRecursiveReduce") from a __global__ function("gpuRecursiveReduce") is only allowed on the compute_35 architecture or above
Since I was compiled using CMakeLists, there is no workable way to solve the problem in CMakeLists, so I can only compile the target file from the command line
(base) $ nvcc -arch=sm_50 -rdc=true -c nestedReduce.cu (base) $ nvcc -arch=sm_50 -dlink nestedReduce.o -o device_link.o (base) $ g++ device_link.o nestedReduce.o -o reduce.out -L{path} -lcudart -lcudadevrt
Path to the cuda package lib64 on your computer in path
Thanks for reading
Thank you for taking the time to see the end and wish you a happy life every day!