# Serial Protocol Problem

## problem analysis

Let's start by analyzing a simple protocol problem, assuming we have a simple array at this point

```size=8;
int array[size]={1, 2, 3, 4, 5, 6, 7, 8};
```

Our goal is to calculate the sum of arrays, that is, array+array+...+array, which can generally be implemented in a simple loop like this

```int i;
int sum = 0;
for(i=0;i<size;i++)
{
sum+=array[i];
}
printf("%d",sum);
/*	output:	36*/
```

Let's think about it, what if we use recursion?How will you do it?

We can think of this 10-length array as a bunch of nodes

Looking at the tree in reverse (from top to bottom), we're going to add up an array of 8 lengths.

First step: Calculate the sum of two consecutive positions of the array, that is, add 8/2=4 times array+array=3
array+array=7
array+array=11
array+array=15
We keep them in the original array in order, that is
array=3
array=7
array=11
array=15
Step 2: Follow the first step and calculate only subscript 0-4.
that is
array=array+array=10
array=array+array=26
Step 3: Still follow the previous approach, that is
array=array+array=36
Result

## Serial Code Implementation Process

By analogy, we can define a range value of the int function recursiveReduce, which has the following parameters

```int recursiveReduce(int *data,const int size)
```

Then, according to our analysis, to calculate the sum of the first size value protocols, we must first calculate and add the sum of the first half and the second half of the former size values, and we can start writing our function body.
1. It is easy to imagine that if the length of the array is 1, then the return value of this function is obviously the first element of the array, then it can be as follows

```int recursiveReduce(int *data,const int size){
if(size==1){
return data;
}
}

```

2. We'll start with the first step as we thought before, which is to add two adjacent elements of an array, add the size/2 times, and save in the first half of the array

```int recursiveReduce(int *data,const int size){
if (size==1){
return data;
}
int const stride =size/2;

for (int i=0; i<stride;i++)
{
data[i]+=data[i+stride];
}
}
```

3. Still according to the previous analysis, what we need to do after running the second step program is to calculate the second step, which is to add the first half of the elements of the array and save them in the first half of the array, then we can accept the supplementary function and get the total code.

### Total Code

```int recursiveReduce(int *data,const int size){
if (size==1){
return data;
}
int const stride =size/2;

for (int i=0; i<stride;i++)
{
data[i]+=data[i+stride];
}

return cpuRecursiveReduce(data,stride);
}
```

# Parallel Protocol

## Concepts of grid and block for GPU

First in the gpu, we need to clarify some concepts about grid and block
In gpu, threads are saved in blocks as shown There are several gird s in a kennel of the GPU, and several blocks in a grid. The specific number needs to be set in the code itself. The limit number of grids and blocks is related to the hardware configuration of the GPU.
There can be several blocks in a gird. blockIdx.x and blockIdx.y can be used to get the coordinates of a block in a grid
There can be several threads in a block, and threadIdx.x and threadIdx.y can be used to get the coordinates of threads in a block

With the GPU, we can distinguish each thread by its coordinates and assign them different tasks

## General Parallel Protocol Issue

Let's start by thinking about the simplest case
We have a grid, a gird with only one block, a block with only one row of threads, and a row with exactly eight threads
That is, as shown in the diagram

We can assign tasks that add array positions I and i+1 to thread i. For example, to assign thread0 tasks that save values to array+array, to assign thread2 tasks that save values to array+array.And for the convenience of subsequent operations, unlike serial operations, we decided to store each addition where the array subscript equals the thread subscript, that is, array=array+array,array=array+array
You can more intuitively understand the assignment of tasks by looking at the following figure

Ignore the waste (unused) of some threads before resolving it

## Simple implementation of common parallel code

With the thread task assignment in mind, we can start writing our code
First allocate the data space in the main function and set the sizes of our grid s and block s.

```/*	Set the sizes of grid s and block s	*/
dim3 block(8,1);
dim3 gird(1,1);

int dataSize=8;
int nBytes=dataSize*sizeof*(int);
int *d_data,*d_odata,*gpuRef,h_data;
/********************************************
* We use d_data saves data
* With d_odata holds the final result of the protocol
* In order to compare the results correctly, we need to compare them with serial results
* So h_is assignedData to hold serial data
* And gpuRef to accept the final result of the GPU
*
* cudaMalloc by	Memory Allocation Method for GPU
********************************************/
cudaMalloc((void **)&d_data,nBytes);
cudaMalloc((void **)&d_odata,sizeof(int));
h_data=malloc(nBytes);
gpuRef=malloc(nBytes);
int i;
/*	Let h_data={1, 2, 3, 4, 5, 6, 7, 8}	*/
for(i=0;i<dataSize;i++){
h_data[i]=i+1;
}
/*	GPU Data Copy Method	*/
cudaMemcpy(&d_data,&h_data,nBytes,cudaMemcpyHostToDevice);
```

Then write our GPU protocol functions

```__global__ void reduceNeighbored(int *g_idata,int *g_odata,unsigned int size)；

```

It is easy to know what a parameter means by its name
g_idata: Array pointer representing the global containing data
g_odata: Indicates the global pointer to the array that holds the result
size: represents the length of the array containing the data
We assign tasks one step at a time by getting the location of the threads before assigning them by subscript

```__global__ void reduceNeighbored(int *g_data,int *g_odata,unsigned int size)
{
}
```

A simple rethink of the serial computing process

First step: Calculate the sum of two consecutive positions of the array, that is, add 8/2=4 times array+array=3
array+array=7
array+array=11
array+array=15
We keep them in the original array in order, that is
array=3
array=7
array=11
array=15
Step 2: Follow the first step of the calculation, where only the subscript 0-4 is calculated, that is, array=array+array=10
array=array+array=26
Step 3: Still follow the previous approach, that is
array=array+array=36
Result

So in parallel, we can plan three phases in a loop

```__global__ void reduceNeighbored(int *g_data,int *g_odata,unsigned int size)
{
int i;
for(i=0;i<3;i++){
}
}
```

We can easily discover the rule that a thread's subscript at the first operation is a multiple of 2, the second operation is a multiple of 4, and the third operation is a multiple of 8.That is, the thread subscript for the first operation should be 2 i 2^{i} Multiples of 2i, and the results of the protocol are saved in g_after the end of the three operationsData, so it's easy to write code

```__global__ void reduceNeighbored(int *g_data,int *g_odata,unsigned int size)
{
int stride;

for(stride=1;i<blockDim.x;stride*=2){
if(tid%(2*stride)==0){
g_data[tid]+=g_data[tid+stride];
}
}

/*	We set thread0 to copy the results	*/
if(tid==0){
*g_odata=g_data;
}

}
```

I'm sure you can see at a glance what's wrong with this code, because of the uncertainty of parallel threads running, we can't control the consistency of the phases of all threads in the loop.
It is likely that thread6 will be in a stage where stride equals 1 while thread0 is still calculating g_data+=g_data
This will make the result different from what we expected.
Fortunately, NVDIA provides a native fence function syncthreads for thread synchronization, which further improves our code.

```__global__ void reduceNeighbored(int *g_data,int *g_odata,unsigned int size)
{
int stride;

for(stride=1;i<blockDim.x;stride*=2){
if(tid%(2*stride)==0){
g_data[tid]+=g_data[tid+stride];
}
/*	Synchronize threads by adding a fence after each step of the loop	*/
}

/*	We set thread0 to copy the results	*/
if(tid==0){
*g_odata=g_data;
}

}
```

## Uncoupled implementation of normal parallel code (about grid and block coupling)

Considering a simple grid and a block, what we should do now is to reduce the coupling, and now assume that our data volume is a long one 2 11 2^{11} An array of 211, it's clear that you can't have as many threads in a block in the GPU at all, so we need more than one grid.
Let's set the number of thread s in the block to 512, then the number of grid s should be equal to 2 11 / 512 2^{11}/512 211/512, we set in the program

dataSize=1<<12;
blocksize=512;
girdsize=dataSize/blocksize;

Obviously, this is problematic because integer division is rounding down, and if blocksize is not a divisor of dataSize, it is problematic
So you can set gridsize=(dataSize+blocksize-1)/blocksize;

As before, the main function now allocates data space and sets the sizes of our grids and block s.And since the latitude of our grid is no longer 1, we need to allocate space to the array where we store the gpu results, one grid for each location.

```	int size=1<<12;

int blocksize=512;
dim3 block(blocksize,1);
dim3 grid((size+block.x-1)/block.x,1);

size_t nBytes=size*sizeof(int);
int *h_idata=(int *)malloc(nBytes);
int *gpuRef=(int *)malloc(grid.x*sizeof(int));
int *tmp=(int*)malloc(nBytes);
int *d_idata=NULL;
int *d_odata=NULL;
for (int i=0;i<size;i++){
h_idata[i]=(int)(rand()&0xFF);
}
memcpy(tmp,h_idata,nBytes);
cudaMalloc((void **) &d_idata,nBytes);
cudaMalloc((void **) &d_odata,grid.x*sizeof(int));
cudaMemcpy(d_idata,h_idata,nBytes,cudaMemcpyHostToDevice);

```

So during the protocol process, we need to react to the data subscript that needs to be processed by the thread through the threads subscript, and we need to create a local variable idata to point to the address where the block will start processing the data.
For point-to-address, since the amount of data is not particularly large, the latitude and longitude of grid s and blocks are one-dimensional in the code, so the y-coordinates of blocks and thread s are not considered.
Let's start with block0,0: the obvious starting coordinate is 0 for the data and ending at 512, which is blocksize-1.
block1,0: The corresponding starting coordinate is 512 of the data, which follows the end of the data to which block0,0 points.
As you can easily see from this rule, the subscript at the beginning of the data you point to is equal to the x-coordinate of the blocksize*block
Code implementation is (blockDim.x represents the latitude of the block in the x-axis)

index=blockIdx.x*blockDim.x;

So you can set the variable idx in your code

We can use the above data to determine that the location idata is pointing to should be equal to

And use the data subscript idx corresponding to thread s to determine whether overflow occurs

if(idx>=size){return;}

The complete code for the function is as follows

```__global__ void reduceNeighbored(int *g_idata,int *g_odata,unsigned int size)
{

int *idata=g_idata+blockIdx.x*blockDim.x;

if (idx >= size){
return ;
}

for (int stride=1;stride<blockDim.x;stride*=2)
{
if ((tid%(2*stride))==0)
{
idata[tid]+=idata[tid+stride];
}
}

if (tid==0){
g_odata[blockIdx.x]=idata;
}
}

```

## Solve small problems

We found that there are many threads that are not working, termed Thread bundle differentiation.
It is obvious that threads whose corresponding data array subscripts are not 2stride multiples are not assigned tasks at all stages, so we can easily modify the code. This time, we only set half of the threads and multiply their corresponding indexes by 2stride. Of course, to prevent subscript overflow, we need to set a critical condition, idx<blockDim.x.
The improvement code is

```__global__ void reduceNeighboredLess(int *g_idata,int *g_odata,unsigned int size)
{

int *idata=g_idata+blockIdx.x*blockDim.x;

if (idx >= size){
return ;
}

for (int stride=1;stride<blockDim.x;stride*=2)
{
int index=2*tid*stride;
if (index<blockDim.x){
idata[index]+=idata[index+stride];
}
}

if (tid==0){
g_odata[blockIdx.x]=idata;
}
}

```

## Nested Recursive Parallel Protocol Problem

### GPU Dynamic Parallel

Next, we discuss nested parallelization protocols.
Nesting is recursion, and the core idea is that the parent thread calls out the child thread.
As shown, each row represents a set of sibling threads, and thread 0 in each row calls out a child thread
Specific can be referred to Dynamic Parallel ### Use of dynamic parallelism in protocols

Or, as previously set, blocksize is 512, thread0 in each grid should create 256 sub-threads to complete the second stage after 512 threads in the first stage complete the task, and so on, until the number of threads equals 1, then it can be returned.
To make our code easier, we need a different way of thinking about the protocol and the assignment of tasks, as shown below
Statute thinking Clear your mind and you can easily write code
The code for the nested protocol is as follows

```__global__ void gpuRecursiveReduce(int *g_idata,int *g_odata,const int iSize){

int *idata=g_idata+blockIdx.x*blockDim.x;
int *odata=g_odata+blockIdx.x;

if (iSize==2 && tid==0){
g_odata[blockIdx.x]=idata+idata;
return;
}
int iStride=iSize>>1;
if (iStride>1 && tid<iStride) {
idata[tid] += idata[tid + iStride];
}

if (tid==0){
gpuRecursiveReduce<<<1,iStride>>>(idata,odata,iStride);

}

}

```

_u is used multiple times to ensure thread synchronizationThe syncthreads() function allows the reader to understand the intent of synchronization everywhere and the possible consequences of not synchronizing

### Complete code (including main function)

```#include <stdio.h>
#include <cuda_runtime.h>
#include "../tool.h"
/*	CHECK Functions and GET_TIME functions are defined in the tool.h file
*	Readers should delete code when copying or copying and pasting*/
int cpuRecursiveReduce(int *data,const int size){
if (size==1){
return data;
}
int const stride =size/2;

for (int i=0; i<stride;i++)
{
data[i]+=data[i+stride];
}

return cpuRecursiveReduce(data,stride);
}

__global__ void gpuRecursiveReduce(int *g_idata,int *g_odata,const int iSize){

int *idata=g_idata+blockIdx.x*blockDim.x;
int *odata=g_odata+blockIdx.x;

if (iSize==2 && tid==0){
g_odata[blockIdx.x]=idata+idata;
return;
}
int iStride=iSize>>1;
if (iStride>1 && tid<iStride) {
idata[tid] += idata[tid + iStride];
}

if (tid==0){
gpuRecursiveReduce<<<1,iStride>>>(idata,odata,iStride);

}

}

int main(int argc,char* argv[]){
int size=32;
int blocksize=32;
int gridsize=(size+blocksize-1)/blocksize;
size_t nBytes=size*sizeof(int);
int *hData,*diData,*doData,*gpuRef,cpuSum=0.0,gpuSum=0.0;
double start,finish,elapsed;
if (argc>1){
gridsize=atoi(argv);
blocksize=(size+gridsize-1)/gridsize;
}

dim3 block(blocksize,1);
dim3 grid(gridsize,1);

hData=(int *)malloc(nBytes);
gpuRef=(int *)malloc(grid.x*sizeof(int));
CHECK(cudaMalloc((void**)&diData,nBytes));
CHECK(cudaMalloc((void**)&doData,grid.x*sizeof(int)));
for (int i=0;i<size;i++){
*(hData)=i;
}
CHECK(cudaMemcpy(diData,hData,nBytes,cudaMemcpyHostToDevice));

GET_TIME(start);
cpuSum=cpuRecursiveReduce(hData,size);
GET_TIME(finish);
elapsed=finish-start;
printf("calculate %d numbers,cpu cost %lf sc. result is %d\n",size,elapsed,cpuSum);

GET_TIME(start);
gpuRecursiveReduce<<<grid,block>>>(diData,doData,block.x);
CHECK(cudaGetLastError());
GET_TIME(finish);
elapsed=finish-start;
CHECK(cudaMemcpy(gpuRef,doData,grid.x*sizeof(int),cudaMemcpyDeviceToHost););
for (int i=0;i<grid.x;i++){
gpuSum+=gpuRef[i];
}

printf("calculate %d numbers,gpu cost %lf sc. result is %d\n",size,elapsed,gpuSum);
return 0;
}

```

Paste Out Results

```(base) \$ nvcc -arch=sm_50 -rdc=true -c nestedReduce.cu
(base)\$./reduce.out
./reduce.out starting .... at device 0: GeForce GTX 750 Ti
calculate 32 numbers,cpu cost 0.000002 sc. result is 31
calculate 32 numbers,gpu cost 0.000196 sc. result is 31

```

## Problems with dynamic parallelism

Due to the special nature of nesting, problems and errors may occur during compilation

```error: calling a __global__ function("gpuRecursiveReduce") from a __global__ function("gpuRecursiveReduce") is only allowed on the compute_35 architecture or above

```

Since I was compiled using CMakeLists, there is no workable way to solve the problem in CMakeLists, so I can only compile the target file from the command line

```(base) \$ nvcc -arch=sm_50 -rdc=true -c nestedReduce.cu