Spatial coordinate system transformation realized by CUDA Thrust
Catalogue of series articles
The space target orbit parallel computing technology based on CUDA has four sections, of which the contents of the first and second sections are as follows1. Orbit calculation requirements and mission analysis of space target based on CUDA
2 Calculation of spatial coordinate system transformation matrix based on CUDA
2.1 spatial coordinate system transformation based on SOFA
2.2 spatial coordinate system transformation realized by CUDA Thrust
2.3 spatial coordinate system transformation realized by CUDA Runtime
--------
principle
CUDA supports development based on Thrust. At this time, you only need to implement the corresponding imitation function, but unlike the runtime mode, threads are distinguished by sequence number.
The spatial coordinate system transformation based on CUDA is actually the parallel calculation of the transformation matrix at each time point. Each thread inputs time and outputs the required transformation matrix.
Since the input of Thrust is the sequence number (the most efficient), the time needs to be calculated according to the sequence number. Therefore, the member variables required to calculate the time according to the sequence number need to be provided in the imitation function.
At the same time, the data such as pole shift is loaded into the memory on the CPU side in advance, and the data cannot be read from the memory on the GPU side. Therefore, an access mechanism needs to be provided.
Therefore, the functor class is defined as:
struct c2tTransFunctor { timeOfSpace _bt; //start time int _stepMilliSecond;//Step size in milliseconds double *_xps, *_yps, *_dut1s, *_ddp80s, *_dde80s;//Pole shift data address int _eopNum; //Number of pole shift data __device__ mat3x3 operator()(int& t) { int off = (long long int)_stepMilliSecond * t / (86400 * 1000); if (off >= _eopNum) off = _eopNum - 1; double xp = _xps[off] * DAS2R; double yp = _yps[off] * DAS2R; double dut1 = _dut1s[off]; timeOfSpace t0 = _bt; t0.addMilliSecond((long long int)_stepMilliSecond*t); iauCal2jd(t0._year, t0._month, t0._day, &djmjd0, &date); ...... return mat; };
In the imitation function class, the start time of the pre push is defined_ bt and step size_ stepMiliSeond (ms). In the function body, the time (t0) can be calculated according to the above information and sequence number.
The pole shift data pointer is defined in the imitation function*_ xps, *_yps, *_dut1s, *_ddp80s, *_dde80s, the offset of the corresponding pole shift data in the pole shift array can be calculated according to time.
The omitted code is the code corresponding to the implementation in the SOFA Library in the previous section.
void InitC2TFunctorPolarData(const timeOfSpace& t0, const timeOfSpace& t1, c2tTransFunctor& functor){ thrust::host_vector<double> xps; timeOfSpace t = t0; while (!(t > t1)) { double xp, yp, dut1, ddp80, dde80; find(t._year, t._month, t._day, xp, yp, dut1, ddp80, dde80); xps.push_back(xp); t.addDays(1); } functor._eopNum = xps.size(); thrust::device_vector<double> d_xps = xps; functor._xps = thrust::raw_pointer_cast(d_xps.data()); }
According to the Thrust specification, the parallel calculation based on the transformation of spatial coordinate system can be completed by calling the function.
propagateTransMat_Thrust(const timeOfSpace& t0, const timeOfSpace& t1, int stepMilliSecond, thrust::device_vector<mat3x3>& d_mats) { struct c2tTransFunctor functor; functor._bt = t0; functor._stepMilliSecond = stepMilliSecond; InitC2TFunctorPolarData(t0, t1, functor); int num = getSamplesNums(t0,t1,stepMilliSecond) ; thrust::device_vector<int> d_tms(num); d_mats.resize(num); thrust::sequence(d_tms.begin(), d_tms.end(), 0, 1); thrust::transform(d_tms.begin(), d_tms.end(), d_mats.begin(), functor); }
Efficiency analysis
In the case of different sampling points, the CPU version and Thrust version are timed respectively, and the results are shown in the table.
Sampling points | CPU version time (s) | Thrust version time (s) |
---|---|---|
14400 | 0.155 | 0.079 |
86400 | 0.976 | 0.132 |
864000 | 5.611 | 0.432 |
864000*5 | 26.798 | 1.909 |
It can be seen that when the number of sampling points is relatively small, the optimization of parallel computing is not obvious, but with the increase of sampling points, the optimization can reach more than ten times, and the efficiency improvement is obvious.
Analyze it with nvpf (14400 sampling points), and the timeline is shown in the figure
It can be seen that the call time of cudaMalloc and driver API is quite high, which is much higher than that of the kernel. At the same time, with the increase of sampling points, CUDA malloc time-consuming increases nonlinearly, but slowly; In addition, the test shows that the call location of cudaMalloc does not affect its time-consuming. Therefore, it can be reasonably inferred that the time consumed by cudaMalloc and driver API calls should include the time required for GPU initialization. Therefore, in the subsequent analysis, the efficiency analysis will mainly focus on the kernel.
Using Thrust for development has the advantages of simplicity and ease of use. Many CUDA calls can be hidden. In many cases, the efficiency is similar to that of Runtime mode.
However, the runtime mode still has its advantages: first, the runtime mode is more controllable, and the number of threads and memory access can be set; Second, Thrust generates results by imitating the function return value. For the case of generating multiple calculation results, it can only use the structure method. The result data formed in this way is the structure array, but the subsequent calculation often needs the structure of the array, which will be more efficient. Therefore, the following focuses on the implementation and optimization of runtime mode.