CN113452383A

CN113452383A - GPU parallel optimization method for TPC decoding of software radio system

Info

Publication number: CN113452383A
Application number: CN202010225156.5A
Authority: CN
Inventors: 习勇; 陈海赞; 黄铁; 肖辉明; 王欣
Original assignee: Hunan Leading Wisdom Telecommunication and Technology Co Ltd
Current assignee: Hunan Leading Wisdom Telecommunication and Technology Co Ltd
Priority date: 2020-03-26
Filing date: 2020-03-26
Publication date: 2021-09-28

Abstract

The invention belongs to the technical field of software radio and the field of high-performance computing of computers, and discloses a GPU parallel optimization method for TPC decoding of a software radio system, which comprises the following steps: firstly, initializing, and copying input information bit block data from a CPU memory to a GPU global storage space; secondly, decoding all the information bit block data; thirdly, performing column decoding on the TPC data block after the second-step row decoding; fourthly, repeating the second step to the third step for K times; and fifthly, calculating a judgment code word according to the TPC data block after the Kth decoding, and outputting the judgment code word to a CPU (Central processing Unit) end as a decoding result. The invention utilizes the calculation capability of the GPU, can greatly improve the TPC decoding speed in a software radio system, and meets the requirement of real-time decoding of multi-channel signals.

Description

GPU parallel optimization method for TPC decoding of software radio system

Technical Field

The invention relates to the technical field of software radio and the field of high-performance computing of computers, in particular to a GPU parallel optimization method for TPC decoding of a software radio system.

Background

TPC decoding is a key step in TDMA satellite signal demodulation, and its decoding performance affects the performance of the entire demodulation system. The Turbo Product Code (TPC) is widely applied to a software radio communication system as a high code rate code, but most TPC decoding algorithms have the problems of complex structure, high resource requirement and large data processing delay. For example, TPC requires row-column iterative decoding of a data block, and both the number of iterations and the efficiency of row-column coding affect the performance of the decoder. The TPC decoding realized based on the CPU at present is difficult to achieve the effect of real-time decoding due to the adoption of a serial execution mode, the TPC decoding is often the most time-consuming part in the whole satellite signal demodulation system, and the low-efficiency realization mode limits the wide application of the TPC decoding in the satellite signal demodulation system. It is important to perform accelerated optimization of TPC decoding.

Currently, in order to improve the decoding efficiency of the TPC, there are some related works, such as coarse-grained parallel optimization based on a multi-core CPU, hardware acceleration implementation based on an FPGA, and the like. The coarse-grained parallel optimization based on the CPU can obtain certain performance gain, but has the problem of insufficient calculation power when the sampling rate is higher. The FPGA-based implementation mode has a long programming period and insufficient flexibility. Compared with the traditional CPU, the GPU can provide more enhanced computing power and higher-speed memory access bandwidth. In addition, due to the appearance of high-level programming languages such as CUDA (compute unified device architecture), the programmability of the GPU is greatly enhanced. At present, the GPU is widely applied in the fields of science and engineering technology, and the effect is very obvious. Recent successful applications in deep learning have made the use of GPUs more popular. The TPC decoding algorithm has good parallelism essentially and is matched with the architecture characteristics of a GPU, so that the TPC decoding is accelerated by the GPU parallel optimization method aiming at the TPC decoding of the software radio system, and the method is very urgent and practical.

Disclosure of Invention

In order to solve the technical problems, the invention provides a GPU parallel optimization method aiming at TPC decoding of a software radio system, which is used for solving the problems of large time overhead and insufficient flexibility of the traditional TPC decoding implementation mode, and the specific technical scheme is as follows:

a GPU parallel optimization method aiming at TPC decoding of a software radio system comprises the following steps:

(S1) initializing, copying the input information bit block data from the CPU memory to the GPU global storage space, setting a corresponding output result storage space according to the size of the input information bit block data, and setting the initial value of the output result storage space to be 0;

(S2) decoding all the information bit block data;

(S3) column-decoding the row-decoded information bit block data;

(S4) repeating the steps (S2) to (S3) K times, wherein K is the number of times of iterative processing, and the value of K is an integer which is greater than or equal to 6.

(S5) a decision codeword is calculated from the K-th decoded TPC data block and output to the CPU as a decoding result.

Further, the step (S1) specifically includes the steps of:

(S11) copying the information bit block data to be processed from the CPU memory space to the GPU global memory space using the cudaMemcpy function;

(S12) initializing a GPU global memory space for storing the decoding result using a cudaMemset function, with an initial value of 0.

Further, the step (S2) specifically includes the steps of:

(S21) solving the position of p data with the minimum absolute value of each line of data of all information bit block data by adopting a thread block and thread two-stage parallel method, wherein p is a preset threshold value and is a positive integer, and p is set to be 4 in the embodiment of the technical scheme of the invention; in each row calculation of information bit block data, use i₀,i₁,…,i_p-1The positions of the p data respectively representing the minimum value in the row;

(S22) equally dividing the threads in the thread block into 2^pGroups, each group of threads calculating a test sequence of line data, lines within each thread blockProgram parallel computing of row data 2^pThe test sequence of each row is calculated with the corresponding input row data, and the Euclidean distance of each test sequence of each row is solved, namely 2 is obtained^pEuclidean distance, 2 corresponding to the row^pThe Euclidean distances form a data set A; the calculation of the line test sequence and the calculation of the Euclidean distance from the input line data all adopt the existing line decoding technology.

The calculation process of the sequencing sequence is as follows: carrying out symbol decision on a row of the information bit block data which is correspondingly input to obtain a sequence L, i of the sequence L₀,i₁,…,i_p-1Taking 0 and 1 at the position respectively, and keeping the values of other positions unchanged to form 2^pA row test sequence; for example: first line test sequence i₀The position is 0, and the values of the other positions are unchanged; second row test sequence i₀The position is 1, the values of the other positions are unchanged, and the rest of the positions are analogized in sequence.

The rules for symbol decision are: for the element value of the position, if the value is less than 0, the value is 0, otherwise the value is 1; (S23) the first thread of each thread block solving for the 2' S corresponding to the row^pThe minimum value in the Euclidean distance (data set A) is obtained, and then all threads in the thread block solve the row updating amount of the corresponding row of data in parallel according to the minimum value; the row update amount is: taking the row test sequence with the minimum Euclidean distance as an optimal code word, and simultaneously calculating the off-line information quantity of each position in the corresponding row by all threads;

(S24) updating the data corresponding to all the lines in all the information bit blocks in parallel, adding the off-line information quantity and the elements at the corresponding positions of the input line data, and finishing a line decoding process.

Further, the step (S3) specifically includes the steps of:

(S31) solving the positions of the p data with the minimum absolute value of each column of data of all the information bit blocks by adopting a thread block and thread two-stage parallel method; for each column in the calculation process, use j₀,j₁,…,j_p-1The positions of the p data respectively representing the minimum value in the column;

(S32) willThe threads within a thread block are equally divided into 2^pEach group of threads respectively calculates a column test sequence of column data, and the threads in each thread block parallelly calculate 2 of the column data^pEach column test sequence is calculated with the corresponding input column data, and the Euclidean distance of each column test sequence is solved, namely 2 is obtained^pEuclidean distance, 2 corresponding to the column^pThe Euclidean distances form a data set B;

the calculation process of the column test sequence is as follows: performing symbol decision on a column of the information bit block data corresponding to the input to obtain a sequence M, wherein j of the sequence M is₀,j₁,…,j_p-1(ii) a

Taking

0 and 1 at the position respectively, and keeping the values of other positions unchanged to form 2^pA column test sequence; the rules for symbol decision are: for the element value of the position, if the value is less than 0, the value is 0, otherwise the value is 1;

(S33) the first thread of each thread block solving the minimum value of all euclidean distances (data set B), and then all threads in the thread block solving the column update amount of the corresponding column data in parallel according to the minimum value; the column update amount is: taking the column test sequence with the minimum Euclidean distance as the optimal code word, simultaneously calculating the off-column information quantity of each position in the corresponding column by all threads,

(S34) updating the data corresponding to all columns in all information bit blocks in parallel, adding the off-column information quantity and the elements at the corresponding positions of the input column data, and finishing a column decoding process.

The calculation process of the off-line information amount and the off-column information amount is realized according to the calculation rule of the off-line information amount in the Chase algorithm in the prior art.

Further, the step (S5) includes:

and each thread reads the data of the corresponding position in the corresponding TPC data block, if the data is less than 0, the judgment code word is calculated to be 1, namely the decoding result is 1, otherwise, the judgment code word is 0, the decoding result is 0, and each thread writes the decoding result of the corresponding position into the output data storage space.

Further, it is characterized byIn the step (S21), the number of thread blocks is equal to the number of information bit blocks to be processed multiplied by the number of rows of each information bit block data, and the number of threads in a thread block is the length of one row of data in the information bit block data multiplied by 2^p。

Further, in the step (S24), the parallelism of the parallel update is: number of blocks of information bits per data block size.

Further, the value of K is an integer larger than or equal to 6.

The GPU kernel executed in the step (S2) and the step (S3) both use the shared memory to store data of corresponding rows or columns and intermediate data such as euclidean distance, so as to reduce the number of accesses to the GPU global memory as much as possible and improve the efficiency of the program.

In the step (S22), the euclidean distances of all test sequences are calculated, so that it is possible to avoid determining whether the test sequences are the same in the GPU segment, and to lag the solution of the minimum value among all the euclidean distances to the step (S23).

Compared with the prior art, the invention has the following advantages and beneficial effects: through GPU parallel optimization of TPC decoding calculation, all data are processed simultaneously by utilizing tens of thousands of threads, and original time-consuming parts are parallelized to achieve the effect of quick decoding; subdividing the row-column decoding, and selecting proper parallelism to carry out fine-grained parallel optimization according to the parallel characteristics and data dependency of different stages; the shared memory is used for caching the data to be processed and the intermediate data of each thread block, so that the aim of reducing the access times of the global memory is fulfilled, and the access cost is reduced.

Drawings

FIG. 1 is a general flow diagram of the present invention;

fig. 2 is a schematic structural diagram of a GPU parallel optimization method for TPC decoding of a software radio system according to the present invention.

Detailed Description

In order to better understand the technical solution of the present application, the present application will be described in detail with reference to the drawings and the detailed description in the embodiments of the present application.

Referring to fig. 1, it is a schematic general flow chart of a GPU parallel optimization method for TPC decoding of a software radio system according to an embodiment of the present invention. The specific process is as follows:

firstly, initializing problem function parameters, and copying input information bit block data from a CPU memory to a GPU global storage space;

secondly, decoding all the information bit block data; the method comprises the steps that line decoding is carried out on line processing GPU kernel functions, each thread block in the line processing GPU kernel functions decodes a line of data of information bit block data according to a CHASE algorithm, and the number of the thread blocks of the line processing GPU kernel functions is the number of the information bit block data to be processed multiplied by the number of lines of each information bit block data packet;

thirdly, performing column decoding on the TPC data block after the second-step row decoding; performing column decoding on a column processing GPU kernel, wherein each thread block in the column processing GPU kernel performs column decoding on a column of data of the information bit block data after row decoding according to a CHASE algorithm, and the number of the thread blocks of the column processing GPU kernel is the number of the information bit block data to be processed multiplied by the number of columns of each information bit block data packet;

fourthly, repeating the second step to the third step for K times;

and fifthly, calculating a judgment code word according to the TPC data block after the Kth decoding, and outputting the judgment code word to a CPU (Central processing Unit) end as a decoding result.

The following describes the specific technical scheme of the embodiment of the invention in detail as follows:

firstly, initializing parameters, and copying input information bit block data from a CPU memory to a GPU global storage space. And initializing an output data storage space out according to the number N of input information bit blocks and the size N m of each information bit block data, wherein the initial value is 0, N represents the number of columns, and m represents the number of rows.

And secondly, decoding all the information bit block data. The line processing process executes line decoding GPU-kernel, the parallel structure of the line processing GPU-kernel is shown in figure 2, and three levels of parallel can be realizedFirstly, different data blocks (N) can be processed in parallel, secondly, all the line (m) decoding in each data block can be processed in parallel, and finally, 2 in each line decoding process^pThe computation of the individual test sequences and euclidean distances may be performed in parallel. The method for configuring the kernel function of the line processing GPU comprises the following steps: the number of thread blocks of the kernel function is represented by a dim3 type variable grid1, which has a value of (N × m, 1, 1), indicating that each thread block processes one line of a data block, there are N data blocks in total, and the number of lines of each block is m. The thread block size of the kernel function is set to n x 2^pIndicating that each thread processes an element of the line of data. Inside the kernel function, firstly, one warp (32 threads) is used for sorting one line of data and the position of each data according to the ascending rule of the absolute value of the data (warp represents a thread bundle, and one warp consists of 32 threads), and the p data with the minimum absolute value and the position (the first p data after sorting) of the p data in the input line data are found according to the sorting result. Then, n x 2 is added^pThread division into 2^pAnd each group solves a test sequence, and then calculates the Euler distance between the obtained test sequence and the input row data. Compared with the CPU serial program which firstly compares the test sequences and then solves the Euclidean distances corresponding to different test sequences, the embodiment of the invention adopts a space time-changing strategy, does not compare the test sequences, but calculates the Euler distances of all the test sequences, converts the comparison operation into summation calculation, and improves the efficiency of the GPU. Finally, taking the test sequence with the minimum Euclidean distance as the optimal code word, simultaneously calculating the external information quantity of each position in the corresponding line by all threads, adding the external information quantity and the elements on the corresponding positions of the input line data to obtain updated line data, namely completing a line decoding process, wherein s is 2 in figure 2^p。

And thirdly, performing column decoding on the information bit block data subjected to the row decoding. The column decoding process executes column decoding GPU-kernels, and the kernels are configured according to the number of input data blocks and the size of each data block. The number of thread blocks of the kernel function is represented by a dim3 type variable grid2, which has a value of (N × N, 1, 1) indicating that each thread block processes dataOne column of blocks, there are a total of N data blocks, and the number of columns per block is N. Thread block size setting for column decode GPU-kernels to m x 2^pIndicating that each thread processes an element of the column of data. Inside the kernel function, firstly, one warp (32 threads) is used for sorting a column of data according to an ascending rule, and p data with the minimum absolute value and the positions of the p data in the input column of data are found according to a sorting result. Then, m.sup.2 was used^pParallel generation of threads 2^pAnd calculating the test sequences and the input column data to obtain the Euler distance corresponding to each test sequence. And finally, taking the test sequence with the minimum Euclidean distance as an optimal encoding word, simultaneously calculating the external information quantity of each position in the corresponding column by all threads, and adding the external information quantity and elements on the corresponding positions of the input column data to finish one-time column decoding. The workflow of each thread is described in the literature references (Davidhaze. acids of Algorithms for Decoding Block Codes with Channel measurement Information. IEEEtransactions On Information Theory, Vol. IT-18, No.1, January 1972, pp170-179)

And fourthly, repeating the second step to the third step K times, wherein K is the number of times of iterative processing and is set to be not less than 6.

And fifthly, calculating a judgment code word according to the TPC data block after the Kth decoding, and outputting the judgment code word to a CPU (Central processing Unit) end as a decoding result. The process executes the decision code to compute the GPU kernel, the parallelism of which is determined according to the output data block size k x l. In the kernel function configuration process, the number of thread blocks is set to be N × k, and the size of each thread block can be set to be l. Where N represents the number of processed data blocks, k is the length of each output data block, and l is the height of each output data block. And each thread in the GPU-kernel function of the decision code calculation judges the information bit block data after K row-column decoding iterations according to decision code calculation logic described in the CHASE algorithm, wherein the number of the thread blocks of the GPU-kernel function of the decision code calculation is the number of the information bit block data to be judged multiplied by the number of the lines of the information bit block after judgment.

And inside the kernel function, each thread reads data at a corresponding position in a corresponding TPC data block, if the data is less than 0, the decoding result is 1, otherwise, the decoding result is 0. Each thread writes the decoded result of the corresponding location into the output data storage space out. And finally copying the output data to a storage space corresponding to the CPU end through the cudaMemcpy.

In summary, the GPU parallel optimization method for TPC decoding of a software radio system according to the embodiment of the present invention has the advantages that: parallel acceleration is carried out on TPC decoding by constructing a multistage parallel GPU kernel function, and high-throughput decoding is realized by effectively utilizing the strong computing power of a GPU; and the access overhead of GPU-side data is effectively reduced by adopting a shared memory, and the time overhead of TPC decoding is further reduced.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims

1. A GPU parallel optimization method aiming at TPC decoding of a software radio system is characterized by comprising the following steps:

(S2) decoding all the information bit block data;

(S3) column-decoding the row-decoded information bit block data;

(S4) repeating the steps (S2) to (S3) K times, K being the number of iterations;

(S5) according to the decoded TPC data block of the Kth time, a judgment code word is calculated and output to the CPU as a decoding result.

2. The method for GPU parallel optimization for TPC decoding for a software radio system of claim 1, wherein the step (S1) specifically comprises the steps of:

3. The method for GPU parallel optimization for TPC decoding for a software radio system of claim 1, wherein the step (S2) specifically comprises the steps of:

(S21) solving the position of p data with the minimum absolute value of each line of data of all information bit block data by adopting a thread block and thread two-stage parallel method;

(S22) thread parallel computation 2 within each thread block^pThe test sequence of each row is calculated with the corresponding input row data, and the Euclidean distance of each test sequence of each row is solved, namely 2 is obtained^p(ii) a euclidean distance;

(S23) solving for 2 for the first thread of each thread block^pThe minimum value in the Euclidean distance is obtained, and then all threads in the thread block solve the row updating amount of the corresponding row of data in parallel according to the minimum value; the row update amount is: taking the row test sequence with the minimum Euclidean distance as an optimal code word, and simultaneously calculating the off-line information quantity of each position in the corresponding row by all threads;

4. The method for GPU parallel optimization for TPC decoding for a software radio system of claim 1, wherein the step (S3) specifically comprises the steps of:

(S31) solving the positions of the p data with the minimum absolute value of each column of data of all the information bit blocks by adopting a thread block and thread two-stage parallel method;

(S32) Each wireThread parallel computation within a program block 2^pEach column test sequence is calculated with the corresponding input column data, and the Euclidean distance of each column test sequence is solved, namely 2 is obtained^p(ii) a euclidean distance;

(S33) solving for 2 for the first thread of each thread block^pThe minimum value in the Euclidean distance is obtained, and then all threads in the thread block solve the column updating quantity of the corresponding column data in parallel according to the minimum value; the update amount is: taking the column test sequence with the minimum Euclidean distance as the optimal code word, simultaneously calculating the off-column information quantity of each position in the corresponding column by all threads,

5. The method for GPU parallel optimization for TPC decoding for a software radio system as claimed in claim 1, wherein the step (S5) is specifically performed by:

and each thread reads the data of the corresponding position in the corresponding TPC data block, if the data is less than 0, the decoding result is 1, otherwise, the decoding result is 0, and each thread writes the decoding result of the corresponding position into the output data storage space.

6. The method for GPU parallel optimization for TPC decoding for software radio systems of claim 3 wherein in step (S21), the number of thread blocks is equal to the number of information bit blocks that need to be processed multiplied by the number of rows per information bit block data, the number of threads in a thread block being the length of one row of data in the information bit block data.

7. The method of GPU parallel optimization for TPC decoding for a software radio system of claim 3, wherein: in the step (S24), the parallelism of the parallel update is the number of information bit blocks per data block.

8. The method for GPU parallel optimization for TPC decoding for a software radio system of claim 1, where K is an integer greater than or equal to 6.