CN106970895B

CN106970895B - FFT device and method based on FPGA

Info

Publication number: CN106970895B
Application number: CN201610024296.XA
Authority: CN
Inventors: 王纪宁
Original assignee: Potevio Information Technology Co Ltd
Current assignee: Potevio Information Technology Co Ltd
Priority date: 2016-01-14
Filing date: 2016-01-14
Publication date: 2023-10-03
Anticipated expiration: 2036-01-14
Also published as: CN106970895A

Abstract

The invention relates to an FFT device and a method based on FPGA, the device comprises: the system comprises a cache module, a control module and a base-4 butterfly operator; the control module is respectively connected with the buffer module and the radix-4 butterfly operator and is used for controlling the input and output of data, controlling the data to be buffered in the buffer module in a ping-pong buffer mode and controlling the data to finish FFT operation in the radix-4 butterfly operator in a cyclic addressing mode; the buffer memory module is used for initially inputting the data of the previous 3/4, outputting the operation result of the next 3/4 and storing the intermediate result; the radix-4 butterfly operator is used for initially inputting 1/4 of the data and outputting the operation result of the previous 1/4. The invention improves the operation speed, and reduces the total depth of the RAM for storing data under the condition that the using quantity of the DSP multiplier is equivalent to that of the FFT device applying the radix-2 butterfly operation unit.

Description

FFT device and method based on FPGA

Technical Field

The invention relates to the field of digital signals, in particular to an FFT device and method based on an FPGA.

Background

In wireless communication systems, an input time-domain signal is often subjected to transform analysis using a fast fourier transform FFT, and a frequency-domain waveform is observed to obtain frequency-domain characteristics of the signal. The OFDM utilizes Inverse Discrete Fourier Transform (IDFT) and discrete Fourier transform (IDFT/DFT) to replace the realization of multi-carrier modulation and demodulation, namely, the modulation is realized by performing IFFT operation on data to be modulated at a transmitting end, and the demodulation is realized by performing FFT operation on the received data at a receiving end, so that the complexity of system realization is greatly reduced.

The FPGA can well solve the problems of parallelism and speed, has the characteristics of flexible configuration, easy upgrading and the like, and is a common method for realizing fast Fourier transform FFT. For example, the Virtex6 series chip of Xilinx is inside the FPGA, which provides not only a plurality of computing units called DSP Slices, but also a readable and writable LUT unit, a dual port RAM unit.

The FFT algorithm IP soft core in the Xilinx Virtex6 series chip is divided into four modes, which are respectively: streaming I/O (Streaming I/O), base-4 burst I/O (Radix-4, burst I/O), base-2 burst I/O (Radix-2, burst I/O), base-2 Lite burst I/O (Radix-2 Lite, burst I/O). The structure can be divided into two types of pipeline and Burst, and the following implementation methods of the two types of structures are introduced briefly, as follows:

(1) Data stream I/O of the stream.

The pipelined data stream I/O architecture enables continuous data processing through the pipelining of a set of radix-2 butterfly unit processing engines. Each processing engine has a memory block to store input data and intermediate data.

(2) Base-4 burst I/O.

For the radix-4 burst I/O structure, the FFT IP core is implemented with a radix-4 butterfly unit processing engine.

For the data flow I/O structure of the running water, the IP core can load the input data of the next frame and output the conversion result data of the previous frame while processing the conversion calculation of the data of the current frame, and can continuously input the data and obtain continuous calculation result output after a certain calculation delay. The input data is sequential and the output data may be in reverse order or sequential. The following illustrates a radix-2 butterfly pipelined FFT (as shown in fig. 1) with 8 points as an example.

The base-2 DIF performs butterfly operation in two points as a unit, and data buffering is performed before the operation is performed, so that the upper half part and the lower half part of input data are combined. The basic structure is as follows:

let one clock cycle buffer one data, i.e. the first clock buffer 0, the second clock buffer 1. The final output frequency domain data follows the reverse order arrangement, and the input and output table of the 8-point radix-2 butterfly pipelined FFT is shown in Table 1:

table 1 radix-2 butterfly pipelined FFT I/O table

Input (positive sequence)	Decimal system	Output (reverse order)	Decimal system
				000	0	000	0
001	1	100	4
				010	2	010	2
011	3	110	6
				100	4	001	1
101	5	101	5
				110	6	011	3
111	7	111	7

According to the butterfly diagram, the 8-point radix-2 FFT is divided into 3 stages, the space required by the cached data before operation is 4, the space required by the middle number is 4 and 2 respectively, and finally, when the data is sequentially output, the caching of all points is required to be firstly carried out, then the data is addressed and output, and the caching space is 8. A total of 18 RAM spaces are used, one butterfly is used per stage, a total of 3 butterfly units are used, and a total of 9 DSP multipliers are used assuming that 1 butterfly unit uses 3 DSP multipliers.

The FFT device using the radix-2 butterfly operation unit utilizes each stage to place the butterfly unit and store intermediate data, so that the data can continuously carry out fixed point FFT, the occupied resources are increased along with the increase of the number of FFT operation points, and as each stage operation only uses one radix-2 butterfly unit, the order of the calculation is fixed, therefore, when the final stage requires to output in sequence, RAM is required to be additionally added, and the total depth of the RAM occupied by the stored data and the number of DSP multipliers occupied by the operation are counted in table 2 when the FFT device using the radix-2 butterfly operation unit adopts a scale scaling mode to process.

Table 2 amount of resources occupied by FFT apparatus employing radix-2 butterfly operation unit

Disclosure of Invention

The technical problems to be solved by the invention are as follows: the conventional FFT device has the problems of low utilization rate of RAM in data sequence output and more FPGA resources.

In order to solve the technical problems, the invention provides an FFT device based on an FPGA. The FPGA-based FFT device comprises:

the system comprises a cache module, a control module and a base-4 butterfly operator;

the control module is respectively connected with the buffer module and the radix-4 butterfly operator and is used for controlling the input and output of data, controlling the data to be buffered in the buffer module in a ping-pong buffer mode and controlling the data to finish FFT operation in the radix-4 butterfly operator in a cyclic addressing mode;

the buffer memory module is used for initially inputting the data of the previous 3/4 and outputting the operation result of the next 3/4; and is used for storing intermediate data;

the radix-4 butterfly operator is used for initially inputting 1/4 of the data and outputting the operation result of the previous 1/4.

Optionally, the cache module is a plurality of dual-port RAMs or a plurality of single-port RAMs.

Optionally, the number of the dual-port RAM is 7 or 8, which is determined by the number of the FFT operation points.

Optionally, the total depth of the plurality of RAMs is less than or equal to twice the FFT operation point number.

Optionally, the number of the radix-4 butterfly operators is 1 or 2, which is determined by the number of the FFT operation points.

Another aspect of the present invention proposes an FFT method employing the FPGA-based FFT apparatus as described above, comprising:

sequentially inputting first frame data, after finishing 1-level butterfly operation of the first frame data, sequentially inputting second frame data by adopting ping-pong buffer, and finishing M-level butterfly operation of the first frame data;

completing the sequential output of the butterfly operation result of the first frame data, and simultaneously carrying out the buffer memory and the butterfly operation of the second frame data;

finishing M-level butterfly operation of the second frame data, simultaneously adopting ping-pong buffer to buffer the third frame data, and starting 1-level butterfly operation of the third frame data;

continuously repeating the caching, butterfly operation and result output processes of the data to finish the butterfly operation of multi-frame data;

wherein M is the number of stages of the butterfly operation, N is the number of points of the FFT operation, n=4 ^M The method comprises the steps of carrying out a first treatment on the surface of the The data reading and storing adopts a circular addressing mode.

Optionally, the first frame data is sequentially input, after the 1-level butterfly operation of the first frame data is completed, the second frame data is sequentially input by adopting a ping-pong buffer, and the M-level butterfly operation of the first frame data is completed; completing the sequential output of the butterfly operation result of the first frame data, and simultaneously performing the buffering and the butterfly operation of the second frame data comprises the following steps:

sequentially inputting the first frame data of the front 3/4 into the first part of the buffer memory module, when the first frame data of the rear 1/4 reaches the base-4 butterfly operator, directly performing butterfly operation with the data in the buffer memory module according to the butterfly operation diagram, and storing the result of the level 1 butterfly operation into the first part of the buffer memory module;

completing M-level butterfly operation of the first frame data, sequentially outputting the first 1/4 of the butterfly operation result of the first frame data by the base-4 butterfly operator, and storing the operation result of the last 3/4 to the first part of the buffer memory module; the ping-pong buffer sequence is adopted to input the first 3/4 second frame data to the second part of the buffer module, and when the second 1/4 second frame data reaches the base-4 butterfly operator, butterfly operation is directly carried out with the data in the buffer module according to the butterfly operation diagram;

the first part of the buffer memory module sequentially outputs the last 3/4 of the butterfly operation result of the first frame data;

correspondingly, the data reading and storage of the cache module adopts a circular addressing mode.

Optionally, the cyclic addressing mode includes:

performing 1-level butterfly operation, and storing a 1-level butterfly operation result in a cache module in a cyclic addressing mode;

performing intermediate butterfly operation, reading data in the cache module according to a cyclic addressing mode, and storing intermediate butterfly operation results in the cache module according to the cyclic addressing mode;

and performing the butterfly operation of the last stage, storing the butterfly operation result into a cache module according to a cyclic addressing mode, sequentially reading the data in the cache module and sequentially outputting the butterfly operation result.

Optionally, performing the level 1 butterfly operation, and storing the level 1 butterfly operation result in the cache module according to a cyclic addressing mode includes:

performing 1-level butterfly operation, dividing a 1-level butterfly operation result into 16 groups, and sequentially storing 0-3 groups of data of the 16 groups of butterfly operation results into a first RAM, a second RAM, a third RAM and a fourth RAM; the 4 th-7 th group data of the 16 groups of butterfly operation results are sequentially stored into a second RAM, a third RAM, a fourth RAM and a first RAM; the 8 th-11 th group of data of the 16 groups of butterfly operation results are sequentially stored into a third RAM, a fourth RAM, a first RAM and a second RAM; and sequentially storing the 12 th-15 th group data of the 16 groups of butterfly operation results into a fourth RAM, a first RAM, a second RAM and a third RAM.

Optionally, the performing the intermediate butterfly operation, reading the data in the cache module according to the cyclic addressing mode, and storing the intermediate butterfly operation result in the cache module according to the cyclic addressing mode includes:

performing intermediate-level butterfly operation, and reading data in the buffer memory module according to a cyclic addressing mode and inputting the data to a first port, a second port, a third port and a fourth port of the radix-4 butterfly operator; reading data in the cache module according to a cyclic addressing mode and inputting the data to a second port, a third port, a fourth port and a first port of the radix-4 butterfly operator; reading data in the cache module according to a cyclic addressing mode, and inputting the data to a third port, a fourth port, a first port and a second port of the radix-4 butterfly operator; reading data of the buffer module according to a cyclic addressing mode, and inputting the data to a fourth port, a first port, a second port and a third port of the radix-4 butterfly operator;

wherein the length of each conversion input port is 1/4 ^M ×N；

The butterfly operation result of each intermediate stage is divided into 16 groups, and the 16 groups are stored in a cache module in a cyclic addressing mode.

Optionally, the performing the butterfly operation of the last stage, storing the butterfly operation result in the buffer module according to a cyclic addressing mode, sequentially reading the data in the buffer module and sequentially outputting the butterfly operation result includes:

performing the final stage of butterfly operation, storing the data of a first port in the base-4 butterfly operator into a first RAM, storing the data of a second port in the base-4 butterfly operator into a third RAM, storing the data of a third port in the base-4 butterfly operator into the second RAM, and storing the data of a fourth port in the base-4 butterfly operator into a fourth RAM;

the cache module divides the data into multiple stages until the number of each group of data is 1;

and sequentially reading the data in the cache module and sequentially outputting butterfly operation results.

According to the FPGA-based FFT device and method, the radix-4 butterfly operator is adopted, so that the operation speed is improved, an additional RAM is not needed when intermediate data are stored in a cyclic addressing mode, the additional RAM is not needed when the data are sequentially output, the total depth of the RAM for storing the data is reduced under the condition that the number of DSP multipliers is equivalent to that of the FFT device applying the radix-2 butterfly operator, the utilization rate of the RAM is improved, and the resources of the FPGA are saved.

Drawings

The features and advantages of the present invention will be more clearly understood by reference to the accompanying drawings, which are illustrative and should not be construed as limiting the invention in any way, in which:

fig. 1 is a schematic diagram of an FFT apparatus employing a radix-2 butterfly operation unit;

FIG. 2 is a schematic diagram of an FPGA-based FFT device according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an FPGA-based FFT device according to an embodiment of the invention;

fig. 4 is a schematic diagram of a method of FPGA-based FFT according to one embodiment of the invention.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

Fig. 2 shows a schematic structural diagram of an FPGA-based FFT apparatus according to an embodiment of the present invention.

As shown in fig. 2, the FPGA-based FFT apparatus of the present embodiment includes:

the device comprises a cache module 1, a control module 2 and a radix-4 butterfly operator 3;

the control module 2 is respectively connected with the buffer module 1 and the radix-4 butterfly operator 3, and is used for controlling the input and output of data, controlling the buffer of the data in the buffer module 1 in a ping-pong buffer manner and controlling the completion of FFT operation in the radix-4 butterfly operator 3 in a cyclic addressing manner;

the buffer memory module 1 is used for initially inputting the data of the previous 3/4, outputting the operation result of the next 3/4 and storing intermediate data;

the radix-4 butterfly operator 2 is used for initially inputting 1/4 of the data and outputting the operation result of the previous 1/4.

According to the FPGA-based FFT device, the radix-4 butterfly operator is adopted, the operation speed is improved, an additional RAM is not needed when intermediate data are stored in a cyclic addressing mode, the additional RAM is not needed when the data are sequentially output, the total depth of the RAM for storing the data is reduced under the condition that the number of DSP multipliers is equivalent to that of the FFT device applying the radix-2 butterfly operation unit, the utilization rate of the RAM is improved, and the resources of the FPGA are saved.

In an alternative embodiment, the cache module is a plurality of dual-port RAMs or a plurality of single-port RAMs. In the FFT device based on the FPGA, the buffer module is a dual-port RAM, so that the effect of using less RAMs can be achieved.

The number of the dual-port RAMs is 7 or 8, and is determined by the number of FFT operation points.

And the total depth of the RAMs is less than or equal to twice the number of FFT operation points.

The number of the radix-4 butterfly operators is 1 or 2, and is determined by the number of FFT operation points.

Fig. 3 is a schematic diagram of an FPGA-based FFT apparatus according to an embodiment of the present invention. As shown in fig. 3, the FFT apparatus includes a plurality of dual-port RAMs, butterfly operators and selectors, wherein the total RAM depth is at most 2 times the number of FFT points, and the width is the data width. The butterfly operation is provided with 2 radix-4 butterfly operators at most, 8 RAMs can output 8 data in parallel in one period, two radix-4 butterfly operation units can be fully utilized, and the operation speed is improved.

Fig. 4 is a schematic diagram of a method of FPGA-based FFT according to one embodiment of the invention. As shown in fig. 4, the FFT method employing the FPGA-based FFT apparatus as described above includes:

s41: sequentially inputting first frame data, after finishing 1-level butterfly operation of the first frame data, sequentially inputting second frame data by adopting ping-pong buffer, and finishing M-level butterfly operation of the first frame data;

s42: completing the sequential output of the butterfly operation result of the first frame data, and simultaneously carrying out the buffer memory and the butterfly operation of the second frame data;

s43: finishing M-level butterfly operation of the second frame data, simultaneously adopting ping-pong buffer to buffer the third frame data, and starting 1-level butterfly operation of the third frame data;

s44: continuously repeating the caching, butterfly operation and result output processes of the data to finish the butterfly operation of multi-frame data;

Further, the first frame data is sequentially input, after the 1-level butterfly operation of the first frame data is completed, the second frame data is sequentially input by adopting a ping-pong buffer, and the M-level butterfly operation of the first frame data is completed; completing the sequential output of the butterfly operation result of the first frame data, and simultaneously performing the buffering and the butterfly operation of the second frame data comprises the following steps:

The ping-pong buffering procedure in this FPGA-based FFT method is described below with a specific example.

Let the number of FFT operation of a frame of serial data be 4096 points, use basic-4 DIF operation, the RAM used is RAM1-14 in FIG. 3 (it should be noted that RAM1-14 in FIG. 3 is single-port RAM, the following ping-pong buffer process is also described by taking single-port RAM as an example, for FFT operation of 4096 points, 7 double-port RAMs can be used, the process and working principle are similar to those of single-port RAM), the process is as follows:

(1) The input serial data frame 0 is buffered, and the buffer space is set to be 3/4 of the number of operation points, i.e. 4096×0.75=3072, i.e. buffered in the RAM6.

(2) When 3073 data arrives, according to the butterfly operation diagram, the butterfly operation of base-4 is directly carried out with 1 st, 1025 th and 2049 th data in the previous cache RAM. And the calculation result is stored in the RAM 1-RAM 8.

(3) When 3074 th data arrives, according to the butterfly operation diagram, the base-4 butterfly operation is directly carried out with the 2 nd, 1026 th and 2050 th data in the cache, and the calculation result is stored in the RAM 1-RAM 8.

(4) When 3075 data arrives, according to the butterfly operation diagram, the base-4 butterfly operation is directly carried out with the 3 rd, 1027 th and 2051 th data in the cache, and the calculation result is stored in the RAM 1-RAM 8.

When 3076 data comes.

When 4096 th data arrives, according to the butterfly operation diagram, the base-4 butterfly operation is directly carried out with 1024, 2048 and 3072 th data in the cache, and the calculation result is stored in the cache RAM. All butterfly operations at stage 1 are now complete. The cache RAM in which the data of the level 1 operation is stored is RAM 1-RAM 8.

(5) The next frame of input data frame 1 is buffered, the buffer space starts from the RAM9, and in 1024 clock cycles, the data in the 1-6 buffer RAMs can be continuously processed by using 2 radix-4 butterfly operation units, at this time, the 8-point data in the buffer is read out in 1 clock cycle to perform butterfly operation, and 1024×8=8192 points, i.e. 8192/4096=2-level butterfly operation, are totally completed in 1024 cycles. At this time, the data after 3-level operation is still stored in the RAM 1-RAM 8, so as to realize the original address operation.

(6) And continuously caching the frame 1 data, wherein the caching space is RAM11 and RAM12, and the data in the 1-6 cache RAMs can be continuously processed by using 2 radix-4 butterfly operation units in 1024 clock cycles, at the moment, the data in 8 points in the cache are read out in 1 clock cycle to perform butterfly operation, and 1024 x 8 = 8192 points, namely 8192/4096 = 2-level butterfly operation, are totally completed in 1024 cycles. At this time, the data after 5-level operation is still stored in the RAM 1-RAM 8, so as to realize the original address operation.

(7) And continuously caching the frame 1 data, wherein the caching space is RAM13 and RAM14, and the data in the 7-14 caching RAM can be continuously processed by using 1 radix-4 butterfly operation unit in 1024 clock cycles, at the moment, 4-point data in the cache is read out in 1 clock cycle to perform butterfly operation, and 1024 x 4 = 4096 points, namely 4096/4096 = 1-level butterfly operation, are totally completed in 1024 cycles. At this time, the data after the 6-level operation is still stored in the RAM 1-RAM 8, so as to realize the original address operation. Since the operation of the last stage is completed, the result after the operation of 6 stages can be directly output in the process of calculation, and when all the calculations are completed, the result is output by 1/4.

(8) The 1-level operation is carried out on the frame 1, the operation results are stored in the RAMs 1-2 and the RAMs 9-14, and the RAM3 outputs the operation result of the previous frame 0.

(9) The RAM3 and the RAM4 are cached, the cached data is the frame 2 data of the next frame, the RAM5 outputs the operation result of the previous frame, and the frame 1 data is subjected to 3-level butterfly operation by using two butterfly operators.

(10) The RAM5 and the RAM6 are buffered, and the RAM7 starts outputting frame 0 data, and the frame 1 data completes 5-level operation.

(11) The RAM7, 8 are buffered, and the frame 1 data is output after the 6-level operation.

Further, the cyclic addressing mode includes:

Specifically, performing the 1-level butterfly operation, and storing the 1-level butterfly operation result in the cache module according to the circular addressing mode includes:

Specifically, the performing the intermediate butterfly operation, reading the data in the cache module according to the cyclic addressing mode, and storing the intermediate butterfly operation result in the cache module according to the cyclic addressing mode includes:

wherein the length of each conversion input port is 1/4 ^M ×N；

Specifically, the performing the butterfly operation of the last stage, storing the butterfly operation result in the buffer module according to the cyclic addressing mode, sequentially reading the data in the buffer module and sequentially outputting the butterfly operation result includes:

The process of cyclic addressing in FPGA-based FFT methods is described below with a specific example. (this introduction is a method using one butterfly, consistent with a two butterfly method

(1) N-point data are sequentially input into the RAM until 3/4 of the data are input into the RAM, and then 1-level addressing calculation is started.

(2) Level 1 addressing: the RAMs 1-3 sequentially read out data according to addresses 0- (1/4*N-1) and serve as the first 3 inputs of the butterfly arithmetic unit, and the 4 th input of the butterfly arithmetic unit is direct data. After calculation, the 0 th to (1/16 x N-1) data of the butterfly arithmetic unit output ports 1-4 are sequentially stored in RAMs 1, 2, 3 and 4, serial numbers (1/16 x N) - (1/8*N-1) are sequentially stored in RAMs 2, 3, 4 and 1, serial numbers (1/8*N) - (3/16 x N-1) are sequentially stored in RAMs 3, 4, 1 and 2, and serial numbers (3/16 x N) - (1/4*N-1) are sequentially stored in RAMs 4, 1, 2 and 3.

(3) 2-level addressing: RAM1 reads data of addresses 0 to (1/16 x N-1), (1/16 x N) to (1/8*N-1), (1/8*N) to (3/16 x N-1) and (3/16 x N) to (1/4*N-1) and respectively uses the data as data of butterfly conveyor input ports 1, 2, 3 and 4. Meanwhile, the RAM2 reads data of addresses (1/16 x N) - (1/8*N-1), (1/8*N) - (3/16 x N-1), (3/16 x N) - (1/4*N-1) and 0 to (1/16 x N-1) and is used as data of input ports 2, 3, 4 and 1 of the butterfly transporter. RAM3 reads the data of addresses (1/8*N) - (3/16 x N-1), (3/16 x N) - (1/4*N-1), 0- (1/16 x N-1), (1/16 x N) - (1/8*N-1) and is used as the data of butterfly operator input ports 3, 4, 1, 2. RAM4 reads the data of addresses (3/16 x N) - (1/4*N-1), 0 to (1/16 x N-1), (1/16 x N) - (1/8*N-1) and (1/8*N) - (3/16 x N-1) and is used as the data of butterfly transport input ports 4, 1, 2 and 3. After calculation, the 0 th to (1/64 th) data of the output ports 1-4 of the butterfly arithmetic unit are sequentially stored in RAM1, 2, 3 and 4, serial numbers (1/64 th) to (1/32 th) are sequentially stored in RAM2, 3, 4 and 1, serial numbers (1/32 th) to (3/64 th) are sequentially stored in RAM3, 4, 1 and 2, and serial numbers (3/64 th) to (1/16 th) are sequentially stored in RAM4, 1, 2 and 3. The same operation is performed on the rest numbers, namely, the data of serial numbers (1/16 x N) - (5/64 x N-1) are sequentially stored in the RAMs 1, 2, 3 and 4, the data of serial numbers (5/64 x N) - (6/64 x N-1) are sequentially stored in the RAMs 2, 3, 4 and 1, the data of serial numbers (6/64 x N) - (7/64 x N-1) are sequentially stored in the RAMs 3, 4, 1 and 2, and the data of serial numbers (7/64 x N) - (8/64 x N-1) are sequentially stored in the RAMs 4, 1, 2 and 3.

(4) 3-level addressing: RAM1 reads data from addresses 0 to (1/64 x N-1), (1/64 x N) to (2/64 x N-1), (2/64 x N) to (3/64 x N-1) and (3/64 x N) to (4/64 x N-1) and respectively uses the data as data of butterfly conveyor input ports 1, 2, 3 and 4. At the same time, the RAM2 reads the data of addresses (4/64 x N) - (5/64 x N-1), (5/64 x N) - (6/64 x N-1), (6/64 x N) - (7/64 x N-1) and (7/64 x N) - (8/64 x N-1) and respectively uses the data as the data of the butterfly conveyor input ports 2, 3, 4 and 1. RAM3 reads the data of addresses (8/64 x N) - (9/64 x N-1), (9/64 x N) - (10/64 x N-1), (10/64 x N) - (11/64 x N-1) and (11/64 x N) - (12/64 x N-1) and respectively uses the data as the data of butterfly conveyor input ports 3, 4, 1 and 2. RAM4 reads the data of addresses (12/64 x N) - (13/64 x N-1), (13/64 x N) - (14/64 x N-1), (14/64 x N) - (15/64 x N-1), (15/64 x N) - (16/64 x N-1) and respectively uses the data as the data of butterfly conveyor input ports 3, 4, 1 and 2. The same is done for the remaining address data. After calculation, the 0 th to (1/256 th) data of the output ports 1-4 of the butterfly arithmetic unit are sequentially stored in RAM1, 2, 3 and 4, the serial numbers (1/256 th) to (2/256 th) N-1 are sequentially stored in RAM2, 3, 4 and 1, the serial numbers (2/256 th) to (3/256 th) N-1 are sequentially stored in RAM3, 4, 1 and 2, and the serial numbers (3/256 th) to (4/256 th) N-1 are sequentially stored in RAM4, 1, 2 and 3. The same operation is performed on the rest numbers, namely, the data of serial numbers (4/256 x N) - (5/256 x N-1) are sequentially stored in the RAMs 1, 2, 3 and 4, the data of serial numbers (5/256 x N) - (6/256 x N-1) are sequentially stored in the RAMs 2, 3, 4 and 1, the data of serial numbers (6/256 x N) - (7/256 x N-1) are sequentially stored in the RAMs 3, 4, 1 and 2, and the data of serial numbers (7/256 x N) - (8/256 x N-1) are sequentially stored in the RAMs 4, 1, 2 and 3.

(5) Level 4, 5, 6, 7 addressing.

(6) The last stage addressing: firstly, sequentially reading addresses 0, 2/16×n, 3/16×n and 1/16×n from RAM1, inputting the output data as port 1 of the butterfly computing device, sequentially reading addresses (a+ a1.°), (2/16×n+a+a1.), (3/16×n+a 1.), (1/16×n+a 1.), inputting the output data as port 2 of the butterfly computing device, sequentially reading addresses 2×2×a+ a1..+ -), [2/16×n+2×a+ a1..>, [3/16×n+2×a+ a1..>, [1/16×n+2 ] (a+ a1..) ], sequentially reading addresses (a+ a1. +a+a1.+ -), [ 2+n+16×n+3.+ -), [3/16×n+3.+ -), [3/16×n+6.+ -) (a+3... The port 1 is directly output as final output data, the port 2 is directly stored in RAM3, the port 3 is directly stored in RAM2, the port 4 is directly stored in RAM4, the port 4 is continuously read, the read address is 2/64 n, (2/16 n+2/64 n), (3/16 n+2/64 n), (1/16 n+2/64 n), the read address is (2/64 n+a+a1.+ -), [ (2/16 n+2/64 n) +a+ a1. ], [ (3/16 n+2/64 n) +a+ a1. ], the read address is [2/64 n+2/16 n+58 a+n ], [ (2/16) n+14 ] (2/16+n+6 ] ], and the port is (2/16) n+2/16+n+2/4.), [ (2/16+2/16) n+4 ], [ (2/16+2/16+n+4.) ], and the port is (2/4 ], [ (2/16+2/16+4.) ] ]. The calculation result of the port 1 is directly output as final output data, the data original address of the port 2 is stored in the RAM3, the data original address of the port 3 is stored in the RAM2, and the data original address of the port 4 is stored in the RAM4. If the last stage is a 4-stage operation, i.e., 256 points, a=16, a1=4, a2=1. If the last stage is an M-stage operation, i.e., 4^M points, a= 4^M/16, a1= 4^M/64.

(7) After the final stage operation is completed, 1/4*N data are sequentially output, and then RAM 2-4 data are sequentially output.

Summary addressing it is not difficult to find:

the data of the first stage can be sequentially read out from each RAM and sequentially input to 4 ports of butterfly operation for operation. The output data of the operation is divided into 16 groups (each butterfly operator outputs 4 groups of data at the same time, and each butterfly operator output port generates 4 groups of data) and sequentially stored in RAMs 1, 2, 3, 4, RAM2, 3, 4, 1, RAM3, 4, 1, 2, RAM4, 1, 2 and 3.

When addressing data of the intermediate stage, the RAM1 is always read from the address 0, and the read data are input to the input ports 1, 2, 3, and 4 of the butterfly operator, respectively, and sequentially circulated. The data length of each conversion input port is sequentially 1 level 1/4 x N,2 level 1/16 x N. The RAM2 starts the read address a1+a2..a 1=1/4×n, a2, a 3..0 for the 1-stage calculation, a1=1/4×n, a2=1/16×n, a 3..0 for the 2-stage calculation, a1=1/4×n, a2=1/16×n..am=1/4+.m×n for the M-stage calculation, and then sequentially reads the addresses, returns to address 0 when the maximum value is read, and addressing is continued until one cycle of the entire address space depth is completed. RAM3, 4 start read addresses are 2 (a1+a2.) and 3 (a1+a2.) respectively, and the other operates with RAM 2. When the butterfly operation is finished and writing is started, the data is consistent with the address reading position, the original address storage is realized, and note that each 4 groups of data of the output port of each butterfly operation device are required to be placed in different RAMs, the data length of each group is defined according to the number of stages, 1-stage operation output is 1/16 x N, 2-stage operation output is 1/64 x N.

When addressing the data of the last 1 level, according to the characteristics of the final output sequence of the radix-4 butterfly operation diagram, the addressing rule is as follows: in the first step, the numbers in the RAM are divided into 4 groups according to addresses, called 1-level groups. Each group has a depth of 1/16×n and a number of groups 1-4, then RAM1 is addressed according to 0, 2/16×n, 3/16×n, 1/16×n, i.e. the first addresses of groups 1, 3, 2, 4 are sequentially given, and other RAMs are added with a1+a2. And secondly, dividing each 1-level group into 4 groups, namely 2-level groups, wherein the depth of each group is 1/64 of N, the number of each group is 1-4, then, the RAM1 is addressed according to the addresses of the group 1, the group 3, the group 2 and the group 4, and the other RAMs 2 are added with a1+a2. (two-level group 1 has been addressed in the first step) in a third step, 4 groups are subdivided for each 2-level group. After butterfly operation, the data of the output port 1 is directly sent to the output port of the total module, the data of the port 2 is stored in the RAM3, the data of the port 3 is stored in the RAM2, and the data of the port 4 is stored in the RAM4. After the butterfly operation is finished, the data output of 1/4*N is finished, and then the data output of the RAMs 2-4 are sequentially read.

According to the above process, the FFT method based on the FPGA of the present embodiment adopts the radix-4 butterfly operator, which improves the operation speed, and adopts the ping-pong buffer and circular addressing modes to implement the original address calculation of the data, and no additional RAM is needed when intermediate data is stored, and no additional RAM is needed when data is sequentially output (as can be seen from comparison table 2 and table 3), so that the total depth of the RAM storing the data is reduced under the condition that the FFT device of the radix-2 butterfly operator is equivalent, the utilization rate of the RAM is improved, and the resources of the FPGA are saved. Table 3 counts the total depth of RAM occupied by the stored data and the number of DSP multipliers occupied by the operations, where the bit width of the stored RAM is the data bit width as follows:

table 3 amount of resources occupied by FFT apparatus employing radix-4 butterfly operator

It will be appreciated by those skilled in the art that embodiments of the invention may be a system or a computer program product. Accordingly, the apparatus of the present invention may take the form of an entirely hardware embodiment. The present invention is described with reference to flowchart illustrations and/or block diagrams of apparatus (systems) according to embodiments of the invention. While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

According to the FFT device and the method based on the FPGA, the radix-4 butterfly arithmetic unit is adopted, so that the operation speed is improved, an additional RAM is not needed when intermediate data is stored in a cyclic addressing mode, and an additional RAM is not needed when data is sequentially output (as can be seen from comparison tables 2 and 3), the total depth of the RAM for storing the data is reduced under the condition that the FFT device of the radix-2 butterfly arithmetic unit is equivalent, the utilization rate of the RAM is improved, and the resources of the FPGA are saved.

Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims

1. An FFT method of an FPGA-based FFT apparatus, comprising:

wherein M is the number of stages of the butterfly operation, N is the number of points of the FFT operation, n=4 ^M The method comprises the steps of carrying out a first treatment on the surface of the The data reading and storing adopts a cyclic addressing mode;

the FPGA-based FFT apparatus includes:

the buffer memory module is used for initially inputting the data of the previous 3/4, outputting the operation result of the next 3/4 and storing intermediate data;

2. The FFT method of claim 1 wherein the sequentially inputting the first frame data, after completing the 1-level butterfly operation of the first frame data, sequentially inputting the second frame data using a ping-pong buffer, and completing the M-level butterfly operation of the first frame data; completing the sequential output of the butterfly operation result of the first frame data, and simultaneously performing the buffering and the butterfly operation of the second frame data comprises the following steps:

3. The FFT method of claim 1 wherein the cyclic addressing scheme comprises:

4. The FFT method of claim 3 wherein performing the level 1 butterfly operation and storing the level 1 butterfly operation result in a buffer module in a circular addressing manner comprises:

5. The FFT method of claim 4 wherein the performing the intermediate butterfly operation reads data in the buffer module in a cyclic addressing manner, and saves intermediate butterfly operation results in the buffer module in a cyclic addressing manner; the butterfly operation of the last stage is carried out, the butterfly operation result is stored in the buffer memory module according to a cyclic addressing mode, the data in the buffer memory module are sequentially read, and the butterfly operation result is sequentially output, wherein the method comprises the following steps:

wherein the length of each conversion input port is 1/4 ^M ×N；

Dividing butterfly operation results of each intermediate stage into 16 groups, and storing the 16 groups in a buffer memory module in a cyclic addressing mode;

6. The FFT method of claim 1 wherein the buffer module is a plurality of dual port RAMs or a plurality of single port RAMs.

7. The FFT method of claim 6 wherein the number of dual port RAMs is 7 or 8, as determined by the number of points of the FFT operation.

8. The FFT method according to claim 6, characterized in that,

and the total depth of the plurality of dual-port RAMs is less than or equal to twice the number of FFT operation points.

9. The FFT method of claim 1, wherein the number of radix-4 butterfly operators is 1 or 2, which is determined by the number of FFT operations.