CN115640493A - FPGA-based piecewise linear fractional order operation IP core - Google Patents

FPGA-based piecewise linear fractional order operation IP core Download PDF

Info

Publication number
CN115640493A
CN115640493A CN202211332312.3A CN202211332312A CN115640493A CN 115640493 A CN115640493 A CN 115640493A CN 202211332312 A CN202211332312 A CN 202211332312A CN 115640493 A CN115640493 A CN 115640493A
Authority
CN
China
Prior art keywords
data
linear
calculation
address
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211332312.3A
Other languages
Chinese (zh)
Inventor
赵佳
钟乔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Medical College
Original Assignee
Chengdu Medical College
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Medical College filed Critical Chengdu Medical College
Priority to CN202211332312.3A priority Critical patent/CN115640493A/en
Publication of CN115640493A publication Critical patent/CN115640493A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention discloses a piecewise linear fractional order operation IP core based on an FPGA (field programmable gate array). An input conversion module receives ADC (analog-to-digital converter) acquired data, performs format conversion and then sends the data to a nonlinear operation module and a linear operation module, the nonlinear operation module calculates a nonlinear convolution operation result in data fractional order operation, a binomial coefficient is obtained by piecewise linear fitting calculation, the linear operation module calculates a linear convolution operation result in data fractional order operation, and a combination operation module fuses the nonlinear convolution operation result and the linear convolution operation result to obtain a fractional order operation result and sends the fractional order operation result to an output conversion module for format conversion and then output. The method uses a multi-segment linear mode to fit a real binomial coefficient curve, flexibly adapts to different application scenes through the characteristic of configurable segment numbers, and simultaneously adopts a parallel mode to improve the operation efficiency.

Description

FPGA-based piecewise linear fractional order operation IP core
Technical Field
The invention belongs to the technical field of digital signal processing, and particularly relates to a piecewise linear fractional order operation IP core based on an FPGA.
Background
In recent years, fractional calculus operation has become a hot research field, because scholars find that the description of the system change process by the fractional calculus is more accurate. At present, fractional calculus operation has been widely applied to a plurality of technical fields such as synovial membrane control, multidimensional chaotic systems, weak signal detection, voice encryption, digital filtering, memristors, image recognition, neuron simulation and the like.
Analog circuit implementation and digital implementation fractional calculus operations play a very important role in the field of fractional application research. The analog circuit mainly adopts resistors, capacitors, inductors, operational amplifiers and the like to construct a fractional calculus operational circuit, for example, a memristor circuit is realized by adopting a plurality of basic elements. However, since the analog Circuit implementation is affected by various factors such as parasitic parameters, device environment characteristics, and PCB (Printed Circuit Board) distribution parameters, the analog Circuit implementation can only verify the function, and it is difficult to implement a fractional calculus with high accuracy. Digital implementation does not need to be concerned with the characteristics of the device itself, and there are also many EDA (Electronic Design Automation) software realizable. The method greatly reduces the difficulty of realizing the fractional calculus operation number, and further accelerates the wide application and popularization of the fractional calculus operation in a plurality of technical fields, such as secret communication, weak signal detection, automatic control and the like. Therefore, ways of digitally implementing fractional calculus operations are becoming increasingly popular.
The existing fractional order calculating circuit adopts a calculating mode of limited point numbers, the convolution point numbers of the fractional order are fixed, when the calculated point numbers are larger than the set point numbers, redundant data are abandoned, namely, the fractional order coefficient is calculated through a zero filling mode, the mode stores large errors in calculating precision, and the actual engineering value is not high; on the basis, a linear fitting algorithm is provided, a straight line is used for replacing a real fractional order coefficient curve, so that a fractional order calculation result can be calculated all the time in a recursion mode.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a piecewise linear fractional order operation IP core based on an FPGA (field programmable gate array). A true fractional order coefficient curve is fitted by using a piecewise linear mode, different application scenes are flexibly adapted by the characteristic of configurable segment numbers, and meanwhile, the operation efficiency is improved by adopting a parallel mode.
In order to achieve the above object, the present invention provides an IP core based on FPGA with piecewise linear fractional order operation, which is characterized by comprising an input conversion module, a nonlinear operation module, a linear operation module, a merge operation module, and an output conversion module, wherein:
the input conversion module is used for receiving ADC (analog to digital converter) acquired data and converting the data into a floating point or fixed point decimal format to obtain converted data x (n), wherein n represents the number of sampling points, and n =1,2, \ 8230;
the nonlinear operation module is used for calculating a nonlinear convolution operation result W (n) in the fractional order operation of the data x (n) and sending the result W (n) to the merging operation module, and a calculation formula of the nonlinear convolution operation result W (n) is as follows:
Figure BDA0003913975290000021
where m represents the length of the memory, b (j) represents the binomial coefficient, j =0,1, \ 8230, L-1, L represents the total number of binomial coefficients, and the binomial coefficient b (j) is calculated as follows:
when j =0,1, \8230;, m-1, a theoretical calculation formula is adopted to solve to obtain a binomial coefficient b (j), and when j = m, m +1, \8230;, L-1, a piecewise linear fitting is adopted to obtain a binomial coefficient b (j), and the specific method is as follows:
let the binomial coefficients b (m) to b (L-1) consist of K-segment linear functions, where K is 1. Ltoreq. K.ltoreq.K, and L is L k The length of the k-th section linear function is expressed, namely the number of fitting coefficients b (j) of the k-th section is L k The k-th segment of the linear function is represented by y k To express, make the linear function y k B (n) k-1 ) And b (n) k ) B (n) is obtained by calculation by adopting a theoretical calculation formula k-1 ) And b (n) k ) The k-th linear function y k Slope of (b) beta k The following formula is adopted for calculation:
Figure BDA0003913975290000022
wherein n is 0 =m-1,n k =n k-1 +L k
Then, a binomial coefficient b (j) is calculated by adopting the following formula:
b(j)=b(j-1)+β k ,n k-1 <j≤n k
the linear operation module is used for calculating a linear convolution operation result S (n) in the fractional order operation of the data x (n) and sending the linear convolution operation result S (n) to the merging operation module, and the calculation formula of the linear convolution operation result S (n) is as follows:
Figure BDA0003913975290000031
the merging operation module receives the nonlinear convolution operation result W (n) and the linear convolution operation result S (n) and then performs operation according to a formula D α x(n)=T s [W(n)+S(n)]The fractional order operation result D of the data x (n) is obtained by calculation α x (n) wherein T s Expressing the fractional time interval coefficient, and then calculating the fractional operation result D α x (n) is sent to an output conversion module;
the output conversion module calculates the merging operation module to obtain a fractional order operation result D α And x (n) is subjected to format conversion, and the original acquired data format is selected to be reserved or converted into quantized data to be output according to actual needs.
The invention relates to a piecewise linear fractional order operation IP core based on an FPGA (field programmable gate array), wherein an input conversion module receives ADC (analog-to-digital converter) acquired data and performs format conversion, and then sends the data to a nonlinear operation module and a linear operation module, the nonlinear operation module calculates a nonlinear convolution operation result in data fractional order operation, a binomial coefficient is obtained by piecewise linear fitting calculation, the linear operation module calculates a linear convolution operation result in data fractional order operation, and a combination operation module fuses the nonlinear convolution operation result and the linear convolution operation result to obtain a fractional order operation result and sends the fractional order operation result to an output conversion module for format conversion and then outputs the fractional order operation result.
The invention has the following beneficial effects:
1) When the binomial coefficient is subjected to piecewise linear fitting, the combination configuration of any nonlinear point number and linear point number is supported, and the piecewise number can be configured, so that the method can be more flexibly suitable for various conditions, the calculation precision is higher when the number of the sections is larger, and when the number of the sections is set to be equal to the linear point number, the result is equal to the real value calculation result;
2) The nonlinear operation module and the linear operation module are adopted for parallel execution, so that the calculation time is reduced;
3) In the specific implementation of the invention, the nonlinear operation module adopts a convolution method based on data flow and adopts a parallel computing mode of a production line, so that the operation speed is improved; the linear operation module adopts a parallel flow multiplication and configurable tree-shaped addition structure, and increases the parallelization of multiplication and addition operation as much as possible so as to reduce the operation time;
4) In specific implementation, the invention can be developed by using a High Level Synthesis (HLS) tool (for short) by using an IP (Internet protocol), and can convert C/C + + codes into verliog codes firstly and then comprehensively implement the verliog codes into a specific circuit. All the configuration information can be put into the header file, and configuration modification is carried out in a parameter form, so that the method is more convenient and faster.
Drawings
FIG. 1 is a schematic diagram of fractional order operation of a digital signal in a combination of non-linear convolution and linear convolution;
FIG. 2 is a block diagram of an IP core for FPGA-based piecewise linear fractional order operation of the present invention;
FIG. 3 is a flow chart of a single point calculation for volume number 4;
FIG. 4 is a flow chart of a multi-pass parallel computation with a number of volumes of 4;
FIG. 5 is a block diagram of a non-linear operation module according to the present embodiment;
FIG. 6 is a view showing a memory structure of a binomial coefficient memory cell in serial calculation;
FIG. 7 is a schematic diagram of a cyclic blocking method;
fig. 8 is a memory structure diagram of a binomial coefficient memory cell in the present embodiment;
FIG. 9 is a schematic diagram of a read/write method of read/write separation in this embodiment;
FIG. 10 is a block diagram of a linear operation block in the present embodiment;
FIG. 11 is a schematic diagram of a signal storage unit in this embodiment
Fig. 12 is a memory structural diagram of a fitting coefficient memory unit in the present embodiment;
fig. 13 is a structural diagram of a calculation block in the linear operation unit in the present embodiment;
fig. 14 is a schematic diagram of an address mapping process in the signal address calculation circuit in the present embodiment;
fig. 15 is a schematic diagram of an address mapping process in the coefficient address calculation circuit in the present embodiment.
Detailed Description
The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the main content of the present invention.
In order to better explain the technical solution of the present invention, first, a brief derivation description is made on the principle of the present invention.
The following is a conventional fractional order calculation:
Figure BDA0003913975290000041
wherein x (n) represents digital signal, n is sampling point serial number of digital signal, m is internal memory length, D α x (n) denotes the fractional order of the digital signal x (n), T s The coefficient of fractional time interval is expressed, alpha is the order, b (j) is the coefficient of binomial formula, and the theoretical calculation formula is as follows:
Figure BDA0003913975290000051
the above formula shows that when calculating the fractional order, when the sampling point number n is greater than the upper limit of memory storage, the calculation cannot be performed, so that the finite memory principle is adopted, the times of multiplication and accumulation are fixed as m, when n is greater than m-1, some calculation term values can be lost, namely, part of coefficients b (j) can be replaced by zero values, so that errors can be generated, when m is selected to be smaller, the absolute value of b (j) replaced by the zero values is larger, so that the actual calculation can generate larger errors, and if the precision is improved by increasing the value of m, the actual calculation time and the resource consumption can be increased.
In order to improve the operation rate and reduce the error caused by zero filling in the traditional method under the limit of limited operation time and resources, the invention enlarges the number of the binomial coefficients, the total number of the binomial coefficients is L, L is more than m, the binomial coefficients b (0) to b (m-1) are calculated by adopting a theoretical calculation formula, the binomial coefficients b (m) to b (L-1) are obtained by adopting a fitting method based on a piecewise linear idea, the method adopts a piecewise linear fitting mode, and b (j) values after j is more than m-1 are fitted by adopting a plurality of sections of linear data to replace partial zero filling data, so that the improvement of obtaining larger operation precision by using a small amount of logic resources is realized, and the calculation result is closer to an ideal value.
The principle of obtaining the binomial coefficient by the fitting method based on the piecewise linearity idea is as follows:
let the linear fitting binomial coefficients b (m) to b (L-1) consist of K segments of linear functions, where K is 1. Ltoreq. K.ltoreq.K, and L is L k The length of the k-th section linear function is expressed, namely the number of fitting coefficients b (j) of the k-th section is L k The k-th segment of the linear function is represented by y k To express, make the linear function y k B (n) k-1 ) And b (n) k ) (obtained by pre-calculation) of the k-th section of the linear function y k Slope of (b) beta k Satisfies the following conditions:
Figure BDA0003913975290000052
wherein n is k The recurrence relation is satisfied, namely: n is a radical of an alkyl radical 0 =m-1,n k =n k-1 +L k
From the above conclusion, the calculation formula of the linear-fitted binomial coefficient b (j) can be obtained:
b(j)=α k-1 +l k β k (4)
wherein n is k-1 <j≤n k ,l k =j-n k-1 I.e. 0 < l k ≤L k ,α 0 =b(n 0 ),α k =b(n k )。
The recursion form is as follows:
b(j)=b(j-1)+β k (5)
on the basis of the above binomial coefficient b (j), the following improvement can be made to the formula (1):
Figure BDA0003913975290000061
w (n) represents a nonlinear convolution operation, S (n) represents a linear convolution operation, and the following conditions are satisfied:
Figure BDA0003913975290000062
Figure BDA0003913975290000063
therefore, equation (5) can be simplified as follows:
D α x(n)=T s [W(n)+S(n)],n≥0 (9)
the above formula adopts a mode of combining nonlinear convolution and linear convolution to carry out fractional order operation on the digital signal, adds a linear convolution sum S (n) on the basis of the formula (1), and is obtained by convolution of K-segment fitting coefficients b (j) and the digital signal x (n). Fig. 1 is a schematic diagram of a digital signal fractional order operation in a combination of nonlinear convolution and linear convolution. According to the calculation rule shown in fig. 1, it can be known that:
(1) When n is more than or equal to 0 and less than or equal to n 0 Then, that is, the amount of data x (n) is not enough to enter the first linear fitting interval, and the linear convolution and calculation result is:
S(n)=0 (10)
(2) When n is 0 <n≤n 1 While, the data x (n) enters the first linear fit interval, i.e., b (n) is increased 0 +1),b(n 0 + 2., b (n) and x (0), x (1),. X (n-n) 0 -1) the result of the convolution calculation, when the linear convolution sum calculation result is:
Figure BDA0003913975290000064
(3) When n is k-1 <n≤n k While, the data x (n) enters the kth linear fit interval, i.e., b (n) is increased k-1 +1),b(n k-1 + 2., b (n) and x (0), x (1),. And x (n-n) k-1 -1) convolution calculation results, when the linear convolution sum calculation results are:
Figure BDA0003913975290000071
(4) When n > n K When the data x (n) enters the zero padding interval, i.e. b (n) is increased K +1),b(n K + 2., b (n) and x (0), x (1),. And x (n-n) K -1) convolution calculation results, when the convolution sum calculation results are:
Figure BDA0003913975290000072
since the above formula has a recurrence relation, it can be expressed in a recurrence form, and since it is known that S (n) =0 when n =0, the following recurrence form can be obtained:
(1) When n is more than 0 and less than or equal to n 0 Then, that is, the amount of data x (n) is not enough to enter the first linear fitting interval, and the linear convolution and calculation result is:
S(n)=S(n-1)=0 (14)
(2) When n is 0 <n≤n 1 At this point, the data x (n) begins to enter the first linear fit interval, which is recursive:
Figure BDA0003913975290000073
(3) When n is 1 <n≤n 2 Then, the data x (n) enters a second linear fit interval, whose recursive form is:
Figure BDA0003913975290000074
(4) When n is k-1 <n≤n k Then, the data x (n) enters the kth linear fitting interval, whose recursion form is:
Figure BDA0003913975290000075
(5) When n > n K When the data x (n) enters the zero padding interval, the recurrence form is as follows:
Figure BDA0003913975290000081
for the convenience of circuit implementation, the above segmented expressions need to be unified into a single-segment representation, so that x (i) is zero-filled and right-shifted, that is, n is complemented before x (0) K +1 values of 0, define x * (i) Is a new sequence after zero padding right shift, which satisfies:
Figure BDA0003913975290000082
therefore, the above-mentioned segmented form is expressed as the following single-segment form:
Figure BDA0003913975290000083
to simplify the formula, the multiply-accumulate part of each formula is re-expressed and defined as a multiply-accumulate term P k (n) and Q k (n) and data items D (n) and correction items C (n) satisfying the following forms, respectively:
Figure BDA0003913975290000084
Figure BDA0003913975290000085
D(n)=α 1 x * (n+n K -n 0 ),n≥0 (23)
C(n)=α K x * (n),n≥0 (24)
in this case, equation (19) is simplified as follows:
Figure BDA0003913975290000086
due to multiplication and accumulation of term P k (n) and Q k (n) a recurrence relation exists, so that a recurrence term p is defined k (n) and q k (n) satisfying:
p k (n)=β k x * (n+n K -n k-1 ),n≥1 (26)
q k (n)=β k x * (n+n K -n k ),n≥1 (27)
multiply-accumulate buffer P k (n) and Q k (n) satisfies the following recursion form:
P k (n)=P k (n-1)+p k (n),n≥1 (28)
Q k (n)=Q k (n-1)+q k (n),n≥1 (29)
the above two equations illustrate that the description convolution sum S (n) can be calculated by recursion, and the required multiplication and accumulation result is passed through P by recursion k (n) and Q k (n) is stored in a manner that avoids the need to do so each time S (n) is calculatedAnd a large number of repeated multiply-accumulate operation operations are performed, so that the resource consumption is reduced, and the calculation time is reduced.
The result S (n) of calculating the linear convolution operation in a recursive manner can be expressed as follows:
Figure BDA0003913975290000091
wherein:
P k (n)=P k (n-1)+p k (n),n≥1 (31)
Q k (n)=Q k (n-1)+q k (n),n≥1 (32)
p k (n)=β k x * (n+n K -n k-1 ),n≥1 (33)
q k (n)=β k x * (n+n K -n k ),n≥1 (34)
D(n)=α 1 x * (n+n K -n 0 ),n≥0 (35)
C(n)=α K x * (n),n≥0 (36)
in the above formula, β k 、α k 、n k And x * (i) As a known condition, S (0) =0 is an initial condition.
According to the derivation process, the invention designs the IP core based on the FPGA and used for the piecewise linear fractional order operation. FIG. 2 is a block diagram of an IP core for FPGA-based piecewise linear fractional order operation according to the present invention. As shown in fig. 2, the FPGA-based IP core includes an input conversion module 1, a nonlinear operation module 2, a linear operation module 3, a merge operation module 4, and an output conversion module 5, where:
the input conversion module 1 is used for receiving the data collected by the ADC and converting the data into a floating point or fixed point decimal format (i.e., a format required by an actual project), to obtain converted data x (n), where n represents a sampling point number, and n =1,2, \8230, and then dividing the data x (n) into two paths, one of which is sent to the nonlinear operation module 2 and the other of which is sent to the linear operation module 3. The format conversion of the data can be directly implemented by library functions provided by the HLS tool.
The nonlinear operation module 2 is configured to calculate a nonlinear convolution operation result W (n) in the fractional order operation of the data x (n), and send the result W (n) to the merge operation module 4. The calculation formula of the nonlinear convolution operation result W (n) is as follows:
Figure BDA0003913975290000101
wherein m represents the length of the internal memory, b (j) represents the binomial coefficient, j =0,1, \ 8230, L-1, L represents the total number of the binomial coefficient, and the binomial coefficient b (j) is calculated as follows:
when j =0,1, \ 8230;, m-1, a theoretical calculation formula is adopted to solve and obtain a binomial coefficient b (j), and when j = m, m +1, \ 8230;, L-1, a piecewise linear fitting is adopted to obtain a binomial coefficient b (j), and the specific method is as follows:
let the binomial coefficients b (m) to b (L-1) consist of K-segment linear functions, where K is 1. Ltoreq. K.ltoreq.K, and L is L k The length of the k-th section linear function is expressed, namely the number of fitting coefficients b (j) of the k-th section is L k K-th linear function by y k To express, make the linear function y k B (n) k-1 ) And b (n) k ) B (n) is calculated by adopting a theoretical calculation formula k-1 ) And b (n) k ) The k-th linear function y k Slope of (b) beta k Calculated using the following formula:
Figure BDA0003913975290000102
wherein n is 0 =m-1,n k =n k-1 +L k
Then, a binomial coefficient b (j) is calculated by adopting the following formula:
b(j)=b(j-1)+β k ,n k-1 <j≤n k (39)
the linear operation module 3 is configured to calculate a linear convolution operation result S (n) in the fractional order operation of the data x (n), and send the linear convolution operation result S (n) to the merge operation module 4, where a calculation formula of the linear convolution operation result S (n) is as follows:
Figure BDA0003913975290000103
the merging operation module 4 receives the nonlinear convolution operation result W (n) and the linear convolution operation result S (n), and then performs the operation according to the formula D α x(n)=T s [W(n)+S(n)]The fractional order operation result D of the data x (n) is obtained by calculation α x (n) wherein T s Representing the fractional time interval coefficient, and then calculating the fractional operation result D α x (n) is sent to the output conversion module 5.
The output conversion module 5 calculates the fractional order operation result D to the merging operation module 4 α And x (n) is subjected to format conversion, and the original acquired data format is selected to be reserved or converted into quantized data to be output according to actual needs.
In order to increase the operation rate, an optimal implementation manner is provided for the operation manner of the nonlinear operation module 2 in this embodiment, that is, a parallel computation manner based on a pipeline is adopted to implement parallelization and pipelining of operations, and the implementation principle is as follows:
because the data signals are input in a flowing water mode and cannot be received in the same clock, each output result cannot be calculated in a single clock period, and each multiply-accumulate result can also be calculated in a flowing water mode according to the characteristic of the flowing water input of the data signals. Fig. 3 is a flow chart of a single point calculation with a volume number of 4. As shown in fig. 3, a set of multiply and add operations is required for each input of data, and the output of one point is completed after 4 cycles. Since the data comes in sequence, 4 paths of multiply and add operations are required simultaneously in order to realize the output of one data point in each cycle.
Fig. 4 is a flowchart of the multi-pass parallel computation with the number of volumes 4. As shown in fig. 4, this is a 4-line independent calculation process, there is no data dependency between each line of operations, and separate operation resources and memory space may be allocated. For complete continuous calculation, each operation cycle is operated according to columns, that is, each input data needs to complete multiplication operation with all filter coefficients in one calculation cycle, and the result of the previous cycle is added. From the view of row operation, each calculation result is output after the multiplication and addition operation of four periods is completed, the first calculation result in the first row can be output after the fourth period, the first calculation result in the second row can be output after the fifth period, and so on, a row of data is selected to be output after each calculation period is finished, namely continuous output is obtained. The calculation rate of the calculation mode depends on the clock interval of two-time point output, namely the speed of the calculation speed of each period; to increase speed, multiple multiply and add operations may be performed in a pipelined and parallel manner to achieve higher sample rate real-time operations. For the running water calculation, the resources consumed by multiplication and addition can be multiplexed, the difference is performed by one clock cycle every time, and only three clock cycles are consumed for completing four operations compared with one operation. When the length m of the internal memory is larger, a parallel mode can be further adopted, and only a plurality of operation resources need to be copied and participate in calculation at the same time, so that the clock period is further reduced.
From the above description, the advantage of this calculation method is that in each group of pipelines, each input data only needs to be read once, and multiplication is performed with all filter coefficients according to a certain algorithm rule, so that buffering of the input data is not needed, and a storage space is reduced.
Based on the principle, the nonlinear operation module based on the assembly line is provided. Fig. 5 is a structural diagram of the nonlinear operation module in the present embodiment. As shown in fig. 5, in the present embodiment, the nonlinear operation module 2 includes a binomial coefficient storage unit 21, a nonlinear buffer unit 22 and a nonlinear operation unit 23, where:
the binomial coefficient storage unit 21 is used to store a binomial coefficient b (j) calculated in advance according to a binomial coefficient calculation method. If serial calculation is used, the binomial coefficient storage unit 21 can be directly realized by a single-port ROM. Fig. 6 is a memory structure diagram of a binomial coefficient memory cell in serial calculation. As shown in fig. 6, according to the calculation formula of the nonlinear convolution operation result W (n), if the number of the binomial coefficients b (j) to be used is m, the depth of the single block single-port ROM configured in the serial calculation is also set to be m, and the binomial coefficients b (j) are stored in the reverse order according to the j value.
However, in this embodiment, a parallel computing method based on a pipeline is required, and the storage result shown in fig. 6 has a problem of insufficient port number, so that parallel reading cannot be achieved. In order to increase the number of ports, the ROM needs to be partitioned, and the partitioning method is called a Cyclic partitioning method (Cyclic), which is referred to as a C method for short. FIG. 7 is a schematic diagram of the cyclic blocking method.
Fig. 8 is a memory structure diagram of the binomial coefficient memory cell in the present embodiment. As shown in fig. 8, in this embodiment, since the number of parallel lines is P, a single-port ROM with a depth of m is divided into P blocks in a cyclic block division manner, and the depth of each ROM block is m/P. In practical application, the depth of the ROM blocks is made to be an integer by setting the sizes of m and P, or the integer is obtained by complementing the depth m. Then, the m binomial coefficients b (j) are extracted by P times, and then are stored in the corresponding ROM blocks according to the reverse order of addresses. The number of a ROM partition where a binomial coefficient b (j) is located is addr _ col (j), the address in the ROM partition is addr _ row (j), and the calculation formulas are respectively as follows:
addr_col(j)=P-1-j%P (41)
Figure BDA0003913975290000121
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003913975290000122
indicating a rounding down.
The non-linear buffer unit 22 is used for buffering the accumulated data in the middle of the multiply-accumulate calculation of the data x (n) and the binomial coefficient b (j). In this embodiment, the non-linear cache unit 22 uses a single-port RAM, and because there is a problem of insufficient number of read ports in parallel operation, it also needs to perform block processing according to parallel numbers, and the block processing mode is the same as the storage mode of b (j), that is, the single-port RAM is divided into P blocks by using a circular block processing mode, and the depth of each RAM block is m/P. Taking 8 o' clock 2 as an example, the storage addresses are represented by 0 to 7, the data in the address 0 is divided into a first block, the data in the address 1 is divided into a second block, the data in the address 2 is divided into the first block from the new count, the data in the address 3 is continuously divided into the second block, and so on. The ROM and the RAM can be realized by selectively using BRAM and DRAM FPGA resources, and the resource types can be set and realized only by a pragma HLS bind _ storage setting instruction in an HLS tool.
The nonlinear operation unit 23 is configured to receive the data x (n), read the binomial coefficient b (j) from the binomial coefficient storage unit 21, perform calculation to obtain a nonlinear convolution operation result W (n), and buffer the result W (n) into the nonlinear buffer unit 22. The nonlinear operation unit 23 includes P operation modules, and performs m/P times of calculation for each path by using a P-path parallel calculation method. In order to implement the shift operation of data with fewer clock cycles, a specific read-write mode needs to be adopted in this embodiment, and the basic idea of this mode is to change the spatial positions of data before and after calculation by a read-write separation method, thereby equivalently implementing the shift operation of data. Fig. 9 is a schematic diagram of a read/write method of read/write separation in this embodiment. As shown in fig. 9, in this embodiment, the nonlinear operation unit 23 adopts a fixed reading mode when reading the corresponding binomial coefficient b (j) from the binomial coefficient storage unit 21, that is, the p-th path operation module reads data from the p-th ROM block of the binomial coefficient storage unit 21, and the accumulated data adopts a block reading and writing mode, that is, the p-th path operation unit reads data from the p-th RAM block of the nonlinear cache unit 22, and writes data into the p-1 RAM block after the operation is completed, where the difference between the number of read and write blocks is one, and returns to the maximum value again when the number of blocks is less than 0. The reading block number of the RAM is represented by p, the writing block sequence number of the RAM is represented by p', and the rule can be represented by the following formula:
p′=((p-1)) P-1 (43)
wherein, () P-1 Meaning modulo, i.e. when P-1 < 0, P '= P-1, otherwise, P' = P-1.
The method has the advantages that the wiring connection among the physical resources is fixed, and a large amount of resources do not need to be selected for selecting the operation space, so that a large amount of resources are saved.
After the selection mode of the storage space is determined, a specific read-write address needs to be further calculated. The read address of the binomial coefficient b (j) and the read address of the accumulated data are simple, namely, the reading is sequentially performed in the ROM and the RAM corresponding to the number of blocks, and the write addresses of the accumulated result are different and can be shifted along with the difference of the number of calculation paths, and the specific calculation method is as follows:
for convolution operation of m points, when performing P-way parallel computation, each way of computation module executes m/P times of computation, each time of computation reads a coefficient b (j), an accumulated data and writes an accumulated result, a counter count _ q is set for recording the number of times of computation, count _ q =0,1, \ 8230, m/P-1, a register reg _ h is set for storing a read address of the coefficient b (j), reg _ h =0,1, \ 8230, m/P-1, a register reg _ read is set for storing a read address of the accumulated data, reg _ read =0,1, \ 8230, m/P-1, a register reg _ P is set for storing a block number (i.e. P'), reg _ P =0,1,. 8230, P-1, a register reg _ write is set for storing a write address of the accumulated data, reg _ write =0,1,. The counter count-P-1, and the register-P-1 satisfy the following relationship.
reg_h=count_q (44)
reg_read=count_q (45)
Figure BDA0003913975290000141
Wherein, (()) m/P-1 Indicating modulo, i.e. when
Figure BDA0003913975290000142
Reg _ write = m/P-1, and the rest is unchanged, i.e.
Figure BDA0003913975290000143
In an actual circuit, specific values of reg _ h, reg _ read and reg _ write can be calculated by designing a corresponding logic circuit, because P-path operation is executed simultaneously, only a counter count _ q with an initial value of 0 needs to be set during parallel calculation, and the read address reg _ h of each path of storage binomial coefficient b (j) and the read address reg _ read of accumulated data and the counter count _ q take the same values, so that all calculation paths can share one same group of reg _ read and count _ q values; unlike the read address, the write address is associated with the register reg _ p, so the per-way computation module needs to compute the write address of the per-way accumulated result.
In summary, the specific method of P-way parallel computation in the nonlinear operation unit 23 is as follows:
after the nonlinear operation unit 23 receives the data x (n), the P operation modules read the binomial coefficients from the binomial coefficient storage unit 21 in m/P batches and multiply the binomial coefficients with the data x (n), wherein the binomial coefficients read by the P operation module for the second count _ q time have the binomial coefficients of the address count _ q in the P ROM block in the binomial coefficient storage unit 21, count _ q =0,1, \\ 8230, m/P-1. Reading accumulated data from the address count _ q in the p-th RAM block in the non-linear cache unit 22, adding the multiplication result of this time to the accumulated data, and writing the result into the address reg _ write in the p '-th RAM block in the non-linear cache unit 22, wherein the calculation formulas of the RAM block number and the p' write address reg _ write are respectively as follows:
p′=((p-1)) P-1 (47)
wherein, () P-1 Meaning modulo, i.e. when P-1 < 0, P' = P-1, otherwise the same.
Figure BDA0003913975290000144
Wherein, () m/P-1 Indicating taking the modulus, i.e. when
Figure BDA0003913975290000151
Reg _ write = m/P-1, and the rest is unchanged.
After completing all the calculations, the nonlinear operation unit 23 controls the nonlinear cache unit 22 to output the accumulated data with address 0 in the P-1 th block of RAM as the nonlinear convolution operation result W (n) of the data x (n) to the merge operation module 4, and then clears the accumulated data in the address to store the multiplication result of the next data and the binomial coefficient b (0).
For the linear operation module 3, according to the recursive derivation in the foregoing principle derivation in this embodiment, a preferable implementation manner of the linear operation module 3 is provided, and the linear convolution operation result S (n) in the fractional order calculation is obtained by using a recursive calculation in the implementation manner. Fig. 10 is a structural diagram of the linear operation module in the present embodiment. As shown in fig. 10, the linear operation module 3 in this embodiment includes a signal storage unit 31, a fitting coefficient storage unit 32, a linear operation unit 33, an accumulation buffer unit 34, a tree-shaped addition unit 35, and a summation operation unit 36, where:
the signal storage unit 31 is configured to receive the data x (n) and perform cyclic storage. The signal storage unit 31 improves the storage manner of the data x (n) for the convenience of subsequent calculation and circuit implementation. Fig. 11 is a schematic structural diagram of a signal storage unit in the present embodiment. As shown in fig. 11, in this embodiment, the setting signal available memory length m =2 a Linear segment length H =2 b Number of segments of linear segment K =2 c -1, total linear segment length L = K × H, a, b, c all being positive integers. Set length as length m 0 Compensation space of (c), m + m 0 = H, the storage depth N = L + m of the signal storage unit 31 0 =2 c ×2 b Thus ensuring that the memory depth N is also an integer power of 2. For the purpose of parallel computation, the signal storage unit 31 needs to be partitioned into blocks, the number of parallel lines is P, and an exemplary P-block single-port RAM is used for storing data x (n).
Because the blocking processing is performed, a certain mapping relationship exists between the original storage space and the new storage space after the blocking, and the mapping relationship adopted in this embodiment is as follows:
firstly, H times of extraction is carried out on an original storage space with the depth of N to obtain H groups of storage spaces, the depth of each group of storage spaces is K +1, and the H groups of storage spaces are sequentially arranged to obtain a first-step mapping result.
Then, a P-partition cyclic blocking method (cyclic), referred to as P-partition C method for short, is used to continuously map the memory space after the first step of mapping, the rule of the method is to perform P-fold extraction on the memory space after the first step of mapping with the depth of N to obtain P groups of memory spaces, the depth of each group of memory spaces is N/P, P represents the number of groups, P =0,1, 8230, P-1, the P-th group of memory spaces corresponds to the P-th single block of RAM, and the P groups of memory spaces obtain P single blocks of RAM, that is, the realized blocking process of the single block of RAM, specifically, the following process is shown:
and using addr _ old (n) to represent an original address in the original storage space, and using addr _ new (n) to represent an address after first mapping, wherein the addr _ new (n) satisfies the following relation:
Figure BDA0003913975290000161
where mod represents the remainder of the calculation,
Figure BDA0003913975290000162
meaning rounding down.
addr _ col (n) represents the number of blocks of the RAM, addr _ row (n) represents the address in each block, which satisfies the following relationship:
Figure BDA0003913975290000163
addr_row(n)=addr_new(n)mod P
after the block is divided, a writing mode of data x (N) needs to be further explained, in this embodiment, a sequential loop storage mode is adopted, and the rule of the method is that addr _ initial = L + m is used as an initial writing address, addresses of subsequent x (N) are sequentially added by one, when the addresses exceed a storage depth N, the addresses are returned to 0, and then the sequential addition is continued and the loop is performed according to the rule, addr _ old (N) represents a theoretical writing address of x (N), and the theoretical writing address satisfies the following relationship:
addr_old(n)=(addr_initial+n)modN
according to the mapping rule, the addr _ row (n) and addr _ col (n) corresponding to addr _ old (n), namely the actual storage address after the available data partition, are calculated.
The fitting coefficient storage unit 32 is for storing a fitting coefficient, i.e., α, for calculating a linear-segment binomial coefficient 1 、β 1 、β 2 、…、β K And alpha K . Fig. 12 is a memory configuration diagram of the fitting coefficient memory unit in the present embodiment. As shown in fig. 12, in the present embodiment, for the purpose of simultaneous reading of coefficients to realize parallel computation, the fitting coefficient storage unit 32 uses K +2 registers to store the correction term C (n) corresponding to the coefficient α K Is loaded into the 1 st register after being loaded, and the data item D (n) corresponds to the coefficient alpha 1 After getting negative, storing in the last register, beta k And sequentially storing the K values into the rest K registers in a reverse order according to the value of K.
The linear operation unit 33 is used for reading data from the signal storage unit 31, reading the fitting coefficient of the binomial coefficient from the fitting coefficient storage unit 32, and calculating to obtain the multiply-accumulate term P k (n) and multiply-accumulate term Q k (n), a data item D (n), and a correction item C (n). The linear operation unit 33 comprises 2 parallel computing modules, the 1 st path is used for computing the multiplication accumulation term P k (n) and data item D (n), marked as PD way calculating module, 2 nd way for calculating multiplication accumulation item Q k And (n) and a correction term C (n) are recorded as a QC path calculation module, namely each path needs to calculate a K +1 term result. In order to increase the computation speed, in this embodiment, each computing module adopts parallel computation, that is, P instantiations are respectively performed on the multiplication computing resource and the addition computing resource of two computing modules, and each instantiating computing module needs to perform the computation of (K + 1)/P item results. Fig. 13 is a structural diagram of a calculation block in the linear operation unit in the present embodiment. As shown in fig. 13, each calculation block in the linear operation unit 33 includes a signal address calculation circuit 331, a coefficient address calculation circuit 332, P multiplication operation circuits 333, and P addition operation circuits 334.
The operation process of the linear operation unit 33 is substantially to multiply and add the corresponding data in the signal storage unit 31 and the fitting coefficient storage unit 32, so the rough flow of the single calculation is as follows:
in the first step, corresponding data needs to be read from the signal storage unit 31 and the fitting coefficient storage unit 32.
And secondly, multiplying the two read data.
Third, the existing data is read from the accumulation buffer unit 34, and the multiplication result is added thereto.
In the fourth step, the calculated result is continuously written into the accumulation buffer unit 34.
In order to ensure correct reading of data in the signal storage unit 31 and the fitting coefficient storage unit 32, in this embodiment, a signal address calculation circuit 331 is provided in each path of calculation module, and a coefficient address calculation circuit 332 is used to generate a read address, and a detailed derivation and description will be made of the read address calculation method of the two address calculation circuits.
The signal address calculation circuit 331 is configured to generate a read address of the signal storage unit 31, and in order to ensure that no port collision occurs during parallel reading, a specific address calculation rule needs to be set, which is specifically as follows:
since the PD-way computation module and the QC-way computation module need to read K +1 data from the signal storage unit 31 respectively at each computation time, and the parallel lines are P, a total of V reads are performed each time P data are read, where V = (K + 1)/P. In this embodiment, the signal address calculating circuit 331 needs to calculate P data read addresses, where the P-th data read address is the number addr _ col (P) (V) of RAM blocks in the signal storage unit 31 selected when the P-th computing resource performs the V-th computation, P =1,2, \ 8230;, P, V =1,2, \ 8230;, V. The counter count _ l is set to record the number of operations of the linear operation module 33, the initial value is 0, the value is incremented by 1 after each operation is completed, and the count value is reset to zero after being N-1. And the read address is corresponding to the block storage mode in the signal storage unit by means of a running time counter count _ l.
When count _ l =0, the reading rule is as follows:
at the 0 th reading time, the 0 th path computing resource reads data from the 0 th address in the 0 th block RAM for computing a recursion item p k (n), the 1 st way computing resource reads data from the 0 th address in the 1 st block of RAM, and so on, and the p-1 st way computing resource reads data from the 0 th address in the p-1 st block of RAM. When the 1 st reading is carried out later, the 0 th path computing resource reads data from the 1 st address in the 0 th block RAM for computing a recursion item p k (n), the 1 st way computing resource reads data from the 1 st address in the 1 st block of RAM, and so on, the P-1 st way computing resource reads data from the 1 st address in the P-1 st block of RAM, and the process is repeated according to the rule until the V-1 th reading, the P-1 st way computing resource reads data from the V-1 st address in the P-1 st block of RAM.
When the running time count _ l is increased, the input signal x (n) is updated according to the algorithm rule, and the reading rule is also changed, namely, for the RAM block number addr _ col (P) (v), the RAM block number addr _ col (P) (v) is only changed along with the value of the count _ l, and every time the running time count _ l is increased by H, the RAM block number addr _ col (P) (v) is increased by one on the basis of the original value, and is zeroed after the RAM block number addr _ col (P) (v) exceeds the maximum value P-1, and the operation is repeated according to the rule.
For the address addr _ row (P) (V) in the RAM, the reading rule is the same as that described above, and the address addr _ row (P) (V) in the RAM is first increased with the increase of the reading times V, but is not increased by taking 0 as the starting address but is increased based on the maximum address V-1 of the last reading until the increase of the running times count _ l is H, which exceeds the maximum value N/P-1, and then the loop is restarted by zeroing the address addr _ row (P) (V) in the RAM. However, each time the running time count _ l increases by H, the RAM block number addr _ col (p) (V) is changed, so the read address addr _ row (p) (V) in the RAM is also changed according to the RAM block number addr _ col (p) (V), at this time, each time the RAM block number addr _ col (p) (V) is reset to zero, the corresponding address addr _ row (p) (V) in the RAM is incremented on the basis of the original value, and when the RAM block number addr _ row (p) (V) exceeds the maximum value addr _ row (p) (V-1), the RAM block number addr _ row (p) (V) is reset to the minimum value addr _ row (p) (0), and the operation is repeated according to the rule.
Based on the above rule, the RAM block number addr _ col (p) (v) in the theoretical read address of the signal storage unit 31 in this embodiment is calculated by using the following formula:
Figure BDA0003913975290000181
the address addr _ row (p) (v) in the RAM is calculated using the following equation:
Figure BDA0003913975290000182
according to the above calculation rules, in this embodiment, each calculation resource in each path of calculation module is not read from a fixed RAM, and changes with the increase of the running times count _ l, which consumes a large amount of selector resources. Therefore, in order to reduce resource consumption, the present embodiment improves the combination manner of the computing resource and the RAM, and at this time, a manner that the pth resource always reads the pth block RAM is adopted, that is, the RAM block number addr _ col (p) (v) does not change with the running time count _ l, and in this allocation manner, the p value needs to be mapped and transformed in order to p * Represents the remapped p-value, which satisfies the following equation:
Figure BDA0003913975290000191
at this time, only the newly mapped p is needed * Substituting the original calculation formula of addr _ col (p) (v) and addr _ row (p) (v) to calculate the reallocated address, namely:
p (th) of * The single-port RAM sequence number addr _ col (p) (v) in the signal storage unit in each read address is always p * Read Address addr _ row (p) in Single Port RAM * ) (v) calculated according to the following formula:
Figure BDA0003913975290000192
fig. 14 is a schematic diagram of an address mapping process in the signal address calculation circuit in the present embodiment. As shown in fig. 14, when the p-value mapping transformation is not adopted, the serial number of the single-port RAM is periodically switched when the multiplication circuit reads data, and after the p-value mapping transformation is adopted, the multiplication circuit only needs to fixedly read the data of the corresponding single-port RAM, and does not need to switch, so that the actual circuit can be simplified, and the implementation of the present invention is facilitated.
The coefficient address calculation circuit 332 is configured to calculate read addresses of the fitting coefficients in the fitting coefficient storage unit 32. The same as the signal address calculation circuit, each time the linear operation module 3 completes calculation, the PD path and the QC path read K +1 data from the fitting coefficient storage unit 32, and when the number of parallel lines is P, read P data each time, and perform V times of reading in total, where V = (K + 1)/P, and let addr _ read (P) (V) represent the initial read address when the pth path calculation resource performs the pth calculation, and the read rule is as follows:
for the PD way calculation module, when the 0 th reading is carried out, the 0 th calculation resource reads beta K For calculating p K (n), the 1 st way computing resource will read β K-1 For calculating p K-1 (n), and so on, the P-1 st way computing resource will read beta K-P+1 For calculating p K-P+1 (n), the 0 th computational resource reads β again after the 1 st read K-P For calculating p K-P (n) cycling according to the rule until the coefficient- (alpha) is read by the last P-1 path of computing resources when V-1 is read 11 ) For calculating D (n).
For the QC path computation module, the 0 th path computation resource reads-alpha at the time of 0 th reading K For computing C (n), the 1 st computing resource will read β K For calculating p K (n), and so on, the P-1 st computing resource reads beta K-P+2 For calculating p K-P+2 (n), the 0 th computational resource reads β again after the 1 st read K-P+1 For calculating p K-P+1 (n) repeating the above steps until V-1 reads, and reading beta from the P-1 path of calculation resource 1 For calculating p 1 (n)。
The above method is a sequential reading method, and the calculation formula of the initial read address addr _ read (p) (v) when the p-th path calculation resource performs the v-th calculation is as follows:
for a PD way operation with a 1 starting address, it satisfies:
addr_read(p)(v)=p+vP+1
for a QC way operation with a start address of 0, it satisfies:
addr_read(p)(v)=p+vP
in this embodiment, the reading mode of the data storage unit 31 is reassigned, so that the fitting coefficient can be correctly corresponded to the input signal x (n) after reading, and therefore, the reading address of the fitting coefficient and the calculation resource need to be reassigned, that is, addr _ initial (p) (v) is mapped and transformed, and addr _ real (p) (v) represents the actual reading address when the pth path calculation resource performs the pth calculation. Similarly, the counter count _ l records the number of operations of the linear operation module 33, and when the initial value is 0, and the count is counted for the first H times of count \, the initial value of addr _ initial (p) (v) is used as the initial value, and each time count _ l counts H times, addr _ initial (p) (v) is converted once, and the mapping rule is as follows:
when count _ l is less than H, addr _ initial (p) (v) remains unchanged.
When the count _ l is more than or equal to H and less than 2H, subtracting 1 from the addr _ initial (p) (v), and if the addr _ initial (p) (v) is less than the minimum value, directly taking the value as the maximum value;
when the count _ l is more than or equal to 2H and less than 3H, subtracting 2 from the addr _ initial (p) (v), and if the addr _ initial (p) (v) is less than the minimum value, directly taking the value as the maximum value;
by analogy with this rule, until the count _ l is initialized to 0, addr _ initial (p) (v) is restored to the initial value again, and the above-mentioned rule is continuously executed repeatedly. In summary, in this embodiment, the actual calculation formula of the initial read address addr _ read (p) (v) when the p-th path computation resource performs the v-th computation is as follows:
when the calculation module is a PD way calculation module, the initial addresses of the P read addresses are calculated by using the following formula:
Figure BDA0003913975290000201
when the calculation module is a QC path calculation module, the initial addresses of the P reading addresses are calculated by adopting the following formula:
Figure BDA0003913975290000211
fig. 15 is a schematic diagram of an address mapping process in the coefficient address calculation circuit in the present embodiment. As shown in fig. 15, in this embodiment, the coefficient addresses are periodically switched according to the count _ l to adapt to mapping of the serial number of the single-port RAM, so that the data can be multiplied by the correct coefficient.
The P multiplication circuits 333 execute P multiplication operations in parallel at each data read address sent by the received signal address calculation circuit and the fitting coefficient read address sent by the coefficient address calculation circuit to multiply the corresponding data by the fitting coefficient, and then send the multiplication result to the corresponding addition circuit 334. In this embodiment, the multiplication circuit is implemented by using DSP resources in the FPGA, and the specific implementation process can be completed by the HLS tool, so that the detailed description is omitted here.
After receiving the multiplication result from the corresponding multiplication circuit 333, the P addition circuits 334 determine the type of the multiplication result according to the batch v corresponding to the multiplication result and the serial number P of the multiplication circuit, and perform subsequent processing according to the type, specifically, the method includes:
when the calculation module is a PD-way calculation module, if the batch V = V-1 and the serial number P = P-1 of the multiplication circuit, the multiplication result is a data item D (n), and is directly written into the register D _ reg in the accumulation cache unit 34, otherwise, the multiplication result is a recursion item P k (n) reading the accumulated value in the p-th register in the PD range register in the accumulation buffer unit 34 and the recursion item p k And (n) after addition, overwriting and writing into the p-th register in the PD path register.
When the calculation module is a QC circuit calculation module, if the batch v =0 and the serial number p =0 of the multiplication circuit, the multiplication result is a correction term C (n), and the correction term C (n) is directly usedIt is written into the register c _ reg in the accumulation buffer unit 34, otherwise, it indicates that the multiplication result is a recursion item q k (n) reading the accumulated value in the p-th register in the QC way register in the accumulation buffer unit 34 and the recursion item q k And (n) after addition, overwriting and writing into a p-th register in the QC-way register.
The accumulation buffer unit 34 is configured to store an accumulation result calculated by the linear operation unit, set two groups of registers, which are respectively recorded as a PD path register and a QC path register, where each group includes P registers, each group of registers is respectively configured to store P recurrence item accumulation results output by the PD path calculation module and the QC path calculation module, and set two separate registers D _ reg and C _ reg to be respectively configured to store a data item D (n) and a correction item C (n).
The tree-shaped addition unit 35 reads the accumulation structures of P registers in the PD path register and the QC path register from the accumulation buffer unit 34 respectively, and sums by adopting a tree-shaped addition structure to obtain a multiply-accumulate item P k (n) and Q k (n) in the formula (I). The specific parallel number and the stage number of the tree-like addition structure can be set according to the actual situation, and the specific details of the tree-like addition structure are not described herein again because the tree-like addition structure is common.
Summation operation unit 36 reads the accumulated term P from tree addition unit 35 k (n) and Q k (n), then reading the data item D (n) and the correction item C (n) from the register D _ reg and the register C _ reg of the accumulation buffer unit 34, summing the data item D (n) and the correction item C (n) with the currently buffered linear convolution operation result S (n-1) to obtain a linear convolution operation result S (n), outputting the linear convolution operation result S (n) to the merging operation module 4, and locally buffering the linear convolution operation result S (n) for the next linear convolution operation.
Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims (3)

1. The utility model provides a piecewise linearity fractional order operation IP core based on FPGA which characterized in that, includes input conversion module, nonlinear operation module, linear operation module, amalgamation operation module and output conversion module, wherein:
the input conversion module is used for receiving ADC acquisition data and converting the ADC acquisition data into a floating point or fixed point decimal format to obtain converted data x (n), wherein n represents the number of sampling points, n =1,2, \8230, then dividing the data x (n) into two paths, one path is sent to the nonlinear operation module, and the other path is sent to the linear operation module;
the nonlinear operation module is used for calculating a nonlinear convolution operation result W (n) in fractional order operation of data x (n) and sending the result W (n) to the merging operation module, and a calculation formula of the nonlinear convolution operation result W (n) is as follows:
Figure FDA0003913975280000011
wherein m represents the length of the internal memory, b (j) represents the binomial coefficient, j =0,1, \ 8230, L-1, L represents the total number of the binomial coefficient, and the binomial coefficient b (j) is calculated as follows:
when j =0,1, \ 8230;, m-1, a theoretical calculation formula is adopted to solve and obtain a binomial coefficient b (j), and when j = m, m +1, \ 8230;, L-1, a piecewise linear fitting is adopted to obtain a binomial coefficient b (j), and the specific method is as follows:
let the binomial coefficients b (m) to b (L-1) consist of K-segment linear functions, where K is K ≦ 1, and L is k The length of the k-th segment linear function is expressed, namely the number of fitting coefficients b (j) of the k-th segment is L k K-th linear function by y k Express, make the linear function y k B (n) k-1 ) And b (n) k ) B (n) is obtained by calculation by adopting a theoretical calculation formula k-1 ) And b (n) k ) The k-th linear function y k Slope of (b) beta k Calculated using the following formula:
Figure FDA0003913975280000012
wherein n is 0 =m-1,n k =n k-1 +L k
Then, a binomial coefficient b (j) is calculated by adopting the following formula:
b(j)=b(j-1)+β k ,n k-1 <j≤n k
the linear operation module is used for calculating a linear convolution operation result S (n) in fractional order operation of the data x (n) and sending the linear convolution operation result S (n) to the merging operation module, and a calculation formula of the linear convolution operation result S (n) is as follows:
Figure FDA0003913975280000021
after receiving the nonlinear convolution operation result W (n) and the linear convolution operation result S (n), the merging operation module carries out the operation according to a formula
Figure FDA0003913975280000022
The fractional order operation result D of the data x (n) is obtained by calculation α x (n) in which
Figure FDA0003913975280000023
Representing the fractional time interval coefficient, and then calculating the fractional operation result D α x (n) is sent to an output conversion module;
the output conversion module calculates the merging operation module to obtain a fractional order operation result D α And x (n) performs format conversion, and selects to reserve the original acquired data format or convert the acquired data into quantized data for output according to actual needs.
2. The IP core of the reconfigurable piecewise linear fractional order operation of claim 1, wherein the nonlinear operation module comprises a binomial coefficient storage unit, a nonlinear cache unit and a nonlinear operation unit, wherein:
the binomial coefficient storage unit is used for storing a binomial coefficient b (j) which is obtained by calculation according to a binomial coefficient calculation method in advance; the binomial coefficient storage unit adopts a single-port ROM with the depth of m, and the specific storage mode is as follows:
dividing a single-port ROM into P blocks in a cyclic block dividing mode, wherein the depth of each ROM block is m/P, then performing P-time extraction on m binomial coefficients b (j), and storing the m binomial coefficients b (j) into corresponding ROM blocks in a reverse order according to addresses. The number of a ROM partition where a binomial coefficient b (j) is located is addr _ col (j), the address in the ROM partition is addr _ row (j), and the calculation formulas are respectively as follows:
addr_col(j)=P-1-j%P
Figure FDA0003913975280000024
wherein the content of the first and second substances,
Figure FDA0003913975280000025
represents rounding down;
the nonlinear buffer unit is used for buffering intermediate process accumulated data of multiply-accumulate calculation of the data x (n) and the binomial coefficient b (j); the non-linear cache unit adopts a single-port RAM with the depth of m, the single-port RAM is divided into P blocks in a circulating block dividing mode, and the depth of each RAM block is m/P;
the nonlinear operation unit is used for receiving data x (n), reading corresponding binomial coefficients from the binomial coefficient storage unit, calculating to obtain a nonlinear convolution operation result W (n) and caching the nonlinear convolution operation result W (n) to the nonlinear cache unit; the nonlinear operation unit comprises P operation modules, P paths of parallel calculation modes are adopted, each path carries out m/P times of calculation, and the specific calculation mode is as follows:
after the nonlinear operation unit receives data x (n), P operation modules read binomial coefficients from the binomial coefficient storage unit in m/P batches and multiply the binomial coefficients with the data x (n), wherein the binomial coefficients read by the P operation module for the second time count _ q are binomial coefficients of an address count _ q in the P ROM block in the binomial coefficient storage unit, and count _ q =0,1, \ 8230;, m/P-1; reading accumulated data from the address count _ q in the p-th RAM block in the non-linear cache unit, adding the multiplication result of the time and the accumulated data, and writing the result into the address reg _ write in the p '-th RAM block in the non-linear cache unit, wherein the calculation formulas of the RAM block serial number and the p' -write address reg _ write are respectively as follows:
p′=((p-1)) P-1
wherein, (()) P-1 Represents the modulus, i.e. when P-1 < 0, P' = P-1, and the rest is unchanged;
Figure FDA0003913975280000031
wherein, () m/P-1 Indicating taking the modulus, i.e. when
Figure FDA0003913975280000032
Reg _ write = m/P-1, otherwise unchanged;
after completing all the calculations, the nonlinear operation unit controls the nonlinear cache unit 22 to output the accumulated data with address 0 in the P-1 th RAM block as the nonlinear convolution operation result W (n) of the data x (n) to the merge operation module 4, and then clears the accumulated data in the address.
3. The IP core of claim 1, wherein the linear operation module comprises a signal storage unit, a fitting coefficient storage unit, a linear operation unit, an accumulation buffer unit, a tree-shaped addition unit and a summation operation unit, wherein:
the signal storage unit is used for receiving data x (n) and performing dynamic storage, and a block storage mode is adopted, and the specific method is as follows:
setting signal memorable length m =2 a Linear segment length H =2 b Number of segments of linear segment K =2 c -1, total linear segment length L = K × H, a, b, c being positive integers; set length as length m 0 Compensation space of (m + m) 0 = H, let the storage depth N = L + m of the signal storage unit 0 (ii) a Recording the parallel number as P, instantiating P blocks of single-port RAM for storing data x (n);
h times of the original storage space with the depth of N in the signal storage unit is extracted to obtain H groups of storage spaces, the depth of each group of storage spaces is K +1, and the H groups of storage spaces are sequentially arranged to obtain a first-step mapping result; then, P times of the storage space after the first step of mapping with the depth of N is extracted by using a P-division cyclic blocking method to obtain P groups of storage spaces, wherein the depth of each group of storage spaces is N/P, P represents the number of groups, and P =0,1, \ 8230;, P-1, the P-th group of storage spaces corresponds to the P-th single RAM;
the single port RAM serial number addr _ col (n) corresponding to the data x (n) and the address addr _ row (n) in the corresponding single port RAM are calculated as follows:
let addr _ initial = L + m be the initial write address to cyclically write data in the signal storage unit, the theoretical write address addr _ old (n) of the data x (n) is calculated by the following formula:
addr_old(n)=(addr_initial+n)modN
the address addr _ new (n) after the first mapping is calculated by using the following formula:
Figure FDA0003913975280000041
wherein mod represents the remainder of the computation,
Figure FDA0003913975280000042
represents rounding down;
the single-port RAM serial number addr _ col (n) corresponding to the data x (n) is calculated by adopting the following formula:
Figure FDA0003913975280000043
the address of the data x (n) in the addr _ col (n) -th single-port RAM is calculated by adopting the following formula:
addr_row(n)=addr_new(n)modP
coefficient of fitThe storage unit is used for storing fitting coefficients for calculating linear segment binomial coefficients, namely alpha 1 、β 1 、β 2 、…、β K And alpha K The specific storage mode is as follows: adopting K +2 registers to correspond the correction term C (n) to the coefficient alpha K Is loaded into the 1 st register after being loaded, and the data item D (n) corresponds to the coefficient alpha 1 After getting negative, storing in the last register, beta k Sequentially storing the K values into the rest K registers in a reverse order mode according to the value of K;
the linear operation unit is used for reading data from the signal storage unit, reading the fitting coefficient of the binomial coefficient from the fitting coefficient storage unit, and calculating to obtain a multiply-accumulate term P k (n) and multiply-accumulate term Q k (n), a data item D (n) and a correction item C (n); the linear arithmetic unit comprises 2 paths of parallel computing modules, wherein the 1 st path is used for computing multiplication accumulation terms P k (n) and data item D (n), marked as PD way calculating module, 2 nd way for calculating multiplication accumulation item Q k (n) and correction terms C (n) are recorded as QC path calculation modules, and each path needs to calculate K +1 term results; each path of calculation module comprises a signal address calculation circuit, a coefficient address calculation circuit, P multiplication operation circuits and P addition operation circuits, wherein:
the signal address calculation circuit is used for calculating a signal reading address of the signal storage unit and sending the signal reading address to the multiplication operation circuit, and the data reading address calculation method comprises the following steps: the signal address calculation circuit generates V batches of read addresses, each batch contains P data read addresses, and the P-th batch * A read address is sent to the pth * A multiplication circuit, p * =1,2, \ 8230;, P, wherein the P-th * The single-port RAM sequence number addr _ col (p) (v) in the signal storage unit in the data reading address is p * At the addr _ col (p) * ) (v) read Address addr _ row (p) in Block Single Port RAM * ) (v) is calculated using the following formula:
Figure FDA0003913975280000051
wherein, count _ l represents a running time counter for recording the running times of the linear operation module, the initial value is 0, the numerical value is added with 1 after each time of completing the operation, and the counting value is reset to zero after being N-1;
the coefficient address calculation circuit is used for generating a fitting coefficient reading address of the fitting coefficient storage unit and sending the fitting coefficient reading address to the multiplication operation circuit, and the calculation method of the fitting coefficient reading address comprises the following steps: the coefficient address calculation circuit generates V batches of read addresses, each batch comprises P read addresses, the P read addresses are continuously generated from an initial address, the P-th read address is sent to the P-th multiplication operation circuit, and P =1,2, \ 8230; when the calculation module is a PD way calculation module, the initial addresses of the P read addresses are calculated by using the following formula:
Figure FDA0003913975280000052
when the calculation module is a QC path calculation module, the initial addresses of the P reading addresses are calculated by adopting the following formula:
Figure FDA0003913975280000053
the P multiplication circuits execute P multiplication operations in parallel to multiply corresponding data and the fitting coefficients at the data reading address sent by the signal address receiving calculation circuit and the fitting coefficient reading address sent by the coefficient address calculation circuit each time, and then send the multiplication operation result to the corresponding addition operation circuit;
after P addition circuits receive multiplication results from corresponding multiplication circuits, the type of the multiplication results is judged according to the batch v corresponding to the multiplication results and the serial number P of the multiplication circuits, and subsequent processing is carried out according to the type, and the specific method is as follows:
when the calculation module is a PD-way calculation module, if the batch V = V-1 and the serial number P = P-1 of the multiplication circuit, the multiplication result is a data item D (n), and the data item D (n) is directly written into the accumulation bufferRegister d _ reg in memory unit, otherwise, the result of multiplication is recurrence item p k (n) reading the accumulated value in the p-th register in the PD range register in the accumulation cache unit and the recursion item p k (n) overwriting the added signals into the p-th register in the PD path register;
when the calculation module is a QC way calculation module, if the batch v =0 and the serial number p =0 of the multiplication circuit, the multiplication result is a correction term C (n) and is directly written into a register C _ reg in the accumulation cache unit, otherwise, the multiplication result is a recursion term q k (n) reading the accumulated value in the p-th register in the QC path register in the accumulated cache unit and the recursion item q k (n) after addition, writing the added signals into the p-th register in the QC path register in a covering manner;
the accumulation cache unit is used for storing accumulation results obtained by calculation of the linear operation unit, two groups of registers are arranged and are respectively marked as a PD path register and a QC path register, each group comprises P registers, each group of registers is respectively used for storing P recurrence item accumulation results output by the PD path calculation module and the QC path calculation module, and two independent registers D _ reg and C _ reg are respectively arranged and used for storing a data item D (n) and a correction item C (n);
the tree-shaped addition unit respectively reads the accumulation structures of P registers in the PD path register and the QC path register from the accumulation cache unit, and the summation is carried out by adopting the tree-shaped addition structure to obtain a multiply-accumulate item P k (n) and Q k (n);
The summation operation unit reads the accumulation item P from the tree-shaped addition unit k (n) and Q k And (n), reading a data item D (n) and a correction item C (n) from a register D _ reg and a register C _ reg of the accumulation cache unit, summing the data item D (n) and the correction item C (n) with a linear convolution operation result S (n-1) cached currently to obtain a linear convolution operation result S (n), outputting the linear convolution operation result S (n) to the merging operation module, and locally caching the linear convolution operation result S (n) for the next linear convolution operation.
CN202211332312.3A 2022-10-28 2022-10-28 FPGA-based piecewise linear fractional order operation IP core Pending CN115640493A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211332312.3A CN115640493A (en) 2022-10-28 2022-10-28 FPGA-based piecewise linear fractional order operation IP core

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211332312.3A CN115640493A (en) 2022-10-28 2022-10-28 FPGA-based piecewise linear fractional order operation IP core

Publications (1)

Publication Number Publication Date
CN115640493A true CN115640493A (en) 2023-01-24

Family

ID=84947052

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211332312.3A Pending CN115640493A (en) 2022-10-28 2022-10-28 FPGA-based piecewise linear fractional order operation IP core

Country Status (1)

Country Link
CN (1) CN115640493A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116720554A (en) * 2023-08-11 2023-09-08 南京师范大学 Method for realizing multi-section linear fitting neuron circuit based on FPGA technology

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110084361A (en) * 2017-10-30 2019-08-02 上海寒武纪信息科技有限公司 A kind of arithmetic unit and method
CN113157637A (en) * 2021-04-27 2021-07-23 电子科技大学 High-capacity reconfigurable FFT operation IP core based on FPGA
CN113377340A (en) * 2021-05-12 2021-09-10 电子科技大学 Digital oscilloscope with fractional calculus operation and display function
CN113778940A (en) * 2021-09-06 2021-12-10 电子科技大学 High-precision reconfigurable phase adjustment IP core based on FPGA

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110084361A (en) * 2017-10-30 2019-08-02 上海寒武纪信息科技有限公司 A kind of arithmetic unit and method
CN113157637A (en) * 2021-04-27 2021-07-23 电子科技大学 High-capacity reconfigurable FFT operation IP core based on FPGA
CN113377340A (en) * 2021-05-12 2021-09-10 电子科技大学 Digital oscilloscope with fractional calculus operation and display function
CN113778940A (en) * 2021-09-06 2021-12-10 电子科技大学 High-precision reconfigurable phase adjustment IP core based on FPGA

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
朱栋等: "《面向EDA课程设计的数字神经元电路实现》", 《实验室科学》, vol. 25, no. 3, 30 June 2022 (2022-06-30) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116720554A (en) * 2023-08-11 2023-09-08 南京师范大学 Method for realizing multi-section linear fitting neuron circuit based on FPGA technology
CN116720554B (en) * 2023-08-11 2023-11-14 南京师范大学 Method for realizing multi-section linear fitting neuron circuit based on FPGA technology

Similar Documents

Publication Publication Date Title
CN101782893B (en) Reconfigurable data processing platform
CN109146067B (en) Policy convolution neural network accelerator based on FPGA
CN107612523B (en) FIR filter implementation method based on software table look-up method
CN101847986B (en) Circuit and method for realizing FFT/IFFT conversion
CN102377437B (en) Method and device for coding quasi-cyclic low density parity check codes
CN111723336B (en) Cholesky decomposition-based arbitrary-order matrix inversion hardware acceleration system adopting loop iteration mode
CN115640493A (en) FPGA-based piecewise linear fractional order operation IP core
CN113157637B (en) High-capacity reconfigurable FFT operation IP core based on FPGA
US20200310818A1 (en) Multiplier-Accumulator Circuitry having Processing Pipelines and Methods of Operating Same
CN103870438A (en) Circuit structure using number theoretic transform for calculating cyclic convolution
CN111931925A (en) FPGA-based binary neural network acceleration system
CN104077492A (en) Sample data interpolation method based on FPGA
CN114422085B (en) FPGA-based optimized rate matching method and system
CN115425976A (en) ADC sampling data calibration method and system based on FPGA
CN113778940B (en) High-precision reconfigurable phase adjustment IP core based on FPGA
CN101425794A (en) Digital filter with fixed coefficient
CN102314215B (en) Low power consumption optimization method of decimal multiplier in integrated circuit system
CN201860303U (en) Digital filter circuit
CN108900177B (en) Method for filtering data by FIR filter
CN115146769A (en) Digital circuit module for calculating tanh function based on range addressable lookup table
CN100517968C (en) Hilbert filter used for power computing
CN104317554A (en) Device and method of reading and writing register file data for SIMD (Single Instruction Multiple Data) processor
CN115549644A (en) FIR filter
CN108897524A (en) Division function processing circuit, method, chip and system
CN108595148A (en) Division function implementation method, circuit, chip and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination