WO2022141321A1 - Dsp and parallel computing method therefor - Google Patents

Dsp and parallel computing method therefor Download PDF

Info

Publication number
WO2022141321A1
WO2022141321A1 PCT/CN2020/141848 CN2020141848W WO2022141321A1 WO 2022141321 A1 WO2022141321 A1 WO 2022141321A1 CN 2020141848 W CN2020141848 W CN 2020141848W WO 2022141321 A1 WO2022141321 A1 WO 2022141321A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
circuit
control signal
processing circuit
decoding
Prior art date
Application number
PCT/CN2020/141848
Other languages
French (fr)
Chinese (zh)
Inventor
任子木
韩彬
Original Assignee
深圳市大疆创新科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市大疆创新科技有限公司 filed Critical 深圳市大疆创新科技有限公司
Priority to PCT/CN2020/141848 priority Critical patent/WO2022141321A1/en
Publication of WO2022141321A1 publication Critical patent/WO2022141321A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead

Definitions

  • the present application relates to the technical field of DSP, and in particular, to a DSP processor and a parallel computing method thereof.
  • Parallel computing refers to the process of using multiple computing resources to solve computing problems at the same time, and it is an effective means to improve the computing speed and processing capacity of computer systems.
  • DSP Digital Signal Processor, Digital Signal Processor
  • other processors can realize parallel computing.
  • the current DSP processor cannot support the operation of complex instructions very well.
  • the present application provides a DSP processor and a parallel computing method thereof, aiming at solving the technical problems that the current DSP processor cannot well support the operation of complex instructions.
  • an embodiment of the present application provides a DSP processor, where the DSP processor includes a data bus connected to a data memory and a program bus connected to a program memory;
  • the DSP processor further includes a control circuit and a data processing circuit, and the control circuit includes:
  • a read request generating circuit is configured to output a read data request to the data bus according to an operation instruction, the operation instruction is obtained from the program memory through the program bus, and the read data request is used for the data processing circuit to pass through the program memory.
  • the data bus obtains the data to be processed from the data memory;
  • a first decoding circuit configured to perform first-level decoding on the operation instruction, and transmit the first control signal obtained by the first-level decoding to the data processing circuit;
  • a second decoding circuit configured to perform second-level decoding on the first control signal, and transmit the second control signal obtained by the second-level decoding to the data processing circuit;
  • the data processing circuit is configured to perform an operation on the data to be processed according to the first control signal and the second control signal, and store the operation result to the data memory through the data bus;
  • the first control signal is uniquely determined by the operation instruction, and the second control signal is used to determine the processing flow of the data processing circuit according to the operation instruction.
  • an embodiment of the present application provides a parallel computing method for a DSP processor, where the DSP processor includes a data bus connected to a data memory, a program bus connected to a program memory, a control circuit and a data processing circuit, wherein the The control circuit includes a read request generating circuit, a first decoding circuit and a second decoding circuit;
  • the parallel computing method includes:
  • the read request generation circuit outputs a read data request to the data bus according to an operation instruction, the operation instruction is obtained from the program memory through the program bus, and the read data request is used by the data processing circuit to pass the the data bus obtains the data to be processed from the data memory;
  • the first decoding circuit performs first-level decoding on the operation instruction, and transmits the first control signal obtained by the first-level decoding to the data processing circuit;
  • the second decoding circuit performs second-level decoding on the first control signal, and transmits the second control signal obtained by the second-level decoding to the data processing circuit;
  • the data processing circuit performs an operation on the data to be processed according to the first control signal and the second control signal, and stores the operation result to the data memory through the data bus;
  • the first control signal is uniquely determined by the operation instruction, and the second control signal is used to determine the processing flow of the data processing circuit according to the operation instruction.
  • the embodiments of the present application provide a DSP processor and a parallel computing method thereof.
  • a first decoding circuit and a second decoding circuit in the control circuit of the DSP processor, the structure is clearly divided, and the mapping of complex instructions is convenient.
  • the separation of the first control signal and the second control signal makes the DSP processor more scalable and can better support the operation of complex instructions. For example, when a new operation instruction needs to be added, the overall architecture of the DSP processor does not need to be adjusted, but only the corresponding control signal needs to be added. For example, the first control signal and the second control signal corresponding to the operation instruction can be determined. Strong scalability.
  • FIG. 1 is a schematic block diagram of a DSP processor provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a data processing circuit pipeline
  • FIG. 3 is a schematic structural diagram of a multiplier circuit in an embodiment
  • FIG. 4 is a schematic diagram of a data processing circuit reading data
  • FIG. 5 is a schematic diagram of a sliding window operation in one embodiment
  • FIG. 6 is a schematic diagram of a sliding window operation in another embodiment
  • FIG. 7 is a schematic diagram of a sliding window operation in yet another embodiment
  • FIG. 8 is a schematic flowchart of a parallel computing method for a DSP processor provided by an embodiment of the present application.
  • references numerals 10, data bus; 11, data memory; 20, program bus; 21, program memory; 30, configuration bus; 110, control circuit; 111, read request generation circuit; 112, first decoding circuit; 113, second decoding circuit; 114, configuration bus interface circuit; 120, data processing circuit; 121, data loading circuit; 122, preprocessing circuit; 123, multiplier circuit; 124, accumulation circuit; 125, post-processing circuit.
  • FIG. 1 is a schematic block diagram of a DSP processor provided by an embodiment of the present application.
  • the core idea of DSP processor is SIMD (Single Instruction Multiplex Data Parallel Computing). Compared with general-purpose processors (such as CPU, Central Processing Unit), DSP processors can provide stronger parallel computing capabilities.
  • the vector execution circuit is the core processing circuit inside the DSP processor.
  • the vector execution circuit is used to execute vector instructions. For example, the addition of two vectors can be completed according to the vector instructions, with a high degree of parallelism and strong computing power.
  • common applications of DSP processors include vector multiply-accumulate operations, and operations such as image filtering and feature extraction can be mapped to vector multiply-accumulate operations.
  • the computing capability index of the DSP processor can be measured by multiplying the accumulated parallelism of different data types that the DSP processor can provide.
  • the vector execution circuit of the DSP processor needs to support a variety of instructions, such as VMADD, VMADDS, VMADDC for instructing addition operations, and VMMUL, VMMULS, VMMULC for instructing multiplication operations, but of course not limited to this.
  • VMADD VMADDS
  • VMADDC VMADDC
  • VMMUL VMMULS
  • VMMULC VMMULC
  • the symbol types of input data also have various combinations: uchar+char, uchar +uchar, etc., where uchar represents an unsigned byte type and char represents a signed byte type. Therefore, the vector execution circuit needs to support a variety of instructions, and each instruction contains a mixture of multiple data types and multiple symbol types. How to map these instructions to the hardware structure and how to implement these complex instructions with the smallest area is a
  • the vector execution circuit of the DSP processor includes a control circuit 110 and a data processing circuit 120 , wherein the control circuit 110 may be referred to as a control path (ctrl_path), and the data processing circuit 120 may be referred to as a data_path.
  • control circuit 110 may be referred to as a control path (ctrl_path)
  • data processing circuit 120 may be referred to as a data_path.
  • the DSP processor further includes a data bus 10 connected to the data memory 11 and a program bus 20 connected to the program memory 21 .
  • the data bus 10 includes, for example, a data crossbar inside the DSP processor.
  • control circuit 110 includes: a read request generating circuit 111 , a first decoding circuit 112 and a second decoding circuit 113 .
  • the read request generation circuit 111 (for example, called cbar_req_gen) is used for outputting a read data request to the data bus 10 according to the operation instruction.
  • the operation instruction is obtained from the program memory 21 through the program bus 20 .
  • a read data request may include a source address, a destination address, a read request valid signal, and an amount of data.
  • the read data request includes a field used to indicate that the issued read request is valid, such as read_vld, a field used to indicate the data amount of the read request, such as read_cnt, and a field used to indicate the source address and destination address, such as set_src /dst.
  • control circuit 110 further includes a configuration bus interface circuit 114, and the configuration bus interface circuit 114 is configured to configure the read request generation circuit 111 according to the configuration signal of the configuration bus 30, so that the read request generation circuit 111 is based on configuration, and output a read data request to the data bus 10 according to the operation instruction.
  • configuration bus interface circuit 114 is configured to configure the read request generation circuit 111 according to the configuration signal of the configuration bus 30, so that the read request generation circuit 111 is based on configuration, and output a read data request to the data bus 10 according to the operation instruction.
  • the configuration bus 30 is, for example, a configuration bus 30 (crf bus) inside the DSP processor.
  • the configuration bus interface circuit 114 (referred to as crf_reg, for example) can configure the read request generation circuit 111 according to the configuration signal of the configuration bus 30 .
  • the configuration bus interface circuit 114 is further configured to output a register configuration signal to the data processing circuit 120 .
  • the configuration bus interface circuit 114 outputs some register configuration signals to the data path according to the configuration signals of the configuration bus 30 , and such register configuration signals may be uniformly prefixed by rctrl_.
  • the read request generating circuit 111 can configure the value of the corresponding register according to the register configuration signal of the configuration bus interface circuit 114, and perform a data request according to the configured value.
  • the first decoding circuit 112 (for example, called static_ctrl_gen) is used to perform first-level decoding on the operation instruction, and transmit the first control signal obtained by the first-level decoding to the data processing circuit 120 .
  • the first control signal may include a control signal that remains unchanged throughout the execution of the instruction. It can be understood that the first control signal is uniquely determined by the operation instruction.
  • the first control signal may be prefixed by sctrl_.
  • the first control signal includes at least one of a symbol type, an instruction type, and a data extension type.
  • the first control signal can be transmitted to the data processing circuit 120, for example, it can be used to control the data processing circuit 120 to determine the type of data to be processed, and the like.
  • the first control signal can also be transmitted to the second decoding circuit 113 (for example, called dynamic_ctrl_gen) for secondary decoding.
  • the second decoding circuit 113 for example, called dynamic_ctrl_gen
  • the second decoding circuit 113 is configured to perform second-level decoding on the first control signal, and transmit the second control signal obtained by the second-level decoding to the data processing circuit 120 .
  • the second control signal is used to determine the processing flow of the data processing circuit 120 according to the operation instruction, and the data processing circuit 120 performs operation on the data to be processed according to the second control signal.
  • the second control signal is a control signal that varies with the flow of the data stream.
  • the second control signal mainly controls the operation of each pipeline stage in the data processing circuit 120, and controls when the pipeline of each stage is turned on and off.
  • the second control signal may be prefixed by dctrl_.
  • the second decoding circuit 113 starts to perform second-level decoding on the first control signal when receiving the read data valid signal, and transmits the second control signal obtained by the second-level decoding to data processing circuit 120.
  • the read data valid signal is used to indicate that the data processing circuit 120 has acquired the data to be processed.
  • the data processing circuit 120 is configured to perform an operation on the data to be processed according to the first control signal and the second control signal, and store the operation result to the data memory 11 through the data bus 10.
  • the data processing circuit 120 includes: a data loading circuit 121 , a preprocessing circuit 122 , a multiplier circuit 123 , an accumulation circuit 124 and a post-processing circuit 125 .
  • the data loading circuit 121 is used to register the data to be processed; the preprocessing circuit 122 is used to perform addition and subtraction operations, multiplier data source allocation and/or take absolute values; the multiplier circuit 123 is used to perform multiplication operations; the accumulation circuit 124 For carrying out the accumulation operation of data; the post-processing circuit 125 is used for intercepting, saturating and/or rounding the data after the operation, and storing the processed data, that is, the operation result, to the data memory through the data bus 10 11.
  • the data loading circuit 121 (which may be referred to as a load unit) is used to register data on the data exchange bus as data to be processed.
  • the data loading circuit 121 can support three sets of 512-bit wide data input, such as master_a_rdata, master_b_rdata, master_c_rdata, and can buffer the data read back from the three data ports (ports) a/b/c.
  • the number of pipeline stages of the data loading circuit 121 is one stage, such as the pipeline stages marked ireg0 , ireg1 , and ireg2 correspondingly.
  • a read data valid signal may be sent to the second decoding circuit 113, so that the second decoding circuit 113 starts to output the second control to the data processing circuit 120. Signal.
  • the second control signal is used to control the states of a plurality of pipeline stages in the data processing circuit 120 .
  • the multiple pipeline stages in the data processing circuit 120 can be referred to in FIG. 2 .
  • the second control signal is used to control the data processing circuit 120 to process the data to be processed in a pipeline manner.
  • the preprocessing circuit 122 (may be referred to as a pre_proc unit) is used for data preprocessing (pre_proc) operations, performing addition and subtraction operations before multiplication, and performing preprocessing such as assignment of multiplier data sources and taking absolute values.
  • pre_proc data preprocessing
  • the number of pipeline stages of the preprocessing circuit 122 is 2, for example, the pipeline stages corresponding to the labels are 0 and 1.
  • the multiplier circuit 123 (which may be referred to as a multi unit) is used to perform a multiplication (mult) operation, and a truncation operation after the multiplication.
  • the number of pipeline stages of the multiplier circuit 123 is four, such as pipeline stages numbered 2 to 5 correspondingly.
  • the accumulation circuit 124 (which may be referred to as an acc unit) is used to perform an accumulation (acc) operation of data, and a tree-shaped accumulation (tree_add) operation of multiple accumulators in the accumulation circuit 124, such as 64 accumulators.
  • the number of pipeline stages of the accumulating circuit 124 is seven, such as pipeline stages numbered 6 to 13.
  • the post-processing circuit 125 (may be referred to as a post_proc unit) is used to truncate, saturate, and round the calculated data, select the data, and output the selected data to the corresponding data port of the data bus 10. (port).
  • the operation result output by the post-processing circuit 125 such as master_a_wdata, may also have a data bit width of 512 bits, and the operation result may be stored in the data memory 11 through the data bus 10 .
  • the first control signal is fixed when processing the operation instruction, and the second control signal is dynamically changed when the operation instruction is processed.
  • the symbol type, instruction type, data extension type, etc. corresponding to the operation instruction are all the same, so the first control signal can be generated according to the fixed information when the operation instruction is executed;
  • the corresponding operations are decomposed into basic operations such as multiplication and addition that can be performed by hardware multipliers and adders, etc., and these basic operations can be processed in a pipelined manner on multiple pipeline stages in the data processing circuit 120.
  • the second control signal controls the operation of each water level, and controls the opening and closing timing of each level of water, so as to realize the operation corresponding to the operation instruction.
  • the second decoding circuit 113 is configured to control the data loading circuit 121 , the preprocessing circuit 122 , the multiplier circuit 123 , the accumulating circuit 124 , and the post-processing circuit 125 in the data processing circuit 120 to perform corresponding data processing in a pipeline manner. .
  • the second decoding circuit 113 includes a data loading control circuit (may be called load_stg), a preprocessing control circuit (may be called pre_stg), a multiplication control circuit (may be called mult_stg), an accumulation circuit Control circuit (may be called acc_stg), post-processing control circuit (may be called post_stg).
  • the data loading control circuit, preprocessing control circuit, multiplication control circuit, accumulation control circuit, and post-processing control circuit are used to control the data loading circuit 121 , the preprocessing circuit 122 , the multiplier circuit 123 , and the accumulation circuit 124 in the corresponding control data processing circuit 120 .
  • the post-processing circuit 125 performs corresponding data processing.
  • the structure is clearly divided, which facilitates the mapping of complex instructions, and the separation of the first control signal and the second control signal can be realized, so that the DSP can process the
  • the extensibility of the device is strong, and it can better support the operation of complex instructions. For example, when a new operation instruction needs to be added, the overall architecture of the DSP processor does not need to be adjusted, but only the corresponding control signal needs to be added. For example, the first control signal and the second control signal corresponding to the operation instruction can be determined. Strong scalability.
  • the data processing circuit 120 is used to obtain multiple sets of data to be processed from the data bus 10 in parallel.
  • the data loading circuit 121 can support three sets of 512-bit wide data inputs, such as master_a_rdata, master_b_rdata, master_c_rdata, and can buffer the data read back from the three data ports a/b/c.
  • the second decoding circuit 113 is configured to transmit the second control signal obtained by decoding to the data processing circuit 120 when the acquisition of multiple groups of data to be processed is completed, and pause when the acquisition of the multiple groups of data to be processed is not completed.
  • the decoded second control signal is transmitted to the data processing circuit 120 . Therefore, the entire pipeline can be correctly controlled when the data loading circuit 121 does not acquire the complete data to be processed. It can prevent that the data to be processed cannot be returned within a preset period of time, resulting in a data processing error when a conflict occurs when reading data from the data storage 11 . In addition, it is not necessary to organize potentially conflicting data into steps that can ensure no conflict during reading, thus improving the efficiency of instruction execution.
  • the second decoding circuit 113 is configured to transmit the second control signal obtained by decoding to the data processing circuit 120 when receiving the read data valid signals corresponding to the multiple groups of data to be processed, and the read data valid signal is: Obtained when the pending data acquisition of the corresponding group is completed.
  • the three pipeline stages marked ireg0 , ireg1 and ireg2 correspond to the three data ports a/b/c of the data loading circuit 121 one-to-one.
  • the three pipeline stages of ireg0, ireg1, and ireg2 can be used to indicate whether the read data on the corresponding data port is valid.
  • the second decoding circuit 113 transmits the decoded second control signal to the data processing circuit 120, and the pipeline starts to flow.
  • the second control signal mainly controls the data processing circuit 120. The operation of each pipeline stage. If the read data of one of the three data ports needs to arrive with a delay due to conflicts and other reasons, wait until the read data on the three data ports are all valid, and the second decoding circuit 113 will decode the obtained data.
  • the second control signal is transmitted to the data processing circuit 120, and the pipeline begins to flow.
  • the second decoding circuit 113 will wait until the 5th clock cycle and find that all three After the data is all ready, the control pipeline starts to flow again, so that the data processing circuit 120 processes the data to be processed in a pipeline manner.
  • the second decoding circuit 113 suspends transmitting the decoded second control signal to the data processing circuit 120 until the operation result passes through the data bus.
  • the second decoding circuit 113 transmits the decoded second control signal to the data processing circuit 120 .
  • the post-processing circuit 125 can store the operation result in the data memory 11 . If the data memory 11 collides and cannot receive the data, the second decoding circuit 113 suspends transmitting the second control signal to the data processing circuit 120, suspends the execution of all pipeline stages, and waits for the external buffer to receive the written data, and then Start the execution of the pipeline. In order to correctly control the entire pipeline by the second decoding circuit 113, to prevent data processing errors when the data memory 11 cannot receive the operation results in a timely manner, and it is not necessary to temporarily store the operation results in a certain storage space, and then organize them after the storage is completed. Stored in the data memory 11, so the efficiency of instruction execution can be improved.
  • the multiplier circuit 123 in the data processing circuit 120 includes a four-stage pipeline operation circuit, wherein the first-stage pipeline operation circuit is used to perform 16bit ⁇ 8bit unsigned multiplication operation, and the second-stage pipeline
  • the arithmetic circuit is used to perform the multiplication and splicing operation of 16bit ⁇ 16bit and the multiplication and splicing operation of 32bit ⁇ 16bit.
  • the third-stage pipeline operation circuit is used to perform the multiplication and splicing operation of 32bit ⁇ 32bit.
  • the result of the operation and/or the result of the 32bit ⁇ 32bit multiplication and concatenation operation is subjected to shift processing.
  • the multiplier circuit 123 can implement the operations shown in Table 2 through multiplier splicing:
  • the preprocessing circuit 122 in the data processing circuit 120 takes the absolute value of the signed data
  • the third-stage pipeline operation circuit compares the sign bit of the signed data with the 32bit value.
  • the result of the multiplication and concatenation operation of ⁇ 16bit or the result of the multiplication and concatenation operation of 32bit ⁇ 32bit is combined (comb sign).
  • the first-stage pipeline operation circuit is used to calculate the unsigned multiplication operation of 16bit ⁇ 8bit;
  • the second-stage pipeline operation circuit includes two-stage addition, and the first-stage addition completes the multiplication and splicing of 16bit ⁇ 16bit, The second-stage addition completes the multiplication and splicing of 32bit ⁇ 16bit;
  • the third-stage pipeline operation circuit includes one-stage addition, completes the multiplication and splicing of 32bit ⁇ 32bit, and finally combines the unsigned result with the sign bit;
  • the fourth-stage pipeline operation circuit completes the 32bit
  • the shift function of ⁇ 16bit and 32bit ⁇ 32bit results includes two shifters.
  • the input data of the first shifter is 64bit, and the moving range is from right shift by 32 bits to right shift by 0 bits, and the output data bit width is 48 bits;
  • the input data of the second shifter is 48 bits, the shift range is 16 bits to the right to 0 bits to the right, and the bit width of the output data is 48 bits.
  • the data processing circuit 120 may read two 512-bit image data and one 512-bit coefficient from the data memory 11 each time.
  • the image data to be processed is stored in the ireg0 cached on-chip.
  • the coefficients are stored in ireg1 in the on-chip cache
  • the image data to be processed includes a(0), a(1), a(2)...a(127) in the first line
  • the coefficients include, for example, b in the first line (0), b(1), b(2).
  • 128 unsigned multipliers such as a char ⁇ char multiplier
  • step 1 image operations, such as convolution operations
  • the multiplier circuit 123 needs to complete the following operations: where step refers to the interval between two adjacent pixels inside a single window, and stride is The number of pixels each time the window slides.
  • the multiplier circuit 123 needs to complete the following operations:
  • the multiplier circuit 123 needs to complete the following operations:
  • the multiplier circuit 123 needs to complete the following operations:
  • the preprocessing circuit 122 is used to select the data source before the multiplier. For example, the data is arranged according to the above operation modes corresponding to step and stride, and then sent to the multiplier circuit 123 for multiplication and addition operations.
  • the preprocessing circuit 122 can realize the permutation and combination of multiplier data sources when step and stride are various situations, so as to realize the DSP processor's support for sliding window operation when step and stride are various numerical values.
  • the operation of the second row can be completed by referring to the operation process of the first row
  • the image data to be processed and the third row of the image data to be processed can be read.
  • the operation of the third row can be completed by referring to the operation process of the first row.
  • the multiplier circuit 123 can perform 64 multiplication and addition operations of 8bit ⁇ 8bit+8bit ⁇ 8bit, and can also perform 64 multiplication operations of 16bit ⁇ 16bit.
  • the accumulating circuit 124 completes the accumulating operation, and the output data of the multiplier circuit 123 is directly sent to the accumulator for accumulation. When all the coefficients of a coefficient matrix are processed, the data in the accumulator is written out to the DSP processor. in the on-chip cache.
  • the post-processing circuit 125 completes the post-processing of the output data, such as the interception of the output data.
  • the data loading circuit 121 in the data processing circuit 120 may be used to perform data sliding processing of sliding window instructions.
  • the data loading circuit 121 registers the data on the data exchange bus as the data to be processed, and performs data sliding processing of the sliding window instruction.
  • the input of the sliding window operation is an image data matrix to be processed and a coefficient matrix.
  • the operation of the sliding window operation is that the coefficient matrix slides on the image, and all the pixels and coefficients in the sliding area are multiplied and accumulated, and finally an output data is obtained.
  • a 258x258 image matrix and a 3x3 coefficient matrix after the coefficient matrix is slid on the image matrix, a 256x256 output matrix is finally obtained.
  • the sliding window operation includes a large number of multiply-accumulate operations. If the multiply-accumulate operation in a general-purpose processor is called, it needs to be called many times. The scheduling of instructions is performed at the software level. Due to the low real-time processing of the software, this greatly increases the sliding window. The real-time performance of this implementation is very low; in addition, this implementation needs to read and write the on-chip cache multiple times, and the power consumption of reading and writing the on-chip cache is very large, so this implementation consumes a lot of power. .
  • the MAC (multiply-accumulate) utilization of ordinary multiply-accumulate operations in general-purpose processors is usually not high: for a processor with a 10-bit data bus width of 512 bits, it is necessary to process the char type. For multiplication and accumulation, only 64 char type values can be loaded in the same clock cycle, while the processor can process up to 128 char type multiplication and accumulation, and the MAC utilization rate is only 50%.
  • O[0,1], O[1,0], etc. can also be calculated through the sliding window.
  • the data loading circuit 121 , the preprocessing circuit 122 , the multiplier circuit 123 , the accumulating circuit 124 , and the post-processing circuit 125 in the data processing circuit 120 are designed with a full pipeline structure, and the processing of each circuit is independent and can be performed in parallel, so that the DSP processing device has higher performance. Reading 2 pieces of image data and sliding the window inside the circuit can greatly reduce the number of times of reading the on-chip cache, and the power consumption is low.
  • the DSP processor provided by the embodiments of the present application, by setting the first decoding circuit and the second decoding circuit in the control circuit of the DSP processor, the structure is clearly divided, the mapping of complex instructions is convenient, and the first control signal and the second decoding circuit can be realized.
  • the separation of control signals makes the DSP processor more scalable and can better support the operation of complex instructions. For example, when a new operation instruction needs to be added, the overall architecture of the DSP processor does not need to be adjusted, but only the corresponding control signal needs to be added. For example, the first control signal and the second control signal corresponding to the operation instruction can be determined. Strong scalability.
  • FIG. 8 is a schematic flowchart of a parallel computing method for a DSP processor provided by another embodiment of the present application.
  • the parallel computing method is applied in a DSP processor.
  • the DSP processor includes a data bus connected to the data memory, a program bus connected to the program memory, and a control circuit and a data processing circuit, wherein the control circuit includes a read request generating circuit, a first decoding circuit and a second decoding circuit .
  • the parallel computing method of this embodiment includes steps S210 to S240.
  • the read request generation circuit outputs a read data request to the data bus according to an operation instruction, the operation instruction is obtained from the program memory through the program bus, and the read data request is used by the data processing circuit to pass
  • the data bus obtains data to be processed from the data memory
  • the first decoding circuit performs first-level decoding on the operation instruction, and transmits the first control signal obtained by the first-level decoding to the data processing circuit;
  • the second decoding circuit performs second-level decoding on the first control signal, and transmits the second control signal obtained by the second-level decoding to the data processing circuit;
  • the data processing circuit performs an operation on the data to be processed according to the first control signal and the second control signal, and stores the operation result to the data memory through the data bus;
  • the first control signal is uniquely determined by the operation instruction, and the second control signal is used to determine the processing flow of the data processing circuit according to the operation instruction.
  • the method further includes: the configuration bus interface circuit of the control circuit configures the read request generation circuit according to the configuration signal of the configuration bus, so that the read request generation circuit based on the configuration, according to The operation instruction outputs a read data request to the data bus.
  • the read data request includes a source address, a target address, a read request valid signal, and a data amount.
  • the method further includes:
  • the configuration bus interface circuit outputs a register configuration signal to the data processing circuit.
  • the first control signal is fixed when the operation instruction is processed, and the second control signal changes dynamically when the operation instruction is processed.
  • the second control signal is used to control the states of a plurality of pipeline stages in the data processing circuit.
  • the data processing circuit includes:
  • a data loading circuit for registering the data to be processed
  • the post-processing circuit is used for intercepting, saturating and/or rounding the data after the operation, and storing the processed data to the data memory through the data bus.
  • the second control signal controls the data loading circuit, the preprocessing circuit, the multiplier circuit, the accumulation circuit, and the post-processing circuit in the data processing circuit to perform corresponding data processing in a pipeline manner.
  • the data processing circuit is configured to obtain multiple sets of data to be processed from the data bus in parallel;
  • the second decoding circuit transmits the second control signal obtained by decoding to the data processing circuit, and when the acquisition of the multiple groups of data to be processed is not completed, the The second decoding circuit suspends transmitting the decoded second control signal to the data processing circuit.
  • the second decoding circuit transmits the second control signal obtained by decoding to the data processing circuit when receiving the read data valid signals corresponding to the multiple groups of data to be processed, and the read data The data valid signal is obtained when the to-be-processed data of the corresponding group is acquired.
  • the second decoding circuit suspends transmitting the decoded second control signal to the data processing circuit, Until the operation result is stored in the data memory through the data bus, the second decoding circuit transmits the decoded second control signal to the data processing circuit.
  • the multiplier circuit in the data processing circuit includes a four-stage pipeline operation circuit, wherein the first-stage pipeline operation circuit is used for performing 16bit ⁇ 8bit unsigned multiplication operation, and the second-stage pipeline operation circuit is used for Perform 16bit ⁇ 16bit multiplication and splicing operations and 32bit ⁇ 16bit multiplication and splicing operations, the third-stage pipeline operation circuit is used to perform 32bit ⁇ 32bit multiplication and splicing operations, and the fourth-stage pipeline is used to 32bit ⁇ 16bit multiplication and splicing operation results and / or 32bit ⁇ 32bit multiplication and splicing operation results are shifted.
  • the preprocessing circuit in the data processing circuit takes the absolute value of the signed data
  • the third-stage pipeline operation circuit converts the signed data into an absolute value.
  • the sign bit of the data is combined with the result of the 32bit ⁇ 16bit multiplication and splicing operation or the result of the 32bit ⁇ 32bit multiplication and splicing operation.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Advance Control (AREA)

Abstract

A DSP, comprising a data bus connected to a data memory and a program bus connected to a program memory, and further comprising a control circuit and a data processing circuit. The control circuit comprises: a read request generation circuit, which is used for acquiring, from the data memory, data to be processed; a first decoding circuit, which is used for performing first-stage decoding on an operation instruction and transmitting, to the data processing circuit, a first control signal obtained by means of the first-stage decoding; and a second decoding circuit, which is used for performing second-stage decoding on the first control signal and transmitting, to the data processing circuit, a second control signal obtained by means of the second-stage decoding. The data processing circuit is used for performing an operation on said data according to the first control signal and the second control signal, and for storing an operation result in the data memory by means of the data bus. The present application can support the operation of a complex instruction. Further provided is a parallel computing method for a DSP.

Description

DSP处理器及其并行计算方法DSP processor and its parallel computing method 技术领域technical field
本申请涉及DSP技术领域,尤其涉及一种DSP处理器及其并行计算方法。The present application relates to the technical field of DSP, and in particular, to a DSP processor and a parallel computing method thereof.
背景技术Background technique
并行计算(Parallel Computing)是指同时使用多种计算资源解决计算问题的过程,是提高计算机***计算速度和处理能力的一种有效手段。DSP(数字信号处理器,Digital Signal Processor)等处理器可以实现并行计算。但是目前的DSP处理器不能很好的支持复杂指令的运算。Parallel computing refers to the process of using multiple computing resources to solve computing problems at the same time, and it is an effective means to improve the computing speed and processing capacity of computer systems. DSP (Digital Signal Processor, Digital Signal Processor) and other processors can realize parallel computing. However, the current DSP processor cannot support the operation of complex instructions very well.
发明内容SUMMARY OF THE INVENTION
本申请提供了一种DSP处理器及其并行计算方法,旨在解决目前的DSP处理器不能很好的支持复杂指令的运算等技术问题。The present application provides a DSP processor and a parallel computing method thereof, aiming at solving the technical problems that the current DSP processor cannot well support the operation of complex instructions.
第一方面,本申请实施例提供了一种DSP处理器,所述DSP处理器包括连接数据存储器的数据总线,和连接程序存储器的程序总线;In a first aspect, an embodiment of the present application provides a DSP processor, where the DSP processor includes a data bus connected to a data memory and a program bus connected to a program memory;
所述DSP处理器还包括控制电路和数据处理电路,所述控制电路包括:The DSP processor further includes a control circuit and a data processing circuit, and the control circuit includes:
读请求产生电路,用于根据运算指令向所述数据总线输出读数据请求,所述运算指令通过所述程序总线从所述程序存储器获取,所述读数据请求用于所述数据处理电路通过所述数据总线从所述数据存储器获取待处理数据;A read request generating circuit is configured to output a read data request to the data bus according to an operation instruction, the operation instruction is obtained from the program memory through the program bus, and the read data request is used for the data processing circuit to pass through the program memory. The data bus obtains the data to be processed from the data memory;
第一译码电路,用于对所述运算指令进行第一级译码,及将第一级译码得到的第一控制信号传输给所述数据处理电路;a first decoding circuit, configured to perform first-level decoding on the operation instruction, and transmit the first control signal obtained by the first-level decoding to the data processing circuit;
第二译码电路,用于对所述第一控制信号进行第二级译码,及将第二级译码得到的第二控制信号传输给所述数据处理电路;a second decoding circuit, configured to perform second-level decoding on the first control signal, and transmit the second control signal obtained by the second-level decoding to the data processing circuit;
所述数据处理电路用于根据所述第一控制信号和所述第二控制信号对所述待处理数据进行运算,以及将运算结果通过所述数据总线存储至所述数据存储 器;The data processing circuit is configured to perform an operation on the data to be processed according to the first control signal and the second control signal, and store the operation result to the data memory through the data bus;
其中,所述第一控制信号由所述运算指令唯一确定,所述第二控制信号用于根据运算指令确定所述数据处理电路的处理流程。The first control signal is uniquely determined by the operation instruction, and the second control signal is used to determine the processing flow of the data processing circuit according to the operation instruction.
第二方面,本申请实施例提供了一种DSP处理器的并行计算方法,所述DSP处理器包括连接数据存储器的数据总线、连接程序存储器的程序总线,以及控制电路和数据处理电路,其中所述控制电路包括读请求产生电路、第一译码电路和第二译码电路;In a second aspect, an embodiment of the present application provides a parallel computing method for a DSP processor, where the DSP processor includes a data bus connected to a data memory, a program bus connected to a program memory, a control circuit and a data processing circuit, wherein the The control circuit includes a read request generating circuit, a first decoding circuit and a second decoding circuit;
所述并行计算方法包括:The parallel computing method includes:
所述读请求产生电路根据运算指令向所述数据总线输出读数据请求,所述运算指令通过所述程序总线从所述程序存储器获取,所述读数据请求用于所述数据处理电路通过所述数据总线从所述数据存储器获取待处理数据;The read request generation circuit outputs a read data request to the data bus according to an operation instruction, the operation instruction is obtained from the program memory through the program bus, and the read data request is used by the data processing circuit to pass the the data bus obtains the data to be processed from the data memory;
所述第一译码电路对所述运算指令进行第一级译码,及将第一级译码得到的第一控制信号传输给数据处理电路;The first decoding circuit performs first-level decoding on the operation instruction, and transmits the first control signal obtained by the first-level decoding to the data processing circuit;
所述第二译码电路对所述第一控制信号进行第二级译码,及将第二级译码得到的第二控制信号传输给所述数据处理电路;The second decoding circuit performs second-level decoding on the first control signal, and transmits the second control signal obtained by the second-level decoding to the data processing circuit;
所述数据处理电路根据所述第一控制信号和所述第二控制信号对所述待处理数据进行运算,以及将运算结果通过所述数据总线存储至所述数据存储器;The data processing circuit performs an operation on the data to be processed according to the first control signal and the second control signal, and stores the operation result to the data memory through the data bus;
其中,所述第一控制信号由所述运算指令唯一确定,所述第二控制信号用于根据运算指令确定所述数据处理电路的处理流程。The first control signal is uniquely determined by the operation instruction, and the second control signal is used to determine the processing flow of the data processing circuit according to the operation instruction.
本申请实施例提供了一种DSP处理器及其并行计算方法,通过在DSP处理器的控制电路设置第一译码电路和第二译码电路,架构划分清晰,方便复杂指令的映射,可以实现第一控制信号和第二控制信号分离,使得DSP处理器的可扩展性较强,能够较好的支持复杂指令的运算。例如,在需要新增新的运算指令时,DSP处理器的整体架构不需做调整,只需增加相应的控制信号,如确定运算指令对应的第一控制信号、第二控制信号即可,可扩展性强。The embodiments of the present application provide a DSP processor and a parallel computing method thereof. By arranging a first decoding circuit and a second decoding circuit in the control circuit of the DSP processor, the structure is clearly divided, and the mapping of complex instructions is convenient. The separation of the first control signal and the second control signal makes the DSP processor more scalable and can better support the operation of complex instructions. For example, when a new operation instruction needs to be added, the overall architecture of the DSP processor does not need to be adjusted, but only the corresponding control signal needs to be added. For example, the first control signal and the second control signal corresponding to the operation instruction can be determined. Strong scalability.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请实施例的公开内容。It should be understood that the above general description and the following detailed description are only exemplary and explanatory, and do not limit the disclosure of the embodiments of the present application.
附图说明Description of drawings
为了更清楚地说明本申请实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.
图1是本申请实施例提供的一种DSP处理器的示意性框图;1 is a schematic block diagram of a DSP processor provided by an embodiment of the present application;
图2是数据处理电路流水线的示意图;2 is a schematic diagram of a data processing circuit pipeline;
图3是一实施方式中乘法器电路的结构示意图;3 is a schematic structural diagram of a multiplier circuit in an embodiment;
图4是数据处理电路读取数据的示意图;4 is a schematic diagram of a data processing circuit reading data;
图5是一实施方式中滑窗运算的示意图;5 is a schematic diagram of a sliding window operation in one embodiment;
图6是另一实施方式中滑窗运算的示意图;6 is a schematic diagram of a sliding window operation in another embodiment;
图7是又一实施方式中滑窗运算的示意图;7 is a schematic diagram of a sliding window operation in yet another embodiment;
图8是本申请实施例提供的一种DSP处理器的并行计算方法的流程示意图。FIG. 8 is a schematic flowchart of a parallel computing method for a DSP processor provided by an embodiment of the present application.
附图标记说明:10、数据总线;11、数据存储器;20、程序总线;21、程序存储器;30、配置总线;110、控制电路;111、读请求产生电路;112、第一译码电路;113、第二译码电路;114、配置总线接口电路;120、数据处理电路;121、数据加载电路;122、预处理电路;123、乘法器电路;124、累加电路;125、后处理电路。Description of reference numerals: 10, data bus; 11, data memory; 20, program bus; 21, program memory; 30, configuration bus; 110, control circuit; 111, read request generation circuit; 112, first decoding circuit; 113, second decoding circuit; 114, configuration bus interface circuit; 120, data processing circuit; 121, data loading circuit; 122, preprocessing circuit; 123, multiplier circuit; 124, accumulation circuit; 125, post-processing circuit.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
附图中所示的流程图仅是示例说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解、组合或部分合并,因此实际执行的顺序有可能根据实际情况改变。The flowcharts shown in the figures are for illustration only, and do not necessarily include all contents and operations/steps, nor do they have to be performed in the order described. For example, some operations/steps can also be decomposed, combined or partially combined, so the actual execution order may be changed according to the actual situation.
下面结合附图,对本申请的一些实施方式作详细说明。在不冲突的情况下,下述的实施例及实施例中的特征可以相互组合。Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and features in the embodiments may be combined with each other without conflict.
请参阅图1,图1是本申请实施例提供的一种DSP处理器的示意性框图。 DSP处理器的核心思想是SIMD(单指令多路数据并行计算),相比于通用的处理器(如CPU,Central Processing Unit),DSP处理器能够提供更强的并行计算能力。矢量执行电路是DSP处理器内部的核心处理电路,矢量执行电路用于执行矢量指令,例如可以根据矢量指令完成两个矢量的加法,并行度较高,具有较强的计算能力。示例性的,DSP处理器通常的应用包括矢量的乘累加运算,图像滤波、特征提取等运算都可以映射为矢量的乘累加运算。在一些实施方式中,可以通过DSP处理器能够提供的不同数据类型乘累加的并行度,衡量DSP处理器的计算能力指标。Please refer to FIG. 1. FIG. 1 is a schematic block diagram of a DSP processor provided by an embodiment of the present application. The core idea of DSP processor is SIMD (Single Instruction Multiplex Data Parallel Computing). Compared with general-purpose processors (such as CPU, Central Processing Unit), DSP processors can provide stronger parallel computing capabilities. The vector execution circuit is the core processing circuit inside the DSP processor. The vector execution circuit is used to execute vector instructions. For example, the addition of two vectors can be completed according to the vector instructions, with a high degree of parallelism and strong computing power. Exemplarily, common applications of DSP processors include vector multiply-accumulate operations, and operations such as image filtering and feature extraction can be mapped to vector multiply-accumulate operations. In some embodiments, the computing capability index of the DSP processor can be measured by multiplying the accumulated parallelism of different data types that the DSP processor can provide.
示例性的,DSP处理器的矢量执行电路需要支持多种指令,如用于指示加法运算的VMADD、VMADDS、VMADDC,用于指示乘法运算的VMMUL、VMMULS、VMMULC,当然也不限于此。对于各种指令,需要支持的数据类型也包括多种,如8bit与16bit的运算、32bit与16bit的运算、32bit与32bit的运算等,输入数据的符号类型也有各种组合:uchar+char、uchar+uchar等,其中uchar表示无符号字节型,char表示有符号字节型。因此,矢量执行电路需要支持多种指令,每条指令包含多种数据类型以及多种符号类型的混合运算,如何将这些指令映射到硬件结构上,如何用最小的面积实现这些复杂指令,是设计矢量执行电路时需要着重考虑的。Exemplarily, the vector execution circuit of the DSP processor needs to support a variety of instructions, such as VMADD, VMADDS, VMADDC for instructing addition operations, and VMMUL, VMMULS, VMMULC for instructing multiplication operations, but of course not limited to this. For various instructions, there are also various data types that need to be supported, such as 8bit and 16bit operations, 32bit and 16bit operations, 32bit and 32bit operations, etc. The symbol types of input data also have various combinations: uchar+char, uchar +uchar, etc., where uchar represents an unsigned byte type and char represents a signed byte type. Therefore, the vector execution circuit needs to support a variety of instructions, and each instruction contains a mixture of multiple data types and multiple symbol types. How to map these instructions to the hardware structure and how to implement these complex instructions with the smallest area is a design issue. It is important to consider when implementing the circuit in vector.
如图1所示,DSP处理器的矢量执行电路包括控制电路110和数据处理电路120,其中控制电路110可以称为控制通路(ctrl_path),数据处理电路120可以称为数据通路(data_path)。As shown in FIG. 1 , the vector execution circuit of the DSP processor includes a control circuit 110 and a data processing circuit 120 , wherein the control circuit 110 may be referred to as a control path (ctrl_path), and the data processing circuit 120 may be referred to as a data_path.
具体的,如图1所示,DSP处理器还包括连接数据存储器11的数据总线10,和连接程序存储器21的程序总线20。数据总线10例如包括DSP处理器内部的数据交互总线(crossbar)。Specifically, as shown in FIG. 1 , the DSP processor further includes a data bus 10 connected to the data memory 11 and a program bus 20 connected to the program memory 21 . The data bus 10 includes, for example, a data crossbar inside the DSP processor.
请参阅图1,其中控制电路110包括:读请求产生电路111、第一译码电路112以及第二译码电路113。Please refer to FIG. 1 , wherein the control circuit 110 includes: a read request generating circuit 111 , a first decoding circuit 112 and a second decoding circuit 113 .
读请求产生电路111(例如称为cbar_req_gen)用于根据运算指令向数据总线10输出读数据请求,读数据请求用于数据处理电路120通过数据总线10从数据存储器11获取待处理数据。示例性的,运算指令是通过程序总线20从程序存储器21获取的。The read request generation circuit 111 (for example, called cbar_req_gen) is used for outputting a read data request to the data bus 10 according to the operation instruction. Exemplarily, the operation instruction is obtained from the program memory 21 through the program bus 20 .
在一些实施方式中,读数据请求可以包括源地址、目标地址、读请求有效 信号和数据量。举例而言,读数据请求包括用于指示发出的读请求有效的字段,如read_vld、用于指示读请求的数据量的字段,如read_cnt,以及用于指示源地址、目标地址的字段,如set_src/dst。In some implementations, a read data request may include a source address, a destination address, a read request valid signal, and an amount of data. For example, the read data request includes a field used to indicate that the issued read request is valid, such as read_vld, a field used to indicate the data amount of the read request, such as read_cnt, and a field used to indicate the source address and destination address, such as set_src /dst.
示例性的,如图1所示,控制电路110还包括配置总线接口电路114,配置总线接口电路114用于根据配置总线30的配置信号配置读请求产生电路111,以使读请求产生电路111基于配置,根据运算指令向数据总线10输出读数据请求。Exemplarily, as shown in FIG. 1 , the control circuit 110 further includes a configuration bus interface circuit 114, and the configuration bus interface circuit 114 is configured to configure the read request generation circuit 111 according to the configuration signal of the configuration bus 30, so that the read request generation circuit 111 is based on configuration, and output a read data request to the data bus 10 according to the operation instruction.
其中配置总线30例如为DSP处理器内部的配置总线30(crf bus)。配置总线接口电路114(例如称为crf_reg)能够根据配置总线30的配置信号配置读请求产生电路111。The configuration bus 30 is, for example, a configuration bus 30 (crf bus) inside the DSP processor. The configuration bus interface circuit 114 (referred to as crf_reg, for example) can configure the read request generation circuit 111 according to the configuration signal of the configuration bus 30 .
可选的,配置总线接口电路114还用于输出寄存器配置信号给数据处理电路120。Optionally, the configuration bus interface circuit 114 is further configured to output a register configuration signal to the data processing circuit 120 .
示例性的,配置总线接口电路114根据配置总线30的配置信号输出一些寄存器配置信号给数据通路,该类寄存器配置信号可以统一由rctrl_作为前缀。读请求产生电路111可以根据配置总线接口电路114的寄存器配置信号配置相应寄存器的值,以及根据配置的值进行数据请求。Exemplarily, the configuration bus interface circuit 114 outputs some register configuration signals to the data path according to the configuration signals of the configuration bus 30 , and such register configuration signals may be uniformly prefixed by rctrl_. The read request generating circuit 111 can configure the value of the corresponding register according to the register configuration signal of the configuration bus interface circuit 114, and perform a data request according to the configured value.
具体的,第一译码电路112(例如称为static_ctrl_gen)用于对运算指令进行第一级译码,及将第一级译码得到的第一控制信号传输给数据处理电路120。Specifically, the first decoding circuit 112 (for example, called static_ctrl_gen) is used to perform first-level decoding on the operation instruction, and transmit the first control signal obtained by the first-level decoding to the data processing circuit 120 .
第一控制信号具体可以包括在整个指令执行期间保持不变的控制信号。可以理解的,第一控制信号由运算指令唯一确定。第一控制信号可以由sctrl_作为前缀。第一控制信号例如包括符号类型,指令类型、数据扩展类型中的至少一种。Specifically, the first control signal may include a control signal that remains unchanged throughout the execution of the instruction. It can be understood that the first control signal is uniquely determined by the operation instruction. The first control signal may be prefixed by sctrl_. For example, the first control signal includes at least one of a symbol type, an instruction type, and a data extension type.
举例而言,如表1所示为一些第一控制信号的示例。For example, as shown in Table 1, some examples of the first control signals are shown.
表1第一控制信号的示例Table 1 Example of the first control signal
Figure PCTCN2020141848-appb-000001
Figure PCTCN2020141848-appb-000001
Figure PCTCN2020141848-appb-000002
Figure PCTCN2020141848-appb-000002
第一控制信号一方面可以传输给数据处理电路120,例如可以用于控制数据处理电路120确定需要处理的数据的类型等。On the one hand, the first control signal can be transmitted to the data processing circuit 120, for example, it can be used to control the data processing circuit 120 to determine the type of data to be processed, and the like.
第一控制信号另一方面也可以传输给第二译码电路113(例如称为dynamic_ctrl_gen),用于二级译码。On the other hand, the first control signal can also be transmitted to the second decoding circuit 113 (for example, called dynamic_ctrl_gen) for secondary decoding.
第二译码电路113用于对第一控制信号进行第二级译码,及将第二级译码得到的第二控制信号传输给数据处理电路120。The second decoding circuit 113 is configured to perform second-level decoding on the first control signal, and transmit the second control signal obtained by the second-level decoding to the data processing circuit 120 .
具体的,第二控制信号用于根据运算指令确定数据处理电路120的处理流程,数据处理电路120根据第二控制信号对待处理数据进行运算。示例性的,第二控制信号为随着数据流的流动而变化的控制信号。示例性的,第二控制信号主要控制数据处理电路120中各个流水级的运转情况,控制各级流水何时开启,何时关闭。示例性的,第二控制信号可以由dctrl_作为前缀。Specifically, the second control signal is used to determine the processing flow of the data processing circuit 120 according to the operation instruction, and the data processing circuit 120 performs operation on the data to be processed according to the second control signal. Exemplarily, the second control signal is a control signal that varies with the flow of the data stream. Exemplarily, the second control signal mainly controls the operation of each pipeline stage in the data processing circuit 120, and controls when the pipeline of each stage is turned on and off. Exemplarily, the second control signal may be prefixed by dctrl_.
在一些实施方式中,第二译码电路113在接收到读数据有效信号时开始对第一控制信号进行第二级译码,以及将第二级译码得到的第二控制信号传输给数据处理电路120。其中,读数据有效信号用于指示数据处理电路120已经获取到待处理数据。In some embodiments, the second decoding circuit 113 starts to perform second-level decoding on the first control signal when receiving the read data valid signal, and transmits the second control signal obtained by the second-level decoding to data processing circuit 120. The read data valid signal is used to indicate that the data processing circuit 120 has acquired the data to be processed.
在一些实施方式中,数据处理电路120用于根据第一控制信号和第二控制信号对待处理数据进行运算,以及将运算结果通过数据总线10存储至数据存储 器11。In some embodiments, the data processing circuit 120 is configured to perform an operation on the data to be processed according to the first control signal and the second control signal, and store the operation result to the data memory 11 through the data bus 10.
在一些实施方式中,如图1所示,数据处理电路120包括:数据加载电路121、预处理电路122、乘法器电路123、累加电路124和后处理电路125。其中,数据加载电路121用于寄存待处理数据;预处理电路122用于进行加减法运算、乘法器数据源分配和/或取绝对值;乘法器电路123用于进行乘法运算;累加电路124用于进行数据的累加运算;后处理电路125用于对运算后的数据进行截取处理、取饱和处理和/或四舍五入处理,以及将处理后的数据,即运算结果通过数据总线10存储至数据存储器11。In some embodiments, as shown in FIG. 1 , the data processing circuit 120 includes: a data loading circuit 121 , a preprocessing circuit 122 , a multiplier circuit 123 , an accumulation circuit 124 and a post-processing circuit 125 . Among them, the data loading circuit 121 is used to register the data to be processed; the preprocessing circuit 122 is used to perform addition and subtraction operations, multiplier data source allocation and/or take absolute values; the multiplier circuit 123 is used to perform multiplication operations; the accumulation circuit 124 For carrying out the accumulation operation of data; the post-processing circuit 125 is used for intercepting, saturating and/or rounding the data after the operation, and storing the processed data, that is, the operation result, to the data memory through the data bus 10 11.
示例性的,数据加载电路121(可以称为load unit)用于寄存数据交互总线上的数据作为待处理数据。示例性的,请参阅图1,数据加载电路121能够支持3组512bit位宽的数据输入,如master_a_rdata、master_b_rdata、master_c_rdata,能够缓存a/b/c三个数据端口(port)上读回来的数据。示例性的,请参阅图2,数据加载电路121的流水线(pipeline)级数为1级,如对应标记为ireg0、ireg1、ireg2的流水级。示例性的,在数据加载电路121已经寄存完整的待处理数据时,可以向第二译码电路113发送读数据有效信号,以使第二译码电路113开始向数据处理电路120输出第二控制信号。Exemplarily, the data loading circuit 121 (which may be referred to as a load unit) is used to register data on the data exchange bus as data to be processed. Exemplarily, please refer to FIG. 1 , the data loading circuit 121 can support three sets of 512-bit wide data input, such as master_a_rdata, master_b_rdata, master_c_rdata, and can buffer the data read back from the three data ports (ports) a/b/c. . Exemplarily, referring to FIG. 2 , the number of pipeline stages of the data loading circuit 121 is one stage, such as the pipeline stages marked ireg0 , ireg1 , and ireg2 correspondingly. Exemplarily, when the data loading circuit 121 has registered the complete data to be processed, a read data valid signal may be sent to the second decoding circuit 113, so that the second decoding circuit 113 starts to output the second control to the data processing circuit 120. Signal.
示例性的,第二控制信号用于控制数据处理电路120中多个流水级的状态。举例而言,数据处理电路120中的多个流水级可以参阅图2。第二控制信号用于控制数据处理电路120以流水线方式处理待处理数据。Exemplarily, the second control signal is used to control the states of a plurality of pipeline stages in the data processing circuit 120 . For example, the multiple pipeline stages in the data processing circuit 120 can be referred to in FIG. 2 . The second control signal is used to control the data processing circuit 120 to process the data to be processed in a pipeline manner.
示例性的,预处理电路122(可以称为pre_proc unit)用于数据预处理(pre_proc)操作,完成乘法之前的加减法运算,以及对完成乘法器数据源的分配,取绝对值等预处理操作。示例性的,请参阅图2,预处理电路122的流水线级数为2级,如对应标号为0、1的流水级。Exemplarily, the preprocessing circuit 122 (may be referred to as a pre_proc unit) is used for data preprocessing (pre_proc) operations, performing addition and subtraction operations before multiplication, and performing preprocessing such as assignment of multiplier data sources and taking absolute values. operate. 2, the number of pipeline stages of the preprocessing circuit 122 is 2, for example, the pipeline stages corresponding to the labels are 0 and 1.
示例性的,乘法器电路123(可以称为multi unit)用于完成乘法(mult)操作,以及乘法之后的截取操作。示例性的,请参阅图2,乘法器电路123的流水线级数为4级,如对应标号为2至5的流水级。Exemplarily, the multiplier circuit 123 (which may be referred to as a multi unit) is used to perform a multiplication (mult) operation, and a truncation operation after the multiplication. Exemplarily, referring to FIG. 2 , the number of pipeline stages of the multiplier circuit 123 is four, such as pipeline stages numbered 2 to 5 correspondingly.
示例性的,累加电路124(可以称为acc unit)用于完成数据的累加(acc)操作,以及累加电路124中多个,如64个累加器的树形累加(tree_add)操作。请参阅图2,累加电路124的流水线级数为7级,如对应标号为6至13的流水级。Exemplarily, the accumulation circuit 124 (which may be referred to as an acc unit) is used to perform an accumulation (acc) operation of data, and a tree-shaped accumulation (tree_add) operation of multiple accumulators in the accumulation circuit 124, such as 64 accumulators. Referring to FIG. 2 , the number of pipeline stages of the accumulating circuit 124 is seven, such as pipeline stages numbered 6 to 13.
示例性的,后处理电路125(可以称为post_proc unit)用于对运算后的数据进行截取、饱和处理和四舍五入操作,以及对数据进行选择,将选择的数据输出到数据总线10的相应数据端口(port)上。如图1所示,后处理电路125输出的运算结果,如master_a_wdata也可以是512bit的数据位宽,运算结果可以通过数据总线10存储至数据存储器11。Exemplarily, the post-processing circuit 125 (may be referred to as a post_proc unit) is used to truncate, saturate, and round the calculated data, select the data, and output the selected data to the corresponding data port of the data bus 10. (port). As shown in FIG. 1 , the operation result output by the post-processing circuit 125 , such as master_a_wdata, may also have a data bit width of 512 bits, and the operation result may be stored in the data memory 11 through the data bus 10 .
在一些实施方式中,第一控制信号在处理运算指令时固定不变,第二控制信号在处理运算指令时动态变化。例如,在处理运算指令时,运算指令对应的符号类型,指令类型、数据扩展类型等都是相同的,因此可以根据运算指令执行时固定不变的信息生成第一控制信号;可以通过将运算指令对应的运算分解为可以由硬件乘法器和加法器等执行的乘法运算、加法运算等基础运算,而这些基础运算可以以流水线的方式在数据处理电路120中的多个流水级上处理,通过第二控制信号控制各个流水级的运转情况,控制各级流水的开启、关闭时机,以实现运算指令对应的运算。In some embodiments, the first control signal is fixed when processing the operation instruction, and the second control signal is dynamically changed when the operation instruction is processed. For example, when processing an operation instruction, the symbol type, instruction type, data extension type, etc. corresponding to the operation instruction are all the same, so the first control signal can be generated according to the fixed information when the operation instruction is executed; The corresponding operations are decomposed into basic operations such as multiplication and addition that can be performed by hardware multipliers and adders, etc., and these basic operations can be processed in a pipelined manner on multiple pipeline stages in the data processing circuit 120. The second control signal controls the operation of each water level, and controls the opening and closing timing of each level of water, so as to realize the operation corresponding to the operation instruction.
示例性的,第二译码电路113用于以流水线方式控制数据处理电路120中的数据加载电路121、预处理电路122、乘法器电路123、累加电路124、后处理电路125执行对应的数据处理。Exemplarily, the second decoding circuit 113 is configured to control the data loading circuit 121 , the preprocessing circuit 122 , the multiplier circuit 123 , the accumulating circuit 124 , and the post-processing circuit 125 in the data processing circuit 120 to perform corresponding data processing in a pipeline manner. .
示例性的,如图2所示,第二译码电路113包括数据加载控制电路(可称为load_stg)、预处理控制电路(可称为pre_stg)、乘法控制电路(可称为mult_stg)、累加控制电路(可称为acc_stg)、后处理控制电路(可称为post_stg)。数据加载控制电路、预处理控制电路、乘法控制电路、累加控制电路、后处理控制电路用于对应控制数据处理电路120中的数据加载电路121、预处理电路122、乘法器电路123、累加电路124、后处理电路125执行对应的数据处理。Exemplarily, as shown in FIG. 2 , the second decoding circuit 113 includes a data loading control circuit (may be called load_stg), a preprocessing control circuit (may be called pre_stg), a multiplication control circuit (may be called mult_stg), an accumulation circuit Control circuit (may be called acc_stg), post-processing control circuit (may be called post_stg). The data loading control circuit, preprocessing control circuit, multiplication control circuit, accumulation control circuit, and post-processing control circuit are used to control the data loading circuit 121 , the preprocessing circuit 122 , the multiplier circuit 123 , and the accumulation circuit 124 in the corresponding control data processing circuit 120 . The post-processing circuit 125 performs corresponding data processing.
通过在DSP处理器的控制电路110设置第一译码电路112和第二译码电路113,架构划分清晰,方便复杂指令的映射,可以实现第一控制信号和第二控制信号分离,使得DSP处理器的可扩展性较强,能够较好的支持复杂指令的运算。例如,在需要新增新的运算指令时,DSP处理器的整体架构不需做调整,只需增加相应的控制信号,如确定运算指令对应的第一控制信号、第二控制信号即可,可扩展性强。By setting the first decoding circuit 112 and the second decoding circuit 113 in the control circuit 110 of the DSP processor, the structure is clearly divided, which facilitates the mapping of complex instructions, and the separation of the first control signal and the second control signal can be realized, so that the DSP can process the The extensibility of the device is strong, and it can better support the operation of complex instructions. For example, when a new operation instruction needs to be added, the overall architecture of the DSP processor does not need to be adjusted, but only the corresponding control signal needs to be added. For example, the first control signal and the second control signal corresponding to the operation instruction can be determined. Strong scalability.
在一些实施方式中,数据处理电路120用于并行从数据总线10获取多组待 处理数据。请参阅图1,数据加载电路121能够支持3组512bit位宽的数据输入,如master_a_rdata、master_b_rdata、master_c_rdata,能够缓存a/b/c三个数据端口(port)上读回来的数据。In some embodiments, the data processing circuit 120 is used to obtain multiple sets of data to be processed from the data bus 10 in parallel. Referring to FIG. 1 , the data loading circuit 121 can support three sets of 512-bit wide data inputs, such as master_a_rdata, master_b_rdata, master_c_rdata, and can buffer the data read back from the three data ports a/b/c.
示例性的,第二译码电路113用于在多组待处理数据获取完毕时,将译码得到的第二控制信号传输给数据处理电路120,在多组待处理数据未获取完毕时,暂停将译码得到的第二控制信号传输给数据处理电路120。从而可以在数据加载电路121未获取完整的待处理数据时,正确控制整个流水线。可以防止从数据存储器11读数据时发生冲突等情况时,待处理数据不能在预设的时间周期内返回而导致数据处理出错。而且不需要将可能存在冲突的数据整理至能保证读取时不冲突的步骤,因此提升了指令执行的效率。Exemplarily, the second decoding circuit 113 is configured to transmit the second control signal obtained by decoding to the data processing circuit 120 when the acquisition of multiple groups of data to be processed is completed, and pause when the acquisition of the multiple groups of data to be processed is not completed. The decoded second control signal is transmitted to the data processing circuit 120 . Therefore, the entire pipeline can be correctly controlled when the data loading circuit 121 does not acquire the complete data to be processed. It can prevent that the data to be processed cannot be returned within a preset period of time, resulting in a data processing error when a conflict occurs when reading data from the data storage 11 . In addition, it is not necessary to organize potentially conflicting data into steps that can ensure no conflict during reading, thus improving the efficiency of instruction execution.
示例性的,第二译码电路113用于在接收到多组待处理数据各自对应的读数据有效信号时,将译码得到的第二控制信号传输给数据处理电路120,读数据有效信号是对应组的待处理数据获取完毕时得到的。请参阅图1和图2,标记为ireg0、ireg1、ireg2的三个流水级与数据加载电路121的a/b/c三个数据端口(port)一一对应。ireg0、ireg1、ireg2的三个流水级可以用于指示对应的数据端口上读数据是否有效。当这三个数据端口上读数据均有效时,第二译码电路113将译码得到的第二控制信号传输给数据处理电路120,流水线开始流动,第二控制信号主要控制数据处理电路120中各个流水级的运转情况。如果由于冲突等原因,这三个数据端口中有一个数据端口读数据需要延迟到达,则等待,直至当这三个数据端口上读数据均有效时,第二译码电路113将译码得到的第二控制信号传输给数据处理电路120,流水线开始流动。例如ireg0、ireg1对应的数据都是在第2个时钟周期有效,而ireg2的数据是在第5个时钟周期有效,则第二译码电路113会等待到第5个时钟周期,发现所有3个数据全部准备好后,再开始控制流水线开始流动,以使数据处理电路120以流水线方式处理待处理数据。Exemplarily, the second decoding circuit 113 is configured to transmit the second control signal obtained by decoding to the data processing circuit 120 when receiving the read data valid signals corresponding to the multiple groups of data to be processed, and the read data valid signal is: Obtained when the pending data acquisition of the corresponding group is completed. Referring to FIG. 1 and FIG. 2 , the three pipeline stages marked ireg0 , ireg1 and ireg2 correspond to the three data ports a/b/c of the data loading circuit 121 one-to-one. The three pipeline stages of ireg0, ireg1, and ireg2 can be used to indicate whether the read data on the corresponding data port is valid. When the read data on the three data ports are all valid, the second decoding circuit 113 transmits the decoded second control signal to the data processing circuit 120, and the pipeline starts to flow. The second control signal mainly controls the data processing circuit 120. The operation of each pipeline stage. If the read data of one of the three data ports needs to arrive with a delay due to conflicts and other reasons, wait until the read data on the three data ports are all valid, and the second decoding circuit 113 will decode the obtained data. The second control signal is transmitted to the data processing circuit 120, and the pipeline begins to flow. For example, the data corresponding to ireg0 and ireg1 are valid in the 2nd clock cycle, and the data of ireg2 is valid in the 5th clock cycle, then the second decoding circuit 113 will wait until the 5th clock cycle and find that all three After the data is all ready, the control pipeline starts to flow again, so that the data processing circuit 120 processes the data to be processed in a pipeline manner.
在一些实施方式中,在运算结果未通过数据总线10存储至数据存储器11时,第二译码电路113暂停将译码得到的第二控制信号传输给数据处理电路120,直至运算结果通过数据总线10存储至数据存储器11时,第二译码电路113将译码得到的第二控制信号传输给数据处理电路120。In some embodiments, when the operation result is not stored in the data memory 11 through the data bus 10, the second decoding circuit 113 suspends transmitting the decoded second control signal to the data processing circuit 120 until the operation result passes through the data bus. When 10 is stored in the data memory 11 , the second decoding circuit 113 transmits the decoded second control signal to the data processing circuit 120 .
示例性的,后处理电路125能够将运算结果存储至数据存储器11。如果数 据存储器11发生冲突,无法接收数据,则第二译码电路113暂停将第二控制信号传输给数据处理电路120,暂停所有流水级的执行,等待外部缓存可以接收写出的数据后,再开始流水线的执行。以便第二译码电路113正确控制整个流水线,防止在数据存储器11不能适时接收运算结果时导致数据处理出错,而且不需要先将运算结果临时存放在某个存储空间内,存储完毕后,再整理存储至数据存储器11,因此可以提升指令执行的效率。Exemplarily, the post-processing circuit 125 can store the operation result in the data memory 11 . If the data memory 11 collides and cannot receive the data, the second decoding circuit 113 suspends transmitting the second control signal to the data processing circuit 120, suspends the execution of all pipeline stages, and waits for the external buffer to receive the written data, and then Start the execution of the pipeline. In order to correctly control the entire pipeline by the second decoding circuit 113, to prevent data processing errors when the data memory 11 cannot receive the operation results in a timely manner, and it is not necessary to temporarily store the operation results in a certain storage space, and then organize them after the storage is completed. Stored in the data memory 11, so the efficiency of instruction execution can be improved.
可以理解的,通过输入/输出数据的握手,在外部无法接收数据时可立刻停住矢量执行电路内部的流水线,在外部数据源尚未准备好时,矢量执行电路内部会等待数据源准备好后再执行,具有较高的运算准确性和较高的运算效率。It is understandable that through the handshake of input/output data, the pipeline inside the vector execution circuit can be stopped immediately when the data cannot be received from the outside. When the external data source is not ready, the vector execution circuit will wait for the data source to be ready. Execution, with higher operational accuracy and higher operational efficiency.
举例而言,如图3所示,数据处理电路120中的乘法器电路123包括四级流水运算电路,其中第一级流水运算电路用于进行16bit×8bit的无符号乘法运算,第二级流水运算电路用于进行16bit×16bit的乘法拼接运算和32bit×16bit的乘法拼接运算,第三级流水运算电路用于进行32bit×32bit的乘法拼接运算,第四级流水用于对32bit×16bit乘法拼接运算的结果和/或32bit×32bit乘法拼接运算的结果进行移位(shift)处理。For example, as shown in FIG. 3 , the multiplier circuit 123 in the data processing circuit 120 includes a four-stage pipeline operation circuit, wherein the first-stage pipeline operation circuit is used to perform 16bit×8bit unsigned multiplication operation, and the second-stage pipeline The arithmetic circuit is used to perform the multiplication and splicing operation of 16bit×16bit and the multiplication and splicing operation of 32bit×16bit. The third-stage pipeline operation circuit is used to perform the multiplication and splicing operation of 32bit×32bit. The result of the operation and/or the result of the 32bit×32bit multiplication and concatenation operation is subjected to shift processing.
示例性的,乘法器电路123可以包括128个16bitx8bit=24bit的无符号乘法器,例如包括16组,每组8个16bitx8bit=24bit的无符号乘法器。128个16bitx8bit=24bit的无符号乘法器可拆分成如下形式:64路8bit×8bit+8bit×8bit、64路16bit×16bit、32路32bit×16bit、16路32bit×32bit。Exemplarily, the multiplier circuit 123 may include 128 unsigned multipliers of 16bit×8bit=24bit, for example, including 16 groups, each group of 8 unsigned multipliers of 16bit×8bit=24bit. The 128 unsigned multipliers of 16bitx8bit=24bit can be divided into the following forms: 64 8bit×8bit+8bit×8bit, 64 16bit×16bit, 32 32bit×16bit, 16 32bit×32bit.
示例性的,乘法器电路123可以通过乘法器拼接实现如表2所示的运算:Exemplarily, the multiplier circuit 123 can implement the operations shown in Table 2 through multiplier splicing:
表2乘法器拼接Table 2 Multiplier Splicing
拼接方式编号Splicing method number 组合方式 combination 流水线级数pipeline stage
00 4个8bit×8bit+8bit×bit->17bit4 8bit×8bit+8bit×bit->17bit 22
11 4个16bit×16bit->32bit4 16bit×16bit->32bit 22
22 2个32bit×16bit->48bit2 32bit×16bit->48bit 44
33 1个32bit×32bit->64bit1 32bit×32bit->64bit 44
示例性的,数据处理电路120对有符号数据进行运算时,数据处理电路120 中的预处理电路122对有符号数据取绝对值,第三级流水运算电路将有符号数据的符号位,与32bit×16bit的乘法拼接运算的结果或32bit×32bit的乘法拼接运算的结果结合(comb sign)。Exemplarily, when the data processing circuit 120 operates on the signed data, the preprocessing circuit 122 in the data processing circuit 120 takes the absolute value of the signed data, and the third-stage pipeline operation circuit compares the sign bit of the signed data with the 32bit value. The result of the multiplication and concatenation operation of ×16bit or the result of the multiplication and concatenation operation of 32bit×32bit is combined (comb sign).
示例性的,请参阅图3,第一级流水运算电路用于计算16bit×8bit的无符号乘法运算;第二级流水运算电路包含两级加法,第一级加法完成16bit×16bit的乘法拼接,第二级加法完成32bit×16bit的乘法拼接;第三级流水运算电路包含一级加法,完成32bit×32bit的乘法拼接,最后将无符号结果与符号位结合;第四级流水运算电路完成对32bit×16bit以及32bit×32bit结果的移位功能,包含两个移位器,第一个移位器输入数据是64bit,移动范围为右移32位至右移0位,输出数据位宽为48bit;第二个移位器输入数据是48bit,移动范围为右移16位至右移0位,输出数据位宽为48bit。Exemplarily, please refer to FIG. 3, the first-stage pipeline operation circuit is used to calculate the unsigned multiplication operation of 16bit×8bit; the second-stage pipeline operation circuit includes two-stage addition, and the first-stage addition completes the multiplication and splicing of 16bit×16bit, The second-stage addition completes the multiplication and splicing of 32bit×16bit; the third-stage pipeline operation circuit includes one-stage addition, completes the multiplication and splicing of 32bit×32bit, and finally combines the unsigned result with the sign bit; the fourth-stage pipeline operation circuit completes the 32bit The shift function of ×16bit and 32bit×32bit results includes two shifters. The input data of the first shifter is 64bit, and the moving range is from right shift by 32 bits to right shift by 0 bits, and the output data bit width is 48 bits; The input data of the second shifter is 48 bits, the shift range is 16 bits to the right to 0 bits to the right, and the bit width of the output data is 48 bits.
示例性的,对于有符号的乘法,比如int×int,在送入乘法器前,在预处理阶段首先进行取绝对值操作,完成乘法操作后,再与符号位相结合。从而实现了有符号和无符号类型的资源复用。Exemplarily, for signed multiplication, such as int×int, before being sent to the multiplier, the absolute value operation is first performed in the preprocessing stage, and after the multiplication operation is completed, it is combined with the sign bit. Thereby, resource multiplexing of signed and unsigned types is realized.
可以理解的,只需要8个short×char的无符号乘法器,以及一些加法器,即可拼接组合成各种数据类型和符号类型的组合。可以实现乘法器资源的高度复用,从而可以用相对较小的面积实现复杂混合数据类型运算。It can be understood that only 8 short × char unsigned multipliers and some adders are needed to be spliced into combinations of various data types and symbol types. A high degree of multiplexing of multiplier resources can be achieved, allowing complex mixed data type operations to be implemented with a relatively small area.
示例性的,如图4所示,数据处理电路120可以每次从数据存储器11中读入2笔512bit的图像数据,以及1笔512bit的系数,例如待处理的图像数据存放在片上缓存的ireg0中,系数存放在片上缓存的ireg1中,待处理的图像数据包括第一行的a(0)、a(1)、a(2)……a(127),系数例如包括第一行的b(0)、b(1)、b(2)。示例性的,数据处理电路120包括128个无符号乘法器,如char×char的乘法器,可以采用如下方式进行第一笔计算:AR(0:63)=AR(0:63)+a(0:63)×b(0)+a(1:64)×b(1),其中AR为累加寄存器,AR(0:63)表示64个累加寄存器,a(0:63)表示a(0)至a(63),a(1:64)表示a(1)至a(64)。这样就充分利用了128个乘法器。Exemplarily, as shown in FIG. 4 , the data processing circuit 120 may read two 512-bit image data and one 512-bit coefficient from the data memory 11 each time. For example, the image data to be processed is stored in the ireg0 cached on-chip. , the coefficients are stored in ireg1 in the on-chip cache, the image data to be processed includes a(0), a(1), a(2)...a(127) in the first line, and the coefficients include, for example, b in the first line (0), b(1), b(2). Exemplarily, the data processing circuit 120 includes 128 unsigned multipliers, such as a char×char multiplier, and the first calculation can be performed in the following manner: AR(0:63)=AR(0:63)+a( 0:63)×b(0)+a(1:64)×b(1), where AR is the accumulation register, AR(0:63) means 64 accumulation registers, and a(0:63) means a(0 ) to a(63), a(1:64) means a(1) to a(64). This fully utilizes the 128 multipliers.
完成第一笔计算后,系数矩阵第一行还有一个系数b(2)没有计算,接下来进行第二笔数据的计算:AR(0:63)=AR(0:63)+a(2:65)×b(2),其中a(2:65)表示a(2)至a(65)。After the first calculation is completed, there is still one coefficient b(2) in the first row of the coefficient matrix that has not been calculated. Next, the calculation of the second data is performed: AR(0:63)=AR(0:63)+a(2 :65)×b(2), where a(2:65) represents a(2) to a(65).
示例性的,对于step=1,stride=1的图像运算,如卷积运算,乘法器电路 123需要完成以下运算:其中step是指单个窗内部相邻两个像素点之间的间隔,stride是指窗每次滑动的像素点个数。Exemplarily, for step=1, stride=1 image operations, such as convolution operations, the multiplier circuit 123 needs to complete the following operations: where step refers to the interval between two adjacent pixels inside a single window, and stride is The number of pixels each time the window slides.
a(0)×b(0)+a(1)×b(1),a(0)×b(0)+a(1)×b(1),
a(1)×b(0)+a(2)×b(1),a(1)×b(0)+a(2)×b(1),
......
a(63)×b(0)+a(64)×b(1);a(63)×b(0)+a(64)×b(1);
示例性的,对于step=2,stride=1的图像运算,乘法器电路123需要完成以下运算:Exemplarily, for step=2, stride=1 image operations, the multiplier circuit 123 needs to complete the following operations:
a(0)×b(0)+a(2)×b(1),a(0)×b(0)+a(2)×b(1),
a(1)×b(0)+a(3)×b(1),a(1)×b(0)+a(3)×b(1),
......
a(63)×b(0)+a(65)×b(1);a(63)×b(0)+a(65)×b(1);
示例性的,对于step=4,stride=1的图像运算,乘法器电路123需要完成以下运算:Exemplarily, for step=4, stride=1 image operation, the multiplier circuit 123 needs to complete the following operations:
a(0)×b(0)+a(4)×b(1),a(0)×b(0)+a(4)×b(1),
a(1)×b(0)+a(5)×b(1),a(1)×b(0)+a(5)×b(1),
......
a(63)×b(0)+a(67)×b(1);a(63)×b(0)+a(67)×b(1);
示例性的,对于step=1,stride=2的图像运算,乘法器电路123需要完成以下运算:Exemplarily, for step=1, stride=2 image operations, the multiplier circuit 123 needs to complete the following operations:
a(0)×b(0)+a(1)×b(1),a(0)×b(0)+a(1)×b(1),
a(2)×b(0)+a(3)×b(1),a(2)×b(0)+a(3)×b(1),
......
a(62)×b(0)+a(63)×b(1);a(62)×b(0)+a(63)×b(1);
预处理电路122用于完成乘法器之前的数据源选择,例如将数据按照step、stride对应的上述运算方式排列好,然后送入乘法器电路123中进行乘法和加法操作。在预处理电路122可以实现step和stride为各种情况时的乘法器数据源的排列组合,从而实现DSP处理器对step和stride为各种数值时滑窗运算的支持。The preprocessing circuit 122 is used to select the data source before the multiplier. For example, the data is arranged according to the above operation modes corresponding to step and stride, and then sent to the multiplier circuit 123 for multiplication and addition operations. The preprocessing circuit 122 can realize the permutation and combination of multiplier data sources when step and stride are various situations, so as to realize the DSP processor's support for sliding window operation when step and stride are various numerical values.
之后可以读取第二行的待处理的图像数据、第二行的系数,参照第一行的 运算过程可以完成第二行的运算,以及可以读取第三行的待处理的图像数据、第三行的系数,参照第一行的运算过程可以完成第三行的运算。完成上述三行的运算后,此时AR(0:63)中存储的数据为要输出的结果,将结果进行输出后,可重复进行上述操作,完成后续图像数据的处理。After that, the image data to be processed and the coefficients of the second row of the second row can be read, the operation of the second row can be completed by referring to the operation process of the first row, and the image data to be processed and the third row of the image data to be processed can be read. For the coefficients of the three rows, the operation of the third row can be completed by referring to the operation process of the first row. After the operations of the above three lines are completed, the data stored in AR(0:63) is the result to be output at this time. After the result is output, the above operations can be repeated to complete subsequent image data processing.
示例性的,乘法器电路123可完成64个8bit×8bit+8bit×8bit的乘加操作,也可完成64个16bit×16bit乘法操作。累加电路124完成累加操作,对于乘法器电路123的输出数据,直接送往累加器进行累加,当把一个系数矩阵所有的系数都处理完毕后,将累加器中的数据写出到DSP处理器的片上缓存中。后处理电路125完成对输出数据的后处理,比如对输出数据的截取等。Exemplarily, the multiplier circuit 123 can perform 64 multiplication and addition operations of 8bit×8bit+8bit×8bit, and can also perform 64 multiplication operations of 16bit×16bit. The accumulating circuit 124 completes the accumulating operation, and the output data of the multiplier circuit 123 is directly sent to the accumulator for accumulation. When all the coefficients of a coefficient matrix are processed, the data in the accumulator is written out to the DSP processor. in the on-chip cache. The post-processing circuit 125 completes the post-processing of the output data, such as the interception of the output data.
在一些实施方式中,数据处理电路120中的数据加载电路121可以用于进行滑窗指令的数据滑动处理。例如,数据加载电路121寄存数据交互总线上的数据作为待处理数据,以及进行滑窗指令的数据滑动处理。In some embodiments, the data loading circuit 121 in the data processing circuit 120 may be used to perform data sliding processing of sliding window instructions. For example, the data loading circuit 121 registers the data on the data exchange bus as the data to be processed, and performs data sliding processing of the sliding window instruction.
滑窗运算的输入为一幅需要处理的图像数据矩阵以及一个系数矩阵,滑窗运算的操作是系数矩阵在图像上滑动,滑过的区域所有像素点和系数进行乘累加,最终得到一个输出数据。例如一个258x258的图像矩阵以及一个3x3的系数矩阵,系数矩阵在图像矩阵上滑动完成后,最终得到一个256x256的输出矩阵。The input of the sliding window operation is an image data matrix to be processed and a coefficient matrix. The operation of the sliding window operation is that the coefficient matrix slides on the image, and all the pixels and coefficients in the sliding area are multiplied and accumulated, and finally an output data is obtained. . For example, a 258x258 image matrix and a 3x3 coefficient matrix, after the coefficient matrix is slid on the image matrix, a 256x256 output matrix is finally obtained.
滑窗运算包含大量的乘累加运算,如果调用通用处理器中的乘累加,需要调用很多次,指令的调度是在软件层面进行的,由于软件的处理实时性很低,这大大增加了滑窗运算执行的时间,所以这种实现方式的实时性很低;另外这种实现方式需要读写多次片上缓存,而读写片上缓存的功耗很大,所以这种实现方式的功耗很大。另外,受处理器内部数据总线10位宽限制,通用处理器中的普通乘累加运算的MAC(乘累加)利用率通常不高:对于数据总线10位宽为512bit的处理器,要处理char类型的乘累加,同一时钟周期只能加载进来64个char类型的值,而处理器内部最多可处理128个char类型的乘累加,MAC利用率只有50%。The sliding window operation includes a large number of multiply-accumulate operations. If the multiply-accumulate operation in a general-purpose processor is called, it needs to be called many times. The scheduling of instructions is performed at the software level. Due to the low real-time processing of the software, this greatly increases the sliding window. The real-time performance of this implementation is very low; in addition, this implementation needs to read and write the on-chip cache multiple times, and the power consumption of reading and writing the on-chip cache is very large, so this implementation consumes a lot of power. . In addition, due to the limitation of the 10-bit width of the internal data bus of the processor, the MAC (multiply-accumulate) utilization of ordinary multiply-accumulate operations in general-purpose processors is usually not high: for a processor with a 10-bit data bus width of 512 bits, it is necessary to process the char type. For multiplication and accumulation, only 64 char type values can be loaded in the same clock cycle, while the processor can process up to 128 char type multiplication and accumulation, and the MAC utilization rate is only 50%.
图5示出了step=1和stride=1时的滑窗运算,step=1用于指示单个窗内像素点是相邻的,stride=1用于指示每次窗滑动的距离是1个像素。Figure 5 shows the sliding window operation when step=1 and stride=1, step=1 is used to indicate that the pixels in a single window are adjacent, and stride=1 is used to indicate that the distance of each window sliding is 1 pixel .
其中:in:
O[0,0]=P[0,0]×Coef[0,0]+P[0,1]×Coef[0,1]+P[0,2]×Coef[0,2]+O[0,0]=P[0,0]×Coef[0,0]+P[0,1]×Coef[0,1]+P[0,2]×Coef[0,2]+
P[1,0]×Coef[1,0]+P[1,1]×Coef[1,1]+P[1,2]×Coef[1,2]+P[1,0]×Coef[1,0]+P[1,1]×Coef[1,1]+P[1,2]×Coef[1,2]+
P[2,0]×Coef[2,0]+P[2,1]×Coef[2,1]+P[2,2]×Coef[2,2];P[2,0]×Coef[2,0]+P[2,1]×Coef[2,1]+P[2,2]×Coef[2,2];
如图5所示,通过滑窗还可以计算得到O[0,1],O[1,0]等。As shown in Figure 5, O[0,1], O[1,0], etc. can also be calculated through the sliding window.
图6示出了step=2和stride=1时的滑窗运算,step=2用于指示单个窗内像素点间隔一个像素的,stride=1用于指示每次窗滑动的距离是1个像素。Figure 6 shows the sliding window operation when step=2 and stride=1, step=2 is used to indicate that the pixels in a single window are separated by one pixel, and stride=1 is used to indicate that the distance of each window sliding is 1 pixel .
其中:in:
O[0,0]=P[0,0]×Coef[0,0]+P[0,2]×Coef[0,1]+P[0,4]×Coef[0,2]+O[0,0]=P[0,0]×Coef[0,0]+P[0,2]×Coef[0,1]+P[0,4]×Coef[0,2]+
P[2,0]×Coef[1,0]+P[2,2]×Coef[1,1]+P[2,4]×Coef[1,2]+P[2,0]×Coef[1,0]+P[2,2]×Coef[1,1]+P[2,4]×Coef[1,2]+
P[4,0]×Coef[2,0]+P[4,2]×Coef[2,1]+P[4,4]×Coef[2,2];P[4,0]×Coef[2,0]+P[4,2]×Coef[2,1]+P[4,4]×Coef[2,2];
图7示出了step=1和stride=2时的滑窗运算,step=1用于指示单个窗内像素点是相邻的,stride=2用于指示每次窗滑动的距离是2个像素。Figure 7 shows the sliding window operation when step=1 and stride=2, step=1 is used to indicate that the pixels in a single window are adjacent, and stride=2 is used to indicate that the distance of each window sliding is 2 pixels .
由于使用了上述一个时钟周期计算2个系数的方法,多系数并行计算,硬件利用率可以最大化,一个时钟周期可进行128个char×char的运算。通过预处理,实现对多种stride和step的支持,灵活性高,编程人员不需要做大量的额外处理,因此提高了了效率,降低了功耗。数据处理电路120中的数据加载电路121、预处理电路122、乘法器电路123、累加电路124、后处理电路125是全流水结构设计,每电路的处理是独立的,可并行进行,使得DSP处理器具有较高的性能。读取2笔图像数据,在电路内部进行窗体的滑动,可以大大减少读取片上缓存的次数,功耗低。Since the above-mentioned method of calculating 2 coefficients in one clock cycle is used, multi-coefficients are calculated in parallel, the hardware utilization rate can be maximized, and 128 char×char operations can be performed in one clock cycle. Through preprocessing, support for a variety of strides and steps is realized, with high flexibility, and programmers do not need to do a lot of additional processing, thus improving efficiency and reducing power consumption. The data loading circuit 121 , the preprocessing circuit 122 , the multiplier circuit 123 , the accumulating circuit 124 , and the post-processing circuit 125 in the data processing circuit 120 are designed with a full pipeline structure, and the processing of each circuit is independent and can be performed in parallel, so that the DSP processing device has higher performance. Reading 2 pieces of image data and sliding the window inside the circuit can greatly reduce the number of times of reading the on-chip cache, and the power consumption is low.
本申请实施例提供的DSP处理器,通过在DSP处理器的控制电路设置第一译码电路和第二译码电路,架构划分清晰,方便复杂指令的映射,可以实现第一控制信号和第二控制信号分离,使得DSP处理器的可扩展性较强,能够较好的支持复杂指令的运算。例如,在需要新增新的运算指令时,DSP处理器的整体架构不需做调整,只需增加相应的控制信号,如确定运算指令对应的第一控制信号、第二控制信号即可,可扩展性强。In the DSP processor provided by the embodiments of the present application, by setting the first decoding circuit and the second decoding circuit in the control circuit of the DSP processor, the structure is clearly divided, the mapping of complex instructions is convenient, and the first control signal and the second decoding circuit can be realized. The separation of control signals makes the DSP processor more scalable and can better support the operation of complex instructions. For example, when a new operation instruction needs to be added, the overall architecture of the DSP processor does not need to be adjusted, but only the corresponding control signal needs to be added. For example, the first control signal and the second control signal corresponding to the operation instruction can be determined. Strong scalability.
请结合前述实施例参阅图8,图8是本申请另一实施例提供的一种DSP处理器的并行计算方法的流程示意图。所述并行计算方法应用在DSP处理器中。Please refer to FIG. 8 in conjunction with the foregoing embodiment. FIG. 8 is a schematic flowchart of a parallel computing method for a DSP processor provided by another embodiment of the present application. The parallel computing method is applied in a DSP processor.
所述DSP处理器包括连接数据存储器的数据总线、连接程序存储器的程序总线,以及控制电路和数据处理电路,其中所述控制电路包括读请求产生电路、第一译码电路和第二译码电路。The DSP processor includes a data bus connected to the data memory, a program bus connected to the program memory, and a control circuit and a data processing circuit, wherein the control circuit includes a read request generating circuit, a first decoding circuit and a second decoding circuit .
如图8所示,本实施例的并行计算方法包括步骤S210至步骤S240。As shown in FIG. 8 , the parallel computing method of this embodiment includes steps S210 to S240.
S210、所述读请求产生电路根据运算指令向所述数据总线输出读数据请求,所述运算指令通过所述程序总线从所述程序存储器获取,所述读数据请求用于所述数据处理电路通过所述数据总线从所述数据存储器获取待处理数据;S210. The read request generation circuit outputs a read data request to the data bus according to an operation instruction, the operation instruction is obtained from the program memory through the program bus, and the read data request is used by the data processing circuit to pass The data bus obtains data to be processed from the data memory;
S220、所述第一译码电路对所述运算指令进行第一级译码,及将第一级译码得到的第一控制信号传输给数据处理电路;S220, the first decoding circuit performs first-level decoding on the operation instruction, and transmits the first control signal obtained by the first-level decoding to the data processing circuit;
S230、所述第二译码电路对所述第一控制信号进行第二级译码,及将第二级译码得到的第二控制信号传输给所述数据处理电路;S230, the second decoding circuit performs second-level decoding on the first control signal, and transmits the second control signal obtained by the second-level decoding to the data processing circuit;
S240、所述数据处理电路根据所述第一控制信号和所述第二控制信号对所述待处理数据进行运算,以及将运算结果通过所述数据总线存储至所述数据存储器;S240, the data processing circuit performs an operation on the data to be processed according to the first control signal and the second control signal, and stores the operation result to the data memory through the data bus;
其中,所述第一控制信号由所述运算指令唯一确定,所述第二控制信号用于根据运算指令确定所述数据处理电路的处理流程。The first control signal is uniquely determined by the operation instruction, and the second control signal is used to determine the processing flow of the data processing circuit according to the operation instruction.
在一些实施方式中,所述方法还包括:所述控制电路的配置总线接口电路根据配置总线的配置信号,配置所述读请求产生电路,以使所述读请求产生电路基于所述配置,根据运算指令向数据总线输出读数据请求。In some embodiments, the method further includes: the configuration bus interface circuit of the control circuit configures the read request generation circuit according to the configuration signal of the configuration bus, so that the read request generation circuit based on the configuration, according to The operation instruction outputs a read data request to the data bus.
在一些实施方式中,所述读数据请求包括源地址、目标地址、读请求有效信号和数据量。In some embodiments, the read data request includes a source address, a target address, a read request valid signal, and a data amount.
示例性的,所述方法还包括:Exemplarily, the method further includes:
所述配置总线接口电路输出寄存器配置信号给所述数据处理电路。The configuration bus interface circuit outputs a register configuration signal to the data processing circuit.
在一些实施方式中,所述第一控制信号在处理所述运算指令时固定不变,所述第二控制信号在处理所述运算指令时动态变化。In some embodiments, the first control signal is fixed when the operation instruction is processed, and the second control signal changes dynamically when the operation instruction is processed.
示例性的,其特征在于,所述第二控制信号用于控制所述数据处理电路中多个流水级的状态。Exemplarily, it is characterized in that the second control signal is used to control the states of a plurality of pipeline stages in the data processing circuit.
在一些实施方式中,所述数据处理电路包括:In some embodiments, the data processing circuit includes:
数据加载电路,用于寄存所述待处理数据;a data loading circuit for registering the data to be processed;
预处理电路,用于进行加减法运算、乘法器数据源分配和/或取绝对值;Preprocessing circuits for addition and subtraction operations, multiplier data source assignment, and/or absolute values;
乘法器电路,用于进行乘法运算;a multiplier circuit for performing multiplication operations;
累加电路,用于进行数据的累加运算;Accumulation circuit for accumulating data;
后处理电路,用于对运算后的数据进行截取处理、取饱和处理和/或四舍五 入处理,以及将处理后的数据通过所述数据总线存储至所述数据存储器。The post-processing circuit is used for intercepting, saturating and/or rounding the data after the operation, and storing the processed data to the data memory through the data bus.
示例性的,所述第二控制信号以流水线方式控制所述数据处理电路中的数据加载电路、预处理电路、乘法器电路、累加电路、后处理电路执行对应的数据处理。Exemplarily, the second control signal controls the data loading circuit, the preprocessing circuit, the multiplier circuit, the accumulation circuit, and the post-processing circuit in the data processing circuit to perform corresponding data processing in a pipeline manner.
在一些实施方式中,所述数据处理电路用于并行从所述数据总线获取多组待处理数据;In some embodiments, the data processing circuit is configured to obtain multiple sets of data to be processed from the data bus in parallel;
在所述多组待处理数据获取完毕时,所述第二译码电路将译码得到的第二控制信号传输给所述数据处理电路,在所述多组待处理数据未获取完毕时,所述第二译码电路暂停将译码得到的第二控制信号传输给所述数据处理电路。When the acquisition of the multiple groups of data to be processed is completed, the second decoding circuit transmits the second control signal obtained by decoding to the data processing circuit, and when the acquisition of the multiple groups of data to be processed is not completed, the The second decoding circuit suspends transmitting the decoded second control signal to the data processing circuit.
示例性的,所述第二译码电路在接收到所述多组待处理数据各自对应的读数据有效信号时,将译码得到的第二控制信号传输给所述数据处理电路,所述读数据有效信号是对应组的待处理数据获取完毕时得到的。Exemplarily, the second decoding circuit transmits the second control signal obtained by decoding to the data processing circuit when receiving the read data valid signals corresponding to the multiple groups of data to be processed, and the read data The data valid signal is obtained when the to-be-processed data of the corresponding group is acquired.
在一些实施方式中,在所述运算结果未通过所述数据总线存储至所述数据存储器时,所述第二译码电路暂停将译码得到的第二控制信号传输给所述数据处理电路,直至所述运算结果通过所述数据总线存储至所述数据存储器时,所述第二译码电路将译码得到的第二控制信号传输给所述数据处理电路。In some embodiments, when the operation result is not stored in the data memory through the data bus, the second decoding circuit suspends transmitting the decoded second control signal to the data processing circuit, Until the operation result is stored in the data memory through the data bus, the second decoding circuit transmits the decoded second control signal to the data processing circuit.
在一些实施方式中,所述数据处理电路中的乘法器电路包括四级流水运算电路,其中第一级流水运算电路用于进行16bit×8bit的无符号乘法运算,第二级流水运算电路用于进行16bit×16bit的乘法拼接运算和32bit×16bit的乘法拼接运算,第三级流水运算电路用于进行32bit×32bit的乘法拼接运算,第四级流水用于对32bit×16bit乘法拼接运算的结果和/或32bit×32bit乘法拼接运算的结果进行移位处理。In some embodiments, the multiplier circuit in the data processing circuit includes a four-stage pipeline operation circuit, wherein the first-stage pipeline operation circuit is used for performing 16bit×8bit unsigned multiplication operation, and the second-stage pipeline operation circuit is used for Perform 16bit×16bit multiplication and splicing operations and 32bit×16bit multiplication and splicing operations, the third-stage pipeline operation circuit is used to perform 32bit×32bit multiplication and splicing operations, and the fourth-stage pipeline is used to 32bit×16bit multiplication and splicing operation results and / or 32bit × 32bit multiplication and splicing operation results are shifted.
示例性的,所述数据处理电路对有符号数据进行运算时,所述数据处理电路中的预处理电路对所述有符号数据取绝对值,所述第三级流水运算电路将所述有符号数据的符号位,与所述32bit×16bit的乘法拼接运算的结果或所述32bit×32bit的乘法拼接运算的结果结合。Exemplarily, when the data processing circuit operates on signed data, the preprocessing circuit in the data processing circuit takes the absolute value of the signed data, and the third-stage pipeline operation circuit converts the signed data into an absolute value. The sign bit of the data is combined with the result of the 32bit×16bit multiplication and splicing operation or the result of the 32bit×32bit multiplication and splicing operation.
本申请实施例提供的并行计算方法的具体原理和实现方式均与前述实施例的DSP处理器类似,此处不再赘述。The specific principles and implementation manners of the parallel computing methods provided in the embodiments of the present application are similar to those of the DSP processors in the foregoing embodiments, and details are not described herein again.
应当理解,在此本申请中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。It should be understood that the terminology used in this application is for the purpose of describing particular embodiments only and is not intended to limit the application.
还应当理解,在本申请和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It will also be understood that, as used in this application and the appended claims, the term "and/or" refers to and including any and all possible combinations of one or more of the associated listed items.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The above are only specific embodiments of the present application, but the protection scope of the present application is not limited thereto. Any person skilled in the art can easily think of various equivalents within the technical scope disclosed in the present application. Modifications or substitutions shall be covered by the protection scope of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (26)

  1. 一种DSP处理器,其特征在于,所述DSP处理器包括连接数据存储器的数据总线,和连接程序存储器的程序总线;A DSP processor, characterized in that the DSP processor includes a data bus connected to a data memory, and a program bus connected to a program memory;
    所述DSP处理器还包括控制电路和数据处理电路,所述控制电路包括:The DSP processor further includes a control circuit and a data processing circuit, and the control circuit includes:
    读请求产生电路,用于根据运算指令向所述数据总线输出读数据请求,所述运算指令通过所述程序总线从所述程序存储器获取,所述读数据请求用于指示所述数据处理电路通过所述数据总线从所述数据存储器获取待处理数据;A read request generation circuit is configured to output a read data request to the data bus according to an operation instruction, the operation instruction is obtained from the program memory through the program bus, and the read data request is used to instruct the data processing circuit to pass The data bus obtains data to be processed from the data memory;
    第一译码电路,用于对所述运算指令进行第一级译码,及将第一级译码得到的第一控制信号传输给所述数据处理电路;a first decoding circuit, configured to perform first-level decoding on the operation instruction, and transmit the first control signal obtained by the first-level decoding to the data processing circuit;
    第二译码电路,用于对所述第一控制信号进行第二级译码,及将第二级译码得到的第二控制信号传输给所述数据处理电路;a second decoding circuit, configured to perform second-level decoding on the first control signal, and transmit the second control signal obtained by the second-level decoding to the data processing circuit;
    所述数据处理电路用于根据所述第一控制信号和所述第二控制信号对所述待处理数据进行运算,以及将运算结果通过所述数据总线存储至所述数据存储器。The data processing circuit is configured to perform an operation on the data to be processed according to the first control signal and the second control signal, and store the operation result to the data memory through the data bus.
  2. 根据权利要求1所述的DSP处理器,其特征在于,所述控制电路还包括配置总线接口电路,所述配置总线接口电路用于根据配置总线的配置信号配置所述读请求产生电路,以使所述读请求产生电路基于所述配置,根据运算指令向数据总线输出读数据请求。The DSP processor according to claim 1, wherein the control circuit further comprises a configuration bus interface circuit, and the configuration bus interface circuit is configured to configure the read request generating circuit according to a configuration signal of the configuration bus, so that the The read request generating circuit outputs a read data request to the data bus according to the operation instruction based on the configuration.
  3. 根据权利要求1或2所述的DSP处理器,其特征在于,所述读数据请求包括源地址、目标地址、读请求有效信号和数据量。The DSP processor according to claim 1 or 2, wherein the read data request includes a source address, a target address, a read request valid signal and a data amount.
  4. 根据权利要求2所述的DSP处理器,其特征在于,所述配置总线接口电路还用于输出寄存器配置信号给所述数据处理电路。The DSP processor according to claim 2, wherein the configuration bus interface circuit is further configured to output a register configuration signal to the data processing circuit.
  5. 根据权利要求1-4中任一项所述的DSP处理器,其特征在于,所述第一控制信号在处理所述运算指令时固定不变,所述第二控制信号在处理所述运算指令时动态变化。The DSP processor according to any one of claims 1-4, wherein the first control signal is fixed when processing the operation instruction, and the second control signal is fixed when processing the operation instruction dynamic changes over time.
  6. 根据权利要求5所述的DSP处理器,其特征在于,所述第二控制信号用于控制所述数据处理电路中多个流水级的状态。The DSP processor of claim 5, wherein the second control signal is used to control the states of a plurality of pipeline stages in the data processing circuit.
  7. 根据权利要求1-6中任一项所述的DSP处理器,其特征在于,所述数据 处理电路包括:The DSP processor according to any one of claims 1-6, wherein the data processing circuit comprises:
    数据加载电路,用于寄存所述待处理数据;a data loading circuit for registering the data to be processed;
    预处理电路,用于进行加减法运算、乘法器数据源分配和/或取绝对值;Preprocessing circuits for addition and subtraction operations, multiplier data source assignment, and/or absolute values;
    乘法器电路,用于进行乘法运算;a multiplier circuit for performing multiplication operations;
    累加电路,用于进行数据的累加运算;Accumulation circuit for accumulating data;
    后处理电路,用于对运算后的数据进行截取处理、取饱和处理和/或四舍五入处理,以及将处理后的数据通过所述数据总线存储至所述数据存储器。The post-processing circuit is used for intercepting, saturating and/or rounding the data after the operation, and storing the processed data to the data memory through the data bus.
  8. 根据权利要求7所述的DSP处理器,其特征在于,所述第二译码电路用于以流水线方式控制所述数据处理电路中的数据加载电路、预处理电路、乘法器电路、累加电路、后处理电路执行对应的数据处理。The DSP processor according to claim 7, wherein the second decoding circuit is configured to control the data loading circuit, preprocessing circuit, multiplier circuit, accumulating circuit, The post-processing circuit performs corresponding data processing.
  9. 根据权利要求1-8中任一项所述的DSP处理器,其特征在于,所述数据处理电路用于并行从所述数据总线获取多组待处理数据;The DSP processor according to any one of claims 1-8, wherein the data processing circuit is configured to acquire multiple groups of data to be processed from the data bus in parallel;
    所述第二译码电路用于在所述多组待处理数据获取完毕时,将译码得到的第二控制信号传输给所述数据处理电路,在所述多组待处理数据未获取完毕时,暂停将译码得到的第二控制信号传输给所述数据处理电路。The second decoding circuit is used to transmit the second control signal obtained by decoding to the data processing circuit when the acquisition of the multiple groups of data to be processed is completed, and when the acquisition of the multiple groups of data to be processed is not completed. , and stop transmitting the decoded second control signal to the data processing circuit.
  10. 根据权利要求9所述的DSP处理器,其特征在于,所述第二译码电路用于在接收到所述多组待处理数据各自对应的读数据有效信号时,将译码得到的第二控制信号传输给所述数据处理电路,所述读数据有效信号是对应组的待处理数据获取完毕时得到的。The DSP processor according to claim 9, wherein the second decoding circuit is configured to decode the second decoded data when receiving the read data valid signals corresponding to the multiple groups of data to be processed. The control signal is transmitted to the data processing circuit, and the read data valid signal is obtained when the acquisition of the to-be-processed data of the corresponding group is completed.
  11. 根据权利要求1-10中任一项所述的DSP处理器,其特征在于,在所述运算结果未通过所述数据总线存储至所述数据存储器时,所述第二译码电路暂停将译码得到的第二控制信号传输给所述数据处理电路,直至所述运算结果通过所述数据总线存储至所述数据存储器时,所述第二译码电路将译码得到的第二控制信号传输给所述数据处理电路。The DSP processor according to any one of claims 1-10, wherein when the operation result is not stored in the data memory through the data bus, the second decoding circuit suspends decoding The second control signal obtained by decoding is transmitted to the data processing circuit, and the second decoding circuit transmits the second control signal obtained by decoding until the operation result is stored in the data memory through the data bus. to the data processing circuit.
  12. 根据权利要求1-11中任一项所述的DSP处理器,其特征在于,所述数据处理电路中的乘法器电路包括四级流水运算电路,其中第一级流水运算电路用于进行16bit×8bit的无符号乘法运算,第二级流水运算电路用于进行16bit×16bit的乘法拼接运算和32bit×16bit的乘法拼接运算,第三级流水运算电路用于进行32bit×32bit的乘法拼接运算,第四级流水用于对32bit×16bit乘法拼接运算的结果和/或32bit×32bit乘法拼接运算的结果进行移位处理。The DSP processor according to any one of claims 1-11, wherein the multiplier circuit in the data processing circuit comprises a four-stage pipeline operation circuit, wherein the first-stage pipeline operation circuit is used to perform 16bit× 8bit unsigned multiplication operation, the second stage pipeline operation circuit is used for multiplication and splicing operation of 16bit×16bit and 32bit×16bit multiplication and splicing operation, and the third stage pipeline operation circuit is used for multiplication and splicing operation of 32bit×32bit. The four-stage pipeline is used to perform shift processing on the result of the 32bit×16bit multiplication and splicing operation and/or the result of the 32bit×32bit multiplication and splicing operation.
  13. 根据权利要求12所述的DSP处理器,其特征在于,所述数据处理电路对有符号数据进行运算时,所述数据处理电路中的预处理电路对所述有符号数据取绝对值,所述第三级流水运算电路将所述有符号数据的符号位,与所述32bit×16bit的乘法拼接运算的结果或所述32bit×32bit的乘法拼接运算的结果结合。The DSP processor according to claim 12, wherein when the data processing circuit operates on signed data, the preprocessing circuit in the data processing circuit takes an absolute value of the signed data, and the The third-stage pipeline operation circuit combines the sign bit of the signed data with the result of the 32bit×16bit multiplication and splicing operation or the result of the 32bit×32bit multiplication and splicing operation.
  14. 一种DSP处理器的并行计算方法,其特征在于,所述DSP处理器包括连接数据存储器的数据总线、连接程序存储器的程序总线,以及控制电路和数据处理电路,其中所述控制电路包括读请求产生电路、第一译码电路和第二译码电路;A parallel computing method for a DSP processor, characterized in that the DSP processor includes a data bus connected to a data memory, a program bus connected to a program memory, and a control circuit and a data processing circuit, wherein the control circuit includes a read request a generating circuit, a first decoding circuit and a second decoding circuit;
    所述并行计算方法包括:The parallel computing method includes:
    所述读请求产生电路根据运算指令向所述数据总线输出读数据请求,所述运算指令通过所述程序总线从所述程序存储器获取,所述读数据请求用于指示所述数据处理电路通过所述数据总线从所述数据存储器获取待处理数据;The read request generation circuit outputs a read data request to the data bus according to an operation instruction, the operation instruction is obtained from the program memory through the program bus, and the read data request is used to instruct the data processing circuit to pass the The data bus obtains the data to be processed from the data memory;
    所述第一译码电路对所述运算指令进行第一级译码,及将第一级译码得到的第一控制信号传输给数据处理电路;The first decoding circuit performs first-level decoding on the operation instruction, and transmits the first control signal obtained by the first-level decoding to the data processing circuit;
    所述第二译码电路对所述第一控制信号进行第二级译码,及将第二级译码得到的第二控制信号传输给所述数据处理电路;The second decoding circuit performs second-level decoding on the first control signal, and transmits the second control signal obtained by the second-level decoding to the data processing circuit;
    所述数据处理电路根据所述第一控制信号和所述第二控制信号对所述待处理数据进行运算,以及将运算结果通过所述数据总线存储至所述数据存储器。The data processing circuit operates on the data to be processed according to the first control signal and the second control signal, and stores the operation result to the data memory through the data bus.
  15. 根据权利要求14所述的并行计算方法,其特征在于,所述方法还包括:The parallel computing method according to claim 14, wherein the method further comprises:
    所述控制电路的配置总线接口电路根据配置总线的配置信号,配置所述读请求产生电路,以使所述读请求产生电路基于所述配置,根据运算指令向数据总线输出读数据请求。The configuration bus interface circuit of the control circuit configures the read request generation circuit according to the configuration signal of the configuration bus, so that the read request generation circuit outputs a read data request to the data bus according to the operation instruction based on the configuration.
  16. 根据权利要求14或15所述的并行计算方法,其特征在于,所述读数据请求包括源地址、目标地址、读请求有效信号和数据量。The parallel computing method according to claim 14 or 15, wherein the read data request includes a source address, a target address, a read request valid signal and a data amount.
  17. 根据权利要求15所述的并行计算方法,其特征在于,所述方法还包括:The parallel computing method according to claim 15, wherein the method further comprises:
    所述配置总线接口电路输出寄存器配置信号给所述数据处理电路。The configuration bus interface circuit outputs a register configuration signal to the data processing circuit.
  18. 根据权利要求14-17中任一项所述的并行计算方法,其特征在于,所述第一控制信号在处理所述运算指令时固定不变,所述第二控制信号在处理所述运算指令时动态变化。The parallel computing method according to any one of claims 14-17, wherein the first control signal is fixed when processing the operation instruction, and the second control signal is fixed when processing the operation instruction dynamic changes over time.
  19. 根据权利要求18所述的并行计算方法,其特征在于,所述第二控制信号用于控制所述数据处理电路中多个流水级的状态。The parallel computing method according to claim 18, wherein the second control signal is used to control the states of a plurality of pipeline stages in the data processing circuit.
  20. 根据权利要求14-19中任一项所述的并行计算方法,其特征在于,所述数据处理电路包括:The parallel computing method according to any one of claims 14-19, wherein the data processing circuit comprises:
    数据加载电路,用于寄存所述待处理数据;a data loading circuit for registering the data to be processed;
    预处理电路,用于进行加减法运算、乘法器数据源分配和/或取绝对值;Preprocessing circuits for addition and subtraction operations, multiplier data source assignment, and/or absolute values;
    乘法器电路,用于进行乘法运算;a multiplier circuit for performing multiplication operations;
    累加电路,用于进行数据的累加运算;Accumulation circuit for accumulating data;
    后处理电路,用于对运算后的数据进行截取处理、取饱和处理和/或四舍五入处理,以及将处理后的数据通过所述数据总线存储至所述数据存储器。The post-processing circuit is used for intercepting, saturating and/or rounding the data after the operation, and storing the processed data to the data memory through the data bus.
  21. 根据权利要求20所述的并行计算方法,其特征在于,所述第二控制信号以流水线方式控制所述数据处理电路中的数据加载电路、预处理电路、乘法器电路、累加电路、后处理电路执行对应的数据处理。The parallel computing method according to claim 20, wherein the second control signal controls a data loading circuit, a preprocessing circuit, a multiplier circuit, an accumulating circuit, and a post-processing circuit in the data processing circuit in a pipeline manner Perform corresponding data processing.
  22. 根据权利要求14-21中任一项所述的并行计算方法,其特征在于,所述数据处理电路用于并行从所述数据总线获取多组待处理数据;The parallel computing method according to any one of claims 14-21, wherein the data processing circuit is configured to acquire multiple groups of data to be processed from the data bus in parallel;
    在所述多组待处理数据获取完毕时,所述第二译码电路将译码得到的第二控制信号传输给所述数据处理电路,在所述多组待处理数据未获取完毕时,所述第二译码电路暂停将译码得到的第二控制信号传输给所述数据处理电路。When the acquisition of the multiple groups of data to be processed is completed, the second decoding circuit transmits the second control signal obtained by decoding to the data processing circuit, and when the acquisition of the multiple groups of data to be processed is not completed, the The second decoding circuit suspends transmitting the decoded second control signal to the data processing circuit.
  23. 根据权利要求22所述的并行计算方法,其特征在于,所述第二译码电路在接收到所述多组待处理数据各自对应的读数据有效信号时,将译码得到的第二控制信号传输给所述数据处理电路,所述读数据有效信号是对应组的待处理数据获取完毕时得到的。The parallel computing method according to claim 22, wherein the second decoding circuit decodes the obtained second control signal when receiving the read data valid signals corresponding to the multiple groups of data to be processed. It is transmitted to the data processing circuit, and the read data valid signal is obtained when the acquisition of the data to be processed of the corresponding group is completed.
  24. 根据权利要求14-23中任一项所述的并行计算方法,其特征在于,在所述运算结果未通过所述数据总线存储至所述数据存储器时,所述第二译码电路暂停将译码得到的第二控制信号传输给所述数据处理电路,直至所述运算结果通过所述数据总线存储至所述数据存储器时,所述第二译码电路将译码得到的第二控制信号传输给所述数据处理电路。The parallel computing method according to any one of claims 14-23, wherein when the operation result is not stored in the data memory through the data bus, the second decoding circuit suspends decoding The second control signal obtained by decoding is transmitted to the data processing circuit, and the second decoding circuit transmits the second control signal obtained by decoding until the operation result is stored in the data memory through the data bus. to the data processing circuit.
  25. 根据权利要求14-24中任一项所述的并行计算方法,其特征在于,所述数据处理电路中的乘法器电路包括四级流水运算电路,其中第一级流水运算电路用于进行16bit×8bit的无符号乘法运算,第二级流水运算电路用于进行 16bit×16bit的乘法拼接运算和32bit×16bit的乘法拼接运算,第三级流水运算电路用于进行32bit×32bit的乘法拼接运算,第四级流水用于对32bit×16bit乘法拼接运算的结果和/或32bit×32bit乘法拼接运算的结果进行移位处理。The parallel computing method according to any one of claims 14-24, wherein the multiplier circuit in the data processing circuit comprises a four-stage pipeline operation circuit, wherein the first-stage pipeline operation circuit is used to perform 16bit× 8bit unsigned multiplication operation, the second stage pipeline operation circuit is used to perform 16bit×16bit multiplication and splicing operation and 32bit×16bit multiplication and splicing operation, and the third stage pipeline operation circuit is used to perform 32bit×32bit multiplication and splicing operation. The four-stage pipeline is used to perform shift processing on the result of the 32bit×16bit multiplication and splicing operation and/or the result of the 32bit×32bit multiplication and splicing operation.
  26. 根据权利要求25所述的并行计算方法,其特征在于,所述数据处理电路对有符号数据进行运算时,所述数据处理电路中的预处理电路对所述有符号数据取绝对值,所述第三级流水运算电路将所述有符号数据的符号位,与所述32bit×16bit的乘法拼接运算的结果或所述32bit×32bit的乘法拼接运算的结果结合。The parallel computing method according to claim 25, wherein when the data processing circuit operates on signed data, the preprocessing circuit in the data processing circuit takes an absolute value of the signed data, and the The third-stage pipeline operation circuit combines the sign bit of the signed data with the result of the 32bit×16bit multiplication and splicing operation or the result of the 32bit×32bit multiplication and splicing operation.
PCT/CN2020/141848 2020-12-30 2020-12-30 Dsp and parallel computing method therefor WO2022141321A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/141848 WO2022141321A1 (en) 2020-12-30 2020-12-30 Dsp and parallel computing method therefor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/141848 WO2022141321A1 (en) 2020-12-30 2020-12-30 Dsp and parallel computing method therefor

Publications (1)

Publication Number Publication Date
WO2022141321A1 true WO2022141321A1 (en) 2022-07-07

Family

ID=82260040

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/141848 WO2022141321A1 (en) 2020-12-30 2020-12-30 Dsp and parallel computing method therefor

Country Status (1)

Country Link
WO (1) WO2022141321A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101957743A (en) * 2010-10-12 2011-01-26 中国电子科技集团公司第三十八研究所 Parallel digital signal processor
CN102750133A (en) * 2012-06-20 2012-10-24 中国电子科技集团公司第五十八研究所 32-Bit triple-emission digital signal processor supporting SIMD
CN107450888A (en) * 2016-05-30 2017-12-08 世意法(北京)半导体研发有限责任公司 Zero-overhead loop in embedded dsp
WO2018005718A1 (en) * 2016-06-30 2018-01-04 Intel Corporation System and method for out-of-order clustered decoding

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101957743A (en) * 2010-10-12 2011-01-26 中国电子科技集团公司第三十八研究所 Parallel digital signal processor
CN102750133A (en) * 2012-06-20 2012-10-24 中国电子科技集团公司第五十八研究所 32-Bit triple-emission digital signal processor supporting SIMD
CN107450888A (en) * 2016-05-30 2017-12-08 世意法(北京)半导体研发有限责任公司 Zero-overhead loop in embedded dsp
WO2018005718A1 (en) * 2016-06-30 2018-01-04 Intel Corporation System and method for out-of-order clustered decoding

Similar Documents

Publication Publication Date Title
KR102443546B1 (en) matrix multiplier
CN109542515B (en) Arithmetic device and method
US10140124B2 (en) Reconfigurable microprocessor hardware architecture
JP6243000B2 (en) Vector processing engine with programmable data path configuration and related vector processor, system, and method for providing multi-mode vector processing
US20120278590A1 (en) Reconfigurable processing system and method
CN111651205B (en) Apparatus and method for performing vector inner product operation
US8595467B2 (en) Floating point collect and operate
CN111651203B (en) Device and method for executing vector four-rule operation
CN116521229A (en) Low hardware overhead vector processor architecture based on RISC-V vector instruction extension
TW202217600A (en) Apparatus and method for vector computing
CN113407483B (en) Dynamic reconfigurable processor for data intensive application
CN116888591A (en) Matrix multiplier, matrix calculation method and related equipment
US10445099B2 (en) Reconfigurable microprocessor hardware architecture
WO2022141321A1 (en) Dsp and parallel computing method therefor
WO2019023910A1 (en) Data processing method and device
US20070198811A1 (en) Data-driven information processor performing operations between data sets included in data packet
CN112074810A (en) Parallel processing apparatus
CN111353124A (en) Operation method, operation device, computer equipment and storage medium
US20130262819A1 (en) Single cycle compare and select operations
CN112463218B (en) Instruction emission control method and circuit, data processing method and circuit
Rettkowski et al. Application-specific processing using high-level synthesis for networks-on-chip
US20160162290A1 (en) Processor with Polymorphic Instruction Set Architecture
Pezzarossa et al. Interfacing hardware accelerators to a time-division multiplexing network-on-chip
US7673117B2 (en) Operation apparatus
US20230205530A1 (en) Graph Instruction Processing Method and Apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20967660

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20967660

Country of ref document: EP

Kind code of ref document: A1