WO2022141321A1 - Dsp et son procédé de calcul parallèle - Google Patents

Dsp et son procédé de calcul parallèle Download PDF

Info

Publication number
WO2022141321A1
WO2022141321A1 PCT/CN2020/141848 CN2020141848W WO2022141321A1 WO 2022141321 A1 WO2022141321 A1 WO 2022141321A1 CN 2020141848 W CN2020141848 W CN 2020141848W WO 2022141321 A1 WO2022141321 A1 WO 2022141321A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
circuit
control signal
processing circuit
decoding
Prior art date
Application number
PCT/CN2020/141848
Other languages
English (en)
Chinese (zh)
Inventor
任子木
韩彬
Original Assignee
深圳市大疆创新科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市大疆创新科技有限公司 filed Critical 深圳市大疆创新科技有限公司
Priority to PCT/CN2020/141848 priority Critical patent/WO2022141321A1/fr
Publication of WO2022141321A1 publication Critical patent/WO2022141321A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead

Definitions

  • the present application relates to the technical field of DSP, and in particular, to a DSP processor and a parallel computing method thereof.
  • Parallel computing refers to the process of using multiple computing resources to solve computing problems at the same time, and it is an effective means to improve the computing speed and processing capacity of computer systems.
  • DSP Digital Signal Processor, Digital Signal Processor
  • other processors can realize parallel computing.
  • the current DSP processor cannot support the operation of complex instructions very well.
  • the present application provides a DSP processor and a parallel computing method thereof, aiming at solving the technical problems that the current DSP processor cannot well support the operation of complex instructions.
  • an embodiment of the present application provides a DSP processor, where the DSP processor includes a data bus connected to a data memory and a program bus connected to a program memory;
  • the DSP processor further includes a control circuit and a data processing circuit, and the control circuit includes:
  • a read request generating circuit is configured to output a read data request to the data bus according to an operation instruction, the operation instruction is obtained from the program memory through the program bus, and the read data request is used for the data processing circuit to pass through the program memory.
  • the data bus obtains the data to be processed from the data memory;
  • a first decoding circuit configured to perform first-level decoding on the operation instruction, and transmit the first control signal obtained by the first-level decoding to the data processing circuit;
  • a second decoding circuit configured to perform second-level decoding on the first control signal, and transmit the second control signal obtained by the second-level decoding to the data processing circuit;
  • the data processing circuit is configured to perform an operation on the data to be processed according to the first control signal and the second control signal, and store the operation result to the data memory through the data bus;
  • the first control signal is uniquely determined by the operation instruction, and the second control signal is used to determine the processing flow of the data processing circuit according to the operation instruction.
  • an embodiment of the present application provides a parallel computing method for a DSP processor, where the DSP processor includes a data bus connected to a data memory, a program bus connected to a program memory, a control circuit and a data processing circuit, wherein the The control circuit includes a read request generating circuit, a first decoding circuit and a second decoding circuit;
  • the parallel computing method includes:
  • the read request generation circuit outputs a read data request to the data bus according to an operation instruction, the operation instruction is obtained from the program memory through the program bus, and the read data request is used by the data processing circuit to pass the the data bus obtains the data to be processed from the data memory;
  • the first decoding circuit performs first-level decoding on the operation instruction, and transmits the first control signal obtained by the first-level decoding to the data processing circuit;
  • the second decoding circuit performs second-level decoding on the first control signal, and transmits the second control signal obtained by the second-level decoding to the data processing circuit;
  • the data processing circuit performs an operation on the data to be processed according to the first control signal and the second control signal, and stores the operation result to the data memory through the data bus;
  • the first control signal is uniquely determined by the operation instruction, and the second control signal is used to determine the processing flow of the data processing circuit according to the operation instruction.
  • the embodiments of the present application provide a DSP processor and a parallel computing method thereof.
  • a first decoding circuit and a second decoding circuit in the control circuit of the DSP processor, the structure is clearly divided, and the mapping of complex instructions is convenient.
  • the separation of the first control signal and the second control signal makes the DSP processor more scalable and can better support the operation of complex instructions. For example, when a new operation instruction needs to be added, the overall architecture of the DSP processor does not need to be adjusted, but only the corresponding control signal needs to be added. For example, the first control signal and the second control signal corresponding to the operation instruction can be determined. Strong scalability.
  • FIG. 1 is a schematic block diagram of a DSP processor provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a data processing circuit pipeline
  • FIG. 3 is a schematic structural diagram of a multiplier circuit in an embodiment
  • FIG. 4 is a schematic diagram of a data processing circuit reading data
  • FIG. 5 is a schematic diagram of a sliding window operation in one embodiment
  • FIG. 6 is a schematic diagram of a sliding window operation in another embodiment
  • FIG. 7 is a schematic diagram of a sliding window operation in yet another embodiment
  • FIG. 8 is a schematic flowchart of a parallel computing method for a DSP processor provided by an embodiment of the present application.
  • references numerals 10, data bus; 11, data memory; 20, program bus; 21, program memory; 30, configuration bus; 110, control circuit; 111, read request generation circuit; 112, first decoding circuit; 113, second decoding circuit; 114, configuration bus interface circuit; 120, data processing circuit; 121, data loading circuit; 122, preprocessing circuit; 123, multiplier circuit; 124, accumulation circuit; 125, post-processing circuit.
  • FIG. 1 is a schematic block diagram of a DSP processor provided by an embodiment of the present application.
  • the core idea of DSP processor is SIMD (Single Instruction Multiplex Data Parallel Computing). Compared with general-purpose processors (such as CPU, Central Processing Unit), DSP processors can provide stronger parallel computing capabilities.
  • the vector execution circuit is the core processing circuit inside the DSP processor.
  • the vector execution circuit is used to execute vector instructions. For example, the addition of two vectors can be completed according to the vector instructions, with a high degree of parallelism and strong computing power.
  • common applications of DSP processors include vector multiply-accumulate operations, and operations such as image filtering and feature extraction can be mapped to vector multiply-accumulate operations.
  • the computing capability index of the DSP processor can be measured by multiplying the accumulated parallelism of different data types that the DSP processor can provide.
  • the vector execution circuit of the DSP processor needs to support a variety of instructions, such as VMADD, VMADDS, VMADDC for instructing addition operations, and VMMUL, VMMULS, VMMULC for instructing multiplication operations, but of course not limited to this.
  • VMADD VMADDS
  • VMADDC VMADDC
  • VMMUL VMMULS
  • VMMULC VMMULC
  • the symbol types of input data also have various combinations: uchar+char, uchar +uchar, etc., where uchar represents an unsigned byte type and char represents a signed byte type. Therefore, the vector execution circuit needs to support a variety of instructions, and each instruction contains a mixture of multiple data types and multiple symbol types. How to map these instructions to the hardware structure and how to implement these complex instructions with the smallest area is a
  • the vector execution circuit of the DSP processor includes a control circuit 110 and a data processing circuit 120 , wherein the control circuit 110 may be referred to as a control path (ctrl_path), and the data processing circuit 120 may be referred to as a data_path.
  • control circuit 110 may be referred to as a control path (ctrl_path)
  • data processing circuit 120 may be referred to as a data_path.
  • the DSP processor further includes a data bus 10 connected to the data memory 11 and a program bus 20 connected to the program memory 21 .
  • the data bus 10 includes, for example, a data crossbar inside the DSP processor.
  • control circuit 110 includes: a read request generating circuit 111 , a first decoding circuit 112 and a second decoding circuit 113 .
  • the read request generation circuit 111 (for example, called cbar_req_gen) is used for outputting a read data request to the data bus 10 according to the operation instruction.
  • the operation instruction is obtained from the program memory 21 through the program bus 20 .
  • a read data request may include a source address, a destination address, a read request valid signal, and an amount of data.
  • the read data request includes a field used to indicate that the issued read request is valid, such as read_vld, a field used to indicate the data amount of the read request, such as read_cnt, and a field used to indicate the source address and destination address, such as set_src /dst.
  • control circuit 110 further includes a configuration bus interface circuit 114, and the configuration bus interface circuit 114 is configured to configure the read request generation circuit 111 according to the configuration signal of the configuration bus 30, so that the read request generation circuit 111 is based on configuration, and output a read data request to the data bus 10 according to the operation instruction.
  • configuration bus interface circuit 114 is configured to configure the read request generation circuit 111 according to the configuration signal of the configuration bus 30, so that the read request generation circuit 111 is based on configuration, and output a read data request to the data bus 10 according to the operation instruction.
  • the configuration bus 30 is, for example, a configuration bus 30 (crf bus) inside the DSP processor.
  • the configuration bus interface circuit 114 (referred to as crf_reg, for example) can configure the read request generation circuit 111 according to the configuration signal of the configuration bus 30 .
  • the configuration bus interface circuit 114 is further configured to output a register configuration signal to the data processing circuit 120 .
  • the configuration bus interface circuit 114 outputs some register configuration signals to the data path according to the configuration signals of the configuration bus 30 , and such register configuration signals may be uniformly prefixed by rctrl_.
  • the read request generating circuit 111 can configure the value of the corresponding register according to the register configuration signal of the configuration bus interface circuit 114, and perform a data request according to the configured value.
  • the first decoding circuit 112 (for example, called static_ctrl_gen) is used to perform first-level decoding on the operation instruction, and transmit the first control signal obtained by the first-level decoding to the data processing circuit 120 .
  • the first control signal may include a control signal that remains unchanged throughout the execution of the instruction. It can be understood that the first control signal is uniquely determined by the operation instruction.
  • the first control signal may be prefixed by sctrl_.
  • the first control signal includes at least one of a symbol type, an instruction type, and a data extension type.
  • the first control signal can be transmitted to the data processing circuit 120, for example, it can be used to control the data processing circuit 120 to determine the type of data to be processed, and the like.
  • the first control signal can also be transmitted to the second decoding circuit 113 (for example, called dynamic_ctrl_gen) for secondary decoding.
  • the second decoding circuit 113 for example, called dynamic_ctrl_gen
  • the second decoding circuit 113 is configured to perform second-level decoding on the first control signal, and transmit the second control signal obtained by the second-level decoding to the data processing circuit 120 .
  • the second control signal is used to determine the processing flow of the data processing circuit 120 according to the operation instruction, and the data processing circuit 120 performs operation on the data to be processed according to the second control signal.
  • the second control signal is a control signal that varies with the flow of the data stream.
  • the second control signal mainly controls the operation of each pipeline stage in the data processing circuit 120, and controls when the pipeline of each stage is turned on and off.
  • the second control signal may be prefixed by dctrl_.
  • the second decoding circuit 113 starts to perform second-level decoding on the first control signal when receiving the read data valid signal, and transmits the second control signal obtained by the second-level decoding to data processing circuit 120.
  • the read data valid signal is used to indicate that the data processing circuit 120 has acquired the data to be processed.
  • the data processing circuit 120 is configured to perform an operation on the data to be processed according to the first control signal and the second control signal, and store the operation result to the data memory 11 through the data bus 10.
  • the data processing circuit 120 includes: a data loading circuit 121 , a preprocessing circuit 122 , a multiplier circuit 123 , an accumulation circuit 124 and a post-processing circuit 125 .
  • the data loading circuit 121 is used to register the data to be processed; the preprocessing circuit 122 is used to perform addition and subtraction operations, multiplier data source allocation and/or take absolute values; the multiplier circuit 123 is used to perform multiplication operations; the accumulation circuit 124 For carrying out the accumulation operation of data; the post-processing circuit 125 is used for intercepting, saturating and/or rounding the data after the operation, and storing the processed data, that is, the operation result, to the data memory through the data bus 10 11.
  • the data loading circuit 121 (which may be referred to as a load unit) is used to register data on the data exchange bus as data to be processed.
  • the data loading circuit 121 can support three sets of 512-bit wide data input, such as master_a_rdata, master_b_rdata, master_c_rdata, and can buffer the data read back from the three data ports (ports) a/b/c.
  • the number of pipeline stages of the data loading circuit 121 is one stage, such as the pipeline stages marked ireg0 , ireg1 , and ireg2 correspondingly.
  • a read data valid signal may be sent to the second decoding circuit 113, so that the second decoding circuit 113 starts to output the second control to the data processing circuit 120. Signal.
  • the second control signal is used to control the states of a plurality of pipeline stages in the data processing circuit 120 .
  • the multiple pipeline stages in the data processing circuit 120 can be referred to in FIG. 2 .
  • the second control signal is used to control the data processing circuit 120 to process the data to be processed in a pipeline manner.
  • the preprocessing circuit 122 (may be referred to as a pre_proc unit) is used for data preprocessing (pre_proc) operations, performing addition and subtraction operations before multiplication, and performing preprocessing such as assignment of multiplier data sources and taking absolute values.
  • pre_proc data preprocessing
  • the number of pipeline stages of the preprocessing circuit 122 is 2, for example, the pipeline stages corresponding to the labels are 0 and 1.
  • the multiplier circuit 123 (which may be referred to as a multi unit) is used to perform a multiplication (mult) operation, and a truncation operation after the multiplication.
  • the number of pipeline stages of the multiplier circuit 123 is four, such as pipeline stages numbered 2 to 5 correspondingly.
  • the accumulation circuit 124 (which may be referred to as an acc unit) is used to perform an accumulation (acc) operation of data, and a tree-shaped accumulation (tree_add) operation of multiple accumulators in the accumulation circuit 124, such as 64 accumulators.
  • the number of pipeline stages of the accumulating circuit 124 is seven, such as pipeline stages numbered 6 to 13.
  • the post-processing circuit 125 (may be referred to as a post_proc unit) is used to truncate, saturate, and round the calculated data, select the data, and output the selected data to the corresponding data port of the data bus 10. (port).
  • the operation result output by the post-processing circuit 125 such as master_a_wdata, may also have a data bit width of 512 bits, and the operation result may be stored in the data memory 11 through the data bus 10 .
  • the first control signal is fixed when processing the operation instruction, and the second control signal is dynamically changed when the operation instruction is processed.
  • the symbol type, instruction type, data extension type, etc. corresponding to the operation instruction are all the same, so the first control signal can be generated according to the fixed information when the operation instruction is executed;
  • the corresponding operations are decomposed into basic operations such as multiplication and addition that can be performed by hardware multipliers and adders, etc., and these basic operations can be processed in a pipelined manner on multiple pipeline stages in the data processing circuit 120.
  • the second control signal controls the operation of each water level, and controls the opening and closing timing of each level of water, so as to realize the operation corresponding to the operation instruction.
  • the second decoding circuit 113 is configured to control the data loading circuit 121 , the preprocessing circuit 122 , the multiplier circuit 123 , the accumulating circuit 124 , and the post-processing circuit 125 in the data processing circuit 120 to perform corresponding data processing in a pipeline manner. .
  • the second decoding circuit 113 includes a data loading control circuit (may be called load_stg), a preprocessing control circuit (may be called pre_stg), a multiplication control circuit (may be called mult_stg), an accumulation circuit Control circuit (may be called acc_stg), post-processing control circuit (may be called post_stg).
  • the data loading control circuit, preprocessing control circuit, multiplication control circuit, accumulation control circuit, and post-processing control circuit are used to control the data loading circuit 121 , the preprocessing circuit 122 , the multiplier circuit 123 , and the accumulation circuit 124 in the corresponding control data processing circuit 120 .
  • the post-processing circuit 125 performs corresponding data processing.
  • the structure is clearly divided, which facilitates the mapping of complex instructions, and the separation of the first control signal and the second control signal can be realized, so that the DSP can process the
  • the extensibility of the device is strong, and it can better support the operation of complex instructions. For example, when a new operation instruction needs to be added, the overall architecture of the DSP processor does not need to be adjusted, but only the corresponding control signal needs to be added. For example, the first control signal and the second control signal corresponding to the operation instruction can be determined. Strong scalability.
  • the data processing circuit 120 is used to obtain multiple sets of data to be processed from the data bus 10 in parallel.
  • the data loading circuit 121 can support three sets of 512-bit wide data inputs, such as master_a_rdata, master_b_rdata, master_c_rdata, and can buffer the data read back from the three data ports a/b/c.
  • the second decoding circuit 113 is configured to transmit the second control signal obtained by decoding to the data processing circuit 120 when the acquisition of multiple groups of data to be processed is completed, and pause when the acquisition of the multiple groups of data to be processed is not completed.
  • the decoded second control signal is transmitted to the data processing circuit 120 . Therefore, the entire pipeline can be correctly controlled when the data loading circuit 121 does not acquire the complete data to be processed. It can prevent that the data to be processed cannot be returned within a preset period of time, resulting in a data processing error when a conflict occurs when reading data from the data storage 11 . In addition, it is not necessary to organize potentially conflicting data into steps that can ensure no conflict during reading, thus improving the efficiency of instruction execution.
  • the second decoding circuit 113 is configured to transmit the second control signal obtained by decoding to the data processing circuit 120 when receiving the read data valid signals corresponding to the multiple groups of data to be processed, and the read data valid signal is: Obtained when the pending data acquisition of the corresponding group is completed.
  • the three pipeline stages marked ireg0 , ireg1 and ireg2 correspond to the three data ports a/b/c of the data loading circuit 121 one-to-one.
  • the three pipeline stages of ireg0, ireg1, and ireg2 can be used to indicate whether the read data on the corresponding data port is valid.
  • the second decoding circuit 113 transmits the decoded second control signal to the data processing circuit 120, and the pipeline starts to flow.
  • the second control signal mainly controls the data processing circuit 120. The operation of each pipeline stage. If the read data of one of the three data ports needs to arrive with a delay due to conflicts and other reasons, wait until the read data on the three data ports are all valid, and the second decoding circuit 113 will decode the obtained data.
  • the second control signal is transmitted to the data processing circuit 120, and the pipeline begins to flow.
  • the second decoding circuit 113 will wait until the 5th clock cycle and find that all three After the data is all ready, the control pipeline starts to flow again, so that the data processing circuit 120 processes the data to be processed in a pipeline manner.
  • the second decoding circuit 113 suspends transmitting the decoded second control signal to the data processing circuit 120 until the operation result passes through the data bus.
  • the second decoding circuit 113 transmits the decoded second control signal to the data processing circuit 120 .
  • the post-processing circuit 125 can store the operation result in the data memory 11 . If the data memory 11 collides and cannot receive the data, the second decoding circuit 113 suspends transmitting the second control signal to the data processing circuit 120, suspends the execution of all pipeline stages, and waits for the external buffer to receive the written data, and then Start the execution of the pipeline. In order to correctly control the entire pipeline by the second decoding circuit 113, to prevent data processing errors when the data memory 11 cannot receive the operation results in a timely manner, and it is not necessary to temporarily store the operation results in a certain storage space, and then organize them after the storage is completed. Stored in the data memory 11, so the efficiency of instruction execution can be improved.
  • the multiplier circuit 123 in the data processing circuit 120 includes a four-stage pipeline operation circuit, wherein the first-stage pipeline operation circuit is used to perform 16bit ⁇ 8bit unsigned multiplication operation, and the second-stage pipeline
  • the arithmetic circuit is used to perform the multiplication and splicing operation of 16bit ⁇ 16bit and the multiplication and splicing operation of 32bit ⁇ 16bit.
  • the third-stage pipeline operation circuit is used to perform the multiplication and splicing operation of 32bit ⁇ 32bit.
  • the result of the operation and/or the result of the 32bit ⁇ 32bit multiplication and concatenation operation is subjected to shift processing.
  • the multiplier circuit 123 can implement the operations shown in Table 2 through multiplier splicing:
  • the preprocessing circuit 122 in the data processing circuit 120 takes the absolute value of the signed data
  • the third-stage pipeline operation circuit compares the sign bit of the signed data with the 32bit value.
  • the result of the multiplication and concatenation operation of ⁇ 16bit or the result of the multiplication and concatenation operation of 32bit ⁇ 32bit is combined (comb sign).
  • the first-stage pipeline operation circuit is used to calculate the unsigned multiplication operation of 16bit ⁇ 8bit;
  • the second-stage pipeline operation circuit includes two-stage addition, and the first-stage addition completes the multiplication and splicing of 16bit ⁇ 16bit, The second-stage addition completes the multiplication and splicing of 32bit ⁇ 16bit;
  • the third-stage pipeline operation circuit includes one-stage addition, completes the multiplication and splicing of 32bit ⁇ 32bit, and finally combines the unsigned result with the sign bit;
  • the fourth-stage pipeline operation circuit completes the 32bit
  • the shift function of ⁇ 16bit and 32bit ⁇ 32bit results includes two shifters.
  • the input data of the first shifter is 64bit, and the moving range is from right shift by 32 bits to right shift by 0 bits, and the output data bit width is 48 bits;
  • the input data of the second shifter is 48 bits, the shift range is 16 bits to the right to 0 bits to the right, and the bit width of the output data is 48 bits.
  • the data processing circuit 120 may read two 512-bit image data and one 512-bit coefficient from the data memory 11 each time.
  • the image data to be processed is stored in the ireg0 cached on-chip.
  • the coefficients are stored in ireg1 in the on-chip cache
  • the image data to be processed includes a(0), a(1), a(2)...a(127) in the first line
  • the coefficients include, for example, b in the first line (0), b(1), b(2).
  • 128 unsigned multipliers such as a char ⁇ char multiplier
  • step 1 image operations, such as convolution operations
  • the multiplier circuit 123 needs to complete the following operations: where step refers to the interval between two adjacent pixels inside a single window, and stride is The number of pixels each time the window slides.
  • the multiplier circuit 123 needs to complete the following operations:
  • the multiplier circuit 123 needs to complete the following operations:
  • the multiplier circuit 123 needs to complete the following operations:
  • the preprocessing circuit 122 is used to select the data source before the multiplier. For example, the data is arranged according to the above operation modes corresponding to step and stride, and then sent to the multiplier circuit 123 for multiplication and addition operations.
  • the preprocessing circuit 122 can realize the permutation and combination of multiplier data sources when step and stride are various situations, so as to realize the DSP processor's support for sliding window operation when step and stride are various numerical values.
  • the operation of the second row can be completed by referring to the operation process of the first row
  • the image data to be processed and the third row of the image data to be processed can be read.
  • the operation of the third row can be completed by referring to the operation process of the first row.
  • the multiplier circuit 123 can perform 64 multiplication and addition operations of 8bit ⁇ 8bit+8bit ⁇ 8bit, and can also perform 64 multiplication operations of 16bit ⁇ 16bit.
  • the accumulating circuit 124 completes the accumulating operation, and the output data of the multiplier circuit 123 is directly sent to the accumulator for accumulation. When all the coefficients of a coefficient matrix are processed, the data in the accumulator is written out to the DSP processor. in the on-chip cache.
  • the post-processing circuit 125 completes the post-processing of the output data, such as the interception of the output data.
  • the data loading circuit 121 in the data processing circuit 120 may be used to perform data sliding processing of sliding window instructions.
  • the data loading circuit 121 registers the data on the data exchange bus as the data to be processed, and performs data sliding processing of the sliding window instruction.
  • the input of the sliding window operation is an image data matrix to be processed and a coefficient matrix.
  • the operation of the sliding window operation is that the coefficient matrix slides on the image, and all the pixels and coefficients in the sliding area are multiplied and accumulated, and finally an output data is obtained.
  • a 258x258 image matrix and a 3x3 coefficient matrix after the coefficient matrix is slid on the image matrix, a 256x256 output matrix is finally obtained.
  • the sliding window operation includes a large number of multiply-accumulate operations. If the multiply-accumulate operation in a general-purpose processor is called, it needs to be called many times. The scheduling of instructions is performed at the software level. Due to the low real-time processing of the software, this greatly increases the sliding window. The real-time performance of this implementation is very low; in addition, this implementation needs to read and write the on-chip cache multiple times, and the power consumption of reading and writing the on-chip cache is very large, so this implementation consumes a lot of power. .
  • the MAC (multiply-accumulate) utilization of ordinary multiply-accumulate operations in general-purpose processors is usually not high: for a processor with a 10-bit data bus width of 512 bits, it is necessary to process the char type. For multiplication and accumulation, only 64 char type values can be loaded in the same clock cycle, while the processor can process up to 128 char type multiplication and accumulation, and the MAC utilization rate is only 50%.
  • O[0,1], O[1,0], etc. can also be calculated through the sliding window.
  • the data loading circuit 121 , the preprocessing circuit 122 , the multiplier circuit 123 , the accumulating circuit 124 , and the post-processing circuit 125 in the data processing circuit 120 are designed with a full pipeline structure, and the processing of each circuit is independent and can be performed in parallel, so that the DSP processing device has higher performance. Reading 2 pieces of image data and sliding the window inside the circuit can greatly reduce the number of times of reading the on-chip cache, and the power consumption is low.
  • the DSP processor provided by the embodiments of the present application, by setting the first decoding circuit and the second decoding circuit in the control circuit of the DSP processor, the structure is clearly divided, the mapping of complex instructions is convenient, and the first control signal and the second decoding circuit can be realized.
  • the separation of control signals makes the DSP processor more scalable and can better support the operation of complex instructions. For example, when a new operation instruction needs to be added, the overall architecture of the DSP processor does not need to be adjusted, but only the corresponding control signal needs to be added. For example, the first control signal and the second control signal corresponding to the operation instruction can be determined. Strong scalability.
  • FIG. 8 is a schematic flowchart of a parallel computing method for a DSP processor provided by another embodiment of the present application.
  • the parallel computing method is applied in a DSP processor.
  • the DSP processor includes a data bus connected to the data memory, a program bus connected to the program memory, and a control circuit and a data processing circuit, wherein the control circuit includes a read request generating circuit, a first decoding circuit and a second decoding circuit .
  • the parallel computing method of this embodiment includes steps S210 to S240.
  • the read request generation circuit outputs a read data request to the data bus according to an operation instruction, the operation instruction is obtained from the program memory through the program bus, and the read data request is used by the data processing circuit to pass
  • the data bus obtains data to be processed from the data memory
  • the first decoding circuit performs first-level decoding on the operation instruction, and transmits the first control signal obtained by the first-level decoding to the data processing circuit;
  • the second decoding circuit performs second-level decoding on the first control signal, and transmits the second control signal obtained by the second-level decoding to the data processing circuit;
  • the data processing circuit performs an operation on the data to be processed according to the first control signal and the second control signal, and stores the operation result to the data memory through the data bus;
  • the first control signal is uniquely determined by the operation instruction, and the second control signal is used to determine the processing flow of the data processing circuit according to the operation instruction.
  • the method further includes: the configuration bus interface circuit of the control circuit configures the read request generation circuit according to the configuration signal of the configuration bus, so that the read request generation circuit based on the configuration, according to The operation instruction outputs a read data request to the data bus.
  • the read data request includes a source address, a target address, a read request valid signal, and a data amount.
  • the method further includes:
  • the configuration bus interface circuit outputs a register configuration signal to the data processing circuit.
  • the first control signal is fixed when the operation instruction is processed, and the second control signal changes dynamically when the operation instruction is processed.
  • the second control signal is used to control the states of a plurality of pipeline stages in the data processing circuit.
  • the data processing circuit includes:
  • a data loading circuit for registering the data to be processed
  • the post-processing circuit is used for intercepting, saturating and/or rounding the data after the operation, and storing the processed data to the data memory through the data bus.
  • the second control signal controls the data loading circuit, the preprocessing circuit, the multiplier circuit, the accumulation circuit, and the post-processing circuit in the data processing circuit to perform corresponding data processing in a pipeline manner.
  • the data processing circuit is configured to obtain multiple sets of data to be processed from the data bus in parallel;
  • the second decoding circuit transmits the second control signal obtained by decoding to the data processing circuit, and when the acquisition of the multiple groups of data to be processed is not completed, the The second decoding circuit suspends transmitting the decoded second control signal to the data processing circuit.
  • the second decoding circuit transmits the second control signal obtained by decoding to the data processing circuit when receiving the read data valid signals corresponding to the multiple groups of data to be processed, and the read data The data valid signal is obtained when the to-be-processed data of the corresponding group is acquired.
  • the second decoding circuit suspends transmitting the decoded second control signal to the data processing circuit, Until the operation result is stored in the data memory through the data bus, the second decoding circuit transmits the decoded second control signal to the data processing circuit.
  • the multiplier circuit in the data processing circuit includes a four-stage pipeline operation circuit, wherein the first-stage pipeline operation circuit is used for performing 16bit ⁇ 8bit unsigned multiplication operation, and the second-stage pipeline operation circuit is used for Perform 16bit ⁇ 16bit multiplication and splicing operations and 32bit ⁇ 16bit multiplication and splicing operations, the third-stage pipeline operation circuit is used to perform 32bit ⁇ 32bit multiplication and splicing operations, and the fourth-stage pipeline is used to 32bit ⁇ 16bit multiplication and splicing operation results and / or 32bit ⁇ 32bit multiplication and splicing operation results are shifted.
  • the preprocessing circuit in the data processing circuit takes the absolute value of the signed data
  • the third-stage pipeline operation circuit converts the signed data into an absolute value.
  • the sign bit of the data is combined with the result of the 32bit ⁇ 16bit multiplication and splicing operation or the result of the 32bit ⁇ 32bit multiplication and splicing operation.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Advance Control (AREA)

Abstract

La présente invention concerne un processeur de signal numérique (DSP) comprenant un bus de données connecté à une mémoire de données et un bus de programme connecté à une mémoire de programme, et comprenant en outre un circuit de commande et un circuit de traitement de données. Le circuit de commande comprend : un circuit de génération de requête de lecture, qui est utilisé pour acquérir, à partir de la mémoire de données, des données à traiter ; un premier circuit de décodage, qui est utilisé pour effectuer un décodage de premier étage sur une instruction d'opération et transmettre, au circuit de traitement de données, un premier signal de commande obtenu au moyen du décodage de premier étage ; et un second circuit de décodage, qui est utilisé pour effectuer un décodage de deuxième étage sur le premier signal de commande et transmettre, au circuit de traitement de données, un deuxième signal de commande obtenu au moyen du décodage de deuxième étage. Le circuit de traitement de données est utilisé pour effectuer une opération sur lesdites données en fonction du premier signal de commande et du second signal de commande, et pour stocker un résultat d'opération dans la mémoire de données au moyen du bus de données. La présente invention peut prendre en charge le fonctionnement d'une instruction complexe. L'invention concerne en outre un procédé de calcul parallèle pour un DSP.
PCT/CN2020/141848 2020-12-30 2020-12-30 Dsp et son procédé de calcul parallèle WO2022141321A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/141848 WO2022141321A1 (fr) 2020-12-30 2020-12-30 Dsp et son procédé de calcul parallèle

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/141848 WO2022141321A1 (fr) 2020-12-30 2020-12-30 Dsp et son procédé de calcul parallèle

Publications (1)

Publication Number Publication Date
WO2022141321A1 true WO2022141321A1 (fr) 2022-07-07

Family

ID=82260040

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/141848 WO2022141321A1 (fr) 2020-12-30 2020-12-30 Dsp et son procédé de calcul parallèle

Country Status (1)

Country Link
WO (1) WO2022141321A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101957743A (zh) * 2010-10-12 2011-01-26 中国电子科技集团公司第三十八研究所 并行数字信号处理器
CN102750133A (zh) * 2012-06-20 2012-10-24 中国电子科技集团公司第五十八研究所 支持simd的32位三发射的数字信号处理器
CN107450888A (zh) * 2016-05-30 2017-12-08 世意法(北京)半导体研发有限责任公司 嵌入式数字信号处理器中的零开销循环
WO2018005718A1 (fr) * 2016-06-30 2018-01-04 Intel Corporation Système et procédé de décodage agrégé dans le désordre

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101957743A (zh) * 2010-10-12 2011-01-26 中国电子科技集团公司第三十八研究所 并行数字信号处理器
CN102750133A (zh) * 2012-06-20 2012-10-24 中国电子科技集团公司第五十八研究所 支持simd的32位三发射的数字信号处理器
CN107450888A (zh) * 2016-05-30 2017-12-08 世意法(北京)半导体研发有限责任公司 嵌入式数字信号处理器中的零开销循环
WO2018005718A1 (fr) * 2016-06-30 2018-01-04 Intel Corporation Système et procédé de décodage agrégé dans le désordre

Similar Documents

Publication Publication Date Title
KR102443546B1 (ko) 행렬 곱셈기
CN109542515B (zh) 运算装置及方法
US10140124B2 (en) Reconfigurable microprocessor hardware architecture
JP6243000B2 (ja) マルチモードベクトル処理を提供するためのプログラム可能データ経路構成を有するベクトル処理エンジン、ならびに関連ベクトルプロセッサ、システム、および方法
US20120278590A1 (en) Reconfigurable processing system and method
CN111142938B (zh) 一种异构芯片的任务处理方法、任务处理装置及电子设备
CN111651205B (zh) 一种用于执行向量内积运算的装置和方法
US8595467B2 (en) Floating point collect and operate
CN111651203B (zh) 一种用于执行向量四则运算的装置和方法
CN116521229A (zh) 一种基于risc-v向量指令扩展的低硬件开销向量处理器架构
TW202217600A (zh) 向量運算裝置和方法
CN113407483B (zh) 一种面向数据密集型应用的动态可重构处理器
CN116888591A (zh) 一种矩阵乘法器、矩阵计算方法及相关设备
WO2022141321A1 (fr) Dsp et son procédé de calcul parallèle
US20190056941A1 (en) Reconfigurable microprocessor hardware architecture
WO2019023910A1 (fr) Procédé et dispositif de traitement de données
US20070198811A1 (en) Data-driven information processor performing operations between data sets included in data packet
CN112074810A (zh) 并行处理设备
CN111353124A (zh) 运算方法、装置、计算机设备和存储介质
US20130262819A1 (en) Single cycle compare and select operations
CN112463218B (zh) 指令发射控制方法及电路、数据处理方法及电路
US8332447B2 (en) Systems and methods for performing fixed-point fractional multiplication operations in a SIMD processor
Rettkowski et al. Application-specific processing using high-level synthesis for networks-on-chip
US20160162290A1 (en) Processor with Polymorphic Instruction Set Architecture
Pezzarossa et al. Interfacing hardware accelerators to a time-division multiplexing network-on-chip

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20967660

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20967660

Country of ref document: EP

Kind code of ref document: A1