CN105373367A - Vector single instruction multiple data-stream (SIMD) operation structure supporting synergistic working of scalar and vector - Google Patents

Vector single instruction multiple data-stream (SIMD) operation structure supporting synergistic working of scalar and vector Download PDF

Info

Publication number
CN105373367A
CN105373367A CN201510718729.7A CN201510718729A CN105373367A CN 105373367 A CN105373367 A CN 105373367A CN 201510718729 A CN201510718729 A CN 201510718729A CN 105373367 A CN105373367 A CN 105373367A
Authority
CN
China
Prior art keywords
vector
vpe
processing unit
instruction
vectorial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510718729.7A
Other languages
Chinese (zh)
Other versions
CN105373367B (en
Inventor
陈书明
彭元喜
雷元武
万江华
郭阳
田甜
彭浩
徐恩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201510718729.7A priority Critical patent/CN105373367B/en
Publication of CN105373367A publication Critical patent/CN105373367A/en
Application granted granted Critical
Publication of CN105373367B publication Critical patent/CN105373367B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a vector single instruction multiple data-stream (SIMD) operation structure supporting synergistic working of a scalar and a vector. The vector SIMD operation structure comprises an united fetch and instruction issuing part, a scalar processing unit SPU, a vector processing unit VPU, a vector array storage AM and a direct memory access (DMA) unit, wherein united fetch and instruction issuing part is used for simultaneously issuing instructions to the scalar processing unit SPU, the vector processing unit VPU and the vector array storage AM, the scalar processing unit SPU is used for processing a serial task and executing control on the vector processing unit VPU, the vector processing unit VPU is used for processing a parallel task with intensive computation, the vector array storage AM is used for proving data and migration support for parallel and multi-width vector operation, and the DMA unit is used for providing the instructions and data for the scalar processing unit SPU and the vector processing unit VPU. By the vector SIMD operation structure, the execution efficiency and the parallelism can be comprehensively improved.

Description

Support the vectorial SIMD operating structure of the vectorial collaborative work of mark
Technical field
The present invention is mainly concerned with microprocessor architecture and design field, refers in particular to a kind of vectorial SIMD operating structure supporting to mark vectorial collaborative work.
Background technology
Digital signal processor (DigitalSignalProcessor, DSP) be widely used in embedded system as the typical embedded microprocessor of one, it is powerful with its data-handling capacity, programmability good, use is flexible and the feature such as low-power consumption, bring huge opportunity to the development of signal transacting, its application is extended to the various aspects of military affairs, economic development.In applications such as modern communications, image procossing and Radar Signal Processing, along with data processing amount strengthens, to the increase of computational accuracy and requirement of real-time, usually need to use more high performance microprocessor to process.
Be different from traditional CPU, DSP has following characteristics: (1) computing power is strong, pays close attention to calculate in real time to be better than Focus Control and issued transaction; (2) specialised hardware support is provided with for type signal process, as multiply-add operation, linear addressing; (3) common feature of embedded microprocessor: address and no more than 32 of instruction path, no more than 32 of most data path; Non-precision is interrupted; The job-program mode (but not universal cpu debugs the method namely run) of the debugging of short-term off-line, long-term online resident operation; (4) integrated Peripheral Interface is set to master with outer fast, is beneficial to online transceiving high speed AD/DA data especially, also supports that between DSP, high speed is direct-connected.
General scientific algorithm needs high performance DSP, but has following shortcoming when traditional DSP is used for scientific algorithm: (1) bit wide is little, makes computational accuracy and addressing space deficiency.General scientific algorithm application at least needs 64 precision; (2) lack the software and hardware supports such as task management, document control, process scheduling, interrupt management, lack operating system hardware environment in other words, make troubles to general, the management of multiple tracks calculation task; (3) lack the support of unified advanced language programming pattern, substantially rely on assembly routine to programme to the support of multinuclear, vector, data parallel etc., be not easy to universal programming; (4) do not support the program debug pattern of local host, only rely on its machine cross debugging to emulate.These problems seriously limit the application of DSP in general scientific algorithm field.
Practitioner is had to propose one " general-purpose computations digital signal processor " (GPDSP), this is a kind of advantage both having kept DSP embedded essential characteristic and high-performance low-power-consumption, again efficient new architecture---the multi-core microprocessor (GPDSP) supporting general scientific algorithm.This structure can overcome general DSP the problems referred to above for scientific algorithm, can provide the efficient support to 64 high-performance computers and embedded high-precision signal transacting simultaneously.This structure has following feature: (1) has the direct representation of double-precision floating point and 64 fixed-point datas, general-purpose register, data bus, instruction bit wide more than 64, address bus more than 40; (2) CPU and DSP heterogeneous polynuclear close-coupled, CPU core supports complete operating system, and the scalar units of DSP core supports operating system micronucleus; (3) the unified programming mode of vectorial array structure in CPU core, DSP core and DSP core is considered; (4) keep its machine intersection artificial debugging, local cpu host is provided debugging mode simultaneously; (5) essential characteristic of the common DSP except figure place is retained.
Separately there is practitioner to propose one and " there is the data shuffling unit of switch matrix storer ", it discloses a kind of data shuffling unit and realize structure and data shuffling method, shuffling in program is asked the switch matrix be converted in switch matrix storer, thus realize data selection and restructuring.This reshuffling unit has the advantages that simple, the flexible and efficient and arbitrary node of structure shuffles.
GPDSP usually forms process array by multiple isomorphism 64 bit processing unit and obtains higher floating-point operation ability.But, also there is following Railway Project when GPDSP uses numerous processing unit to develop general scientific algorithm concurrency: how (1) organizes numerous isomorphism processing unit, make the concurrency of the many levels in the general scientific algorithm of its Efficient Development; (2) how effective coordination for the scalar operation unit that controls and the vector operation unit for calculating; (3) how the matrix class computing in general scientific algorithm is provided support, utilize the mass data multiplexing characteristics in matrix class computing improve to numerous isomorphism processing unit for number ability, and then improve the counting yield of GPDSP.
Summary of the invention
The technical problem to be solved in the present invention is just: the technical matters existed for prior art, the invention provides a kind of vectorial SIMD operating structure that can improve the support mark vector collaborative work of execution efficiency and concurrency.
For solving the problems of the technologies described above, the present invention by the following technical solutions:
Support the vectorial SIMD operating structure marking vectorial collaborative work, it comprises:
Unified fetching and instruction distribute parts, are used for distributing instruction for scalar processing unit SPU, vector processing unit VPU and vector array storer AM simultaneously;
Scalar processing unit SPU, is used for the process of responsible serial task, and to the control that vector processing unit VPU performs;
Vector processing unit VPU, is used for the parallel task process of responsible computation-intensive;
Vector array storer AM, is used for as the parallel vector operation with many width provides data and moves support;
DMA unit, is used for as scalar processing unit SPU, vector processing unit VPU provide instruction and data.
As a further improvement on the present invention: described unified fetching and instruction distribute the N that parts adopt variable length in the process of implementation sI+ N vIlaunch VLIW order structure, simultaneously fetching and distribute N sIbar scalar instruction and N vIbar vector instruction, this N sI+ N vIbar instruction simultaneously supports conditions performs, interrupts and abnormality processing.
As a further improvement on the present invention: described scalar processing unit SPE is by N sMACindividual MAC unit and N sIEUindividual fixed point execution unit IEU composition, this N sIbar pipeline parallel method performs the N in VLIW instruction bag sIbar scalar instruction, performs the serial arithmetic in science application, wherein N sI=N sMAC+ N sIEU.
As a further improvement on the present invention: described vector processing unit VPU is by N vPEindividual isomorphism vector operation unit VPE is formed, under unified instruction stream controls, perform identical operation, wherein N to different pieces of information vPEit is the power side of 2.
As a further improvement on the present invention: described vector operation unit VPE comprises N vMACindividual MAC unit and N vIEUindividual fixed point execution unit IEU, this N vIbar pipeline parallel method performs the N in VLIW instruction bag vIbar vector instruction, performs the concurrent operation in science application, wherein N vI=N vMAC+ N vIEU.
As a further improvement on the present invention: the data interaction between described vector operation unit VPE is completed by regular net region and shuffling network.
As a further improvement on the present invention: the configuration path respectively devising 64 between described scalar processing unit SPU and vector processing unit VPU and vector array storer AM, realize by MOV instruction the access overall situation in vector processing unit VPU and vector array storer A being controlled to configuration register.
As a further improvement on the present invention: the data broadcast pass through mechanism also having two scalar processing unit SPU to vector processing unit VPU between described scalar processing unit SPU and vector processing unit VPU, individual character broadcasting instructions and double word broadcasting instructions is supported respectively;
Described individual character broadcasting instructions is: the individual character in SPU register file is broadcast to N vPEsame position in the vector registor of individual VPE; To N in the process performed vPEregister file in individual VPE carries out a write operation, completes 64*N vPEthe transmission of bit data;
Described double word broadcasting instructions is: a pair data Src_o:Src_e in SPU register file is broadcast to N vPEin Dst_o:Dst_e in register file in individual VPE, it is that VR0 represents VR1:VR0 that register pair uses even number to represent; To N in the process performed vPEregister file in individual VPE carries out a write operation, completes 128*N vPEbit data is transmitted;
Article two, mark vector broadcast path executed in parallel double word broadcast operation can realize 256*N vPEthe transmission of bit data.
Compared with prior art, the invention has the advantages that:
1, the present invention is tight coupling vector SIMD (SingleInstructionMultipleData-stream, the single-instruction multiple-data stream (SIMD)) operating structure of the scalar sum vector collaborative work of a kind of applicable multi-core microprocessor GPDSP.Adopt multi-emitting VLIW (VeryLargeInstructionWord, the very long instruction word) order structure of variable length, fetching distributes N simultaneously sIbar scalar instruction and N vIbar vector instruction, scalar operation cell S PE and vector operation unit VPE performs parallel instruction in VLIW simultaneously.Vector operation unit in this operating structure comprises N vPE(N vPEbe the power side of 2) individual isomorphism vector operation unit VPE, identical instruction is performed to different pieces of information; Data interaction between SPE and VPE is completed by register stage data sharing and the broadcast mechanism of marking vector processing unit fast, and the data interaction between VPE is completed by regular net region and shuffling network, by support matrix class and the application of signal transacting class efficiently of these data interaction mechanism.
2, of the present inventionly support in the vectorial SIMD operating structure of the vectorial collaborative work of mark, scalar processing unit and the tightly coupled organizational form of vector processing unit, by unified fetching and distribute the multi-emitting VLIW instruction that parts realize variable length.This organizational form can play the execution efficiency of scalar sum vector processing unit to greatest extent.
3, support of the present invention is marked in the vectorial SIMD operating structure of vectorial collaborative work, multiple parallel mechanism fully excavates the concurrency in application, comprises the inner sub-word SIMD technology of arithmetic element, marks vectorial SIMD technology between inner many execution pipeline VLIW concurrent techniques of vector and vector operation unit.
4, of the present inventionly support in the vectorial SIMD operating structure of the vectorial collaborative work of mark, between mark vector and the several data pass through mechanism of vector inside to realize data mutual fast between each arithmetic element, improve the execution efficiency of high-performance calculation application.
5, support of the present invention is marked in the vectorial SIMD operating structure of vectorial collaborative work, by said structure tissue signature, multiple parallel mechanism and several data pass through mechanism, the concurrency that abundant excavation core algorithm (as matrix multiplication, FFT etc.) is potential, improves the execution efficiency of GPDSP.
Accompanying drawing explanation
Fig. 1 is structural representation of the present invention.
Fig. 2 is scalar processing unit structural representation in the present invention;
Fig. 3 is the principle schematic of vector processing unit in the present invention.
Fig. 4 is the exemplary plot that the present invention carries out matrix-vector multiplication in embody rule example.
Fig. 5 be the present invention in embody rule example based on the FFT computation process schematic diagram shuffling function.
Fig. 6 is the arrangement schematic diagram of the present invention in embody rule example VLIW instruction slots in FFT computation process.
Embodiment
Below with reference to Figure of description and specific embodiment, the present invention is described in further details.
The present invention is the high performance universal digital signal processor GPDSP that the multithread of a variable length goes out very long instruction word structure, for high-performance calculation, is also applicable to radio communication, video and image procossing etc. simultaneously.This GPDSP is applicable to 64 general scientific algorithm, has the polycaryon processor of a kind of new architecture of embedded dsp essential characteristic simultaneously.Its advantage is both to maintain the essential characteristic of DSP and the advantage of high-performance low-power-consumption, support general scientific algorithm efficiently again, the general considerations of general DSP for scientific algorithm can be overcome, the efficient support to 64 high-performance computers and embedded high-precision signal transacting can be provided simultaneously.
As shown in Figure 1, for the present invention supports to mark the one-piece construction of kernel in the vectorial SIMD operating structure of vectorial collaborative work.Described kernel adopts Harvard structure, and instruction and data separately stores.Structure of the present invention comprises unified fetching and instruction distributes parts, scalar processing unit SPU, vector processing unit VPU, vector array storer AM and DMA unit.Wherein:
Unified fetching and instruction distribute parts, are used for distributing instruction for scalar processing unit SPU, vector processing unit VPU and vector array storer AM simultaneously;
Scalar processing unit SPU, is used for the process of responsible serial task, and to the control that vector processing unit VPU performs; Scalar processing unit SPE is by N sMACindividual MAC unit and N sIEUindividual fixed point execution unit IEU composition, this N sI(N sI=N sMAC+ N sIEU) bar pipeline parallel method performs N in VLIW instruction bag sIbar scalar instruction, performs the serial arithmetic in science application;
Vector processing unit VPU, is used for the parallel task process of responsible computation-intensive; Vector processing unit VPU is by N vPE(N vPEbe the power side of 2) individual isomorphism vector operation unit (VPE:VectorProcessingElement) formation, under unified instruction stream controls, identical operation is performed to different pieces of information; Vector operation unit VPE comprises N vMACindividual MAC unit and N vIEUindividual fixed point execution unit IEU, this N vI(N vI=N vMAC+ N vIEU) bar pipeline parallel method performs N in VLIW instruction bag vIbar vector instruction, performs the concurrent operation in science application;
Vector array storer AM, is used for as the parallel vector operation with many width provides data and moves support;
DMA unit, is used for as scalar processing unit SPU, vector processing unit VPU provide instruction and data.
In embody rule example, above-mentioned kernel adopts the N of variable length sI+ N vIlaunch VLIW order structure (VeryLargeInstructionWord, very long instruction word), can simultaneously fetching and distribute N sIbar scalar instruction and N vIbar vector instruction, this N sI+ N vIbar instruction simultaneously supports conditions performs, interrupts and abnormality processing.
Instruction Cache (InstructionCache:ICache) is designed to 2 road set associatives, adopts and reads allocation strategy and the capable replacement policy of least-recently-used Cache.ICache size is 64KB, and during hit, the access time is 1 cycle.ICache obtains instruction bag according to the request of fetching and dispatch unit by EMI interface.
Distribute in the middle of parts in unified fetching and instruction, instruction fetching component is used for, according to the address sent from interrupt processing parts, abnormality processing parts, instruction Flow Control parts (branch address), ET parts, producing new address.Then, control according to the overall situation, the blank operation information of branch components and DP parts distribute information, control fetching streamline, carry out instruction and distribute control.
Distribute in the middle of parts in unified fetching and instruction, distribute the instruction bag that parts are used for receiving from instruction fetching component and ICache, the instruction in instruction bag is analyzed according to parallel mark position and functional unit type field, instruction is distributed corresponding functional part.
Instruction bag 1024, the instruction of executed in parallel can be formed one and performs bag in one-period, and instruction bag can comprise and multiplely performs bag.The present invention supports 80 and 40 two kinds of order formats, can maximum while fetching and distribute N sIbar scalar instruction and N vIbar vector instruction, wherein N sIbar scalar instruction is respectively 2 scalar access instruction and N sMAC+ N sIEUbar SPU operational order, N vIbar vector instruction is respectively 2 vectorial access instruction and N vMAC+ N vIEUbar vector VPU operational order.Such execution bag maximum length is 11, and minimum is 1, wherein, has at most 4 80 bit instructions, and 80 bit instructions are positioned at the head performing bag, and 40 bit instructions following closely.
Instruction due to scalar processing unit SPU and vector processing unit VPU is arranged in and samely performs bag, under unified value and instruction distribute unit control, completes calculation task with close coupled system collaborate.
In embody rule example, the major function of scalar processing unit SPU has been the memory access of scalar data, the computing of scalar data and branch's redirect of streamline and interrupt operation.SPU performs the serial arithmetic in application, vectorial unitary operation is controlled simultaneously, comprise instruction flow control unit (SBR), scalar operation unit (ScalarProcessElement, SPE), scalar units control register (SUCR) and scalar memory access unit (SM).N is comprised in SPE sIEUindividual fixed point execution unit (IEU) and N sMACindividual multiply accumulating (MAC) arithmetic element, can perform simultaneously; Scalar memory access unit can perform two scalar access instruction simultaneously, reads the data of two words from data Cache.
The present invention respectively devises the configuration path of 64 between SPU and VPU and AM, realizes by MOV instruction the access overall situation in VPU and AM being controlled to configuration register.
In embody rule example, as shown in Figure 3, vector processing unit (VectorProcessUnit, VPU) is a kind of easily extensible vector operation clustering architecture, and the parallel task of main process computation-intensive, by N vPEvector operation unit (VectorProcessElement, the VPE) composition of individual isomorphism, each VPE all devises N vMACindividual vector is taken advantage of and is added (MAC) arithmetic element and N vIEUindividual fixed point arithmetic unit (IEU), to support extensive MAC parallel computation.VPU adopts vectorial SIMD concurrent technique to perform identical VLIW instruction to different pieces of information, thus realizes the vector calculation in application, can perform N simultaneously vPE* (N vMAC+ N vIEU) individual vector operation operation.
A) overall situation shares register;
Can share register by the overall situation between SPU and VPU in the present invention and realize data interaction, this interaction mechanism is completed by scalar MOV instruction and vector M OV instruction.The overall situation shares register by N vPEindividual 64 bit register compositions, VPU carries out read-write operation to it in the form of vectors, and SPU reads and writes one by one by the form of scalar.Pass through N vPEbar scalar MOV instruction is by the N in scalar register file vPEindividual data (64) be written in SVR, and then by 1 vector M OV instruction by this N vPEindividual data are written to N vPEthe same position of register file in individual VPE; Or contrary, by 1 vector M OV instruction by N vPEthe N of the same position in individual VPE in register file vPEindividual data are read in SVR, and then pass through N vPEbar scalar MOV instruction is by this N vPEindividual data are written in scalar register file.
B) mark vector broadcast;
Further, the present invention in the vectorial collaborative work operating structure of mark, also have between SPU and VPU two fast scalar units to the data broadcast pass through mechanism of vector location, support individual character broadcasting instructions and double word broadcasting instructions respectively.1) individual character broadcasting instructions: the individual character (64) in SPU register file is broadcast to N vPEsame position in the vector registor of individual VPE.Need N in the process performed vPEregister file in individual VPE carries out a write operation, completes 64*N vPEthe transmission of bit data.2) double word broadcasting instructions: a pair data Src_o:Src_e (double word: 128) in SPU register file is broadcast to N vPEin Dst_o:Dst_e in register file in individual VPE, register pair uses even number to represent here is that VR0 represents VR1:VR0.Only need N in the process performed vPEregister file in individual VPE carries out a write operation, completes 128*N vPEbit data is transmitted.Article two, mark vector broadcast path executed in parallel double word broadcast operation can realize 256*N vPEthe transmission of bit data.
C) many width stipulations and shuffling network;
Further, the present invention devises many width stipulations tree and shuffles (Shuffle) network to realize the data interaction between VPE register file between vector operation unit.Whole for the data of multiple VPE arithmetic stipulations can be obtained a scalar result by stipulations tree, and also can carry out grouping stipulations and obtain multiple scalar result, the reduction operation of many width is by all N vPEindividual VPE divides into groups, each grouping reduction operation executed in parallel, and grouping only supports average packet, and packet size is the integral number power of 2.Shuffling network is for realizing the data rearrangement on all VPE, thus the data communication realized between VPE, great dirigibility is brought to vector data process, shuffling network can carry out shuffle operation according to the difference of shuffling granularity and shuffle mode to the data between VPE, reduction network by the data in multiple VPE by many width stipulations mode stipulations in one or more VPE.
D) vector data addressed location;
Further, the present invention supports two vectorial load/store parallel instructions accessing operations, is N vPEindividual vector processing unit provides higher memory bandwidth.2*N with the data bandwidth of VPU vPE* 8B/cycle.Vector data memory bank adopts single port multibank institutional framework, supports concurrent access; N is provided vPEindividual special base register and address offset register, support linear addressing and cyclic addressing two kinds of indirect addressing mode.
In the present embodiment, VPU is a kind of easily extensible vector operation clustering architecture.VPU is by N vPEindividual isomorphism vector operation unit VPE is formed, this N vPEindividual VPE adopts vectorial SIMD concurrent technique to carry out improving performance.VPU receives the vector operation class instruction distributing parts and distribute, and delivers to corresponding functional unit and perform after decoding.The local general-purpose register R0 ~ R63 of 64 64 is comprised in each VPE.
As shown in Figures 2 and 3, inner integrated four sub-function of each VPE are N respectively vMACindividual multiply-accumulate unit (MAC) and N vIEUindividual fixed point execution unit (IEU), for the computing on support vector basis.VPE is by this N vMAC+ N vIEUthe VLIW executed in parallel of individual sub-function carrys out improving performance, and each parts correspondence performs a vector instruction in VLIW instruction bag, and namely VPE comprises N vMAC+ N vIEUbar can the streamline of executed in parallel.Each bar operation execution pipeline in SPU and VPE all adopts sub-word SIMD technology to realize 64 and SIMD32 position algorithm, improves 32 bit manipulations (as single-precision floating point) performance further.
Each MAC parts are made up of fixed point MAC, floating-point MAC and floating-point ALU short instruction unit three unit.Wherein floating-point MAC and the multiplexing 64x64 fixed-point multiplication device of fixed point MAC.Floating-point MAC unit, fixed point MAC unit, floating-point ALU short instruction unit are the separate units having identical data path.Same period three can not start to perform or write back simultaneously, but can by software flow schedule parallel.
IEU parts are made up of bit processing unit (BitProcess, BP) and fixed point arithmetic arithmetic logic unit (ArithmeticLogicUnit, ALU).Both are the separate units with identical data path, and both same periods can not start to perform or write back simultaneously, can be dispatched realize walking abreast by software flow.
Broadcast mechanism is there is, maximum support two double word broadcast operations, the filling speed of acceleration vector data between GPDSP scalar performance element SPU of the present invention and vectorial performance element VPU.Data are broadcast to vectorial performance element from scalar performance element, and implementation only needs to carry out a write operation to VRF, can complete 128*N vPEor 256*N vPEthe transmission of bit data.
Data are transferred to N from SPE vPEin individual VPE, use the mark vector broadcast capability had in the present invention, only needed for 4 bat times, and transmitting procedure is that complete pipeline mode carries out.Using SVR to complete this process then needs 20 bats just can complete, and is that serial performs by the data interaction that SVR carries out SPE and VPE.According to such calculating, the application of marking vectorial broadcast capability makes the filling speed of data reach about 20 times that use SVR, can the raising data stuffing speed of high degree; Meanwhile, adopt the broadcast of mark vector to realize data-reusing, reduce memory bandwidth demand, improve overall performance.
In the application of many Science and engineerings, all can relate to matrix class computing, and matrix class computing has good data parallelism, can by the instruction-level parallelism developed based on SIMD and VLIW parallel method in this kind of computing of the present invention.Below with N vPE=16 and N vMAC=3 is example, and the support that parallel organization of the present invention is applied matrix multiplication and FFT is described.
While improving calculated performance by large-scale functional unit parallel mode, also bring huge challenge also to memory bandwidth demand.The present invention also has good data reusing according to matrix class computing, and the mark vector broadcast operation of design, can complete the data transmission of 2048 or 4096 when an execution write operation.This can data reusing in effective exploitation application, reduces memory bandwidth demand, improves vector calculation unit utilization factor.Can the operation efficiency of raising matrix multiplication of high degree, reduce taking of resource, improve overall performance.
As shown in Figure 4, illustrate that the vectorial broadcast operation of mark is to calculated performance and storage demand for operation---matrix-vector multiplication---the most basic in matrix class computing.Matrix-vector multiplication y=A × x, wherein A is the matrix of n*m, x is the vector of m, y is the vector of n.On GPDSP operating structure of the present invention, be stored into by data matrix A in vector memory AM, x is stored in scalar data storer SM, and 16 VPE are performed by SIMD parallel mode wheel calculates, and VPE [i] VPE [i] calculates result wherein 1≤j≤t.As can be seen from Figure 4, each element in result vector y needs multiplexing vectors x.On GPDSP of the present invention, only need reading vector x, then often wheel calculate all pass through mark vectorial broadcast operation the element in vector x is broadcast to successively in the vector registor of 16 VPE, from AM, read the data of 16 corresponding row matrix A successively, in 16 VPE, 48 MAC unit are with pipeline mode executed in parallel simultaneously.
For application matrix class computing very widely, of the present invention significant.All need to use matrix class computing in numerous scientific algorithm tasks, as matrix multiplication, mark vectorial broadcast capability and share register SVR higher than traditional mark vector on operation efficiency.Mark vectorial broadcast capability can complete 2048 or 4096 data transmission by a write operation, performance advantage so can be reached, rely on the support of GPDSP operating structure of the present invention.Scalar processing unit SPU, includes the vector processing unit VPU of the vector operation unit VPE of 16 isomorphisms, and these unit actings in conjunction achieve the vectorial broadcast capability of mark, can the operational performance of significantly lifting matrixes multiplication, has a extensive future.
GPDSP operating structure of the present invention can be applied to signal transacting field equally efficiently, is described for rudimentary algorithm---double-precision floating point fft algorithm---most in this field.Owing to needing to conduct interviews to data with different intervals in FFT computation process, the data that the vectorial SIMD operating structure based on shuffling network of the present invention can realize between VPE are mutual fast, thus meet different interval data access requirements.
As illustrated in Figures 5 and 6, adopt Cooley-Tukey algorithm that random scale FFT is decomposed into the small-scale FFT that multiple scale is no more than at 128.For the FFT that scale is 128, can complete in the register file being stored in vector operation parts by primary data, twiddle factor and result of calculation, each VPE stores the data of 8 points, and each point is a double precision plural number.As shown in Figure 5, data sequence is deposited in each VPE, and the base 2FFT algorithm of 128 is divided into 7 grades of butterfly computations, in the 1st grade, the 2nd grade and 3rd level butterfly computation, each VPE all carries out computing to the data of respective register file, and result is stored in own register file; After 3rd level terminates, need exchange the data between each VPE, this patent performs 7 shuffle instruction by flowing water have been come, and then, carries out the 4th, 5,6 grade of butterfly computation to the data after shuffling; After 6th grade of butterfly computation terminates, complete data interaction between VPE by performing 1 shuffle instruction, then perform the 7th grade of butterfly computation.For in every grade of butterfly computation, each VPE completes 4 butterfly operations, as shown in (A) in Fig. 6, each butterfly computation is by 4 double-precision floating point multiplication and 6 double-precision floating point plus/minus method compositions, as shown in (B) in Fig. 6, therefore, every grade of each VPE of butterfly computation completes 16 floating-point multiplications and 24 floating adds operation (totally 40 floating-point operations) altogether, and the distribution of these 40 floating-point operations in 3 MAC instruction slots of the present invention is as shown in (C) in Fig. 6.106 clock period (14*7+8) are needed altogether in VPU structure of the present invention from known 128 the FFT computings of above-mentioned analysis.
Below be only the preferred embodiment of the present invention, protection scope of the present invention be not only confined to above-described embodiment, all technical schemes belonged under thinking of the present invention all belong to protection scope of the present invention.It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principles of the present invention, should be considered as protection scope of the present invention.

Claims (8)

1. support the vectorial SIMD operating structure marking vectorial collaborative work, it is characterized in that, comprising:
Unified fetching and instruction distribute parts, are used for distributing instruction for scalar processing unit SPU, vector processing unit VPU and vector array storer AM simultaneously;
Scalar processing unit SPU, is used for the process of responsible serial task, and to the control that vector processing unit VPU performs;
Vector processing unit VPU, is used for the parallel task process of responsible computation-intensive;
Vector array storer AM, is used for as the parallel vector operation with many width provides data and moves support;
DMA unit, is used for as scalar processing unit SPU, vector processing unit VPU provide instruction and data.
2. the vectorial SIMD operating structure supporting the vectorial collaborative work of mark according to claim 1, it is characterized in that, described unified fetching and instruction distribute the N that parts adopt variable length in the process of implementation sI+ N vIlaunch VLIW order structure, simultaneously fetching and distribute N sIbar scalar instruction and N vIbar vector instruction, this N sI+ N vIbar instruction simultaneously supports conditions performs, interrupts and abnormality processing.
3. the vectorial SIMD operating structure supporting the vectorial collaborative work of mark according to claim 1, it is characterized in that, described scalar processing unit SPE is by N sMACindividual MAC unit and N sIEUindividual fixed point execution unit IEU composition, this N sIbar pipeline parallel method performs the N in VLIW instruction bag sIbar scalar instruction, performs the serial arithmetic in science application, wherein N sI=N sMAC+ N sIEU.
4. the vectorial SIMD operating structure supporting the vectorial collaborative work of mark according to claim 1, it is characterized in that, described vector processing unit VPU is by N vPEindividual isomorphism vector operation unit VPE is formed, under unified instruction stream controls, perform identical operation, wherein N to different pieces of information vPEit is the power side of 2.
5. the vectorial SIMD operating structure supporting the vectorial collaborative work of mark according to claim 4, it is characterized in that, described vector operation unit VPE comprises N vMACindividual MAC unit and N vIEUindividual fixed point execution unit IEU, this N vIbar pipeline parallel method performs the N in VLIW instruction bag vIbar vector instruction, performs the concurrent operation in science application, wherein N vI=N vMAC+ N vIEU.
6. the vectorial SIMD operating structure supporting the vectorial collaborative work of mark according to claim 5, it is characterized in that, the data interaction between described vector operation unit VPE is completed by regular net region and shuffling network.
7. the vectorial SIMD operating structure supporting the vectorial collaborative work of mark according to claim 1, it is characterized in that, respectively devise the configuration path of 64 between described scalar processing unit SPU and vector processing unit VPU and vector array storer AM, realize by MOV instruction the access overall situation in vector processing unit VPU and vector array storer A being controlled to configuration register.
8. the vectorial SIMD operating structure supporting the vectorial collaborative work of mark according to claim 1, it is characterized in that, also have the data broadcast pass through mechanism of two scalar processing unit SPU to vector processing unit VPU between described scalar processing unit SPU and vector processing unit VPU, support individual character broadcasting instructions and double word broadcasting instructions respectively;
Described individual character broadcasting instructions is: the individual character in SPU register file is broadcast to N vPEsame position in the vector registor of individual VPE; To N in the process performed vPEregister file in individual VPE carries out a write operation, completes 64*N vPEthe transmission of bit data;
Described double word broadcasting instructions is: a pair data Src_o:Src_e in SPU register file is broadcast to N vPEin Dst_o:Dst_e in register file in individual VPE, it is that VR0 represents VR1:VR0 that register pair uses even number to represent; To N in the process performed vPEregister file in individual VPE carries out a write operation, completes 128*N vPEbit data is transmitted;
Article two, mark vector broadcast path executed in parallel double word broadcast operation can realize 256*N vPEthe transmission of bit data.
CN201510718729.7A 2015-10-29 2015-10-29 The vectorial SIMD operating structures for supporting mark vector to cooperate Active CN105373367B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510718729.7A CN105373367B (en) 2015-10-29 2015-10-29 The vectorial SIMD operating structures for supporting mark vector to cooperate

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510718729.7A CN105373367B (en) 2015-10-29 2015-10-29 The vectorial SIMD operating structures for supporting mark vector to cooperate

Publications (2)

Publication Number Publication Date
CN105373367A true CN105373367A (en) 2016-03-02
CN105373367B CN105373367B (en) 2018-03-02

Family

ID=55375596

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510718729.7A Active CN105373367B (en) 2015-10-29 2015-10-29 The vectorial SIMD operating structures for supporting mark vector to cooperate

Country Status (1)

Country Link
CN (1) CN105373367B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109661647A (en) * 2016-09-13 2019-04-19 Arm有限公司 The multiply-add instruction of vector
CN111352894A (en) * 2018-12-20 2020-06-30 深圳市中兴微电子技术有限公司 Single-instruction multi-core system, instruction processing method and storage medium
CN111651201A (en) * 2016-04-26 2020-09-11 中科寒武纪科技股份有限公司 Device and method for executing vector merging operation
CN112328958A (en) * 2020-11-10 2021-02-05 河海大学 Optimized data rearrangement method based on base-64 two-dimensional FFT architecture
WO2022121275A1 (en) * 2020-12-11 2022-06-16 上海阵量智能科技有限公司 Processor, multithread processing method, electronic device, and storage medium
CN115826910A (en) * 2023-02-07 2023-03-21 成都申威科技有限责任公司 Vector fixed point ALU processing system
CN117435259A (en) * 2023-12-20 2024-01-23 芯瞳半导体技术(山东)有限公司 VPU configuration method and device, electronic equipment and computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101986264A (en) * 2010-11-25 2011-03-16 中国人民解放军国防科学技术大学 Multifunctional floating-point multiply and add calculation device for single instruction multiple data (SIMD) vector microprocessor
CN102012893A (en) * 2010-11-25 2011-04-13 中国人民解放军国防科学技术大学 Extensible vector operation cluster
CN102279818A (en) * 2011-07-28 2011-12-14 中国人民解放军国防科学技术大学 Vector data access and storage control method supporting limited sharing and vector memory
CN103440121A (en) * 2013-08-20 2013-12-11 中国人民解放军国防科学技术大学 Triangular matrix multiplication vectorization method of vector processor
CN104636315A (en) * 2015-02-06 2015-05-20 中国人民解放军国防科学技术大学 GPDSP-oriented matrix LU decomposition vectorization calculation method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101986264A (en) * 2010-11-25 2011-03-16 中国人民解放军国防科学技术大学 Multifunctional floating-point multiply and add calculation device for single instruction multiple data (SIMD) vector microprocessor
CN102012893A (en) * 2010-11-25 2011-04-13 中国人民解放军国防科学技术大学 Extensible vector operation cluster
CN102279818A (en) * 2011-07-28 2011-12-14 中国人民解放军国防科学技术大学 Vector data access and storage control method supporting limited sharing and vector memory
CN103440121A (en) * 2013-08-20 2013-12-11 中国人民解放军国防科学技术大学 Triangular matrix multiplication vectorization method of vector processor
CN104636315A (en) * 2015-02-06 2015-05-20 中国人民解放军国防科学技术大学 GPDSP-oriented matrix LU decomposition vectorization calculation method

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111651201A (en) * 2016-04-26 2020-09-11 中科寒武纪科技股份有限公司 Device and method for executing vector merging operation
CN111651201B (en) * 2016-04-26 2023-06-13 中科寒武纪科技股份有限公司 Apparatus and method for performing vector merge operation
CN109661647A (en) * 2016-09-13 2019-04-19 Arm有限公司 The multiply-add instruction of vector
CN109661647B (en) * 2016-09-13 2023-03-03 Arm有限公司 Data processing apparatus and method
CN111352894A (en) * 2018-12-20 2020-06-30 深圳市中兴微电子技术有限公司 Single-instruction multi-core system, instruction processing method and storage medium
CN112328958A (en) * 2020-11-10 2021-02-05 河海大学 Optimized data rearrangement method based on base-64 two-dimensional FFT architecture
WO2022121275A1 (en) * 2020-12-11 2022-06-16 上海阵量智能科技有限公司 Processor, multithread processing method, electronic device, and storage medium
CN115826910A (en) * 2023-02-07 2023-03-21 成都申威科技有限责任公司 Vector fixed point ALU processing system
CN117435259A (en) * 2023-12-20 2024-01-23 芯瞳半导体技术(山东)有限公司 VPU configuration method and device, electronic equipment and computer readable storage medium
CN117435259B (en) * 2023-12-20 2024-03-22 芯瞳半导体技术(山东)有限公司 VPU configuration method and device, electronic equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN105373367B (en) 2018-03-02

Similar Documents

Publication Publication Date Title
CN105373367A (en) Vector single instruction multiple data-stream (SIMD) operation structure supporting synergistic working of scalar and vector
Fang et al. swdnn: A library for accelerating deep learning applications on sunway taihulight
CN102750133B (en) 32-Bit triple-emission digital signal processor supporting SIMD
CN105453071B (en) For providing method, equipment, instruction and the logic of vector group tally function
CN103562866B (en) For the register file segment performed by using the virtual core by divisible engine instance come support code block
CN103635875B (en) For by using by can subregion engine instance the memory segment that is performed come support code block of virtual core
CN105359129B (en) For providing the method, apparatus, instruction and the logic that are used for group's tally function of gene order-checking and comparison
CN109597646A (en) Processor, method and system with configurable space accelerator
CN105247475B (en) Packed data element concludes processor, method, system and instruction
Nagar et al. A sparse matrix personality for the convey hc-1
CN105190538B (en) System and method for the mobile mark tracking eliminated in operation
CN109002659B (en) Fluid machinery simulation program optimization method based on super computer
CN102012893B (en) Extensible vector operation device
CN103049241A (en) Method for improving computation performance of CPU (Central Processing Unit) +GPU (Graphics Processing Unit) heterogeneous device
CN105893319A (en) Multi-lane/multi-core system and method
CN102508643A (en) Multicore-parallel digital signal processor and method for operating parallel instruction sets
CN107451097B (en) High-performance implementation method of multi-dimensional FFT on domestic Shenwei 26010 multi-core processor
Gong et al. Multi2Sim Kepler: A detailed architectural GPU simulator
CN101937425B (en) Matrix parallel transposition method based on GPU multi-core platform
CN112580792B (en) Neural network multi-core tensor processor
CN102306139A (en) Heterogeneous multi-core digital signal processor for orthogonal frequency division multiplexing (OFDM) wireless communication system
CN107667345A (en) Packing data alignment plus computations, processor, method and system
D'Amore et al. Towards a parallel component in a GPU–CUDA environment: a case study with the L-BFGS Harwell routine
CN110188320A (en) Second order blind source separating parallel optimization method and system based on multi-core platform
CN108205447A (en) The stream engine of architecture states is tracked using early stage and later stage address and cycle count register

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant