CN104699458A - Fixed point vector processor and vector data access controlling method thereof - Google Patents

Fixed point vector processor and vector data access controlling method thereof Download PDF

Info

Publication number
CN104699458A
CN104699458A CN201510144307.3A CN201510144307A CN104699458A CN 104699458 A CN104699458 A CN 104699458A CN 201510144307 A CN201510144307 A CN 201510144307A CN 104699458 A CN104699458 A CN 104699458A
Authority
CN
China
Prior art keywords
vector
alu
memory
control unit
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510144307.3A
Other languages
Chinese (zh)
Inventor
庞业勇
王少军
何永福
刘大同
彭宇
彭喜元
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN201510144307.3A priority Critical patent/CN104699458A/en
Publication of CN104699458A publication Critical patent/CN104699458A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a fixed point vector processor and a vector data access controlling method thereof, relates to a vector processor used for online time series prediction and aims to solve the problems that an existing vector processor incapable of optimizing a specific method is poor in universality and cannot satisfy requirements on online computation. The fixed point vector processor comprises a program counter, a microcode memorizer, a vector memorizer, an arithmetic logic unit and a data control unit. A complete fixed point vector processing procedure is formed by every signal processing procedure of the program counter, the microcode memorizer, the vector memorizer, the arithmetic logic unit and the data control unit. By means of ALU (arithmetic logic unit) designing, an ALU structure of each data path can be changed flexibly according to computation needs, flexible configuration of an instruction set is achieved, and accordingly, the fixed point vector processor and the vector data access controlling method are applicable to occasions required by complex computation.

Description

Fixed point vector processor and vector data access control method thereof
Technical field
The present invention relates to the fixed point vector processor of a kind of high-performance, low-power consumption and low delay, particularly relate to a kind of vector processor for online time series prediction.
Background technology
At present, online machine learning for embedded high-performance computing platform has become a study hotspot, there is the information physical system node of the high-performance of online data processing capacity, low-power consumption and low delay, by integrated to information acquisition, Intelligent Information Processing and network communicating function height, be widely used the fields such as environment, commercial production and aerospace engineering.But for application on site, nonlinear method needs constantly add new samples and upgrade model, and ever-increasing sample size and the calculated amount required for model modification increase greatly, propose great challenge to the performance of embedding assembly platform.So, for requiring high-performance, low delay and there is online machine learning and the application of large throughput data processing power, need high performance computing platform.
At present, the existing computing system based on FPGA typically uses HDL language design, at RTL level by specific Algorithm mapping to FPGA, the method for this special accelerating engine has many design examples, achieves very high calculating speed-up ratio.
But although using the design of HDL code mapping special algorithm to achieve very high calculating speed-up ratio, versatility is not strong.Be mainly reflected in when target algorithm changes, major part design or even all design all need manual change, and dirigibility is very poor, and its design complexities and design cycle limit the scope of application of this type of design, i.e. poor expandability; And existing vector processor carries out designing according to the structure of general processor, cannot be optimized for ad hoc approach, performance can not meet the demand in line computation.
In prior art, the number of passages of vector processor is restricted to the exponent (16,32,64,128) of 2 at present.In present embodiment, the data path quantity of vector processor 3 can need setting flexibly according to specific calculating, in order to reduce the complexity of hardware design, according to design herein, can realize the vector processor of 1 ~ 128 road arbitrary data number of passages.The design of this vector processor is the vector processor design that a kind of portability is very strong, can be transplanted to other FPGA device easily.
Summary of the invention
The present invention proposes to solve following problems: the general processor software simulating calculated performance 1) for kernel method is low, power consumption and the large problem of delay; 2) existing vector processor cannot be optimized the strong problem with not meeting the demand in line computation of the versatility caused for ad hoc approach; 3) existing vector processor cannot take into account computational resource consumption on calculated performance and FPGA sheet.Now fixed point vector processor and vector data access control method thereof are proposed.
Fixed point vector processor, it comprises programmable counter, microcode memory, vector memory, ALU and DCU data control unit;
The input destination address micro-code instruction that programmable counter 1 sends for the counting instruction and microcode memory 2 receiving DCU data control unit 5 transmission, and export count value to microcode memory 2;
Microcode memory 2 is for receiving and the count value of storage program counter 1 transmission, and output channel index L micro-code instruction is to DCU data control unit 5, export OP micro-code instruction to ALU 4 and DCU data control unit 5 simultaneously, export input vector address micro-code instruction to vector memory 3, export input destination address micro-code instruction to programmable counter 1 and vector memory 3;
Vector memory 3 is for receiving and storing input vector address micro-code instruction that microcode memory 2 sends and the output vector data command that destination address micro-code instruction, DCU data control unit 5 send and enable command, and output vector data are to ALU 4;
ALU 4 obtains vector after carrying out vector operation for the ALU steering order that the OP micro-code instruction that sends according to microcode memory 2 and DCU data control unit 5 send to the vector data received, and exports this vector to DCU data control unit 5;
DCU data control unit 5 produces enable command and output vector data command for the path index L micro-code instruction that sends according to microcode memory 2 and OP micro-code instruction, and exports vector memory 3 to; Also produce ALU steering order, and export ALU 4 to and make it carry out pointing out functional operation, division arithmetic and extracting operation; Also produce counting instruction and export programmable counter 1 to, making it count.
The vector data access control method of fixed point vector processor, the method comprises the steps:
For receiving and the count value of storage program counter 1 transmission, and output channel index L micro-code instruction is to DCU data control unit 5, export OP micro-code instruction to ALU 4 and DCU data control unit 5 simultaneously, export input vector address micro-code instruction to vector memory 3, export the step of input destination address micro-code instruction to programmable counter 1 and vector memory 3;
For receiving and storing input vector address micro-code instruction that microcode memory 2 sends and the output vector data command that destination address micro-code instruction, DCU data control unit 5 send and enable command, and output vector data are to the step of ALU 4;
Obtain vector after the ALU steering order sent for the OP micro-code instruction that sends according to microcode memory 2 and DCU data control unit 5 carries out vector operation to the vector data received, and export this vector the step of DCU data control unit 5 to;
Produce enable command and output vector data command for the path index L micro-code instruction that sends according to microcode memory 2 and OP micro-code instruction, and export vector memory 3 to; Also produce ALU steering order, and export ALU 4 to and make it carry out pointing out functional operation, division arithmetic and extracting operation; Also produce counting instruction and export programmable counter 1 to, making the step that it counts.
Beneficial effect: the fixed point vector processor of the present invention's design is based on the stronger vector processor of the Universal and scalability of FPGA design.
Innovation of the present invention is:
1) what this processor adopted is a kind of newly, upgradeable vector processor structure, adopts hardware description language (HDL) design, has very strong versatility, can be configured to the fixed point vector processor of floating-point or variable word length.
2) definition number of passages can be needed to be 8 ~ 128 according to calculating, computing engines overall width reaches as high as 4096.For being optimized via design of computation requirement of machine learning method, improve calculated performance.And design by isomery ALU, the ALU structure of each data path can need to change flexibly according to calculating, realizes the flexible configuration of instruction set, obtains the balance that calculated performance, power consumption and computational resource consume.
3) by writing micro code program, can realize multiple machine learning method based on this vector processor, solve the problem that traditional F PGA calculates versatility, poor expandability in design, the reusability of design strengthens greatly.
4) under the prerequisite meeting computational accuracy requirement, this fixed point vector processor is compared with floating point vector processor, CPU, and can obtain 2 times and 9 times of calculated performances liftings respectively, power consumption reduces to 1/3 and 1/40, and computing relay reduces to 1/2 and 1/9.
Accompanying drawing explanation
The structural representation of Fig. 1 fixed point vector processor of the present invention;
Fig. 2 is the cut-away view of the ALU 2 ~ ALU N in embodiment four;
Fig. 3 is the structural representation of the ALU 1 in embodiment four;
Fig. 4 is the microcode memory of fixed point vector processor in embodiment and the structural representation of vector memory;
Fig. 5 is simple instruction and the flowing water instruction sequencing comparison diagram of fixed point vector processor in embodiment.
Embodiment
Embodiment one, reference Fig. 1 illustrate present embodiment, the fixed point vector processor described in present embodiment, and it comprises programmable counter 1, microcode memory 2, vector memory 3, ALU 4 and DCU data control unit 5;
The input destination address micro-code instruction that programmable counter 1 sends for the counting instruction and microcode memory 2 receiving DCU data control unit 5 transmission, and export count value to microcode memory 2;
Microcode memory 2 is for receiving and the count value of storage program counter 1 transmission, and output channel index L micro-code instruction is to DCU data control unit 5, export OP micro-code instruction to ALU 4 and DCU data control unit 5 simultaneously, export input vector address micro-code instruction to vector memory 3, export input destination address micro-code instruction to programmable counter 1 and vector memory 3;
Vector memory 3 is for receiving and storing input vector address micro-code instruction that microcode memory 2 sends and the output vector data command that destination address micro-code instruction, DCU data control unit 5 send and enable command, and output vector data are to ALU 4;
ALU 4 obtains vector after carrying out vector operation for the ALU steering order that the OP micro-code instruction that sends according to microcode memory 2 and DCU data control unit 5 send to the vector data received, and exports this vector to DCU data control unit 5;
DCU data control unit 5 produces enable command and output vector data command for the path index L micro-code instruction that sends according to microcode memory 2 and OP micro-code instruction, and exports vector memory 3 to; Also produce ALU steering order, and export ALU 4 to and make it carry out pointing out functional operation, division arithmetic and extracting operation; Also produce counting instruction and export programmable counter 1 to, making it count.
With reference to shown in Fig. 1, the L in path index L micro-code instruction corresponding diagram; OP in OP micro-code instruction corresponding diagram; Vector A Address in the micro-code instruction corresponding diagram of input vector address and Vector B Address; Vector C Address in input destination address micro-code instruction corresponding diagram; VDATA in output vector data command corresponding diagram; WR in enable command corresponding diagram; B ADDR in the count value corresponding diagram that programmable counter 1 sends; LOAD in the counting instruction corresponding diagram that DCU data control unit 5 sends; CON in ALU steering order corresponding diagram; QA1, QA2 in the vector data corresponding diagram that vector memory 3 sends ... QAN, QB1, QB2 ... QBN.
The principle of work of fixed point vector processor is as follows: when 1) vector processor order performs instruction, the count value of programmable counter automatically adds 1 at the end of each machine cycle; When program moves to branch instruction (BRANCH), jump target addresses will be loaded into programmable counter.The count value of programmable counter is used directly as the address input of microcode memory; 2) micro code program for completing special algorithm is all pre-stored within microcode memory, and micro-code instruction is taken out in turn and performs under the control of programmable counter; 3) under the control of micro-code instruction, the vector data being stored in vector memory particular address is removed, and exports arithmetic logic unit alu to and carry out computing; 4) arithmetic logic unit alu is under the control of micro-code instruction and DCU data control unit, completes all vector operations, and result of calculation is transferred to DCU data control unit; 5) DCU data control unit is by according to different micro-code instructions, produces vector memory and writes enable (WR) signal, and the storage completing vector memory controls; Meanwhile, DCU data control unit DCU produces ALU control signal (CON), and the Heterogeneous Computing unit of control ALU completes computing.The functions such as exponential function (EXP), division (DIV) and extracting operation (SQRT).
For the high efficiency online time series forecasting problem of low delay, towards the online time series predicted application based on KAF method, the application proposes a kind of towards the vector processor structure of a class KAF method based on FPGA, design a kind of high-performance relatively general within the scope of KAF method, low delay fixed point vector processor structure: pass through multidiameter delay, streamline and fixed point technology improve processor calculated performance, solve the problem that computing relay is large and power consumption is high simultaneously, this vector processor adopts the programming mode based on microcode, realize calculating optimization in instruction level, make its versatility, the RTL level mapping method that extensibility is relatively traditional improves greatly, by the feature of abundant mining algorithm computation requirement, propose the design of a kind of isomery ALU, ensure that data path quantity, take into account vectorial division, evolution and exponential function computation requirement simultaneously, realize the balance of calculated performance and FPGA resource consumption, due to the test data set Existence dependency of computational accuracy and use, the application adopt be variable bit width fixed-point processor design, when precision meets the demands, more FPGA resource can be saved and realize higher running frequency, obtaining the higher computing velocity of relative floating point vector processor and lower computing relay.The application achieves three kinds of KAF methods the most classical based on this fixed point vector processor.Experiment shows: under the prerequisite meeting computational accuracy requirement, and this fixed point vector processor is compared with floating point vector processor, CPU, and can obtain 2 times and 9 times of calculated performances liftings respectively, power consumption reduces to 1/3 and 1/40, and computing relay reduces to 1/2 and 1/9.
Embodiment two, present embodiment are further illustrating the fixed point vector processor described in embodiment one, in present embodiment, described vector memory 3 comprise vector memory 1, vector memory 2 ..., vector memory N; ALU 4 comprise ALU 1, ALU 2 ..., ALU N; Wherein, N is more than or equal to the integer that 1 is less than or equal to 128; Vector memory 1 is connected with ALU 1 and forms a data path, and vector memory 1 and ALU 1 have identical data bit width; Vector memory 2 is connected with ALU 2 and forms a data path, and vector memory 2 and ALU 2 have identical data bit width; Vector memory N is connected with ALU N and forms a data path, and vector memory N and ALU N has identical data bit width.
Each vector memory and ALU form a data path, and the two is directly connected, and have identical data bit width M.
Embodiment three, present embodiment are further illustrating the fixed point vector processor described in embodiment one, and in present embodiment, the data path quantity that vector memory 3 and ALU 4 are formed is 1 tunnel ~ 128 tunnel.
Embodiment four, reference Fig. 2 and Fig. 3 illustrate present embodiment, present embodiment is further illustrating the fixed point vector processor described in embodiment one, in present embodiment, ALU 2 to ALU N in ALU 4 has identical structure, namely includes fixed point plus/minus musical instruments used in a Buddhist or Taoist mass 4-1 and fixed-point multiplication device 4-2;
Fixed point plus/minus musical instruments used in a Buddhist or Taoist mass 4-1 is for receiving the vector data of vector memory 3 transmission and exporting operation result to DCU data control unit 5 after carrying out plus/minus computing to it;
Fixed-point multiplication device 4-2 is for receiving the vector data of vector memory 3 transmission and exporting operation result to DCU data control unit 5 after carrying out multiplying to it;
ALU 1 comprises fixed point plus/minus musical instruments used in a Buddhist or Taoist mass 4-1 and fixed-point multiplication device 4-2, fixed point index function arithmetic unit 4-3, fixed-point divider 4-4, fixed point extracting operation unit 4-5 and dot product adder tree unit 4-6;
Fixed point index function arithmetic unit 4-3 is for receiving the vector data of vector memory 3 transmission and exporting operation result to DCU data control unit 5 after carrying out exponent arithmetic to it;
Fixed-point divider 4-4 is for receiving the vector data of vector memory 3 transmission and exporting operation result to DCU data control unit 5 after carrying out division arithmetic to it;
Fixed point extracting operation unit 4-5 is for receiving the vector data of vector memory 3 transmission and exporting operation result to DCU data control unit 5 after carrying out extracting operation to it;
Dot product adder tree unit 4-6 carries out additive operation for the N number of dot-product operation result completed in dot product computing.
In present embodiment, as shown in Figures 2 and 3, two kinds of arithmetic logic unit ALU are had: arithmetic logic unit alu (2 ~ N) altogether N-1 path has identical structure in the design of this isomery, comprise a fixed point plus/minus musical instruments used in a Buddhist or Taoist mass and a fixed-point multiplication device in each ALU, only support fixed point addition, subtraction and multiplying.Fixed-point multiplication device is single precision multiplier.
In ALU1 except comprising a fixed point plus/minus musical instruments used in a Buddhist or Taoist mass and a single precision multiplier, also comprise fixed point index function arithmetic unit, fixed-point divider, fixed point extracting operation unit and dot product adder tree unit, support fixed point addition, subtraction and multiplying, also support division, exponent arithmetic and extracting operation.
Arithmetic logic unit alu comprises N number of multiplier altogether.DCU data control unit DCU produces ALU control signal (CON), and the Heterogeneous Computing unit of control ALU completes the functions such as exponential function (EXP), division (DIV) and extracting operation (SQRT).
In Fig. 2 and Fig. 3, QA and QB represents the vector data output port of vector memory 3, and the vector data of vector memory 3 inputs in ALU 4 through QA and QB.VADD, VSUB, VMUL are simple instruction.Clock represents clock, helps out when computing.Add-sub represents the computing of plus/minus method.
In Fig. 3, QA_1, QA_2 ... QA_N and QB_1, QB_2 ... QB_N represents the vector data that vector memory 1 to the vector memory N in vector memory sends.QA_O represent QA_1, QA_2 ... QA_N is through unique output of MUX.。QB_O represent QB_1, QB_2 ... QB_N is through unique output of MUX.VMUL_1, VMUL_2 ... VMUL_N represents the Output rusults of N number of multiplier.VDOT represents the Output rusults of dot product.
SEXP represents exponential function instruction; SDIV represents divide instruction; SSORT represents extraction of square root instruction; S2V represents a scalar is expanded to vector.
Comprise division, exponent arithmetic and extracting operation unit although only have in ALU1, this fixed point vector processor still can witness vector computing by the round-robin implementation of similar scalar computer under DCU data control unit DCU controls.
The design of the isomery ALU recorded in present embodiment, by fully excavating the feature of KAF algorithm, has taken into account computational resource consumption on calculated performance and FPGA sheet.And this isomery lane-wise, makes this fixed point vector processor have very strong instruction extension ability.
Embodiment five, present embodiment are further illustrating the fixed point vector processor described in embodiment four, in present embodiment, ALU 1 also comprises trigonometric function Tan computing unit, trigonometric function Atan computing unit and logarithmic function Log computing unit.
The isomery lane-wise that embodiment four is recorded, this fixed point vector processor is made to have very strong instruction extension ability, by revising the structure of ALU1, the hardware cell that other many kinds of function calculates can be added, as trigonometric function Tan/Atan and logarithmic function Log etc., principle can realize any standard operation and User Defined computing, achieve the expansion of vector gather instruction.
Embodiment six, according on the basis of embodiment one, two, three, four or five, present embodiment is described, the vector data access control method of the fixed point vector processor described in present embodiment, the method comprises the steps:
For receiving and the count value of storage program counter 1 transmission, and output channel index L micro-code instruction is to DCU data control unit 5, export OP micro-code instruction to ALU 4 and DCU data control unit 5 simultaneously, export input vector address micro-code instruction to vector memory 3, export the step of input destination address micro-code instruction to programmable counter 1 and vector memory 3;
For receiving and storing input vector address micro-code instruction that microcode memory 2 sends and the output vector data command that destination address micro-code instruction, DCU data control unit 5 send and enable command, and output vector data are to the step of ALU 4;
Obtain vector after the ALU steering order sent for the OP micro-code instruction that sends according to microcode memory 2 and DCU data control unit 5 carries out vector operation to the vector data received, and export this vector the step of DCU data control unit 5 to;
Produce enable command and output vector data command for the path index L micro-code instruction that sends according to microcode memory 2 and OP micro-code instruction, and export vector memory 3 to; Also produce ALU steering order, and export ALU 4 to and make it carry out pointing out functional operation, division arithmetic and extracting operation; Also produce counting instruction and export programmable counter 1 to, making the step that it counts.
Embodiment seven, present embodiment are further illustrating the vector data access control method of the fixed point vector processor described in embodiment six, in present embodiment, vector is obtained after carrying out vector operation according to described vector data, and transfer in the step of DCU data control unit 5, described vector operation process comprises:
For receiving the vector data of vector memory 3 transmission and exporting the step of operation result to DCU data control unit 5 after carrying out plus/minus computing to it;
For receiving the vector data of vector memory 3 transmission and exporting the step of operation result to DCU data control unit 5 after carrying out multiplying to it;
For receiving the vector data of vector memory 3 transmission and exporting the step of operation result to DCU data control unit 5 after carrying out exponent arithmetic to it;
For receiving the vector data of vector memory 3 transmission and exporting the step of operation result to DCU data control unit 5 after carrying out division arithmetic to it;
For receiving the vector data of vector memory 3 transmission and exporting the step of operation result to DCU data control unit 5 after carrying out extracting operation to it;
The step of additive operation is carried out for the N number of dot-product operation result completed in dot product computing.
Vector processor, also referred to as array processor, synchronously can carry out the arithmetic operation of integrated data; And most CPU belongs to scalar processor, can only single treatment key element.Vector processor is widely used in scientific algorithm field, and they are bases of the eighties and even the nineties most of supercomputer.Current most of business CPU comprises some vector processor instruction, comparatively typically SIMD.In video entertainment control and user computer graphic hardware, vector processor also play vital status in its framework.
Embodiment eight, below in conjunction with drawings and Examples, a kind of fixed point vector processor provided by the present invention to be described in further detail.
Each vector memory and ALU form a data path, and the two is directly connected, and have identical data bit width M.Composition graphs 1, namely described vector memory 3 comprise vector memory 1, vector memory 2 ... vector memory N; ALU 4 comprise ALU 1, ALU 2 ... ALU N; Wherein, N is more than or equal to the integer that 1 is less than or equal to 128; Vector memory 1 is connected with ALU 1 and forms a data path, and vector memory 1 and ALU 1 have identical data bit width; Vector memory 2 is connected with ALU 2 and forms a data path, and vector memory 2 and ALU 2 have identical data bit width; Vector memory N is connected with ALU N and forms a data path, and vector memory N and ALU N has identical data bit width.
1.1 memory interface
Tradition risc processor generally adopts register file to store intermediate data, and register file generally has tens ports that can read while write, to realize very high calculated performance.Owing to not having available multiport (port number is greater than 2) storer in FPGA, on the sheet that the application adopts FPGA, two-port RAM is as vector memory.
As shown in Figure 4, two identical two-port RAMs are comprised as vector memory in each data path, it is read address port and driven by " A " and " B " two the code sections in micro-code instruction (table 1) respectively, and write address port drives by " C ".Data-out port " QA ", " QB " directly input with ALU and are coupled." WR " signal is the write enable signal of each vector memory, is produced by DCU data control unit (DCU), controls for the storage completing vector data.Microcode memory is set and works in ROM pattern, and power on the micro code program loading and prestore, and therefore microcode memory can not be rewritten online as vector memory.
The storage depth of vector memory is 2048, store data width (M) can set flexibly according to the width of fixed point calculation ALU, before compiling by macro-variable with the formal definition integral part of parameter and fraction part width.Vector memory address space is divided into 3 sections: 0x0000 ~ 0x0BFF for storing training and on-line testing data, and this area data presets when compiling; 0x0C00 ~ 0x0FEF is for storing results of intermediate calculations; 16 last word 0x0FF0 ~ 0x0FFF are then used for storing the constant needing in vector calculation process to use.
1.2 arithmetic logic unit
The application adopts the arithmetic logic unit ALU of isomery to design, for common KAF method, need in computation process to use a small amount of division arithmetic, exponent arithmetic and extracting operation, these arithmetic element frequencies of utilization are very low, but need to consume DSP resource on a large amount of FPGA sheet.The experiment in early stage proves, if comprise division and exponent arithmetic module in each ALU, the maximum access quantity of vector processor is only 17, will have a strong impact on the performance of vector processor.Therefore, the application have employed the via design of isomery, under the prerequisite not affecting overall performance, ensure that data path quantity, and EQUILIBRIUM CALCULATION FOR PROCESS performance and FPGA computational resource consume.
As shown in accompanying drawing 2, Fig. 3 and Fig. 5, in the design of this isomery, have common N-1 the path of two kinds of arithmetic logic unit: ALU (2 ~ N) there is identical structure, comprise a fixed point plus/minus musical instruments used in a Buddhist or Taoist mass and a fixed-point multiplication device in each ALU, only support fixed point addition, subtraction and multiplying; In ALU1 except comprising a fixed point plus/minus musical instruments used in a Buddhist or Taoist mass and a fixed-point multiplication device, also comprise fixed point index function arithmetic unit, fixed-point divider, fixed point extracting operation unit and dot product adder tree unit, not only support fixed point addition, subtraction, multiplication and dot product computing, also support division, exponent arithmetic and extracting operation.Although only have ALU1 to comprise division, exponent arithmetic and extracting operation unit, vector processor can witness vector computing by the round-robin implementation of similar scalar computer under DCU controls.
1.3 micro-code instruction structures
Micro code program exists with the form of micro-code instruction.The length of microcode needs to determine according to the quantity of code section and the length of each yard of section, wherein the length of address field depends on the storage depth that processor needs are directly accessed, such as, if processor needs access 4GByte address realm, then address width should be 32.Micro-code instruction form is as shown in table 1, " A ", " B " are respectively input vector address, " C " represents the destination address that vector stores, the width of " A ", " B " and " C " three code sections is all set as 16, the address space of 64Kbyte can be accessed, although do not use whole address space at present, sufficient surplus can be left for the follow-up design with larger storage depth; Path index " L " is the code section of 12 bit wides, is used to specify " PVDOT " instruction results (scalar) and is stored in which vector memory; " OP " is order code formulation instruction type, and current vector processor only has 12 kinds of instructions, and the width of " OP " is set as 4.Therefore, the overall width of microcode is 64.
As shown in table 2, this vector processor has 12 instructions at present, and table 2 gives the concrete function of every bar instruction simultaneously and performs machine periodicity.Micro-code instruction is divided into simple instruction and flowing water instruction two kinds: simple instruction comprises VADD, VSUB, VMUL, VDIV and VEXP, flowing water instruction comprises PVADD, PVSUB, PVMUL and PVDOT, here the vector operations that " flowing water (Pipelined) " refers to the inner continuous several times of instruction is full flowing water, functionally be equal to N continuous print simple instruction, use flowing water instruction to improve execution efficiency for matrix operation.Such as, as shown in Figure 5, PVADD instruction realizes N continuous time vectorial addition, be equivalent to perform N VADD instruction continuously, execution (Execution) process of PVADD writes back (Write Back) process with data overlapping, therefore, PVADD instruction execution efficiency is higher, and the execution speed-up ratio that relative and VADD instruction can obtain is 4N/ (N+3).
Table 1 microcode form
The list of table 2 micro-code program instructions

Claims (7)

1. fixed point vector processor, is characterized in that, it comprises programmable counter (1), microcode memory (2), vector memory (3), ALU (4) and DCU data control unit (5);
The input destination address micro-code instruction that programmable counter (1) sends for the counting instruction that receives DCU data control unit (5) and send and microcode memory (2), and export count value to microcode memory (2);
The count value that microcode memory (2) sends for receiving also storage program counter (1), and output channel index L micro-code instruction is to DCU data control unit (5), export OP micro-code instruction to ALU (4) and DCU data control unit (5) simultaneously, export input vector address micro-code instruction to vector memory (3), export input destination address micro-code instruction to programmable counter (1) and vector memory (3);
Vector memory (3) is for receiving and storing input vector address micro-code instruction that microcode memory (2) sends and the output vector data command that destination address micro-code instruction, DCU data control unit (5) send and enable command, and output vector data are to ALU (4);
ALU (4) obtains vector after carrying out vector operation for the ALU steering order that the OP micro-code instruction that sends according to microcode memory (2) and DCU data control unit (5) send to the vector data received, and exports this vector to DCU data control unit (5);
DCU data control unit (5) produces enable command and output vector data command for the path index L micro-code instruction that sends according to microcode memory (2) and OP micro-code instruction, and exports vector memory (3) to; Also produce ALU steering order, and export ALU (4) to and make it carry out pointing out functional operation, division arithmetic and extracting operation; Also produce counting instruction and export programmable counter (1) to, making it count.
2. fixed point vector processor according to claim 1, is characterized in that, described vector memory (3) comprise vector memory 1, vector memory 2 ..., vector memory N; ALU (4) comprise ALU 1, ALU 2 ..., ALU N; Wherein, N is more than or equal to the integer that 1 is less than or equal to 128; Vector memory 1 is connected with ALU 1 and forms a data path, and vector memory 1 and ALU 1 have identical data bit width; Vector memory 2 is connected with ALU 2 and forms a data path, and vector memory 2 and ALU 2 have identical data bit width; Vector memory N is connected with ALU N and forms a data path, and vector memory N and ALU N has identical data bit width.
3. fixed point vector processor according to claim 1, is characterized in that, the data path quantity that vector memory (3) and ALU (4) are formed is 1 tunnel ~ 128 tunnel.
4. fixed point vector processor according to claim 1, it is characterized in that, ALU 2 to ALU N in ALU (4) has identical structure, namely includes fixed point plus/minus musical instruments used in a Buddhist or Taoist mass (4-1) and a fixed-point multiplication device (4-2);
Fixed point plus/minus musical instruments used in a Buddhist or Taoist mass (4-1) is for the vector data that receives vector memory (3) and send and export operation result after carrying out plus/minus computing to it to DCU data control unit (5);
Fixed-point multiplication device (4-2) is for the vector data that receives vector memory (3) and send and export operation result after carrying out multiplying to it to DCU data control unit (5);
ALU 1 comprises fixed point plus/minus musical instruments used in a Buddhist or Taoist mass (4-1) and fixed-point multiplication device (4-2), fixed point index function arithmetic unit (4-3), fixed-point divider (4-4), fixed point extracting operation unit (4-5) and a dot product adder tree unit (4-6);
Fixed point index function arithmetic unit (4-3) is for the vector data that receives vector memory (3) and send and export operation result after carrying out exponent arithmetic to it to DCU data control unit (5);
Fixed-point divider (4-4) is for the vector data that receives vector memory (3) and send and export operation result after carrying out division arithmetic to it to DCU data control unit (5);
Fixed point extracting operation unit (4-5) is for the vector data that receives vector memory (3) and send and export operation result after carrying out extracting operation to it to DCU data control unit (5);
Dot product adder tree unit (4-6) carries out additive operation for the N number of dot-product operation result completed in dot product computing.
5. fixed point vector processor according to claim 1, is characterized in that, ALU (1) also comprises trigonometric function Tan computing unit, trigonometric function Atan computing unit and logarithmic function Log computing unit.
6. the vector data access control method of the fixed point vector processor according to claim 1,2,3,4 or 5, it is characterized in that, the method comprises the steps:
For receiving counting instruction that DCU data control unit (5) sends and the input destination address micro-code instruction that microcode memory (2) sends, and export the step of count value to microcode memory (2);
For receiving the count value that also storage program counter (1) sends, and output channel index L micro-code instruction is to DCU data control unit (5), export OP micro-code instruction to ALU (4) and DCU data control unit (5) simultaneously, export input vector address micro-code instruction to vector memory (3), export the step of input destination address micro-code instruction to programmable counter (1) and vector memory (3);
For receiving and storing input vector address micro-code instruction that microcode memory (2) sends and the output vector data command that destination address micro-code instruction, DCU data control unit (5) send and enable command, and output vector data are to the step of ALU (4);
Obtain vector after the ALU steering order sent for the OP micro-code instruction that sends according to microcode memory (2) and DCU data control unit (5) carries out vector operation to the vector data received, and export this vector the step of DCU data control unit (5) to;
Produce enable command and output vector data command for the path index L micro-code instruction that sends according to microcode memory (2) and OP micro-code instruction, and export vector memory (3) to; Also produce ALU steering order, and export ALU (4) to and make it carry out pointing out functional operation, division arithmetic and extracting operation; Also produce counting instruction and export programmable counter (1) to, making the step that it counts.
7. the vector data access control method of fixed point vector processor according to claim 6, it is characterized in that, after carrying out vector operation according to described vector data, obtain vector, and transfer in the step of DCU data control unit (5), described vector operation process comprises:
For receiving vector data that vector memory (3) sends and exporting the step of operation result to DCU data control unit (5) after carrying out plus/minus computing to it;
For receiving vector data that vector memory (3) sends and exporting the step of operation result to DCU data control unit (5) after carrying out multiplying to it;
For receiving vector data that vector memory (3) sends and exporting the step of operation result to DCU data control unit (5) after carrying out exponent arithmetic to it;
For receiving vector data that vector memory (3) sends and exporting the step of operation result to DCU data control unit (5) after carrying out division arithmetic to it;
For receiving vector data that vector memory (3) sends and exporting the step of operation result to DCU data control unit (5) after carrying out extracting operation to it;
The step of additive operation is carried out for the N number of dot-product operation result completed in dot product computing.
CN201510144307.3A 2015-03-30 2015-03-30 Fixed point vector processor and vector data access controlling method thereof Pending CN104699458A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510144307.3A CN104699458A (en) 2015-03-30 2015-03-30 Fixed point vector processor and vector data access controlling method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510144307.3A CN104699458A (en) 2015-03-30 2015-03-30 Fixed point vector processor and vector data access controlling method thereof

Publications (1)

Publication Number Publication Date
CN104699458A true CN104699458A (en) 2015-06-10

Family

ID=53346632

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510144307.3A Pending CN104699458A (en) 2015-03-30 2015-03-30 Fixed point vector processor and vector data access controlling method thereof

Country Status (1)

Country Link
CN (1) CN104699458A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017124648A1 (en) * 2016-01-20 2017-07-27 北京中科寒武纪科技有限公司 Vector computing device
WO2017185385A1 (en) * 2016-04-26 2017-11-02 北京中科寒武纪科技有限公司 Apparatus and method for executing vector merging operation
WO2017185384A1 (en) * 2016-04-26 2017-11-02 北京中科寒武纪科技有限公司 Apparatus and method for executing vector circular shift operation
WO2017185395A1 (en) * 2016-04-26 2017-11-02 北京中科寒武纪科技有限公司 Apparatus and method for executing vector comparison operation
CN107315716A (en) * 2016-04-26 2017-11-03 北京中科寒武纪科技有限公司 A kind of apparatus and method for performing Outer Product of Vectors computing
CN107873091A (en) * 2015-07-20 2018-04-03 高通股份有限公司 SIMD sliding window computings
CN108153514A (en) * 2017-12-19 2018-06-12 北京云知声信息技术有限公司 A kind of floating point vector accelerating method and device
CN108415728A (en) * 2018-03-01 2018-08-17 中国科学院计算技术研究所 A kind of extension floating-point operation instruction executing method and device for processor
CN108733408A (en) * 2017-04-21 2018-11-02 上海寒武纪信息科技有限公司 Counting device and method of counting
CN109388427A (en) * 2017-08-11 2019-02-26 龙芯中科技术有限公司 Vector processing method, vector processing unit and microprocessor
US10762164B2 (en) 2016-01-20 2020-09-01 Cambricon Technologies Corporation Limited Vector and matrix computing device
CN111651205A (en) * 2016-04-26 2020-09-11 中科寒武纪科技股份有限公司 Device and method for executing vector inner product operation
CN112470139A (en) * 2018-01-08 2021-03-09 阿特拉佐有限公司 Compact arithmetic accelerator for data processing apparatus, system and method
WO2021078212A1 (en) * 2019-10-25 2021-04-29 安徽寒武纪信息科技有限公司 Computing apparatus and method for vector inner product, and integrated circuit chip
US11507350B2 (en) 2017-04-21 2022-11-22 Cambricon (Xi'an) Semiconductor Co., Ltd. Processing apparatus and processing method
US11531540B2 (en) 2017-04-19 2022-12-20 Cambricon (Xi'an) Semiconductor Co., Ltd. Processing apparatus and processing method with dynamically configurable operation bit width

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6324638B1 (en) * 1999-03-31 2001-11-27 International Business Machines Corporation Processor having vector processing capability and method for executing a vector instruction in a processor
CN102411773A (en) * 2011-07-28 2012-04-11 中国人民解放军国防科学技术大学 Vector-processor-oriented mean-residual normalized product correlation vectoring method
CN102750133A (en) * 2012-06-20 2012-10-24 中国电子科技集团公司第五十八研究所 32-Bit triple-emission digital signal processor supporting SIMD

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6324638B1 (en) * 1999-03-31 2001-11-27 International Business Machines Corporation Processor having vector processing capability and method for executing a vector instruction in a processor
CN102411773A (en) * 2011-07-28 2012-04-11 中国人民解放军国防科学技术大学 Vector-processor-oriented mean-residual normalized product correlation vectoring method
CN102750133A (en) * 2012-06-20 2012-10-24 中国电子科技集团公司第五十八研究所 32-Bit triple-emission digital signal processor supporting SIMD

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YEYONG PANG等: "A low latency kernel recursive least squares processor using FPGA technology", 《2013 INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE TECHNOLOGY(FPT)》 *

Cited By (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107873091B (en) * 2015-07-20 2021-05-28 高通股份有限公司 Method and apparatus for sliding window arithmetic
CN107873091A (en) * 2015-07-20 2018-04-03 高通股份有限公司 SIMD sliding window computings
US10762164B2 (en) 2016-01-20 2020-09-01 Cambricon Technologies Corporation Limited Vector and matrix computing device
KR102304216B1 (en) 2016-01-20 2021-09-23 캠브리콘 테크놀로지스 코퍼레이션 리미티드 Vector computing device
US11734383B2 (en) 2016-01-20 2023-08-22 Cambricon Technologies Corporation Limited Vector and matrix computing device
KR102185287B1 (en) 2016-01-20 2020-12-01 캠브리콘 테크놀로지스 코퍼레이션 리미티드 Vector computing device
CN106990940B (en) * 2016-01-20 2020-05-22 中科寒武纪科技股份有限公司 Vector calculation device and calculation method
CN106990940A (en) * 2016-01-20 2017-07-28 南京艾溪信息科技有限公司 A kind of vector calculation device
CN111580863B (en) * 2016-01-20 2024-05-03 中科寒武纪科技股份有限公司 Vector operation device and operation method
CN111580865B (en) * 2016-01-20 2024-02-27 中科寒武纪科技股份有限公司 Vector operation device and operation method
WO2017124648A1 (en) * 2016-01-20 2017-07-27 北京中科寒武纪科技有限公司 Vector computing device
CN111580865A (en) * 2016-01-20 2020-08-25 中科寒武纪科技股份有限公司 Vector operation device and operation method
CN111580863A (en) * 2016-01-20 2020-08-25 中科寒武纪科技股份有限公司 Vector operation device and operation method
KR20190073593A (en) * 2016-01-20 2019-06-26 캠브리콘 테크놀로지스 코퍼레이션 리미티드 Vector computing device
KR20200058562A (en) * 2016-01-20 2020-05-27 캠브리콘 테크놀로지스 코퍼레이션 리미티드 Vector computing device
EP3451156A4 (en) * 2016-04-26 2020-03-25 Cambricon Technologies Corporation Limited Apparatus and method for executing vector circular shift operation
US10853069B2 (en) 2016-04-26 2020-12-01 Cambricon Technologies Corporation Limited Apparatus and methods for comparing vectors
CN107315716B (en) * 2016-04-26 2020-08-07 中科寒武纪科技股份有限公司 Device and method for executing vector outer product operation
WO2017185385A1 (en) * 2016-04-26 2017-11-02 北京中科寒武纪科技有限公司 Apparatus and method for executing vector merging operation
CN111651205B (en) * 2016-04-26 2023-11-17 中科寒武纪科技股份有限公司 Apparatus and method for performing vector inner product operation
CN107315563A (en) * 2016-04-26 2017-11-03 北京中科寒武纪科技有限公司 A kind of apparatus and method for performing vectorial comparison operation
CN111651205A (en) * 2016-04-26 2020-09-11 中科寒武纪科技股份有限公司 Device and method for executing vector inner product operation
WO2017185384A1 (en) * 2016-04-26 2017-11-02 北京中科寒武纪科技有限公司 Apparatus and method for executing vector circular shift operation
CN107315716A (en) * 2016-04-26 2017-11-03 北京中科寒武纪科技有限公司 A kind of apparatus and method for performing Outer Product of Vectors computing
WO2017185395A1 (en) * 2016-04-26 2017-11-02 北京中科寒武纪科技有限公司 Apparatus and method for executing vector comparison operation
US11720353B2 (en) 2017-04-19 2023-08-08 Shanghai Cambricon Information Technology Co., Ltd Processing apparatus and processing method
US11698786B2 (en) 2017-04-19 2023-07-11 Shanghai Cambricon Information Technology Co., Ltd Processing apparatus and processing method
US11531541B2 (en) 2017-04-19 2022-12-20 Shanghai Cambricon Information Technology Co., Ltd Processing apparatus and processing method
US11531540B2 (en) 2017-04-19 2022-12-20 Cambricon (Xi'an) Semiconductor Co., Ltd. Processing apparatus and processing method with dynamically configurable operation bit width
US11734002B2 (en) 2017-04-19 2023-08-22 Shanghai Cambricon Information Technology Co., Ltd Counting elements in neural network input data
CN108733408A (en) * 2017-04-21 2018-11-02 上海寒武纪信息科技有限公司 Counting device and method of counting
US11507350B2 (en) 2017-04-21 2022-11-22 Cambricon (Xi'an) Semiconductor Co., Ltd. Processing apparatus and processing method
CN109324826B (en) * 2017-04-21 2021-03-26 上海寒武纪信息科技有限公司 Counting device and counting method
CN109324826A (en) * 2017-04-21 2019-02-12 上海寒武纪信息科技有限公司 Counting device and method of counting
CN109388427A (en) * 2017-08-11 2019-02-26 龙芯中科技术有限公司 Vector processing method, vector processing unit and microprocessor
CN108153514A (en) * 2017-12-19 2018-06-12 北京云知声信息技术有限公司 A kind of floating point vector accelerating method and device
CN112470139B (en) * 2018-01-08 2022-04-08 阿特拉佐有限公司 Compact arithmetic accelerator for data processing apparatus, system and method
CN112470139A (en) * 2018-01-08 2021-03-09 阿特拉佐有限公司 Compact arithmetic accelerator for data processing apparatus, system and method
CN108415728B (en) * 2018-03-01 2020-12-29 中国科学院计算技术研究所 Extended floating point operation instruction execution method and device for processor
CN108415728A (en) * 2018-03-01 2018-08-17 中国科学院计算技术研究所 A kind of extension floating-point operation instruction executing method and device for processor
WO2021078212A1 (en) * 2019-10-25 2021-04-29 安徽寒武纪信息科技有限公司 Computing apparatus and method for vector inner product, and integrated circuit chip

Similar Documents

Publication Publication Date Title
CN104699458A (en) Fixed point vector processor and vector data access controlling method thereof
Singh et al. NERO: A near high-bandwidth memory stencil accelerator for weather prediction modeling
CN106940815B (en) Programmable convolutional neural network coprocessor IP core
CN106775599B (en) The more computing unit coarseness reconfigurable systems and method of recurrent neural network
CN108197705A (en) Convolutional neural networks hardware accelerator and convolutional calculation method and storage medium
CN105468335A (en) Pipeline-level operation device, data processing method and network-on-chip chip
CN112308222B (en) RRAM (remote radio access m) -based memory and calculation integrated full-system simulator and design method thereof
CN105468568B (en) Efficient coarseness restructurable computing system
CN111105023B (en) Data stream reconstruction method and reconfigurable data stream processor
CN102184092A (en) Special instruction set processor based on pipeline structure
CN103984560A (en) Embedded reconfigurable system based on large-scale coarseness and processing method thereof
CN108647779A (en) A kind of low-bit width convolutional neural networks Reconfigurable Computation unit
CN112446471B (en) Convolution acceleration method based on heterogeneous many-core processor
CN116710912A (en) Matrix multiplier and control method thereof
CN110018848A (en) A kind of mixing based on RISC-V is mixed to calculate system and method
CN102722472A (en) Complex matrix optimizing method
Zafar et al. Hardware architecture design and mapping of ‘Fast Inverse Square Root’algorithm
CN111178492B (en) Computing device, related product and computing method for executing artificial neural network model
CN116822600A (en) Neural network search chip based on RISC-V architecture
CN112051981A (en) Data pipeline computing path structure and single-thread data pipeline system
CN103761213A (en) On-chip array system based on circulating pipeline computation
Abdelhamid et al. MITRACA: A next-gen heterogeneous architecture
Diamantopoulos et al. A system-level transprecision FPGA accelerator for BLSTM using on-chip memory reshaping
Vishnu et al. 32-Bit RISC Processor Using VedicMultiplier
Daisaka et al. GRAPE-mp: An simd accelerator board for multi-precision arithmetic

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150610