CN104699458A - Fixed point vector processor and vector data access controlling method thereof - Google Patents
Fixed point vector processor and vector data access controlling method thereof Download PDFInfo
- Publication number
- CN104699458A CN104699458A CN201510144307.3A CN201510144307A CN104699458A CN 104699458 A CN104699458 A CN 104699458A CN 201510144307 A CN201510144307 A CN 201510144307A CN 104699458 A CN104699458 A CN 104699458A
- Authority
- CN
- China
- Prior art keywords
- vector
- alu
- memory
- control unit
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
The invention discloses a fixed point vector processor and a vector data access controlling method thereof, relates to a vector processor used for online time series prediction and aims to solve the problems that an existing vector processor incapable of optimizing a specific method is poor in universality and cannot satisfy requirements on online computation. The fixed point vector processor comprises a program counter, a microcode memorizer, a vector memorizer, an arithmetic logic unit and a data control unit. A complete fixed point vector processing procedure is formed by every signal processing procedure of the program counter, the microcode memorizer, the vector memorizer, the arithmetic logic unit and the data control unit. By means of ALU (arithmetic logic unit) designing, an ALU structure of each data path can be changed flexibly according to computation needs, flexible configuration of an instruction set is achieved, and accordingly, the fixed point vector processor and the vector data access controlling method are applicable to occasions required by complex computation.
Description
Technical field
The present invention relates to the fixed point vector processor of a kind of high-performance, low-power consumption and low delay, particularly relate to a kind of vector processor for online time series prediction.
Background technology
At present, online machine learning for embedded high-performance computing platform has become a study hotspot, there is the information physical system node of the high-performance of online data processing capacity, low-power consumption and low delay, by integrated to information acquisition, Intelligent Information Processing and network communicating function height, be widely used the fields such as environment, commercial production and aerospace engineering.But for application on site, nonlinear method needs constantly add new samples and upgrade model, and ever-increasing sample size and the calculated amount required for model modification increase greatly, propose great challenge to the performance of embedding assembly platform.So, for requiring high-performance, low delay and there is online machine learning and the application of large throughput data processing power, need high performance computing platform.
At present, the existing computing system based on FPGA typically uses HDL language design, at RTL level by specific Algorithm mapping to FPGA, the method for this special accelerating engine has many design examples, achieves very high calculating speed-up ratio.
But although using the design of HDL code mapping special algorithm to achieve very high calculating speed-up ratio, versatility is not strong.Be mainly reflected in when target algorithm changes, major part design or even all design all need manual change, and dirigibility is very poor, and its design complexities and design cycle limit the scope of application of this type of design, i.e. poor expandability; And existing vector processor carries out designing according to the structure of general processor, cannot be optimized for ad hoc approach, performance can not meet the demand in line computation.
In prior art, the number of passages of vector processor is restricted to the exponent (16,32,64,128) of 2 at present.In present embodiment, the data path quantity of vector processor 3 can need setting flexibly according to specific calculating, in order to reduce the complexity of hardware design, according to design herein, can realize the vector processor of 1 ~ 128 road arbitrary data number of passages.The design of this vector processor is the vector processor design that a kind of portability is very strong, can be transplanted to other FPGA device easily.
Summary of the invention
The present invention proposes to solve following problems: the general processor software simulating calculated performance 1) for kernel method is low, power consumption and the large problem of delay; 2) existing vector processor cannot be optimized the strong problem with not meeting the demand in line computation of the versatility caused for ad hoc approach; 3) existing vector processor cannot take into account computational resource consumption on calculated performance and FPGA sheet.Now fixed point vector processor and vector data access control method thereof are proposed.
Fixed point vector processor, it comprises programmable counter, microcode memory, vector memory, ALU and DCU data control unit;
The input destination address micro-code instruction that programmable counter 1 sends for the counting instruction and microcode memory 2 receiving DCU data control unit 5 transmission, and export count value to microcode memory 2;
Microcode memory 2 is for receiving and the count value of storage program counter 1 transmission, and output channel index L micro-code instruction is to DCU data control unit 5, export OP micro-code instruction to ALU 4 and DCU data control unit 5 simultaneously, export input vector address micro-code instruction to vector memory 3, export input destination address micro-code instruction to programmable counter 1 and vector memory 3;
Vector memory 3 is for receiving and storing input vector address micro-code instruction that microcode memory 2 sends and the output vector data command that destination address micro-code instruction, DCU data control unit 5 send and enable command, and output vector data are to ALU 4;
ALU 4 obtains vector after carrying out vector operation for the ALU steering order that the OP micro-code instruction that sends according to microcode memory 2 and DCU data control unit 5 send to the vector data received, and exports this vector to DCU data control unit 5;
DCU data control unit 5 produces enable command and output vector data command for the path index L micro-code instruction that sends according to microcode memory 2 and OP micro-code instruction, and exports vector memory 3 to; Also produce ALU steering order, and export ALU 4 to and make it carry out pointing out functional operation, division arithmetic and extracting operation; Also produce counting instruction and export programmable counter 1 to, making it count.
The vector data access control method of fixed point vector processor, the method comprises the steps:
For receiving and the count value of storage program counter 1 transmission, and output channel index L micro-code instruction is to DCU data control unit 5, export OP micro-code instruction to ALU 4 and DCU data control unit 5 simultaneously, export input vector address micro-code instruction to vector memory 3, export the step of input destination address micro-code instruction to programmable counter 1 and vector memory 3;
For receiving and storing input vector address micro-code instruction that microcode memory 2 sends and the output vector data command that destination address micro-code instruction, DCU data control unit 5 send and enable command, and output vector data are to the step of ALU 4;
Obtain vector after the ALU steering order sent for the OP micro-code instruction that sends according to microcode memory 2 and DCU data control unit 5 carries out vector operation to the vector data received, and export this vector the step of DCU data control unit 5 to;
Produce enable command and output vector data command for the path index L micro-code instruction that sends according to microcode memory 2 and OP micro-code instruction, and export vector memory 3 to; Also produce ALU steering order, and export ALU 4 to and make it carry out pointing out functional operation, division arithmetic and extracting operation; Also produce counting instruction and export programmable counter 1 to, making the step that it counts.
Beneficial effect: the fixed point vector processor of the present invention's design is based on the stronger vector processor of the Universal and scalability of FPGA design.
Innovation of the present invention is:
1) what this processor adopted is a kind of newly, upgradeable vector processor structure, adopts hardware description language (HDL) design, has very strong versatility, can be configured to the fixed point vector processor of floating-point or variable word length.
2) definition number of passages can be needed to be 8 ~ 128 according to calculating, computing engines overall width reaches as high as 4096.For being optimized via design of computation requirement of machine learning method, improve calculated performance.And design by isomery ALU, the ALU structure of each data path can need to change flexibly according to calculating, realizes the flexible configuration of instruction set, obtains the balance that calculated performance, power consumption and computational resource consume.
3) by writing micro code program, can realize multiple machine learning method based on this vector processor, solve the problem that traditional F PGA calculates versatility, poor expandability in design, the reusability of design strengthens greatly.
4) under the prerequisite meeting computational accuracy requirement, this fixed point vector processor is compared with floating point vector processor, CPU, and can obtain 2 times and 9 times of calculated performances liftings respectively, power consumption reduces to 1/3 and 1/40, and computing relay reduces to 1/2 and 1/9.
Accompanying drawing explanation
The structural representation of Fig. 1 fixed point vector processor of the present invention;
Fig. 2 is the cut-away view of the ALU 2 ~ ALU N in embodiment four;
Fig. 3 is the structural representation of the ALU 1 in embodiment four;
Fig. 4 is the microcode memory of fixed point vector processor in embodiment and the structural representation of vector memory;
Fig. 5 is simple instruction and the flowing water instruction sequencing comparison diagram of fixed point vector processor in embodiment.
Embodiment
Embodiment one, reference Fig. 1 illustrate present embodiment, the fixed point vector processor described in present embodiment, and it comprises programmable counter 1, microcode memory 2, vector memory 3, ALU 4 and DCU data control unit 5;
The input destination address micro-code instruction that programmable counter 1 sends for the counting instruction and microcode memory 2 receiving DCU data control unit 5 transmission, and export count value to microcode memory 2;
Microcode memory 2 is for receiving and the count value of storage program counter 1 transmission, and output channel index L micro-code instruction is to DCU data control unit 5, export OP micro-code instruction to ALU 4 and DCU data control unit 5 simultaneously, export input vector address micro-code instruction to vector memory 3, export input destination address micro-code instruction to programmable counter 1 and vector memory 3;
Vector memory 3 is for receiving and storing input vector address micro-code instruction that microcode memory 2 sends and the output vector data command that destination address micro-code instruction, DCU data control unit 5 send and enable command, and output vector data are to ALU 4;
ALU 4 obtains vector after carrying out vector operation for the ALU steering order that the OP micro-code instruction that sends according to microcode memory 2 and DCU data control unit 5 send to the vector data received, and exports this vector to DCU data control unit 5;
DCU data control unit 5 produces enable command and output vector data command for the path index L micro-code instruction that sends according to microcode memory 2 and OP micro-code instruction, and exports vector memory 3 to; Also produce ALU steering order, and export ALU 4 to and make it carry out pointing out functional operation, division arithmetic and extracting operation; Also produce counting instruction and export programmable counter 1 to, making it count.
With reference to shown in Fig. 1, the L in path index L micro-code instruction corresponding diagram; OP in OP micro-code instruction corresponding diagram; Vector A Address in the micro-code instruction corresponding diagram of input vector address and Vector B Address; Vector C Address in input destination address micro-code instruction corresponding diagram; VDATA in output vector data command corresponding diagram; WR in enable command corresponding diagram; B ADDR in the count value corresponding diagram that programmable counter 1 sends; LOAD in the counting instruction corresponding diagram that DCU data control unit 5 sends; CON in ALU steering order corresponding diagram; QA1, QA2 in the vector data corresponding diagram that vector memory 3 sends ... QAN, QB1, QB2 ... QBN.
The principle of work of fixed point vector processor is as follows: when 1) vector processor order performs instruction, the count value of programmable counter automatically adds 1 at the end of each machine cycle; When program moves to branch instruction (BRANCH), jump target addresses will be loaded into programmable counter.The count value of programmable counter is used directly as the address input of microcode memory; 2) micro code program for completing special algorithm is all pre-stored within microcode memory, and micro-code instruction is taken out in turn and performs under the control of programmable counter; 3) under the control of micro-code instruction, the vector data being stored in vector memory particular address is removed, and exports arithmetic logic unit alu to and carry out computing; 4) arithmetic logic unit alu is under the control of micro-code instruction and DCU data control unit, completes all vector operations, and result of calculation is transferred to DCU data control unit; 5) DCU data control unit is by according to different micro-code instructions, produces vector memory and writes enable (WR) signal, and the storage completing vector memory controls; Meanwhile, DCU data control unit DCU produces ALU control signal (CON), and the Heterogeneous Computing unit of control ALU completes computing.The functions such as exponential function (EXP), division (DIV) and extracting operation (SQRT).
For the high efficiency online time series forecasting problem of low delay, towards the online time series predicted application based on KAF method, the application proposes a kind of towards the vector processor structure of a class KAF method based on FPGA, design a kind of high-performance relatively general within the scope of KAF method, low delay fixed point vector processor structure: pass through multidiameter delay, streamline and fixed point technology improve processor calculated performance, solve the problem that computing relay is large and power consumption is high simultaneously, this vector processor adopts the programming mode based on microcode, realize calculating optimization in instruction level, make its versatility, the RTL level mapping method that extensibility is relatively traditional improves greatly, by the feature of abundant mining algorithm computation requirement, propose the design of a kind of isomery ALU, ensure that data path quantity, take into account vectorial division, evolution and exponential function computation requirement simultaneously, realize the balance of calculated performance and FPGA resource consumption, due to the test data set Existence dependency of computational accuracy and use, the application adopt be variable bit width fixed-point processor design, when precision meets the demands, more FPGA resource can be saved and realize higher running frequency, obtaining the higher computing velocity of relative floating point vector processor and lower computing relay.The application achieves three kinds of KAF methods the most classical based on this fixed point vector processor.Experiment shows: under the prerequisite meeting computational accuracy requirement, and this fixed point vector processor is compared with floating point vector processor, CPU, and can obtain 2 times and 9 times of calculated performances liftings respectively, power consumption reduces to 1/3 and 1/40, and computing relay reduces to 1/2 and 1/9.
Embodiment two, present embodiment are further illustrating the fixed point vector processor described in embodiment one, in present embodiment, described vector memory 3 comprise vector memory 1, vector memory 2 ..., vector memory N; ALU 4 comprise ALU 1, ALU 2 ..., ALU N; Wherein, N is more than or equal to the integer that 1 is less than or equal to 128; Vector memory 1 is connected with ALU 1 and forms a data path, and vector memory 1 and ALU 1 have identical data bit width; Vector memory 2 is connected with ALU 2 and forms a data path, and vector memory 2 and ALU 2 have identical data bit width; Vector memory N is connected with ALU N and forms a data path, and vector memory N and ALU N has identical data bit width.
Each vector memory and ALU form a data path, and the two is directly connected, and have identical data bit width M.
Embodiment three, present embodiment are further illustrating the fixed point vector processor described in embodiment one, and in present embodiment, the data path quantity that vector memory 3 and ALU 4 are formed is 1 tunnel ~ 128 tunnel.
Embodiment four, reference Fig. 2 and Fig. 3 illustrate present embodiment, present embodiment is further illustrating the fixed point vector processor described in embodiment one, in present embodiment, ALU 2 to ALU N in ALU 4 has identical structure, namely includes fixed point plus/minus musical instruments used in a Buddhist or Taoist mass 4-1 and fixed-point multiplication device 4-2;
Fixed point plus/minus musical instruments used in a Buddhist or Taoist mass 4-1 is for receiving the vector data of vector memory 3 transmission and exporting operation result to DCU data control unit 5 after carrying out plus/minus computing to it;
Fixed-point multiplication device 4-2 is for receiving the vector data of vector memory 3 transmission and exporting operation result to DCU data control unit 5 after carrying out multiplying to it;
ALU 1 comprises fixed point plus/minus musical instruments used in a Buddhist or Taoist mass 4-1 and fixed-point multiplication device 4-2, fixed point index function arithmetic unit 4-3, fixed-point divider 4-4, fixed point extracting operation unit 4-5 and dot product adder tree unit 4-6;
Fixed point index function arithmetic unit 4-3 is for receiving the vector data of vector memory 3 transmission and exporting operation result to DCU data control unit 5 after carrying out exponent arithmetic to it;
Fixed-point divider 4-4 is for receiving the vector data of vector memory 3 transmission and exporting operation result to DCU data control unit 5 after carrying out division arithmetic to it;
Fixed point extracting operation unit 4-5 is for receiving the vector data of vector memory 3 transmission and exporting operation result to DCU data control unit 5 after carrying out extracting operation to it;
Dot product adder tree unit 4-6 carries out additive operation for the N number of dot-product operation result completed in dot product computing.
In present embodiment, as shown in Figures 2 and 3, two kinds of arithmetic logic unit ALU are had: arithmetic logic unit alu (2 ~ N) altogether N-1 path has identical structure in the design of this isomery, comprise a fixed point plus/minus musical instruments used in a Buddhist or Taoist mass and a fixed-point multiplication device in each ALU, only support fixed point addition, subtraction and multiplying.Fixed-point multiplication device is single precision multiplier.
In ALU1 except comprising a fixed point plus/minus musical instruments used in a Buddhist or Taoist mass and a single precision multiplier, also comprise fixed point index function arithmetic unit, fixed-point divider, fixed point extracting operation unit and dot product adder tree unit, support fixed point addition, subtraction and multiplying, also support division, exponent arithmetic and extracting operation.
Arithmetic logic unit alu comprises N number of multiplier altogether.DCU data control unit DCU produces ALU control signal (CON), and the Heterogeneous Computing unit of control ALU completes the functions such as exponential function (EXP), division (DIV) and extracting operation (SQRT).
In Fig. 2 and Fig. 3, QA and QB represents the vector data output port of vector memory 3, and the vector data of vector memory 3 inputs in ALU 4 through QA and QB.VADD, VSUB, VMUL are simple instruction.Clock represents clock, helps out when computing.Add-sub represents the computing of plus/minus method.
In Fig. 3, QA_1, QA_2 ... QA_N and QB_1, QB_2 ... QB_N represents the vector data that vector memory 1 to the vector memory N in vector memory sends.QA_O represent QA_1, QA_2 ... QA_N is through unique output of MUX.。QB_O represent QB_1, QB_2 ... QB_N is through unique output of MUX.VMUL_1, VMUL_2 ... VMUL_N represents the Output rusults of N number of multiplier.VDOT represents the Output rusults of dot product.
SEXP represents exponential function instruction; SDIV represents divide instruction; SSORT represents extraction of square root instruction; S2V represents a scalar is expanded to vector.
Comprise division, exponent arithmetic and extracting operation unit although only have in ALU1, this fixed point vector processor still can witness vector computing by the round-robin implementation of similar scalar computer under DCU data control unit DCU controls.
The design of the isomery ALU recorded in present embodiment, by fully excavating the feature of KAF algorithm, has taken into account computational resource consumption on calculated performance and FPGA sheet.And this isomery lane-wise, makes this fixed point vector processor have very strong instruction extension ability.
Embodiment five, present embodiment are further illustrating the fixed point vector processor described in embodiment four, in present embodiment, ALU 1 also comprises trigonometric function Tan computing unit, trigonometric function Atan computing unit and logarithmic function Log computing unit.
The isomery lane-wise that embodiment four is recorded, this fixed point vector processor is made to have very strong instruction extension ability, by revising the structure of ALU1, the hardware cell that other many kinds of function calculates can be added, as trigonometric function Tan/Atan and logarithmic function Log etc., principle can realize any standard operation and User Defined computing, achieve the expansion of vector gather instruction.
Embodiment six, according on the basis of embodiment one, two, three, four or five, present embodiment is described, the vector data access control method of the fixed point vector processor described in present embodiment, the method comprises the steps:
For receiving and the count value of storage program counter 1 transmission, and output channel index L micro-code instruction is to DCU data control unit 5, export OP micro-code instruction to ALU 4 and DCU data control unit 5 simultaneously, export input vector address micro-code instruction to vector memory 3, export the step of input destination address micro-code instruction to programmable counter 1 and vector memory 3;
For receiving and storing input vector address micro-code instruction that microcode memory 2 sends and the output vector data command that destination address micro-code instruction, DCU data control unit 5 send and enable command, and output vector data are to the step of ALU 4;
Obtain vector after the ALU steering order sent for the OP micro-code instruction that sends according to microcode memory 2 and DCU data control unit 5 carries out vector operation to the vector data received, and export this vector the step of DCU data control unit 5 to;
Produce enable command and output vector data command for the path index L micro-code instruction that sends according to microcode memory 2 and OP micro-code instruction, and export vector memory 3 to; Also produce ALU steering order, and export ALU 4 to and make it carry out pointing out functional operation, division arithmetic and extracting operation; Also produce counting instruction and export programmable counter 1 to, making the step that it counts.
Embodiment seven, present embodiment are further illustrating the vector data access control method of the fixed point vector processor described in embodiment six, in present embodiment, vector is obtained after carrying out vector operation according to described vector data, and transfer in the step of DCU data control unit 5, described vector operation process comprises:
For receiving the vector data of vector memory 3 transmission and exporting the step of operation result to DCU data control unit 5 after carrying out plus/minus computing to it;
For receiving the vector data of vector memory 3 transmission and exporting the step of operation result to DCU data control unit 5 after carrying out multiplying to it;
For receiving the vector data of vector memory 3 transmission and exporting the step of operation result to DCU data control unit 5 after carrying out exponent arithmetic to it;
For receiving the vector data of vector memory 3 transmission and exporting the step of operation result to DCU data control unit 5 after carrying out division arithmetic to it;
For receiving the vector data of vector memory 3 transmission and exporting the step of operation result to DCU data control unit 5 after carrying out extracting operation to it;
The step of additive operation is carried out for the N number of dot-product operation result completed in dot product computing.
Vector processor, also referred to as array processor, synchronously can carry out the arithmetic operation of integrated data; And most CPU belongs to scalar processor, can only single treatment key element.Vector processor is widely used in scientific algorithm field, and they are bases of the eighties and even the nineties most of supercomputer.Current most of business CPU comprises some vector processor instruction, comparatively typically SIMD.In video entertainment control and user computer graphic hardware, vector processor also play vital status in its framework.
Embodiment eight, below in conjunction with drawings and Examples, a kind of fixed point vector processor provided by the present invention to be described in further detail.
Each vector memory and ALU form a data path, and the two is directly connected, and have identical data bit width M.Composition graphs 1, namely described vector memory 3 comprise vector memory 1, vector memory 2 ... vector memory N; ALU 4 comprise ALU 1, ALU 2 ... ALU N; Wherein, N is more than or equal to the integer that 1 is less than or equal to 128; Vector memory 1 is connected with ALU 1 and forms a data path, and vector memory 1 and ALU 1 have identical data bit width; Vector memory 2 is connected with ALU 2 and forms a data path, and vector memory 2 and ALU 2 have identical data bit width; Vector memory N is connected with ALU N and forms a data path, and vector memory N and ALU N has identical data bit width.
1.1 memory interface
Tradition risc processor generally adopts register file to store intermediate data, and register file generally has tens ports that can read while write, to realize very high calculated performance.Owing to not having available multiport (port number is greater than 2) storer in FPGA, on the sheet that the application adopts FPGA, two-port RAM is as vector memory.
As shown in Figure 4, two identical two-port RAMs are comprised as vector memory in each data path, it is read address port and driven by " A " and " B " two the code sections in micro-code instruction (table 1) respectively, and write address port drives by " C ".Data-out port " QA ", " QB " directly input with ALU and are coupled." WR " signal is the write enable signal of each vector memory, is produced by DCU data control unit (DCU), controls for the storage completing vector data.Microcode memory is set and works in ROM pattern, and power on the micro code program loading and prestore, and therefore microcode memory can not be rewritten online as vector memory.
The storage depth of vector memory is 2048, store data width (M) can set flexibly according to the width of fixed point calculation ALU, before compiling by macro-variable with the formal definition integral part of parameter and fraction part width.Vector memory address space is divided into 3 sections: 0x0000 ~ 0x0BFF for storing training and on-line testing data, and this area data presets when compiling; 0x0C00 ~ 0x0FEF is for storing results of intermediate calculations; 16 last word 0x0FF0 ~ 0x0FFF are then used for storing the constant needing in vector calculation process to use.
1.2 arithmetic logic unit
The application adopts the arithmetic logic unit ALU of isomery to design, for common KAF method, need in computation process to use a small amount of division arithmetic, exponent arithmetic and extracting operation, these arithmetic element frequencies of utilization are very low, but need to consume DSP resource on a large amount of FPGA sheet.The experiment in early stage proves, if comprise division and exponent arithmetic module in each ALU, the maximum access quantity of vector processor is only 17, will have a strong impact on the performance of vector processor.Therefore, the application have employed the via design of isomery, under the prerequisite not affecting overall performance, ensure that data path quantity, and EQUILIBRIUM CALCULATION FOR PROCESS performance and FPGA computational resource consume.
As shown in accompanying drawing 2, Fig. 3 and Fig. 5, in the design of this isomery, have common N-1 the path of two kinds of arithmetic logic unit: ALU (2 ~ N) there is identical structure, comprise a fixed point plus/minus musical instruments used in a Buddhist or Taoist mass and a fixed-point multiplication device in each ALU, only support fixed point addition, subtraction and multiplying; In ALU1 except comprising a fixed point plus/minus musical instruments used in a Buddhist or Taoist mass and a fixed-point multiplication device, also comprise fixed point index function arithmetic unit, fixed-point divider, fixed point extracting operation unit and dot product adder tree unit, not only support fixed point addition, subtraction, multiplication and dot product computing, also support division, exponent arithmetic and extracting operation.Although only have ALU1 to comprise division, exponent arithmetic and extracting operation unit, vector processor can witness vector computing by the round-robin implementation of similar scalar computer under DCU controls.
1.3 micro-code instruction structures
Micro code program exists with the form of micro-code instruction.The length of microcode needs to determine according to the quantity of code section and the length of each yard of section, wherein the length of address field depends on the storage depth that processor needs are directly accessed, such as, if processor needs access 4GByte address realm, then address width should be 32.Micro-code instruction form is as shown in table 1, " A ", " B " are respectively input vector address, " C " represents the destination address that vector stores, the width of " A ", " B " and " C " three code sections is all set as 16, the address space of 64Kbyte can be accessed, although do not use whole address space at present, sufficient surplus can be left for the follow-up design with larger storage depth; Path index " L " is the code section of 12 bit wides, is used to specify " PVDOT " instruction results (scalar) and is stored in which vector memory; " OP " is order code formulation instruction type, and current vector processor only has 12 kinds of instructions, and the width of " OP " is set as 4.Therefore, the overall width of microcode is 64.
As shown in table 2, this vector processor has 12 instructions at present, and table 2 gives the concrete function of every bar instruction simultaneously and performs machine periodicity.Micro-code instruction is divided into simple instruction and flowing water instruction two kinds: simple instruction comprises VADD, VSUB, VMUL, VDIV and VEXP, flowing water instruction comprises PVADD, PVSUB, PVMUL and PVDOT, here the vector operations that " flowing water (Pipelined) " refers to the inner continuous several times of instruction is full flowing water, functionally be equal to N continuous print simple instruction, use flowing water instruction to improve execution efficiency for matrix operation.Such as, as shown in Figure 5, PVADD instruction realizes N continuous time vectorial addition, be equivalent to perform N VADD instruction continuously, execution (Execution) process of PVADD writes back (Write Back) process with data overlapping, therefore, PVADD instruction execution efficiency is higher, and the execution speed-up ratio that relative and VADD instruction can obtain is 4N/ (N+3).
Table 1 microcode form
The list of table 2 micro-code program instructions
Claims (7)
1. fixed point vector processor, is characterized in that, it comprises programmable counter (1), microcode memory (2), vector memory (3), ALU (4) and DCU data control unit (5);
The input destination address micro-code instruction that programmable counter (1) sends for the counting instruction that receives DCU data control unit (5) and send and microcode memory (2), and export count value to microcode memory (2);
The count value that microcode memory (2) sends for receiving also storage program counter (1), and output channel index L micro-code instruction is to DCU data control unit (5), export OP micro-code instruction to ALU (4) and DCU data control unit (5) simultaneously, export input vector address micro-code instruction to vector memory (3), export input destination address micro-code instruction to programmable counter (1) and vector memory (3);
Vector memory (3) is for receiving and storing input vector address micro-code instruction that microcode memory (2) sends and the output vector data command that destination address micro-code instruction, DCU data control unit (5) send and enable command, and output vector data are to ALU (4);
ALU (4) obtains vector after carrying out vector operation for the ALU steering order that the OP micro-code instruction that sends according to microcode memory (2) and DCU data control unit (5) send to the vector data received, and exports this vector to DCU data control unit (5);
DCU data control unit (5) produces enable command and output vector data command for the path index L micro-code instruction that sends according to microcode memory (2) and OP micro-code instruction, and exports vector memory (3) to; Also produce ALU steering order, and export ALU (4) to and make it carry out pointing out functional operation, division arithmetic and extracting operation; Also produce counting instruction and export programmable counter (1) to, making it count.
2. fixed point vector processor according to claim 1, is characterized in that, described vector memory (3) comprise vector memory 1, vector memory 2 ..., vector memory N; ALU (4) comprise ALU 1, ALU 2 ..., ALU N; Wherein, N is more than or equal to the integer that 1 is less than or equal to 128; Vector memory 1 is connected with ALU 1 and forms a data path, and vector memory 1 and ALU 1 have identical data bit width; Vector memory 2 is connected with ALU 2 and forms a data path, and vector memory 2 and ALU 2 have identical data bit width; Vector memory N is connected with ALU N and forms a data path, and vector memory N and ALU N has identical data bit width.
3. fixed point vector processor according to claim 1, is characterized in that, the data path quantity that vector memory (3) and ALU (4) are formed is 1 tunnel ~ 128 tunnel.
4. fixed point vector processor according to claim 1, it is characterized in that, ALU 2 to ALU N in ALU (4) has identical structure, namely includes fixed point plus/minus musical instruments used in a Buddhist or Taoist mass (4-1) and a fixed-point multiplication device (4-2);
Fixed point plus/minus musical instruments used in a Buddhist or Taoist mass (4-1) is for the vector data that receives vector memory (3) and send and export operation result after carrying out plus/minus computing to it to DCU data control unit (5);
Fixed-point multiplication device (4-2) is for the vector data that receives vector memory (3) and send and export operation result after carrying out multiplying to it to DCU data control unit (5);
ALU 1 comprises fixed point plus/minus musical instruments used in a Buddhist or Taoist mass (4-1) and fixed-point multiplication device (4-2), fixed point index function arithmetic unit (4-3), fixed-point divider (4-4), fixed point extracting operation unit (4-5) and a dot product adder tree unit (4-6);
Fixed point index function arithmetic unit (4-3) is for the vector data that receives vector memory (3) and send and export operation result after carrying out exponent arithmetic to it to DCU data control unit (5);
Fixed-point divider (4-4) is for the vector data that receives vector memory (3) and send and export operation result after carrying out division arithmetic to it to DCU data control unit (5);
Fixed point extracting operation unit (4-5) is for the vector data that receives vector memory (3) and send and export operation result after carrying out extracting operation to it to DCU data control unit (5);
Dot product adder tree unit (4-6) carries out additive operation for the N number of dot-product operation result completed in dot product computing.
5. fixed point vector processor according to claim 1, is characterized in that, ALU (1) also comprises trigonometric function Tan computing unit, trigonometric function Atan computing unit and logarithmic function Log computing unit.
6. the vector data access control method of the fixed point vector processor according to claim 1,2,3,4 or 5, it is characterized in that, the method comprises the steps:
For receiving counting instruction that DCU data control unit (5) sends and the input destination address micro-code instruction that microcode memory (2) sends, and export the step of count value to microcode memory (2);
For receiving the count value that also storage program counter (1) sends, and output channel index L micro-code instruction is to DCU data control unit (5), export OP micro-code instruction to ALU (4) and DCU data control unit (5) simultaneously, export input vector address micro-code instruction to vector memory (3), export the step of input destination address micro-code instruction to programmable counter (1) and vector memory (3);
For receiving and storing input vector address micro-code instruction that microcode memory (2) sends and the output vector data command that destination address micro-code instruction, DCU data control unit (5) send and enable command, and output vector data are to the step of ALU (4);
Obtain vector after the ALU steering order sent for the OP micro-code instruction that sends according to microcode memory (2) and DCU data control unit (5) carries out vector operation to the vector data received, and export this vector the step of DCU data control unit (5) to;
Produce enable command and output vector data command for the path index L micro-code instruction that sends according to microcode memory (2) and OP micro-code instruction, and export vector memory (3) to; Also produce ALU steering order, and export ALU (4) to and make it carry out pointing out functional operation, division arithmetic and extracting operation; Also produce counting instruction and export programmable counter (1) to, making the step that it counts.
7. the vector data access control method of fixed point vector processor according to claim 6, it is characterized in that, after carrying out vector operation according to described vector data, obtain vector, and transfer in the step of DCU data control unit (5), described vector operation process comprises:
For receiving vector data that vector memory (3) sends and exporting the step of operation result to DCU data control unit (5) after carrying out plus/minus computing to it;
For receiving vector data that vector memory (3) sends and exporting the step of operation result to DCU data control unit (5) after carrying out multiplying to it;
For receiving vector data that vector memory (3) sends and exporting the step of operation result to DCU data control unit (5) after carrying out exponent arithmetic to it;
For receiving vector data that vector memory (3) sends and exporting the step of operation result to DCU data control unit (5) after carrying out division arithmetic to it;
For receiving vector data that vector memory (3) sends and exporting the step of operation result to DCU data control unit (5) after carrying out extracting operation to it;
The step of additive operation is carried out for the N number of dot-product operation result completed in dot product computing.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510144307.3A CN104699458A (en) | 2015-03-30 | 2015-03-30 | Fixed point vector processor and vector data access controlling method thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510144307.3A CN104699458A (en) | 2015-03-30 | 2015-03-30 | Fixed point vector processor and vector data access controlling method thereof |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104699458A true CN104699458A (en) | 2015-06-10 |
Family
ID=53346632
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510144307.3A Pending CN104699458A (en) | 2015-03-30 | 2015-03-30 | Fixed point vector processor and vector data access controlling method thereof |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104699458A (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017124648A1 (en) * | 2016-01-20 | 2017-07-27 | 北京中科寒武纪科技有限公司 | Vector computing device |
WO2017185385A1 (en) * | 2016-04-26 | 2017-11-02 | 北京中科寒武纪科技有限公司 | Apparatus and method for executing vector merging operation |
WO2017185384A1 (en) * | 2016-04-26 | 2017-11-02 | 北京中科寒武纪科技有限公司 | Apparatus and method for executing vector circular shift operation |
WO2017185395A1 (en) * | 2016-04-26 | 2017-11-02 | 北京中科寒武纪科技有限公司 | Apparatus and method for executing vector comparison operation |
CN107315716A (en) * | 2016-04-26 | 2017-11-03 | 北京中科寒武纪科技有限公司 | A kind of apparatus and method for performing Outer Product of Vectors computing |
CN107873091A (en) * | 2015-07-20 | 2018-04-03 | 高通股份有限公司 | SIMD sliding window computings |
CN108153514A (en) * | 2017-12-19 | 2018-06-12 | 北京云知声信息技术有限公司 | A kind of floating point vector accelerating method and device |
CN108415728A (en) * | 2018-03-01 | 2018-08-17 | 中国科学院计算技术研究所 | A kind of extension floating-point operation instruction executing method and device for processor |
CN108733408A (en) * | 2017-04-21 | 2018-11-02 | 上海寒武纪信息科技有限公司 | Counting device and method of counting |
CN109388427A (en) * | 2017-08-11 | 2019-02-26 | 龙芯中科技术有限公司 | Vector processing method, vector processing unit and microprocessor |
US10762164B2 (en) | 2016-01-20 | 2020-09-01 | Cambricon Technologies Corporation Limited | Vector and matrix computing device |
CN111651205A (en) * | 2016-04-26 | 2020-09-11 | 中科寒武纪科技股份有限公司 | Device and method for executing vector inner product operation |
CN112470139A (en) * | 2018-01-08 | 2021-03-09 | 阿特拉佐有限公司 | Compact arithmetic accelerator for data processing apparatus, system and method |
WO2021078212A1 (en) * | 2019-10-25 | 2021-04-29 | 安徽寒武纪信息科技有限公司 | Computing apparatus and method for vector inner product, and integrated circuit chip |
US11507350B2 (en) | 2017-04-21 | 2022-11-22 | Cambricon (Xi'an) Semiconductor Co., Ltd. | Processing apparatus and processing method |
US11531540B2 (en) | 2017-04-19 | 2022-12-20 | Cambricon (Xi'an) Semiconductor Co., Ltd. | Processing apparatus and processing method with dynamically configurable operation bit width |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6324638B1 (en) * | 1999-03-31 | 2001-11-27 | International Business Machines Corporation | Processor having vector processing capability and method for executing a vector instruction in a processor |
CN102411773A (en) * | 2011-07-28 | 2012-04-11 | 中国人民解放军国防科学技术大学 | Vector-processor-oriented mean-residual normalized product correlation vectoring method |
CN102750133A (en) * | 2012-06-20 | 2012-10-24 | 中国电子科技集团公司第五十八研究所 | 32-Bit triple-emission digital signal processor supporting SIMD |
-
2015
- 2015-03-30 CN CN201510144307.3A patent/CN104699458A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6324638B1 (en) * | 1999-03-31 | 2001-11-27 | International Business Machines Corporation | Processor having vector processing capability and method for executing a vector instruction in a processor |
CN102411773A (en) * | 2011-07-28 | 2012-04-11 | 中国人民解放军国防科学技术大学 | Vector-processor-oriented mean-residual normalized product correlation vectoring method |
CN102750133A (en) * | 2012-06-20 | 2012-10-24 | 中国电子科技集团公司第五十八研究所 | 32-Bit triple-emission digital signal processor supporting SIMD |
Non-Patent Citations (1)
Title |
---|
YEYONG PANG等: "A low latency kernel recursive least squares processor using FPGA technology", 《2013 INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE TECHNOLOGY(FPT)》 * |
Cited By (41)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107873091B (en) * | 2015-07-20 | 2021-05-28 | 高通股份有限公司 | Method and apparatus for sliding window arithmetic |
CN107873091A (en) * | 2015-07-20 | 2018-04-03 | 高通股份有限公司 | SIMD sliding window computings |
US10762164B2 (en) | 2016-01-20 | 2020-09-01 | Cambricon Technologies Corporation Limited | Vector and matrix computing device |
KR102304216B1 (en) | 2016-01-20 | 2021-09-23 | 캠브리콘 테크놀로지스 코퍼레이션 리미티드 | Vector computing device |
US11734383B2 (en) | 2016-01-20 | 2023-08-22 | Cambricon Technologies Corporation Limited | Vector and matrix computing device |
KR102185287B1 (en) | 2016-01-20 | 2020-12-01 | 캠브리콘 테크놀로지스 코퍼레이션 리미티드 | Vector computing device |
CN106990940B (en) * | 2016-01-20 | 2020-05-22 | 中科寒武纪科技股份有限公司 | Vector calculation device and calculation method |
CN106990940A (en) * | 2016-01-20 | 2017-07-28 | 南京艾溪信息科技有限公司 | A kind of vector calculation device |
CN111580863B (en) * | 2016-01-20 | 2024-05-03 | 中科寒武纪科技股份有限公司 | Vector operation device and operation method |
CN111580865B (en) * | 2016-01-20 | 2024-02-27 | 中科寒武纪科技股份有限公司 | Vector operation device and operation method |
WO2017124648A1 (en) * | 2016-01-20 | 2017-07-27 | 北京中科寒武纪科技有限公司 | Vector computing device |
CN111580865A (en) * | 2016-01-20 | 2020-08-25 | 中科寒武纪科技股份有限公司 | Vector operation device and operation method |
CN111580863A (en) * | 2016-01-20 | 2020-08-25 | 中科寒武纪科技股份有限公司 | Vector operation device and operation method |
KR20190073593A (en) * | 2016-01-20 | 2019-06-26 | 캠브리콘 테크놀로지스 코퍼레이션 리미티드 | Vector computing device |
KR20200058562A (en) * | 2016-01-20 | 2020-05-27 | 캠브리콘 테크놀로지스 코퍼레이션 리미티드 | Vector computing device |
EP3451156A4 (en) * | 2016-04-26 | 2020-03-25 | Cambricon Technologies Corporation Limited | Apparatus and method for executing vector circular shift operation |
US10853069B2 (en) | 2016-04-26 | 2020-12-01 | Cambricon Technologies Corporation Limited | Apparatus and methods for comparing vectors |
CN107315716B (en) * | 2016-04-26 | 2020-08-07 | 中科寒武纪科技股份有限公司 | Device and method for executing vector outer product operation |
WO2017185385A1 (en) * | 2016-04-26 | 2017-11-02 | 北京中科寒武纪科技有限公司 | Apparatus and method for executing vector merging operation |
CN111651205B (en) * | 2016-04-26 | 2023-11-17 | 中科寒武纪科技股份有限公司 | Apparatus and method for performing vector inner product operation |
CN107315563A (en) * | 2016-04-26 | 2017-11-03 | 北京中科寒武纪科技有限公司 | A kind of apparatus and method for performing vectorial comparison operation |
CN111651205A (en) * | 2016-04-26 | 2020-09-11 | 中科寒武纪科技股份有限公司 | Device and method for executing vector inner product operation |
WO2017185384A1 (en) * | 2016-04-26 | 2017-11-02 | 北京中科寒武纪科技有限公司 | Apparatus and method for executing vector circular shift operation |
CN107315716A (en) * | 2016-04-26 | 2017-11-03 | 北京中科寒武纪科技有限公司 | A kind of apparatus and method for performing Outer Product of Vectors computing |
WO2017185395A1 (en) * | 2016-04-26 | 2017-11-02 | 北京中科寒武纪科技有限公司 | Apparatus and method for executing vector comparison operation |
US11720353B2 (en) | 2017-04-19 | 2023-08-08 | Shanghai Cambricon Information Technology Co., Ltd | Processing apparatus and processing method |
US11698786B2 (en) | 2017-04-19 | 2023-07-11 | Shanghai Cambricon Information Technology Co., Ltd | Processing apparatus and processing method |
US11531541B2 (en) | 2017-04-19 | 2022-12-20 | Shanghai Cambricon Information Technology Co., Ltd | Processing apparatus and processing method |
US11531540B2 (en) | 2017-04-19 | 2022-12-20 | Cambricon (Xi'an) Semiconductor Co., Ltd. | Processing apparatus and processing method with dynamically configurable operation bit width |
US11734002B2 (en) | 2017-04-19 | 2023-08-22 | Shanghai Cambricon Information Technology Co., Ltd | Counting elements in neural network input data |
CN108733408A (en) * | 2017-04-21 | 2018-11-02 | 上海寒武纪信息科技有限公司 | Counting device and method of counting |
US11507350B2 (en) | 2017-04-21 | 2022-11-22 | Cambricon (Xi'an) Semiconductor Co., Ltd. | Processing apparatus and processing method |
CN109324826B (en) * | 2017-04-21 | 2021-03-26 | 上海寒武纪信息科技有限公司 | Counting device and counting method |
CN109324826A (en) * | 2017-04-21 | 2019-02-12 | 上海寒武纪信息科技有限公司 | Counting device and method of counting |
CN109388427A (en) * | 2017-08-11 | 2019-02-26 | 龙芯中科技术有限公司 | Vector processing method, vector processing unit and microprocessor |
CN108153514A (en) * | 2017-12-19 | 2018-06-12 | 北京云知声信息技术有限公司 | A kind of floating point vector accelerating method and device |
CN112470139B (en) * | 2018-01-08 | 2022-04-08 | 阿特拉佐有限公司 | Compact arithmetic accelerator for data processing apparatus, system and method |
CN112470139A (en) * | 2018-01-08 | 2021-03-09 | 阿特拉佐有限公司 | Compact arithmetic accelerator for data processing apparatus, system and method |
CN108415728B (en) * | 2018-03-01 | 2020-12-29 | 中国科学院计算技术研究所 | Extended floating point operation instruction execution method and device for processor |
CN108415728A (en) * | 2018-03-01 | 2018-08-17 | 中国科学院计算技术研究所 | A kind of extension floating-point operation instruction executing method and device for processor |
WO2021078212A1 (en) * | 2019-10-25 | 2021-04-29 | 安徽寒武纪信息科技有限公司 | Computing apparatus and method for vector inner product, and integrated circuit chip |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104699458A (en) | Fixed point vector processor and vector data access controlling method thereof | |
Singh et al. | NERO: A near high-bandwidth memory stencil accelerator for weather prediction modeling | |
CN106940815B (en) | Programmable convolutional neural network coprocessor IP core | |
CN106775599B (en) | The more computing unit coarseness reconfigurable systems and method of recurrent neural network | |
CN108197705A (en) | Convolutional neural networks hardware accelerator and convolutional calculation method and storage medium | |
CN105468335A (en) | Pipeline-level operation device, data processing method and network-on-chip chip | |
CN112308222B (en) | RRAM (remote radio access m) -based memory and calculation integrated full-system simulator and design method thereof | |
CN105468568B (en) | Efficient coarseness restructurable computing system | |
CN111105023B (en) | Data stream reconstruction method and reconfigurable data stream processor | |
CN102184092A (en) | Special instruction set processor based on pipeline structure | |
CN103984560A (en) | Embedded reconfigurable system based on large-scale coarseness and processing method thereof | |
CN108647779A (en) | A kind of low-bit width convolutional neural networks Reconfigurable Computation unit | |
CN112446471B (en) | Convolution acceleration method based on heterogeneous many-core processor | |
CN116710912A (en) | Matrix multiplier and control method thereof | |
CN110018848A (en) | A kind of mixing based on RISC-V is mixed to calculate system and method | |
CN102722472A (en) | Complex matrix optimizing method | |
Zafar et al. | Hardware architecture design and mapping of ‘Fast Inverse Square Root’algorithm | |
CN111178492B (en) | Computing device, related product and computing method for executing artificial neural network model | |
CN116822600A (en) | Neural network search chip based on RISC-V architecture | |
CN112051981A (en) | Data pipeline computing path structure and single-thread data pipeline system | |
CN103761213A (en) | On-chip array system based on circulating pipeline computation | |
Abdelhamid et al. | MITRACA: A next-gen heterogeneous architecture | |
Diamantopoulos et al. | A system-level transprecision FPGA accelerator for BLSTM using on-chip memory reshaping | |
Vishnu et al. | 32-Bit RISC Processor Using VedicMultiplier | |
Daisaka et al. | GRAPE-mp: An simd accelerator board for multi-precision arithmetic |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20150610 |