WO2018024094A1 - 一种运算装置及其操作方法 - Google Patents

一种运算装置及其操作方法 Download PDF

Info

Publication number
WO2018024094A1
WO2018024094A1 PCT/CN2017/093161 CN2017093161W WO2018024094A1 WO 2018024094 A1 WO2018024094 A1 WO 2018024094A1 CN 2017093161 W CN2017093161 W CN 2017093161W WO 2018024094 A1 WO2018024094 A1 WO 2018024094A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
instruction
module
unit
scale
Prior art date
Application number
PCT/CN2017/093161
Other languages
English (en)
French (fr)
Inventor
陈云霁
刘少礼
陈天石
Original Assignee
上海寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海寒武纪信息科技有限公司 filed Critical 上海寒武纪信息科技有限公司
Priority to KR1020187034254A priority Critical patent/KR102467544B1/ko
Priority to EP17836276.0A priority patent/EP3495947B1/en
Publication of WO2018024094A1 publication Critical patent/WO2018024094A1/zh
Priority to US16/268,479 priority patent/US20190235871A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/34Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
    • G06F9/345Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes of multiple operands or results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30105Register structure
    • G06F9/30112Register structure comprising data of variable length
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • G06F9/30047Prefetch instructions; cache control instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3005Arrangements for executing specific machine instructions to perform operations for flow control
    • G06F9/30065Loop control instructions; iterative instructions, e.g. LOOP, REPEAT
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • G06F9/3016Decoding the operand specifier, e.g. specifier format
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30181Instruction operation extension or modification
    • G06F9/30192Instruction operation extension or modification according to data descriptor, e.g. dynamic data typing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3824Operand accessing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3838Dependency mechanisms, e.g. register scoreboarding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention belongs to the field of computers, and in particular relates to an arithmetic device and an operating method thereof for efficiently and flexibly executing operations of data of the same scale or different scales according to instructions. It is well solved that more and more algorithms currently contain a large number of arithmetic problems of the same scale or different scale data, which reduces the limitation of the scale of the operation unit and improves the flexibility and effectiveness of the vector operation.
  • a known scheme for performing vector operations is to use a general purpose processor (CPU) or a graphics processing unit (GPU).
  • this method may be more suitable for scalar operations when performing vector operations. It is less efficient; or, because its on-chip cache is too small, it can't meet the requirements of efficiently performing large-scale vector operations.
  • vector calculations are performed using specially tailored vector computing devices, ie, using custom memory cells and processing units for vector operations.
  • the existing dedicated vector operation device is limited to the register file and can only support vector operations of the same length, and the flexibility is insufficient.
  • the instruction set corresponding to the above device can only perform the operation of the same length of data, and is limited by the size of the memory and the scale of the operation unit. For data of different lengths and data that does not satisfy the scale of the operation unit, one way is to use multiple instructions to sequentially call the data, and the other way is to use the loop instruction to make repeated calls. This not only makes the finger
  • the structure of the set is complex, the instruction queue is tedious, and the execution efficiency is low, and the runtime is limited and the flexibility is poor, which cannot facilitate large-scale vector operations.
  • the operation of the data reduces the size of the unit.
  • the invention provides an arithmetic device, which comprises an instruction module, a data module and an operation module, wherein:
  • the instruction module is configured to cache instructions and provide instructions to the data module and the operation module;
  • the data module is configured to provide operation data to the operation module according to the instruction in the instruction module;
  • the arithmetic module is used for performing operations according to the instructions in the instruction module and the operation data provided by the data module.
  • the instruction module includes an instruction cache unit, an instruction processing unit, a dependency processing unit, and a storage queue unit, wherein:
  • the instruction cache unit is configured to store an instruction to be executed
  • the instruction processing unit is configured to acquire an instruction from the instruction cache unit, and process the instruction
  • the dependency processing unit is configured to determine whether the instruction is the same as the previous instruction being executed.
  • the dependency processing unit stores the instruction in the storage queue unit, and after the execution of the previously executed instruction is completed, the instruction is provided to the operation module;
  • the instruction processing unit includes:
  • the instruction queue part is used to sequentially store the decoded instructions.
  • the data module includes a data I/O unit and a data temporary storage unit, wherein the data I/O unit is configured to directly read the operation data from the memory, and the data temporary storage unit is configured to store the operation data, and perform the operation data. After adjustment, it is supplied to the arithmetic module.
  • the data temporary storage unit is configured to provide the operation data to the operation module after adjusting the operation data, including:
  • the data temporary storage unit directly supplies the two operation data to the operation module
  • each operation data is split into a plurality of sub-operation data whose length is less than or equal to the operation scale, and the sub-operation data is divided into multiple times.
  • the operation data whose length is larger than the operation scale is split into multiple lengths.
  • the sub-operation data is less than or equal to the operation scale, and the plurality of sub-operation data and the operation data whose length is less than or equal to the operation scale are supplied to the operation module a plurality of times.
  • the operation data is a vector
  • the operation module is used to perform a vector logic operation or a vector four operation.
  • the present invention also provides an operation method of an arithmetic device, the method comprising:
  • the cache instruction is in the instruction module
  • the instruction in the instruction module is provided to the data module, and the data module provides the operation data to the operation module according to the instruction;
  • the instruction in the instruction module is provided to the operation module, and the operation module performs the operation according to the operation data provided by the instruction and the data module.
  • the instruction module includes an instruction cache unit, an instruction processing unit, a dependency processing unit, and a storage queue unit, and the step S1 includes:
  • the instruction processing unit acquires an instruction from the instruction cache unit, and processes the instruction.
  • the dependency processing unit determines whether the instruction and the previous instruction being executed are visited. Asking the same data, if yes, the dependency processing unit stores the instruction in the storage queue unit, and after the execution of the previous execution instruction is completed, the instruction is provided to the operation module; otherwise, the instruction is directly provided. Give the arithmetic module.
  • the instruction processing unit includes an instruction part, a decoding part, and an instruction queue part, wherein step S12 includes:
  • the fetching part acquires an instruction from the instruction cache unit.
  • the decoding part decodes the obtained instruction.
  • the instruction queue part sequentially stores the decoded instructions.
  • the data module includes a data I/O unit and a data temporary storage unit, wherein step S2 includes:
  • the data I/O unit directly reads the operation data from the memory, and stores the data in the data temporary storage unit;
  • the data temporary storage unit adjusts the stored operational data and provides the calculated data to the arithmetic module.
  • step S22 includes:
  • the data temporary storage unit directly supplies the two operation data to the operation module
  • each operation data is split into a plurality of sub-operation data whose length is less than or equal to the operation scale, and the sub-operation data is divided into multiple times.
  • the operation data whose length is larger than the operation scale is split into multiple lengths.
  • the sub-operation data is less than or equal to the operation scale, and the plurality of sub-operation data and the operation data whose length is less than or equal to the operation scale are supplied to the operation module a plurality of times.
  • the operation data is a vector
  • the operation module is used to perform a vector logic operation or a vector four operation.
  • the arithmetic device and the operating method thereof provided by the invention can send only one instruction
  • the operation data is read from the memory, it is temporarily stored in the data temporary storage unit, and the data temporary storage unit adjusts the operation data according to the length of the operation data, and supplies the data to the operation module, thereby supporting the operation of data of different lengths. Reduced the size of the unit.
  • the present invention employs a dependency processing unit to solve the correlation problem in the data storage, thereby improving the execution performance of a large number of computing tasks.
  • the instructions used in the present invention have a compact format, which makes the instruction set simple in structure, convenient to use, and supports flexible data length and operation scale.
  • the invention can be applied to the following (including but not limited to) scenarios: data processing, robots, computers, printers, scanners, telephones, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, cloud servers , cameras, camcorders, projectors, watches, earphones, mobile storage, wearable devices and other electronic products; aircraft, ships, vehicles and other types of transportation; televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, Electric lights, gas stoves, range hoods and other household appliances; and including nuclear magnetic resonance instruments, B-ultrasound, electrocardiograph and other medical equipment.
  • FIG. 1 is a schematic structural view of an arithmetic device provided by the present invention.
  • FIG. 2 is a schematic structural view of an instruction module in the present invention.
  • FIG. 3 is a schematic structural view of a data module in the present invention.
  • FIG. 4 is a schematic structural view of an arithmetic module in the present invention.
  • Figure 5 is a flow chart of a method of supporting instructions for computing data of different lengths in the present invention.
  • FIG. 6 is a schematic diagram showing the operational relationship of cyclically reading a shorter vector for performing operations on different length operation vectors according to an embodiment of the present invention.
  • FIG. 1 is a schematic structural view of an arithmetic device provided by the present invention, as shown in FIG.
  • the instruction module 10, the data module 20, and the arithmetic module 30 are included.
  • the instruction module 10 is used to cache instructions and provide instructions to the data module 20 and the arithmetic module 30.
  • the instruction in the instruction module 10 controls the direction of the data flow of the data module 20, the data of the data module 20 affects the processing of the dependency relationship in the instruction module 10, and at the same time, the instruction in the instruction module 10 controls the specific operation of the operation module 30, and the operation Whether the operation of the module 30 is completed will control whether the instruction module 10 reads a new instruction; the data module 20 provides specific operation data for the operation module 30, and the operation module 30 sends the operation result back to the data module 20 for storage.
  • the instruction module 10 includes an instruction cache unit 11, an instruction processing unit 12, a dependency processing unit 13, and a storage queue unit 14.
  • the instruction processing unit 12 is further divided into three parts: an instruction part 121, a decoding part 122, and an instruction queue part 123.
  • the instruction cache unit 11 is configured to cache the instruction during execution of the instruction. When an instruction is executed, if the instruction is also the earliest instruction in the uncommitted instruction in the instruction cache unit 11, the instruction will be submitted once submitted. The operation of this instruction will not be able to cancel the change of the device status.
  • the fetching portion 121 is configured to fetch the next instruction to be executed from the instruction buffer unit 11 and transfer the instruction to the decoding portion 122.
  • the decoding portion 122 is configured to decode the instruction and transmit the decoded instruction.
  • the instruction queue 123 is used to sequentially store the decoded instructions.
  • the dependency processing unit 13 is configured to process data dependencies that may exist between the current instruction and the previous instruction. For example, when accessing data from the data module 20, the front and rear instructions may access data in the same block of storage space if the previous instruction If the data is not executed, the data will be processed, which will affect the consistency of the data, resulting in the correctness of the operation results.
  • the instruction must wait in the storage queue unit 14 until the dependency is eliminated, wherein the storage queue unit 14 is an ordered queue. Instructions that have dependencies on the data from previous instructions are stored in the queue until the dependencies are eliminated.
  • FIG. 3 is a schematic structural view of a data module in the present invention.
  • the data module 20 is composed of two parts, a data I/O unit 21 and a data temporary storage unit 22.
  • the data I/O unit 21 is used to interact with the memory, that is, can directly read data from the memory or directly write the data.
  • the data temporary storage unit 22 is composed of a scratch pad memory (Scratchpad Memory), which can be implemented by various memory devices (SRAM, eDRAM, DRAM, memristor, 3D-DRAM or nonvolatile memory, etc.).
  • the data temporary storage unit 22 is capable of storing operational data of different sizes, such as vector data of various sizes.
  • the data I/O unit 21 reads out the necessary operation data according to the instruction, and temporarily stores it on the data temporary storage unit 22. Since the high-speed temporary storage memory is used, the operation data of different lengths can be stored, and at the same time, in the operation process.
  • the data temporary storage unit 22 can adjust the arithmetic data according to the scale of the arithmetic unit 30 and the length of the arithmetic data, and then supply the data to the arithmetic module 30.
  • the data temporary storage unit 22 directly supplies the two operation data to the operation module 30.
  • the operation unit 30 is an operation scale that processes two sets of vectors at a time, and each set of vectors includes four elements, such as (A1, A2, A3, A4) and (B1, B2, B3, B4).
  • the operation between the two is the operation scale of the operation unit 30; both of the operation data are vectors of less than 4 elements, such as (A1, A2, A3) and (B1, B2), and in this case, (A1, A2 can be directly used) , A3) and (B1, B2) are provided to the arithmetic module 30 for operation.
  • the data temporary storage unit 22 splits each of the operation data into a plurality of sub-operation data whose length is less than or equal to the operation scale, and the sub-operation data is It is provided to the arithmetic module multiple times.
  • the operation scale of the operation unit 30 is an operation scale that can process two sets of vector operations at a time, wherein each set of vectors includes 4 elements, such as (A1, A2, A3, A4) and (B1, B2, B3,
  • the operation between B4) is the operation scale of the operation unit 30; both operation data are larger than the operation scale, such as (A1, A2, A3, A4, A5) and (B1, B2, B3, B4, B5), (A1, A2, A3, A4, A5) can be divided into D1 (A1, A2, A3, A4) and D2 (A5), and (B1, B2, B3, B4, B5) can be split into d1 (B1, B2, B3, B4) and d2 (B5) are then supplied to the arithmetic unit 30 twice, wherein D1 (A1, A2, A3, A4) and d1 (B1, B2, B3 are provided for the first time).
  • the operation data larger than the operation scale is split into two segments, and the sub-operation data of the corresponding segment is provided each time.
  • the first operation data is split into three segments, which are represented as D1 and D2.
  • D3 the second operation data is divided into two segments, denoted as d1 and d2, and the first operation data D1, D2, D3 is provided to the operation unit in three times, and the third operation data needs to be cyclically provided for the second operation data.
  • D1 and d2 that is, D1 and d1 are provided for the first time
  • D2 and d2 are provided for the second time
  • D3 and d1 are provided for the third time.
  • the first operational data is divided into five segments, which are represented as D1, D2, and D3.
  • D4, D5 the second operation data is divided into 3 segments, denoted as d1, d2 and d3, then the operation data is provided to the operation unit in 5 times, that is, D1 and d1 are provided for the first time, and D2 and D2 are provided for the second time.
  • D2, D3 and d3 are provided for the third time
  • D4 and d1 are provided for the fourth time
  • D5 and d2 are provided for the fifth time.
  • the operation data whose length is larger than the operation scale is split into multiple lengths.
  • the sub-operation data is less than or equal to the operation scale, and the plurality of sub-operation data and the operation data whose length is less than or equal to the operation scale are supplied to the operation module a plurality of times.
  • the length of the first operation data is larger than the operation scale, and is divided into three segments D1, D2, and D3.
  • the second operation data is less than or equal to the operation scale, and no split is required. If it is represented as d, the operation is divided into three operations.
  • the unit provides first and second operational data, namely D1 and d for the first time, D2 and d for the second time, and D3 and d for the third time.
  • the adjustment of the operation data by the data temporary storage unit 22 means that when the operation data length is not greater than the operation scale of the operation unit, the data to be calculated can be directly sent into the operation unit 30 through the memory; otherwise, For each operation, the data conforming to the operation scale of the operation unit 30 is sent to the operation unit 30. After the operation is completed or the batch of data enters the next-stage pipeline, the memory sends a new batch of the operation unit 30 to the operation unit 30. The scale of the data is calculated. In addition, when the lengths of the two data to be calculated are the same, they are directly or split and sent to the operation unit 30 for calculation; otherwise, the data having a larger length is sequentially read and the data segments having a smaller length are read. Read after the loop until the end of the operation.
  • the arithmetic module is composed of several different arithmetic components, such as vector addition components, vector subtraction components, vector logic and components, vector dot product components, and the like. There are several parts for each component. With these arithmetic components, the arithmetic module can support a variety of vector operations.
  • Figure 5 is a flow chart of a method of supporting instructions for computing data of different lengths in the present invention. carried out The process of the directive includes:
  • the fetching portion 121 in the instruction processing unit 12 fetches a vector operation instruction from the instruction buffer unit 11, and sends the instruction to the decoding portion 122 in the instruction processing unit.
  • the decoding part 122 decodes the instruction, and splits the instruction into an operation code and each different operation domain according to a custom instruction rule.
  • the custom instruction rule adopted here is that the instruction includes an operation code and at least one operation field, the operation code defines the type of the vector operation, the operation field holds the data value to be operated, the address of the data storage, the length of the data, or the operation result is saved. The address, etc., the meaning of the specific operation field varies depending on the opcode.
  • the arithmetic instruction is then sent to the instruction queue portion 123.
  • the data to be calculated is acquired according to the operation code and the operation domain of the instruction, and sent to the dependency processing unit 13 for analyzing and judging the data dependency relationship.
  • S4 in the dependency processing unit 14, analyze whether the instruction has a dependency on the data with the previous instruction that has not been executed. If there is no dependency, there is no need to wait, otherwise the instruction is stored in the storage queue unit, waiting until it no longer has a dependency on the data with the previous unexecuted instruction. The instruction is sent to the arithmetic unit 30.
  • the data temporary storage unit 22 in the data module 20 adjusts the data according to the length of the data and the scale of the operation unit 30, that is, when the vector length is not larger than the operation scale of the operation unit 30.
  • the vector to be calculated is directly sent to the operation unit 30; otherwise, the data corresponding to the operation scale of the operation unit 30 is sent to the operation unit 30 for each operation, and after the operation is completed, the new operation unit 30 is sent. A batch of data that fits the scale of the operation is evaluated until the end of the operation.
  • the vectors with larger lengths are read sequentially, and the vectors with smaller lengths are cyclically read until the end of the operation. If the vector to be operated needs to be adjusted according to the scale of the operation unit, and the length needs to be adjusted, the vector with the larger length is guaranteed to be read sequentially, and the order of the vector with the smaller length is read sequentially, and the scale of the calculation is sequentially read.
  • the data can be.
  • inter-vector and operation instruction formats are:
  • each address of the register can store 16-bit data
  • the arithmetic unit includes four inter-operators, and each of the operators can simultaneously perform the inter-operation of 16-bit data.
  • the instruction indicates that vector 0 and vector 1 perform VAV operation, that is, inter-vector AND operation.
  • the process of inter-vector and operation includes:
  • the fetching portion 121 in the instruction processing unit 11 fetches a vector operation instruction, that is, VAV 00001 01000 01001 01000 10001, from the instruction buffer unit 11, and sends the instruction to the decoding portion 122 in the instruction processing unit 12.
  • the decoding part 122 decodes the instruction to obtain the instruction operation code VAV, and represents the inter-vector AND operation.
  • the address and the length, the storage address of the operation result, and the operation instruction are sent to the instruction queue portion 123.
  • the data to be operated is acquired according to the operation code of the instruction and the operation domain.
  • the instruction opcode is VAV, that is, the inter-vector and logical operations are performed, and the data address and data length to be operated are obtained from the operation fields 1, 2, 3, and 4, that is, the start address of the vector vin0 is 00001, and the length of the vector vin0 is 01000.
  • the starting address of the vector vin1 is 01001 and the length of the vector vin1 is 01000. That is, the vector vin0 starts from the address 00001, and reads data of length 8 addresses, that is, data of address 00001 to 01000; the vector vin1 starts from address 01001, and also reads length of 8 addresses.
  • the dependency processing unit 13 analyzes and judges the data dependency.
  • the data I/O unit 21 in the data module 20 acquires data from the external memory in advance, and stores the acquired data in the data temporary storage unit 22.
  • the data temporary storage unit 22 finds the corresponding data according to the data address indicated by the instruction and supplies it to the operation unit 30. Before the supply, the data temporary storage unit 22 calculates the length and operation according to the data.
  • the scale of operation of unit 30 adjusts the data.
  • the arithmetic unit 30 can only process the inter-and-synchronization operation of four groups of 16-bit vectors at a time.
  • the data sent to the arithmetic unit 30 for the first time is the data of the first four address lengths indicated by vin0 and the first four of the numbers indicated by vin1.
  • the data of the address length that is, the data of the addresses 00001 to 00100 and 01001 to 01100, is operated.
  • the data of the last four address lengths of vin0 and vin1 are loaded and calculated, that is, the data of the addresses 00101 to 01000 and 01101 to 10000 are subjected to the AND operation.
  • This embodiment describes a specific process of performing vector addition using an arithmetic device.
  • the format of the defined vector addition instruction is:
  • each address of the register can store 16-bit data
  • the arithmetic unit includes four adders, and each of the operators can perform addition of 16-bit data at the same time.
  • VA 00001 01000 01001 00010 10001 this instruction indicates that vector 0 and vector 1 perform VA operation.
  • the process by which the arithmetic device executes the vector addition instruction includes:
  • the fetching portion 121 in the instruction processing unit 12 fetches a vector operation instruction, i.e., VA 00001 01000 01001 00010 10001, from the instruction buffer unit 11, and sends the instruction to the decoding portion 12 in the instruction processing unit.
  • a vector operation instruction i.e., VA 00001 01000 01001 00010 1000
  • the decoding part 12 decodes the instruction, and obtains the instruction operation code VA, which represents the execution vector addition operation, and has five operation fields, respectively representing the start address and length of the vector to be operated vin0, and the start address of the vector vin1. And the storage address of the length and the operation result, and the operation instruction is sent to the instruction queue portion 123.
  • the data to be operated is acquired according to the operation code of the instruction and the operation domain.
  • the operation code of the instruction is VA, that is, the execution vector addition operation, and the data address and the data length to be operated are obtained from the operation fields 1, 2, 3, and 4, that is, the start address of the vector vin0 is 00001, the length of the vector vin0 is 01000, and the vector is obtained.
  • the starting address of vin1 is 01001, and the length of vector vin1 is 00010. That is, the vector vin0 starts from the address 00001, and reads data of length 8 address lengths, that is, data of addresses 00001 to 01000; the vector vin1 starts from address 01001, and reads data of length 2 address lengths.
  • the dependency processing unit 13 analyzes and judges the data dependency.
  • the vector addition instruction is sent to the arithmetic unit 30.
  • the arithmetic unit 30 takes out the required vector from the data temporary storage unit 22 in accordance with the address and length of the desired data, and then performs the addition operation in the arithmetic unit.
  • the arithmetic unit 30 can process only the addition of four sets of 16-bit vectors at a time, it is not possible to transmit all of the data to the arithmetic unit at a time for calculation, but it is necessary to perform the plurality of times. Because vin0 and vin1 have different lengths, vin1 has a short length. Therefore, when calculating, it is necessary to cyclically read the data of vin1. As shown in FIG.
  • the data sent to the arithmetic unit 30 for the first time is the data of the first four address lengths indicated by vin0 and the data of two address lengths indicated by vin1, that is, the transmitted data is the address of 00001 to 00100.
  • the data at address 00001 is added with the data at address 01001
  • the data at address 00010 is added with the data at address 01010
  • the data at address 00011 is added with the data at address 01001.
  • the addition operation, the data at address 00100, and the data at address 01010 are added.
  • the data sent to the arithmetic unit 30 for the second time is the data of the last four address lengths indicated by vin0 and the data of two address lengths indicated by vin1, that is, the data of addresses 00101 to 01000 and 01001 to 01010.
  • the addition operation is performed.
  • the correspondence between the data at address 00101 and the data at address 01001 is added, the data at address 00110 and the data at address 01010 are added, and the data at address 00111 is added.
  • the data is added with the address 01001, the data at address 01000, and the data at address 01010 are added.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Advance Control (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Apparatus For Radiation Diagnosis (AREA)
  • Complex Calculations (AREA)

Abstract

一种运算装置及其操作方法,该装置包括指令模块(10)、数据模块(20)及运算模块(30),指令模块(10)对指令进行操作,包括指令缓存、指令处理、判断依赖关系等,数据模块(20)对数据进行操作,包括从内存中读出或写入数据和向运算模块输入运算数据等,运算模块(30)用于根据指令对数据进行相关运算。该装置及方法能够在执行指令时,根据待运算数据的长度和运算模块的规模进行相应调整,提升了包含大量向量计算任务的执行性能,具有指令结构简洁、数据运算灵活高效等优点。

Description

一种运算装置及其操作方法 技术领域
本发明属于计算机领域,具体涉及一种运算装置及其操作方法,用于根据指令高效灵活地执行相同规模或不同规模的数据的运算。很好地解决当前越来越多的算法包含大量相同规模或不同规模数据的运算问题,降低了运算单元规模的局限,提高了向量运算的灵活性和有效性。
背景技术
随着大数据时代的来临,与向量运算的相关应用也日益增加,参与运算的数据量不断增大,数据规格和维度不断扩增,运算形式也逐渐增多,一方面,运算单元的规模难以随着数据量的大幅度提升而大幅度扩大,这就使得运算时如何调控运算数据提出了要求;另一方面,这些运算不再局限于在统一规格的数据间进行,而是有很大一部分运算是不同规格或不同维度的数据间进行,这就为运算装置的灵活性提出了更高的要求。
在现有技术中,一种进行向量运算的已知方案是使用通用处理器(CPU)或图形处理器(GPU),然而,这种方法或者因其结构更适应于标量运算,进行向量运算时效率较低;或者,因其片上缓存太小,无法满足高效完成大规模向量运算的要求。在另一种现有技术中,使用专门定制的向量运算装置来进行向量计算,即使用定制的存储单元和处理单元进行向量运算。然而,目前已有的专用向量运算装置受限于寄存器堆,只能支持相同长度的向量运算,灵活性不足。
除此之外,上述装置对应的指令集,只能执行相同长度的数据的运算,且受限于存储器的规模和运算单元的规模。对于不同长度的数据和不满足运算单元规模的数据,一种方式是采用多条指令对数据进行依次调用,另一种方式是采用循环指令的方式进行反复调用。这不仅使得指 令集的结构复杂,指令队列冗长,且执行效率低下,而且运行时限制多、灵活性差,无法为大规模的向量运算提供便利。
发明内容
(一)要解决的技术问题
本发明的目的在于,提供一种运算装置及其操作方法,用于根据指令高效灵活地执行相同规模或不同规模的数据的运算,解决了当前越来越多的算法包含大量相同规模或不同规模数据的运算问题,降低了运算单元规模。
(二)技术方案
本发明提供一种运算装置,装置包括指令模块、数据模块和运算模块,其中:
指令模块用于缓存指令,并向数据模块及运算模块提供指令;
数据模块用于根据指令模块中的指令,向运算模块提供运算数据;
运算模块用于根据指令模块中的指令及数据模块提供的运算数据,进行运算。
进一步,指令模块包括指令缓存单元、指令处理单元、依赖关系处理单元、存储队列单元,其中:
指令缓存单元用于存储待执行的指令,指令处理单元用于从指令缓存单元获取指令,并对该指令进行处理,依赖关系处理单元用于判断该指令与前一正在执行的指令是否访问相同的数据:
如果是,依赖关系处理单元将该指令存放至所述存储队列单元,待前一正在执行的指令执行完毕后,再将该指令提供给运算模块;
否则,直接将该指令提供给运算模块。
进一步,指令处理单元包括:
取指部分,用于从指令缓存单元中获取指令;
译码部分,用于对获取的指令进行译码;
指令队列部分,用于对译码后的指令进行顺序存储。
进一步,数据模块包括数据I/O单元和数据暂存单元,其中,数据I/O单元用于直接从内存中读取运算数据,数据暂存单元用于存储运算数据,并对该运算数据进行调整后,提供至运算模块。
进一步,数据暂存单元用于对运算数据进行调整后,提供至运算模块,包括:
当参与运算的两个运算数据长度均小于等于运算模块的运算规模时,数据暂存单元直接将该两个运算数据提供至运算模块;
当参与运算的两个运算数据长度均大于运算模块的运算规模时,将每个运算数据拆分为多个长度均小于等于所述运算规模的子运算数据,并将该子运算数据分多次提供至所述运算模块;
当参与运算的两个运算数据中,一个运算数据长度大于运算模块的运算规模,另一个运算数据长度小于等于运算模块的运算规模时,将长度大于运算规模的运算数据拆分为多个长度均小于等于运算规模的子运算数据,并将该多个子运算数据和长度小于等于运算规模的运算数据分多次提供至所述运算模块。
进一步,运算数据为向量,运算模块用于执行向量逻辑运算或向量四则运算。
本发明还提供一种运算装置的操作方法,方法包括:
S1,缓存指令于指令模块中;
S2,将指令模块中的指令提供至数据模块,数据模块根据该指令向运算模块提供运算数据;
S3,将指令模块中的指令提供至运算模块,运算模块根据该指令及运数据模块提供的运算数据,进行运算。
进一步,指令模块包括指令缓存单元、指令处理单元、依赖关系处理单元、存储队列单元,所述步骤S1包括:
S11,在指令缓存单元存储待执行的指令;
S12,指令处理单元从指令缓存单元获取指令,并对该指令进行处理;
S13,依赖关系处理单元判断该指令与前一正在执行的指令是否访 问相同的数据,如果是,依赖关系处理单元将该指令存放至所述存储队列单元,待前一正在执行的指令执行完毕后,再将该指令提供给运算模块,否则,直接将该指令提供给运算模块。
进一步,指令处理单元包括取指部分、译码部分和指令队列部分,其中,步骤S12包括:
S121,取指部分从指令缓存单元中获取指令;
S122,译码部分对获取的指令进行译码;
S123,指令队列部分对译码后的指令进行顺序存储。
进一步,数据模块包括数据I/O单元和数据暂存单元,其中,步骤S2包括:
S21,数据I/O单元直接从内存中读取运算数据,并存储于数据暂存单元;
S22,数据暂存单元对存储的运算数据进行调整后,提供至运算模块。
进一步,步骤S22包括:
当参与运算的两个运算数据长度均小于等于运算模块的运算规模时,数据暂存单元直接将该两个运算数据提供至运算模块;
当参与运算的两个运算数据长度均大于运算模块的运算规模时,将每个运算数据拆分为多个长度均小于等于所述运算规模的子运算数据,并将该子运算数据分多次提供至所述运算模块;
当参与运算的两个运算数据中,一个运算数据长度大于运算模块的运算规模,另一个运算数据长度小于等于运算模块的运算规模时,将长度大于运算规模的运算数据拆分为多个长度均小于等于运算规模的子运算数据,并将该多个子运算数据和长度小于等于运算规模的运算数据分多次提供至所述运算模块。
进一步,运算数据为向量,运算模块用于执行向量逻辑运算或向量四则运算。
(三)有益效果
本发明提供的运算装置及其操作方法,能够在仅发送一条指令的情 况下,将运算数据从内存中读取后暂存在数据暂存单元上,数据暂存单元根据运算数据的长度,对运算数据进行调整后提供至运算模块,从而能够支持不同长度数据的运算,降低了运算单元规模。另外,本发明采用依赖关系处理单元解决数据存储中的相关性问题,从而提升了包含大量计算任务的执行性能。而且,本发明采用的指令具有精简的格式,使得指令集结构简单、使用方便、支持灵活的数据长度和运算规模。
本发明可以应用于以下(包括但不限于)场景中:数据处理、机器人、电脑、打印机、扫描仪、电话、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备等各类电子产品;飞机、轮船、车辆等各类交通工具;电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机等各类家用电器;以及包括核磁共振仪、B超、心电图仪等各类医疗设备。
附图说明
图1是本发明提供的运算装置的结构示意图。
图2是本发明中指令模块的结构示意图。
图3是本发明中数据模块的结构示意图。
图4是本发明中运算模块的结构示意图。
图5是本发明中支持不同长度运算数据的指令的方法流程图。
图6是本发明实施例提供的不同长度运算向量进行运算时,循环读取较短向量进行运算的运算关系示意图。
具体实施方式
为使本发明的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本发明进一步详细说明。
图1是本发明提供的运算装置的结构示意图,如图1所示,装置包 括指令模块10、数据模块20和运算模块30。指令模块10用于缓存指令,并向数据模块20及运算模块30提供指令。指令模块10中的指令控制数据模块20的数据流的方向,数据模块20的数据会影响指令模块10中对依赖关系的处理,同时,指令模块10中的指令控制运算模块30的具体运算,运算模块30的运算是否完成会控制指令模块10是否读取新的指令;数据模块20为运算模块30提供具体的运算数据,运算模块30会将运算结果送回数据模块20进行保存。
图2是本发明提供的装置的指令模块的示意图。如图2所示,指令模块10包括指令缓存单元11、指令处理单元12、依赖关系处理单元13和存储队列单元14。其中,指令处理单元12又分为三个部分:取指部分121、译码部分122和指令队列部分123。指令缓存单元11用于在指令执行过程中缓存该指令,当一条指令执行完之后,如果该指令同时也是指令缓存单元11中未被提交指令中最早的一条指令,该指令将被提交,一旦提交,该条指令进行的操作对装置状态的改变将无法撤销。取指部分121用于从指令缓存单元11中取出下一条将要执行的指令,并将该指令传给译码部分122;译码部分122用于对指令进行译码,并将译码后指令传给指令队列123;指令队列部分123用于对译码后的指令进行顺序存储。依赖关系处理单元13用于处理当前指令与前一条指令可能存在的数据依赖关系,例如,在从数据模块20中访问数据时,前后指令可能会访问同一块存储空间中的数据,如果前一条指令未执行完毕,就对该数据进行操作的话,会影响该数据的一致性,从而导致运算结果的正确性。因此,当前指令如果被依赖关系处理单元13检测到与之前的指令的数据存在依赖关系,该指令必须在存储队列单元14内等待至依赖关系被消除,其中,存储队列单元14是一个有序队列,与之前指令在数据上有依赖关系的指令被存储在该队列内直至依赖关系被消除。
图3是本发明中数据模块的结构示意图。如图3所示,数据模块20由两部分组成,即数据I/O单元21和数据暂存单元22。数据I/O单元21用于与内存进行交互,即能够直接从内存中读取数据或直接将数据写 入内存中。数据暂存单元22由高速暂存存储器(Scratchpad Memory)组成,其中该存储器可以通过各种不同存储器件(SRAM、eDRAM、DRAM、忆阻器、3D-DRAM或非易失存储等)实现。数据暂存单元22能够存储不同大小的运算数据,如各种规模的向量数据。数据I/O单元21根据指令将必要的运算数据读取出来,并暂存在数据暂存单元22上,由于采用高速暂存存储器,从而使得可以存储不同长度的运算数据,同时,在运算过程中,数据暂存单元22可以根据运算单元30的规模和运算数据的长度,对运算数据进行调整后,提供至运算模块30。
具体地,当参与运算的两个运算数据长度均小于等于运算模块的运算规模时,数据暂存单元22直接将该两个运算数据提供至运算模块30。举例来说,运算单元30是的运算规模是一次性处理两组向量的运算,每组向量包括4个元素,如(A1,A2,A3,A4)和(B1,B2,B3,B4)之间的运算是该运算单元30的运算规模;两个运算数据均是小于4个元素的向量,如(A1,A2,A3)和(B1,B2),此时,可直接将(A1,A2,A3)和(B1,B2)提供至运算模块30进行运算。
当参与运算的两个运算数据长度均大于运算模块的运算规模时,数据暂存单元22将每个运算数据拆分为多个长度均小于等于运算规模的子运算数据,并将该子运算数据分多次提供至运算模块。举例来说,运算单元30的运算规模是一次性可处理两组向量运算的运算规模,其中每组向量包括4个元素,如(A1,A2,A3,A4)和(B1,B2,B3,B4)之间的运算是该运算单元30的运算规模;两个运算数据均大于运算规模,如(A1,A2,A3,A4,A5)和(B1,B2,B3,B4,B5),此时,可将(A1,A2,A3,A4,A5)拆分为D1(A1,A2,A3,A4)和D2(A5),将(B1,B2,B3,B4,B5)拆分为d1(B1,B2,B3,B4)和d2(B5),然后分两次提供至运算单元30中,其中,第一次提供D1(A1,A2,A3,A4)和d1(B1,B2,B3,B4)进行运算,第二次提供D2(A5)和d2(B5)。上述例子是将大于运算规模的运算数据均拆分为2段,每次提供相应段的子运算数据。在两个运算数据的拆分的段数不一致时,例如,第一个运算数据拆分为3段,表示为D1、D2、 D3,第二个运算数据拆分为2段,表示为d1和d2,则分3次向运算单元提供第一个运算数据D1、D2、D3,并且这3次需要循环提供第二个运算数据d1和d2,即第一次提供D1和d1,第二次提供D2和d2,第三次提供D3和d1,又比如,第一个运算数据拆分为5段,表示为D1、D2、D3、D4、D5,第二个运算数据拆分为3段,表示为d1、d2和d3,则分5次向运算单元提供运算数据,即第一次提供D1和d1,第二次提供D2和d2,第三次提供D3和d3,第四次提供D4和d1,第五次提供D5和d2。
当参与运算的两个运算数据中,一个运算数据长度大于运算模块的运算规模,另一个运算数据长度小于等于运算模块的运算规模时,将长度大于运算规模的运算数据拆分为多个长度均小于等于运算规模的子运算数据,并将该多个子运算数据和长度小于等于运算规模的运算数据分多次提供至所述运算模块。简要举例来说,第一个运算数据长度大于运算规模,拆分为3段D1、D2和D3,第二个运算数据小于等于运算规模,无需拆分,表示为d,则分3次向运算单元提供第一、第二运算数据,即第一次提供D1和d,第二次提供D2和d,第三次提供D3和d。
总的来说,数据暂存单元22对运算数据的调整是指,当运算数据长度不大于运算单元的运算规模时,可以通过该存储器直接将待运算的数据送入运算单元30中;否则,每一次运算,将符合运算单元30运算规模的数据送入运算单元30中,运算完毕或者该批数据进入下一级流水线之后,该存储器向运算单元30中送入新一批符合运算单元30运算规模的数据进行运算。另外,当两个待运算的数据长度相同时,则直接或拆分后送入运算单元30中进行运算;否则,长度较大的数据分段后按顺序读取,长度较小的数据分段后循环读取,直至运算结束。
图4是本发明提供的装置的运算模块的结构示意图。如图4所示,运算模块由若干种不同的运算部件组成,如向量加法部件、向量减法部件、向量逻辑与部件、向量点积部件等等。每种部件有若干个。利用这些运算部件,运算模块能够支持多种向量运算。
图5是本发明中支持不同长度运算数据的指令的方法流程图。执行 该指令的过程包括:
S1,指令处理单元12中的取指部分121从指令缓存单元11中取出一条向量运算指令,并将该指令送往指令处理单元中的译码部分122。
S2,译码部分122对指令进行译码,将指令根据自定义的指令规则拆分为操作码和各个不同的操作域。这里采用的自定义的指令规则是指令包含操作码和至少一个操作域,操作码定义向量运算的类型,操作域中保存待运算的数据值、数据存储的地址、数据的长度或是运算结果保存地址等,具体操作域的含义根据操作码的不同而不同。而后,将该运算指令送往指令队列部分123。
S3,在指令队列部分123中,根据该指令的操作码和操作域获取待运算的数据,送往依赖关系处理单元13对数据依赖关系进行分析和判断。
S4,在依赖关系处理单元14中,分析该指令与前面的尚未执行结束的指令在数据上是否存在依赖关系。若无依赖关系,则无需等待,否则将该条指令存储在存储队列单元中,等待至其与前面的未执行结束的指令在数据上不再存在依赖关系为止。将指令送往运算单元30。
S5,当指令送往运算单元30准备运算时,数据模块20中的数据暂存单元22根据数据的长度和运算单元30的规模对数据进行调整,即当向量长度不大于运算单元30的运算规模时,可以直接将待运算的向量送入运算单元30中;否则,每一次运算,将符合运算单元30运算规模的数据送入运算单元30中,运算完毕后,向运算单元30中送入新一批符合运算规模的数据进行运算,直至运算结束。当两个待运算的向量长度相同时,则直接送入运算单元中进行运算;否则,长度较大的向量按顺序读取,长度较小的向量循环读取,直至运算结束。若待运算的向量同时需要根据运算单元规模进行调整,又需要对长度进行调整,则保证长度较大的向量按顺序读取,长度较小的向量循环读取的顺序,依次读取符合运算规模的数据即可。
S6,运算完成后,将结果写回至数据暂存单元22中的指定地址,同时提交指令缓存单元11中的该指令。
为使该过程更加清楚明白,以下提供一具体实施例,并参照附图,对本流程进一步详细说明。
实施例一
本实施例描述了采用运算装置进行向量间与运算的具体过程,首先,本实施例中向量间与运算指令格式为:
Figure PCTCN2017093161-appb-000001
假定寄存器每个地址能够存储16位数据,运算单元内包含4个间与运算器,每个运算器可以同时执行16位数据的间与运算。以运算指令VAV 00001 01000 01001 01000 10001为例,该指令表示向量0和向量1执行VAV运算,即向量间与运算。具体的,向量间与运算的过程包括:
S1,指令处理单元11中的取指部分121从指令缓存单元11中取出一条向量运算指令,即VAV 00001 01000 01001 01000 10001,并将该指令送往指令处理单元12中的译码部分122。
S2,译码部分122对指令进行译码,得到该指令操作码VAV,表示执行向量间与运算,有五个操作域,分别表示待运算向量vin0的起始地址和长度,向量vin1的起始地址和长度、运算结果的存储地址,将该运算指令送往指令队列部分123。
S3,在指令队列部分123中,根据该指令的操作码和操作域获取待运算的数据。该指令操作码为VAV,即执行向量间与逻辑运算,由操作域1、2、3、4处获得待运算的数据地址和数据长度,即向量vin0的起始地址00001、向量vin0的长度01000、向量vin1的起始地址01001、向量vin1的长度01000。即向量vin0从地址为00001处开始,读取长度为8个地址长度的数据,即地址为00001~01000的数据;向量vin1从地址为01001处开始,同样,也读取长度为8个地址长度的数据。而后, 送往依赖关系处理单元13对数据依赖关系进行分析和判断。
S4,在依赖关系处理单元123中,分析该指令与前面的尚未执行结束的指令在数据上是否存在依赖关系。若无依赖关系,则无需等待,否则将该条指令存储在存储队列单元14中,等待至其与前面的未执行结束的指令在数据上不再存在依赖关系为止。将指令送往运算单元30。
S5,数据模块20中数据I/O单元21事先从外部的内存中获取数据,并将获取的数据存储于数据暂存单元22。当指令送往运算单元30准备运算时,数据暂存单元22根据指令所指示的数据地址,找到相应的数据并提供至运算单元30,在提供前,数据暂存单元22根据数据的长度和运算单元30的运算规模对数据进行调整。这里,运算单元30一次只能够处理4组16位向量的间与运算,所以,第一次送入运算单元30的数据为vin0所指的前4个地址长度的数据和vin1所指的前4个地址长度的数据,即地址为00001~00100和01001~01100的数据进行运算。待运算完毕,载入vin0和vin1各自的后4个地址长度的数据进行运算,即地址为00101~01000和01101~10000的数据进行间与运算。
S6,运算完成后,将结果写回至数据暂存单元22中的指定地址10001处,同时提交指令缓存单元中的该向量间与逻辑指令。
实施例二
本实施例描述了采用运算装置进行向量加法运算的具体过程,首先,本实施例,定义向量加法运算指令格式为:
Figure PCTCN2017093161-appb-000002
假定寄存器每个地址能够存储16位数据,运算单元内包含4个加法运算器,每个运算器可以同时执行16位数据的加法运算。VA 00001 01000 01001 00010 10001为例,该指令表示向量0和向量1执行VA运 算,即向量加法运算。运算装置执行该向量加法指令的过程包括:
S1,指令处理单元12中的取指部分121从指令缓存单元11中取出一条向量运算指令,即VA 00001 01000 01001 00010 10001,并将该指令送往指令处理单元中的译码部分12。
S2,译码部分12对指令进行译码,得到该指令操作码VA,表示执行向量加法运算,有五个操作域,分别表示待运算向量vin0的起始地址和长度,向量vin1的起始地址和长度、运算结果的存储地址,将该运算指令送往指令队列部分123。
S3,在指令队列部分123中,根据该指令的操作码和操作域获取待运算的数据。该指令操作码为VA,即执行向量加法运算,由操作域1、2、3、4处获得待运算的数据地址和数据长度,即向量vin0的起始地址00001、向量vin0的长度01000、向量vin1的起始地址01001、向量vin1的长度00010。即向量vin0从地址为00001处开始,读取长度为8个地址长度的数据,即地址为00001~01000的数据;向量vin1从地址为01001处开始,读取长度为2个地址长度的数据。而后,送往依赖关系处理单元13对数据依赖关系进行分析和判断。
S4,在依赖关系处理单元13中,分析该指令与前面的尚未执行结束的指令在数据上是否存在依赖关系。若无依赖关系,则无需等待,否则将该条指令存储在存储队列单元中,等待至其与前面的未执行结束的指令在数据上不再存在依赖关系为止。将指令送往运算单元。
S5,依赖关系不存在后,该条向量加法指令被送往运算单元30。运算单元30根据所需数据的地址和长度从数据暂存单元22中取出需要的向量,然后在运算单元中完成加法运算。这里,因为运算单元30一次只能够处理4组16位向量的加法运算,所以不能一次将所有数据全部发送至运算单元进行运算,而是需要分多次进行。又因为vin0和vin1长度不同,vin1长度较短,故运算时,需要循环读取vin1的数据。如图6所示,第一次送入运算单元30的数据为vin0所指的前4个地址长度的数据和vin1所指的2个地址长度的数据,即发送的数据为地址为00001~00100和01001~01010的数据,其中进行运算的数据的对应关系 为:地址为00001处的数据与地址为01001处的数据进行加法运算、地址为00010处的数据与地址为01010处的数据进行加法运算、地址为00011处的数据与地址为01001处的数据进行加法运算、地址为00100处的数据与地址为01010处的数据进行加法运算。待运算完毕,第二次送入运算单元30的数据为vin0所指的后4个地址长度的数据和vin1所指的2个地址长度的数据,即地址为00101~01000和01001~01010的数据进行加法运算,运算时的对应关系为地址为00101处的数据与地址为01001处数据进行加法运算、地址为00110处的数据与地址为01010处的数据进行加法运算、地址为00111处内的数据与地址为01001处的数据进行加法运算、地址为01000处的数据与地址为01010处的数据进行加法运算。
S6,运算完成后,将结果写回至数据暂存单元22中的指定地址10001处,同时提交指令缓存单元11中的该向量加法指令。
以上所述的具体实施例,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施例而已,并不用于限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (12)

  1. 一种运算装置,其特征在于,装置包括指令模块、数据模块和运算模块,其中:
    指令模块用于缓存指令,并向数据模块及运算模块提供指令;
    数据模块用于根据指令模块中的指令,向运算模块提供运算数据;
    运算模块用于根据指令模块中的指令及数据模块提供的运算数据,进行运算。
  2. 根据权利要求1所述的运算装置,其特征在于,所述指令模块包括指令缓存单元、指令处理单元、依赖关系处理单元、存储队列单元,其中:
    指令缓存单元用于存储待执行的指令,指令处理单元用于从指令缓存单元获取指令,并对该指令进行处理,依赖关系处理单元用于判断该指令与前一正在执行的指令是否访问相同的数据:
    如果是,依赖关系处理单元将该指令存放至所述存储队列单元,待前一正在执行的指令执行完毕后,再将该指令提供给运算模块;
    否则,直接将该指令提供给运算模块。
  3. 根据权利要求2所述的运算装置,其特征在于,所述指令处理单元包括:
    取指部分,用于从指令缓存单元中获取指令;
    译码部分,用于对获取的指令进行译码;
    指令队列部分,用于对译码后的指令进行顺序存储。
  4. 根据权利要求1所述的运算装置,其特征在于,所述数据模块包括数据I/O单元和数据暂存单元,其中,数据I/O单元用于直接从内存中读取运算数据,数据暂存单元用于存储运算数据,并对该运算数据进行调整后,提供至运算模块。
  5. 根据权利要求4所述的运算装置,其特征在于,数据暂存单元用于对运算数据进行调整后,提供至运算模块,包括:
    当参与运算的两个运算数据长度均小于等于运算模块的运算规模 时,数据暂存单元直接将该两个运算数据提供至运算模块;
    当参与运算的两个运算数据长度均大于运算模块的运算规模时,将每个运算数据拆分为多个长度均小于等于所述运算规模的子运算数据,并将该子运算数据分多次提供至所述运算模块;
    当参与运算的两个运算数据中,一个运算数据长度大于运算模块的运算规模,另一个运算数据长度小于等于运算模块的运算规模时,将长度大于运算规模的运算数据拆分为多个长度均小于等于运算规模的子运算数据,并将该多个子运算数据和长度小于等于运算规模的运算数据分多次提供至所述运算模块。
  6. 根据权利要求1所述的运算装置,其特征在于,所述运算数据为向量,所述运算模块用于执行向量逻辑运算或向量四则运算。
  7. 一种运算装置的操作方法,所述运算装置为权利要求1-6任意一项所述的运算装置,其特征在于,方法包括:
    S1,缓存指令于指令模块中;
    S2,将指令模块中的指令提供至数据模块,数据模块根据该指令向运算模块提供运算数据;
    S3,将指令模块中的指令提供至运算模块,运算模块根据该指令及运数据模块提供的运算数据,进行运算。
  8. 根据权利要求7所述的运算装置的操作方法,其特征在于,所述指令模块包括指令缓存单元、指令处理单元、依赖关系处理单元、存储队列单元,所述步骤S1包括:
    S11,在指令缓存单元存储待执行的指令;
    S12,指令处理单元从指令缓存单元获取指令,并对该指令进行处理;
    S13,依赖关系处理单元判断该指令与前一正在执行的指令是否访问相同的数据,如果是,依赖关系处理单元将该指令存放至所述存储队列单元,待前一正在执行的指令执行完毕后,再将该指令提供给运算模块,否则,直接将该指令提供给运算模块。
  9. 根据权利要求8所述的运算装置的操作方法,其特征在于,所 述指令处理单元包括取指部分、译码部分和指令队列部分,其中,所述步骤S12包括:
    S121,取指部分从指令缓存单元中获取指令;
    S122,译码部分对获取的指令进行译码;
    S123,指令队列部分对译码后的指令进行顺序存储。
  10. 根据权利要求7所述的运算装置的操作方法,其特征在于,所述数据模块包括数据I/O单元和数据暂存单元,其中,所述步骤S2包括:
    S21,数据I/O单元直接从内存中读取运算数据,并存储于数据暂存单元;
    S22,数据暂存单元对存储的运算数据进行调整后,提供至运算模块。
  11. 根据权利要求10所述的运算装置的操作方法,其特征在于,所述步骤S22包括:
    当参与运算的两个运算数据长度均小于等于运算模块的运算规模时,数据暂存单元直接将该两个运算数据提供至运算模块;
    当参与运算的两个运算数据长度均大于运算模块的运算规模时,将每个运算数据拆分为多个长度均小于等于所述运算规模的子运算数据,并将该子运算数据分多次提供至所述运算模块;
    当参与运算的两个运算数据中,一个运算数据长度大于运算模块的运算规模,另一个运算数据长度小于等于运算模块的运算规模时,将长度大于运算规模的运算数据拆分为多个长度均小于等于运算规模的子运算数据,并将该多个子运算数据和长度小于等于运算规模的运算数据分多次提供至所述运算模块。
  12. 根据权利要求7所述的运算装置的操作方法,其特征在于,所述运算数据为向量,所述运算模块用于执行向量逻辑运算或向量四则运算。
PCT/CN2017/093161 2016-08-05 2017-07-17 一种运算装置及其操作方法 WO2018024094A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
KR1020187034254A KR102467544B1 (ko) 2016-08-05 2017-07-17 연산 장치 및 그 조작 방법
EP17836276.0A EP3495947B1 (en) 2016-08-05 2017-07-17 Operation device and method of operating same
US16/268,479 US20190235871A1 (en) 2016-08-05 2019-02-05 Operation device and method of operating same

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610640115.6 2016-08-05
CN201610640115.6A CN107688466B (zh) 2016-08-05 2016-08-05 一种运算装置及其操作方法

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/268,479 Continuation-In-Part US20190235871A1 (en) 2016-08-05 2019-02-05 Operation device and method of operating same

Publications (1)

Publication Number Publication Date
WO2018024094A1 true WO2018024094A1 (zh) 2018-02-08

Family

ID=61072478

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/093161 WO2018024094A1 (zh) 2016-08-05 2017-07-17 一种运算装置及其操作方法

Country Status (6)

Country Link
US (1) US20190235871A1 (zh)
EP (1) EP3495947B1 (zh)
KR (1) KR102467544B1 (zh)
CN (3) CN112214244A (zh)
TW (1) TWI752068B (zh)
WO (1) WO2018024094A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111258646B (zh) * 2018-11-30 2023-06-13 上海寒武纪信息科技有限公司 指令拆解方法、处理器、指令拆解装置及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1349159A (zh) * 2001-11-28 2002-05-15 中国人民解放军国防科学技术大学 微处理器向量处理方法
US20060112159A1 (en) * 2004-11-22 2006-05-25 Sony Corporation Processor
CN101986265A (zh) * 2010-10-29 2011-03-16 浙江大学 一种基于Atom处理器的指令并行分发方法
CN102495719A (zh) * 2011-12-15 2012-06-13 中国科学院自动化研究所 一种向量浮点运算装置及方法
CN104375993A (zh) * 2013-08-12 2015-02-25 阿里巴巴集团控股有限公司 一种数据处理的方法及装置

Family Cites Families (42)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4135242A (en) * 1977-11-07 1979-01-16 Ncr Corporation Method and processor having bit-addressable scratch pad memory
JPS5994173A (ja) * 1982-11-19 1984-05-30 Hitachi Ltd ベクトル・インデツクス生成方式
NL9400607A (nl) * 1994-04-15 1995-11-01 Arcobel Graphics Bv Dataverwerkingscircuit, vermenigvuldigingseenheid met pijplijn, ALU en schuifregistereenheid ten gebruike bij een dataverwerkingscircuit.
US6088783A (en) * 1996-02-16 2000-07-11 Morton; Steven G DPS having a plurality of like processors controlled in parallel by an instruction word, and a control processor also controlled by the instruction word
JP3525209B2 (ja) * 1996-04-05 2004-05-10 株式会社 沖マイクロデザイン べき乗剰余演算回路及びべき乗剰余演算システム及びべき乗剰余演算のための演算方法
WO2000017788A1 (en) * 1998-09-22 2000-03-30 Vectorlog Devices and techniques for logical processing
JP3779540B2 (ja) * 2000-11-08 2006-05-31 株式会社ルネサステクノロジ 複数レジスタ指定が可能なsimd演算方式
AU2002338616A1 (en) * 2001-02-06 2002-10-28 Victor Demjanenko Vector processor architecture and methods performed therein
US6901422B1 (en) * 2001-03-21 2005-05-31 Apple Computer, Inc. Matrix multiplication in a vector processing system
US6922716B2 (en) * 2001-07-13 2005-07-26 Motorola, Inc. Method and apparatus for vector processing
JP3886870B2 (ja) * 2002-09-06 2007-02-28 株式会社ルネサステクノロジ データ処理装置
FI118654B (fi) * 2002-11-06 2008-01-31 Nokia Corp Menetelmä ja järjestelmä laskuoperaatioiden suorittamiseksi ja laite
US7146486B1 (en) * 2003-01-29 2006-12-05 S3 Graphics Co., Ltd. SIMD processor with scalar arithmetic logic units
CN100545804C (zh) * 2003-08-18 2009-09-30 上海海尔集成电路有限公司 一种基于cisc结构的微控制器及其指令集的实现方法
CN1277182C (zh) * 2003-09-04 2006-09-27 台达电子工业股份有限公司 具有辅助处理单元的可编程逻辑控制器
US7594102B2 (en) * 2004-12-15 2009-09-22 Stmicroelectronics, Inc. Method and apparatus for vector execution on a scalar machine
US20070283129A1 (en) * 2005-12-28 2007-12-06 Stephan Jourdan Vector length tracking mechanism
KR100859185B1 (ko) * 2006-05-18 2008-09-18 학교법인 영광학원 유한체 GF(2m)상의 곱셈기
CN100470571C (zh) * 2006-08-23 2009-03-18 北京同方微电子有限公司 一种用于密码学运算的微处理器内核装置
US8667250B2 (en) * 2007-12-26 2014-03-04 Intel Corporation Methods, apparatus, and instructions for converting vector data
JP5481793B2 (ja) * 2008-03-21 2014-04-23 富士通株式会社 演算処理装置および同装置の制御方法
US20100115234A1 (en) * 2008-10-31 2010-05-06 Cray Inc. Configurable vector length computer processor
CN101399553B (zh) * 2008-11-12 2012-03-14 清华大学 一种可在线编程的准循环ldpc码编码器装置
CN101826142B (zh) * 2010-04-19 2011-11-09 中国人民解放军信息工程大学 一种可重构椭圆曲线密码处理器
US8645669B2 (en) * 2010-05-05 2014-02-04 International Business Machines Corporation Cracking destructively overlapping operands in variable length instructions
CN102799800B (zh) * 2011-05-23 2015-03-04 中国科学院计算技术研究所 一种安全加密协处理器及无线传感器网络节点芯片
CN102253919A (zh) * 2011-05-25 2011-11-23 中国石油集团川庆钻探工程有限公司 基于gpu和cpu协同运算的并行数值模拟方法和***
CN102262525B (zh) * 2011-08-29 2014-11-19 孙瑞玮 基于矢量运算的矢量浮点运算装置及方法
US8572131B2 (en) * 2011-12-08 2013-10-29 Oracle International Corporation Techniques for more efficient usage of memory-to-CPU bandwidth
CN102750133B (zh) * 2012-06-20 2014-07-30 中国电子科技集团公司第五十八研究所 支持simd的32位三发射的数字信号处理器
CN103699360B (zh) * 2012-09-27 2016-09-21 北京中科晶上科技有限公司 一种向量处理器及其进行向量数据存取、交互的方法
CN103778069B (zh) * 2012-10-18 2017-09-08 深圳市中兴微电子技术有限公司 高速缓冲存储器的高速缓存块长度调整方法及装置
US9557993B2 (en) * 2012-10-23 2017-01-31 Analog Devices Global Processor architecture and method for simplifying programming single instruction, multiple data within a register
CN107577614B (zh) * 2013-06-29 2020-10-16 华为技术有限公司 数据写入方法及内存***
CN103440227B (zh) * 2013-08-30 2016-06-22 广州天宁信息技术有限公司 一种支持并行运行算法的数据处理方法及装置
US10331583B2 (en) * 2013-09-26 2019-06-25 Intel Corporation Executing distributed memory operations using processing elements connected by distributed channels
CN104636397B (zh) * 2013-11-15 2018-04-20 阿里巴巴集团控股有限公司 用于分布式计算的资源分配方法、计算加速方法以及装置
US10768930B2 (en) * 2014-02-12 2020-09-08 MIPS Tech, LLC Processor supporting arithmetic instructions with branch on overflow and methods
US10452971B2 (en) * 2015-06-29 2019-10-22 Microsoft Technology Licensing, Llc Deep neural network partitioning on servers
WO2017031630A1 (zh) * 2015-08-21 2017-03-02 中国科学院自动化研究所 基于参数量化的深度卷积神经网络的加速与压缩方法
US10482372B2 (en) * 2015-12-23 2019-11-19 Intel Corporation Interconnection scheme for reconfigurable neuromorphic hardware
CN107636640B (zh) * 2016-01-30 2021-11-23 慧与发展有限责任合伙企业 点积引擎、忆阻器点积引擎以及用于计算点积的方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1349159A (zh) * 2001-11-28 2002-05-15 中国人民解放军国防科学技术大学 微处理器向量处理方法
US20060112159A1 (en) * 2004-11-22 2006-05-25 Sony Corporation Processor
CN101986265A (zh) * 2010-10-29 2011-03-16 浙江大学 一种基于Atom处理器的指令并行分发方法
CN102495719A (zh) * 2011-12-15 2012-06-13 中国科学院自动化研究所 一种向量浮点运算装置及方法
CN104375993A (zh) * 2013-08-12 2015-02-25 阿里巴巴集团控股有限公司 一种数据处理的方法及装置

Also Published As

Publication number Publication date
CN107688466A (zh) 2018-02-13
CN112214244A (zh) 2021-01-12
EP3495947A4 (en) 2020-05-20
CN107688466B (zh) 2020-11-03
EP3495947A1 (en) 2019-06-12
US20190235871A1 (en) 2019-08-01
TWI752068B (zh) 2022-01-11
KR20190032282A (ko) 2019-03-27
CN111857822A (zh) 2020-10-30
EP3495947B1 (en) 2022-03-30
TW201805802A (zh) 2018-02-16
KR102467544B1 (ko) 2022-11-16
CN111857822B (zh) 2024-04-05

Similar Documents

Publication Publication Date Title
CN111310910B (zh) 一种计算装置及方法
CN111857820B (zh) 一种用于执行矩阵加/减运算的装置和方法
CN111651205B (zh) 一种用于执行向量内积运算的装置和方法
CN107315717B (zh) 一种用于执行向量四则运算的装置和方法
WO2017185395A1 (zh) 一种用于执行向量比较运算的装置和方法
CN107315568B (zh) 一种用于执行向量逻辑运算的装置
CN107315575B (zh) 一种用于执行向量合并运算的装置和方法
WO2017185384A1 (zh) 一种用于执行向量循环移位运算的装置和方法
WO2017185405A1 (zh) 一种用于执行向量外积运算的装置和方法
CN111651204B (zh) 一种用于执行向量最大值最小值运算的装置和方法
WO2017185388A1 (zh) 一种用于生成服从一定分布的随机向量的装置和方法
WO2018024094A1 (zh) 一种运算装置及其操作方法

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 20187034254

Country of ref document: KR

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17836276

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2017836276

Country of ref document: EP

Effective date: 20190305