WO2018107476A1 - 访存设备、计算设备和应用于卷积神经网络运算的设备 - Google Patents

访存设备、计算设备和应用于卷积神经网络运算的设备 Download PDF

Info

Publication number
WO2018107476A1
WO2018107476A1 PCT/CN2016/110436 CN2016110436W WO2018107476A1 WO 2018107476 A1 WO2018107476 A1 WO 2018107476A1 CN 2016110436 W CN2016110436 W CN 2016110436W WO 2018107476 A1 WO2018107476 A1 WO 2018107476A1
Authority
WO
WIPO (PCT)
Prior art keywords
multiply
unit
data block
accumulate
data
Prior art date
Application number
PCT/CN2016/110436
Other languages
English (en)
French (fr)
Inventor
汪涛
宋风龙
刘武龙
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2016/110436 priority Critical patent/WO2018107476A1/zh
Priority to CN201680091648.1A priority patent/CN110073329B/zh
Publication of WO2018107476A1 publication Critical patent/WO2018107476A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/57Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations

Definitions

  • the present application relates to the field of computers, and in particular, to a memory access device, a computing device, and a device applied to a convolutional neural network operation in the computer field.
  • CNN Convolutional neural network
  • the current neural network processor in general, its research direction mainly includes both computation and storage.
  • the core of the convolution operation is the multiply-accumulate operation.
  • Convolution operations usually contain a large amount of special data such as -1, 0 , 2 n , which takes up a large part of the computing resources.
  • special data such as -1, 0 , 2 n is generated at runtime, and the compiler can only perform static optimization and cannot optimize the data in operation. Resulting in lower calculation rates and throughput.
  • the present application provides a memory access device, a computing device, and a device applied to a convolutional neural network operation to improve memory access efficiency and computational throughput while reducing computational power consumption.
  • a memory access device including: an input buffer unit for buffering a data block to be calculated; a cascading unit connected to the input buffer unit, the cascading unit for using the input
  • the data block to be calculated is read in a buffer unit, where the data block to be calculated includes a first data block and a second data block; and the first data block and the second data block are connected end to end to obtain a level
  • a third data block is intercepted from the concatenated data block, the third data block includes a piece of continuous data in the concatenated data block, and a length of the third data block is Input slow
  • the data blocks in the storage unit are of equal length.
  • the cascading unit may connect the first data block read from the input buffer unit and the second data block end to end to obtain a cascading data block. And a third data block of one data block length at any starting position is intercepted from the concatenated data. Therefore, a fast address non-aligned access can be realized by arbitrarily intercepting data in the cascading data block, thereby improving the efficiency of address non-aligned access.
  • the memory access device further includes: a control unit, the control unit is connected to the cascading unit, and configured to send a first control instruction to the cascading unit, the first The control instruction is configured to indicate a truncation manner of the concatenated data block; the concatenation unit intercepts the third data block from the concatenated data block according to the first control instruction.
  • a vector length data can be quickly obtained from any two data blocks of the input buffer unit according to the first control instruction, that is, an arbitrary address non-aligned access is supported by one instruction, and the address can be reduced. Align access instructions to improve access efficiency.
  • the first control instruction includes first indication information, where the first indication information is used to indicate a starting position of the third data block in the concatenated data block.
  • the first indication information includes a data sequence number of a start position of the third data block
  • the first control instruction further includes second indication information, where the second indication information is used by And indicating a data format of the data block to be calculated; the cascading device determines a starting position of the third data block in the cascading data block according to the data sequence number and the data format.
  • the input buffer unit includes a read port, and the read port is connected to a first control register, where the first control register stores first configuration information, where the first configuration information is used. And indicating a range of addresses of the data block to be read in the input buffer unit, a start address and a step size in the address range, the read port starts from the start address, and the step size is The address of the adjacent two read operations grows in steps, and the data blocks in the address range are cyclically read.
  • the input buffer unit includes a write port, the write port is connected to a second control register, the second control register stores second configuration information, and the second configuration information is used for Instructing an address range of the new data block in the input buffer unit, a start address and a step size in the address range, the write port starting from the start address, adjacent to the step size
  • the address of the two write operations grows in steps, and a new block of data is cyclically written into the address range.
  • control register corresponding to the read port or the write port only needs to be stored for reading.
  • the corresponding data can be accessed by the address range of the data block, the starting address and the step size of the address range. This makes it possible to streamline the instructions of the write port or the read port. Further, in the cyclic self-index access mode, the address range and the step size of the access data can be configured to improve the flexibility of accessing the data of the input buffer unit.
  • a computing device comprising a multiplication buffer unit, a multiplication scheduling unit, and an addition unit, the multiplication buffer unit for buffering a multiply and accumulate instruction to be processed;
  • the multiplication scheduling unit is configured to Obtaining, by the multiplication buffer unit, a first multiply-accumulate instruction, when a source operand of the multiplication operation in the first multiply-accumulate instruction includes an optimizable operand, determining an operation result of the multiplication operation by an optimization operation, and The operation result of the multiplication operation in the first multiply-accumulate instruction is directly sent to the addition unit, n is an integer greater than or equal to 0, the optimizable operand includes -1 or 2 n , and the optimization operation includes symbol extraction a reverse operation or a shift operation; the addition unit performs an addition operation in the first multiply-accumulate instruction according to an operation result of the multiplication operation in the first multiply-accumulate instruction, to obtain a correspondence corresponding to the first multiply-accumulate instruction The result of the operation of the multiply-accumulate operation.
  • the calculation result of the multiplication operation is determined by a symbol negation operation or a shift operation, It is sent directly to the addition unit without multiplication by the multiplier, thereby increasing the rate and throughput of the multiply-accumulate operation and reducing the power consumption of the multiply-accumulate operation.
  • the multiplication scheduling unit is configured to schedule, in one clock cycle, a plurality of multiply and accumulate instructions acquired from the multiply buffer unit, the plurality of multiply accumulate instructions including a first type of multiply and accumulate And the at least one second type multiply-accumulate instruction, the source operand of the multiplication operation in the first type multiply-accumulate instruction does not include any one of -1, 0 and 2n , the second type multiply accumulate instruction The source operand of the multiplication operation in -1, 0 or 2 n .
  • the computing device in the embodiment of the present application can process multiple multiply and accumulate instructions in one clock cycle, thereby improving the speed and throughput of the multiply and accumulate operations.
  • the adding unit further includes an adding buffer unit, an adding scheduling unit, an adder, and at least one accumulating register, where the adding buffer unit is configured to buffer a source operand for the adding operation,
  • the source operand includes an operation result of the multiplication operation in the multiply-accumulate instruction to be processed;
  • the addition scheduling unit determines a first source operand and a second source operand of the addition of the first multiply-accumulate instruction, wherein And the first source operand corresponds to the same target accumulation register as the second source operand, the second source operand is from the addition buffer unit or the target accumulation register;
  • First source operand and second source operand A summation is performed to obtain a summation result;
  • the addition scheduling unit writes the summation result to the addition buffer unit or the target accumulation register.
  • the addition scheduling unit uses the adder to sum the operations of the multiplication operation of the multiply-accumulate instruction corresponding to the same accumulation register in the addition buffer unit, thereby reducing the number of accesses to the accumulation register and reducing the generation of the access accumulation register.
  • the pipeline stalls, increasing the rate and throughput of processing the multiply and accumulate operations.
  • the addition scheduling unit determines the target data as the second source operand, and Writing the summation result to the addition buffer unit; when the addition buffer unit does not store the target data, the addition scheduling unit uses the multiply and accumulate result stored in the target accumulation register as the second source Operands are written to the target accumulation register.
  • the addition scheduling unit first uses the adder to first sum the operations of the multiplication and accumulation instructions corresponding to the same accumulation register in the addition buffer unit, thereby reducing the number of access accumulation registers and reducing the access accumulation register generation.
  • the pipeline stalls, increasing the rate and throughput of processing the multiply and accumulate operations.
  • the multiply scheduling unit is configured to accumulate instructions for the first group of multiply A new target accumulation register is identified, and the operation result of the multiplication operation in the multiply-accumulate instruction in the first group multiply-accumulate instruction corresponds to the same accumulation register.
  • a device for use in a convolutional neural network operation comprising the memory access device of the first aspect or any of the possible implementations of the first aspect, and comprising the second or second aspect A computing device in any of the possible implementations.
  • FIG. 1 is a schematic diagram of a process of convolutional neural network operation according to an embodiment of the present application.
  • FIG. 2 is a schematic structural diagram of a memory access device according to an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a process in which a cascading unit performs a cascading operation according to an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a concatenation unit applied to a convolution operation according to an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of an input buffer unit according to an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a method for accessing an input buffer unit according to an embodiment of the present application.
  • FIG. 7 is a schematic diagram of an input buffer unit adjacent to two convolution operations write data according to an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a computing device according to an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a computing device according to still another embodiment of the present application.
  • FIG. 10 is a schematic flow chart of the multiply and accumulate operation in the embodiment of the present application.
  • FIG. 11 is a schematic flow chart of a multiply and accumulate operation according to another embodiment of the present application.
  • FIG. 12 is a schematic diagram of an apparatus applied to a convolutional neural network according to an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of an apparatus applied to a convolutional neural network according to still another embodiment of the present application.
  • FIG. 14 is a schematic diagram of the operation of the weight buffer in the embodiment of the present application.
  • FIG. 15 is a schematic diagram of a broadcast unit applied to a convolution operation in the embodiment of the present application.
  • 16 is a structural diagram showing the relationship between a multiply-accumulate array, a cascading unit, and a broadcast unit in the embodiment of the present application.
  • the embodiment of the present application proposes a memory access device, and the central idea is that a fast address can be realized by setting an input buffer unit and a cascading unit. Unaligned block access.
  • the embodiment of the present application further provides a computing device, and the central idea is to introduce a multiplication scheduling unit in a computing device, which is used for performing fast multiply and accumulate operations on special data such as -1, 0 , 2 n generated during operation. It can increase the rate and throughput of the multiply and accumulate operations. Further, by introducing an addition scheduling unit in the computing device, the computing device simultaneously operates the multiple multiply and accumulate instructions, and processes the data correlation between the multiple multiply and accumulate instructions, where the data correlation refers to data between the instructions. There is a dependency, for example, the need for the B instruction to depend on the result of the operation of the A instruction.
  • the multiply-accumulate operation is applied to a convolution algorithm, a two-dimensional filtering algorithm, or a finite impulse response (FIR) algorithm.
  • FIR finite impulse response
  • the embodiment of the present application further provides a device applied to a convolutional neural network operation, where the device includes the above-mentioned memory access device and computing device.
  • the device can optimize the process of convolution operation from both computation and storage. In terms of calculation, it can improve the fast multiply-accumulate operation of special data such as -1, 0 , 2 n generated during operation by setting the computing device. The rate and throughput of the multiply-accumulate operation.
  • it sets the input buffer unit and the cascading unit for the convolution operation data overlap, that is, the data locality is strong, and realizes the function of data caching and fast address non-aligned access. This reduces the number of access caches and improves the efficiency of address unaligned access.
  • each device in the embodiment of the present application may be applied to a convolutional neural network operation.
  • Convolutional neural network is a kind of artificial neural network, which has become a research hotspot in the field of speech analysis and image recognition. Its weight-sharing network structure makes it more similar to biological neural networks, reducing the complexity of the network model and reducing the number of weights. This advantage is more obvious when the input of the network is a multi-dimensional image, so that the image can be directly used as the input of the network, avoiding the complicated feature extraction and data reconstruction process in the traditional recognition algorithm.
  • a convolutional network is a multi-layer perceptron specifically designed to recognize two-dimensional shapes that is highly invariant to translation, scaling, tilting, or other forms of deformation.
  • the convolution operation is actually a process of weighted summation. For example, each element in the used image region is multiplied by each element in the convolution kernel, and the sum of all products is used as the new value of the center pixel of the region.
  • a convolution kernel is a matrix of fixed size and consisting of numerical parameters. The reference point of the matrix is usually at the center of the matrix, and the size of the matrix is the size of the kernel.
  • the convolution kernel matrix G performs a dot product operation on the same size data block in the input matrix R to obtain a calculation result of one of the output matrices O; then the convolution kernel is in the input matrix with the specified moving step size. Moving constantly, traversing all the data, and getting the output matrix O.
  • the convolution kernel size is 3*3, and the convolution kernel movement step is 1,
  • G5 R1*G1+R2*G2+R3*G3+R4*G4+R5*G5+R6*G6+R7*G7+R8*G8+R9*G9, if convolution operation is performed on an image, available A 3*3 convolution kernel with a reference point of the array as the core.
  • the reference point of the kernel is first positioned at the first pixel of the image, and the remaining elements of the kernel cover its corresponding partial pixel in the image. For each kernel point, we can get the value of this point and the value of the corresponding image point in the image. Multiply and sum these values and place the result at the position corresponding to the input image reference point. This operation is repeated for each point of the image by scanning the convolution kernel over the entire image. Finally, a convolved image of the image can be obtained.
  • the convolution calculation accounts for more than 90% of the entire operation, and is the main component of the entire CNN operation.
  • the neural network processor currently applied to convolutional neural network operations, in general, its research direction mainly includes both computation and storage.
  • computing it is mainly for application algorithms.
  • Computationally intensive and containing a large amount of sparse data design dedicated parallel computing paths (such as fast multiply-accumulate circuits) to increase the rate of convolution operations and computational throughput.
  • the application algorithm has strong locality and frequent address non-aligned memory access, and designs a dedicated storage path to reduce data transmission and increase data transmission bandwidth.
  • FIG. 2 is a schematic structural diagram of a memory access device 100 according to an embodiment of the present application. As shown in FIG. 2, the memory access device 100 includes:
  • the input buffer unit 110 is configured to cache a data block to be calculated.
  • an input buffer unit when applied to a convolutional neural network operation, can be used to buffer input data for a convolution operation.
  • the cascading unit 120 is connected to the input buffer unit 110, and the cascading unit 120 reads the data block to be calculated from the input buffer unit 110, where the data block to be calculated includes the first data block. And a second data block; connecting the first data block and the second data block end to end to obtain a concatenated data block; intercepting a third data block from the concatenated data block, the third data block includes A piece of continuous data in the concatenated data block, and the length of the third data block is equal to the length of any one of the input buffer units 110.
  • the first data block and the second data block may be data blocks belonging to different storage lines in the input buffer unit, or the first data block and the second data block may be the same storage line in the input buffer unit.
  • the length of the first data block and the second data block may be the length of one vector data. That is, the cascading unit can quickly obtain a vector length data from the cascading data block according to any starting address. Or the cascading unit can support arbitrary address unaligned access according to an instruction.
  • the first data block and the second data block are connected end to end, and may refer to the first data block as a high bit and the second data block as a low bit, which are connected together to obtain a cascading data block.
  • the cascading unit may connect the first data block and the second data block read from the input buffer unit end to end to obtain a cascading data block. And a third data block of one data block length at any starting position is intercepted from the concatenated data. Therefore, a fast address non-aligned access can be realized by arbitrarily intercepting data in the cascading data block, thereby improving the efficiency of address non-aligned access.
  • two data blocks may be first acquired, and then the two data blocks are cascaded to obtain a cascading data block. And directly intercepting the required third data block from the cascaded cascading data block, without multiple access to the input buffer unit, reducing the number of access to the data cache unit, thereby reducing the visit The power consumption of the data cache unit is reduced, and the time for address unaligned access is reduced, which improves the efficiency of address unaligned access.
  • data required for address non-aligned access can be generated in one clock cycle.
  • the memory access device 100 further includes a control unit 130, and the control unit 130 is connected to the cascading unit 120, and configured to send a first control instruction to the cascading unit 120, where the first control instruction is used by And indicating the intercepting manner of the cascading data block; the cascading unit 120 intercepts the third data block from the cascading data block according to the first control instruction.
  • the control unit 130 is connected to the cascading unit 120, and configured to send a first control instruction to the cascading unit 120, where the first control instruction is used by And indicating the intercepting manner of the cascading data block; the cascading unit 120 intercepts the third data block from the cascading data block according to the first control instruction.
  • the control unit 130 may be configured to receive a decoding circuit signal, and generate corresponding control logic to control a unit in the memory access device according to the decoding circuit signal.
  • the first control instruction described above may be used to indicate the manner in which the cascading data block is intercepted.
  • the first control instruction may include first indication information, which may be used to indicate a starting position of the third data block in the concatenated data block.
  • the cascading unit may intercept the third data block from the starting position according to the first indication information.
  • a vector length data can be quickly obtained from any two data blocks of the input buffer unit according to the first control instruction, that is, an arbitrary address non-aligned access is supported by one instruction, and the address can be reduced. Align access instructions to improve access efficiency.
  • the first indication information may include a data sequence number of a start position of the third data block
  • the first control instruction may further include second indication information, where the second indication information is used to determine the a data format of the data block to be calculated, the cascading device determining, according to the data sequence number of the start location and the data format of the data block to be calculated, the third data block in the cascading data block The starting position.
  • the above data format may indicate the width of each element in the data block.
  • the above data sequence number is used to indicate the sequence number of the element in the data block.
  • CAS is represented as an instruction opcode
  • TYPE is a data format of the concatenation operation.
  • TYPE may be 8, 16, 32, 64, or bits (English: bits).
  • TYPE can represent the width of an element in the vector.
  • VRm and VRn respectively represent two vector registers before cascading.
  • Rs represents the initial position of the block after the cascaded data block, and Rs can cooperate with the TYPE to determine the starting position and the intercept length of the intercepted data block.
  • FIG. 3 shows a process in which a cascade unit performs a cascade operation.
  • the cascading unit can read data from the input buffer unit and read the data into a vector register.
  • the unit length of one data block may be equal to the length of the vector register.
  • Data blocks can also be referred to as vector data.
  • the cascading unit needs to cascade the two vector registers VRm and VRn to obtain a vector data of twice the vector length.
  • the vector data stored by the VRm and the VRn respectively correspond to the first data block and the second data block.
  • the cascading unit may determine to intercept the starting position and length of the third data block according to the first indication information.
  • R S 4
  • SM streaming multiprocessor
  • FIG. 4 shows a schematic diagram of a cascade unit applied to a convolution operation.
  • a vector length data can be quickly obtained from any two data blocks of the input buffer unit according to the first control instruction, that is, an arbitrary address non-aligned access is supported by one instruction, and the address can be reduced. Align access instructions to improve access efficiency.
  • the access device in the embodiment of the present application can be applied to a convolution operation.
  • the foregoing functions can be completed by setting multiple Rs by multiple cascade instructions. Access to the input cache unit.
  • the input buffer unit can configure the read/write active area through a control register (Control Register, CR).
  • the input buffer unit can include a read port and a write port.
  • the write port is used to write data to the input buffer unit according to the instruction.
  • the read port is used to read data from the input buffer unit according to the instruction.
  • the above read port and write port may be one or more ports, respectively.
  • Each The read port or write port can correspond to a control register for storing instructions.
  • the read port or write port performs a read or write operation according to the configuration of the corresponding control register.
  • the input cache unit can support multiple read and write modes.
  • an input cache unit can support a loop self-index or an immediate index.
  • the loop self-index can automatically maintain the pointer I through hardware to determine the location of accessing the input buffer unit.
  • the loop self-indexing can determine the specific address of the access input buffer unit based on the address range, the start address and the step size of the address range.
  • the input buffer unit includes a read port, and the read port is connected to a first control register, where the first control register stores first configuration information, and the first configuration information is used to indicate a data block address range to be read in the input buffer unit, a start address and a step size in the address range, the read port starts from the start address, and the step size is adjacent to the two
  • the address of the secondary read operation grows in steps, and the data blocks in the address range are cyclically read.
  • the above starting address may also be referred to as a loop start address (for example, represented by start), and the address range may refer to a partial address range in the input buffer unit.
  • the address range can also be referred to as the loop window length (for example, represented by Winlen).
  • the above step size can refer to the address growth step size of the read port each time.
  • the above step size may also be referred to as a cyclic address growth step size (eg, expressed in steps).
  • FIG. 5 is a schematic structural diagram of an input buffer unit according to an embodiment of the present application.
  • the input buffer unit includes two read ports, read port 0 and read port 1.
  • the loop window is 6 cache lines in length.
  • the address of the read port 0 accessing the input buffer unit is "IB[1]-IB[3]-IB[5]-IB[1]-IB[3]!, corresponding to the accessed data. It is "d0-d2-d4-d0-d2"; the address of access port 1 access buffer unit is "IB[2]-IB[4]-IB[6]-IB[2]-IB[4]... ", the corresponding access data is "d1-d3-d5-d1-d3".
  • FIG. 6 is a schematic diagram showing a method of accessing an input buffer unit in an embodiment of the present application.
  • Addr indicates the specific address of the read port access input buffer, Start indicates the start address, I indicates the internal pointer, Winlen indicates the length of the loop window, and step indicates the step size.
  • the loop window can be any contiguous range of addresses within the input buffer unit.
  • the read port cyclically reads the data in the loop window according to the step size.
  • the input buffer unit includes a write port
  • the write port is connected to a second control register
  • the second control register stores second configuration information
  • the second configuration information is used for Instructing, in the input buffer unit, an address range in which a new data block is stored, a start address in the address range, and a step size
  • the write port is from the start address, and the step size is adjacent to two
  • the address of the secondary write operation is incremented by a step size, and a new data block is cyclically written into the address range.
  • the specific manner of writing data to the write port is similar to the manner in which the read port reads data, and details are not described herein again.
  • the instruction format of the cyclic self-indexed read operation can be expressed as: MOV IB[I++], Dest, which means that data is read from the internal self-indexing address of the input buffer unit to the destination register (which can be called Dest) and updated.
  • the self-indexing pointer I I+step, where MOV represents data handling, and the data direction is data buffer unit ⁇ register.
  • IB represents the input buffer unit.
  • LD represents data handling
  • data direction is memory ⁇ input buffer unit
  • IB represents input buffer unit.
  • the instruction format of an immediate index read operation may be represented as MOV IB[imm], Dest, indicating that data is read from IB[imm] to dest.
  • the instruction format of the immediate index write operation can be expressed as LD Addr, IB[imm], indicating that data is loaded from the memory Addr address and written to IB[imm].
  • the input buffer unit supports a cyclic self-index access mode.
  • the control register corresponding to the read port or the write port only needs to store the address range of the data block to be read, and in the address range.
  • the starting address and step size you can access the corresponding data. This makes it possible to streamline the instructions of the write port or the read port.
  • the address range and the step size of the access data can be configured to improve the flexibility of accessing the data of the input buffer unit.
  • the upper and lower parts in FIG. 7 respectively show schematic diagrams of the convolution operation write data of the input buffer unit in the row direction.
  • the write operation to the input buffer unit can be divided into two phases of initialization and update.
  • 2*k vectors are loaded during the initialization phase; due to the overlapping nature of the data, only two vectors need to be loaded during the update phase, and the previous 2k-2 vectors are reused, thereby reducing the input.
  • Buffer access The number of system memory.
  • the system memory can be, for example, a Dynamic Random Access Memory (DRAM).
  • DRAM Dynamic Random Access Memory
  • the input data is d0 to d5.
  • the input data is d2 to d7.
  • the input data d3 to d5 of the two adjacent convolution operations are overlapped. Therefore, in the second convolution operation, only data d6 and d7 need to be written, and data d0 and d1 are overwritten.
  • the access device of the embodiment of the present application is described above with reference to FIG. 1 to FIG. 7.
  • the computing device of the embodiment of the present application will be described below with reference to FIG. 8 to FIG.
  • FIG. 8 shows a schematic structural diagram of a computing device 300 of an embodiment of the present application.
  • Computing device 300 can be used for multiply and accumulate operations.
  • computing device 300 can be a multiply-accumulate array or a device that includes a multiply-accumulate array.
  • computing device 300 includes a multiplication buffer unit 310, a multiplication scheduling unit 320, and an addition unit 330,
  • the multiplication buffer unit 310 is configured to buffer a multiply and accumulate instruction to be processed.
  • the multiplication buffer unit 310 may also be referred to as a multiplication buffer unit (English: Mul_Buffer).
  • the multiply buffer unit can buffer the multiply and accumulate instructions prepared by the operand after decoding.
  • each entry in the multiply cache unit may include 3 fields. The three fields are "instruction type (English: opcode)", “source operand 0 value (English: src0)", “source operand 1 value (src1)”.
  • the depth of the multiplication buffer unit can be set according to the width of the instruction transmission.
  • the multiplication scheduling unit 320 is configured to obtain a first multiply and accumulate instruction from the multiplication buffer unit 310, and determine, when the source operand of the multiplication operation in the first multiply accumulate instruction includes an optimizable operand, Determining an operation result of the multiplication operation, and directly transmitting the operation result of the multiplication operation in the first multiply-accumulate instruction to the addition unit, where n is an integer greater than or equal to 0, and the optimizable operand includes -1, 0 Or 2 n , the optimization operation includes a symbol inversion operation, a shift operation, or a cancel operation.
  • the adding unit 330 performs an addition operation in the first multiply-accumulate instruction according to an operation result of the multiplication operation in the first multiply-accumulate instruction, and obtains an operation result of the multiply-accumulate operation corresponding to the first multiply-accumulate instruction .
  • the multiplication scheduling unit 320 may also be referred to as a multiplication scheduler (English: Mul_Scheduler), and the multiplication scheduling unit 320 may multiply -1, 0, 2 in the multiply-accumulate instruction according to the instruction type and the source operand in the multiplication buffer unit 310. Special data generated during operation such as n (n ⁇ 0) is scheduled. For example, in a specific scheduling process, for a case where the source operand contains 0, the multiplication result can be canceled. For the case where the source operand contains -1, the multiplication result can be obtained by modifying the sign bit inversion. In the case where the source operand contains 2 n , the result of the multiplication operation can be obtained by the shift operation, and the operation result of the above multiplication operation is directly sent to the addition unit 330 so that the addition unit performs the addition operation.
  • n a multiplication scheduler
  • the multiplication scheduling unit 320 may multiply -1, 0, 2 in the multiply-accumulate instruction according to the instruction type and the source operand in the multiplication buffer unit
  • the calculation result of the multiplication operation is determined by a symbol negation operation or a shift operation, It is sent directly to the addition unit without multiplication by the multiplier, thereby increasing the rate and throughput of the multiply-accumulate operation and reducing the power consumption of the multiply-accumulate operation.
  • the multiplication scheduling unit is configured to schedule, in one clock cycle, a plurality of multiply and accumulate instructions acquired from the multiplication buffer unit, the plurality of multiply and accumulate instructions including a first type multiply and accumulate instruction and at least one a two-type multiply-accumulate instruction, the source operand of the multiplication operation in the first type multiply-accumulate instruction does not include any one of -1, 0, and 2 n , and the multiplication operation in the second type multiply-accumulate instruction Source operands include -1, 0 or 2 n .
  • the computing device may further include a multiplier connected to the multiplication scheduling unit 320 and the adding unit 330.
  • the multiply scheduling unit 320 transmits the source operand to the multiplier, and the multiplier obtains the multiplication operation according to the source operand.
  • the result of the operation is sent to the addition unit 330.
  • the multiplier can process a first type of multiply-accumulate instruction in one clock cycle
  • the multiply scheduling unit can process a plurality of second type multiply-accumulate instructions without using a multiplier in one clock cycle.
  • the computing device in the embodiment of the present application can process multiple multiply and accumulate instructions in one clock cycle, thereby improving the speed and throughput of the multiply and accumulate operations.
  • the multiply scheduling unit 320 performs the above scheduling processing on -1, 0 or 2 n (n ⁇ 0)
  • the next instruction can be read from the multiplication buffer unit 310 and the above scheduling can be continued.
  • the source operand until the multiply accumulate instruction does not contain -1, 0 or 2 n (n ⁇ 0), or until the multiply accumulate instruction to be processed is not included in the multiplication buffer unit 310.
  • the multiply scheduling unit 320 may send the multiply accumulate instruction to the multiplier, after the multiplier performs the multiplication operation, The result of the obtained multiplication operation is sent to the addition unit. Therefore, in the embodiment of the present application, the computing device 300 can process multiple multiply and accumulate instructions simultaneously, which improves the rate and throughput of the multiply and accumulate operations.
  • the multiplication scheduling unit may sequentially acquire the multiply and accumulate instructions from the multiplication buffer unit, and schedule the multiply and accumulate instructions according to the scheduling method. For example, in one clock cycle, after obtaining the first type multiply and accumulate instruction, the multiply scheduling unit may send the first type multiply and accumulate instruction to the multiplier, and the next instruction obtained by the multiply scheduling unit is the second type of multiply accumulate When the instruction is executed, the second type multiply-accumulate instruction may be shifted, inverted, or canceled, and then the operation result is directly sent to the addition unit.
  • the multiply scheduling unit may stop acquiring the multiply and accumulate instruction from the multiplication buffer unit. The multiply and accumulate instructions are processed until the next clock cycle.
  • the adding unit 330 further includes an adding buffer unit, an adding scheduling unit, an adder, and at least one accumulating register.
  • the addition buffer unit is configured to buffer a source operand for an add operation, the source operand including an operation result of the multiplication operation in the multiply-accumulate instruction to be processed;
  • the addition scheduling unit determines the first multiplication a first source operand and a second source operand of the addition operation of the accumulation instruction, wherein the first source operand corresponds to the same target accumulation register as the second source operand, and the second source operand is from The addition buffer unit or the target accumulation register;
  • the addition scheduling unit uses the adder to sum the first source operand and the second source operand to obtain a summation result;
  • the addition scheduling unit The summation result is written to the addition buffer unit or the target accumulation register.
  • the multiply scheduling unit may allocate a corresponding accumulating register tag (English; tag) to the multiply and accumulate instructions.
  • a set of multiply-accumulate instructions corresponds to the same accumulator register.
  • the result of the multiplication operation in the group multiply-accumulate instruction needs to be summed, and the summation result is written into the accumulation register corresponding to the group multiply-accumulate instruction.
  • the first source operand may be the bottommost data of the addition buffer unit (ie, the earliest data entering the addition buffer unit).
  • the choice of the second source operand includes two ways. In the first mode, if the addition buffer unit further stores the target data of the same accumulation register corresponding to the first source operand, the target data may be used as the second source operand of the addition operation, and the operation of the addition operation may be performed. The result is written to the addition buffer unit. In the second mode, if the addition buffer unit does not include the data of the same accumulation register corresponding to the first source operand, the data stored in the accumulation register corresponding to the first source operand is used as the second source operand of the addition operation, and Write the result of the addition operation to the accumulation register.
  • the second source operand corresponds to the same accumulation register.
  • the second source operand may be an operation result of the multiplication operation in the multiply-accumulate instruction, or may be a summation result obtained by summing the operation results of the multiplication operation in the multiply-accumulate instruction.
  • the data stored in the accumulation register may be a multiply-accumulate operation result of the multiply-accumulate instruction in the same group multiply-accumulate instruction as the first multiply-accumulate instruction.
  • the addition buffer unit can simultaneously buffer the result of the multiplication operation of the plurality of multiply-accumulate instructions, and the addition scheduling unit first uses the adder to first multiply the multiply-accumulate instruction corresponding to the same accumulation register in the addition buffer unit.
  • the summation of the operations can reduce the number of accesses to the accumulator register, reduce the pipeline stall caused by accessing the accumulator register, and improve the rate and throughput of the processing multiply and accumulate operations.
  • the multiply scheduling unit is configured to identify a new target accumulating register for the first multiply-accumulate instruction
  • the operation result of the multiplication operation in the multiply-accumulate instruction in the first group multiply-accumulate instruction corresponds to the same accumulation register.
  • the foregoing identifying a new target accumulation register may be to allocate a new accumulated register label for the first group of multiply-accumulate instructions.
  • the above-mentioned addition buffer unit may also be referred to as an addition buffer (English: Arithmetic Logic Unit Buffer, ALU_Buffer).
  • the addition buffer unit can be used to cache the multiplication result of the multiply-accumulate instruction.
  • the data of the addition buffer unit may be derived from a multiplication scheduling unit or a multiplier.
  • the depth of the addition buffer unit can be determined according to the width of the instruction transmission.
  • addition scheduling unit may be referred to as an addition scheduler (English: ALU_scheduler) to schedule the addition of the multiply-accumulate instruction.
  • the addition scheduling unit schedules multiple multiply-accumulate instructions to avoid pipeline stalls caused by data correlation between multiple consecutive multiply-accumulate instructions.
  • the at least one accumulation register may be a plurality of accumulation registers, and the plurality of multiply-accumulate registers may ensure that multiple sets of multiply-accumulate instructions are simultaneously run in the computing device.
  • the number of accumulated registers can be set according to the instruction transmit width.
  • the addition buffer unit caches an operation result of the multiplication operation of the multiply-accumulate instruction to be processed.
  • the multiply-accumulate instruction to be processed may include a plurality of multiply-accumulate instructions.
  • the plurality of multiply-accumulate instructions may include at least one set of multiply-accumulate instructions. Wherein, each of the at least one set of multiply and accumulate instructions is multiplied by The result of the multiplication operation is used for summation, and each set of multiply accumulate instructions corresponds to one of the at least one accumulator register. The summation result of each set of multiply accumulate instructions is used to write to the corresponding accumulator register.
  • the accumulation register no longer enters the addition operation as the second source operand before the accumulation register completes the writeback operation. Pipeline to ensure that no data correlation occurs between multiple multiply and accumulate instructions within the same group.
  • the addition buffer unit can simultaneously buffer the result of the multiplication operation of the plurality of multiply-accumulate instructions, and the addition scheduling unit first uses the adder to first multiply the multiply-accumulate instruction corresponding to the same accumulation register in the addition buffer unit.
  • the operation summation sums all the multiply-accumulate instructions corresponding to the same accumulating register in the addition buffer unit, and after obtaining the summation result, sums the multiply-accumulated result in the accumulating register, and writes the result back Accumulating registers, thereby reducing the number of accesses to the accumulator registers, increases the rate and throughput of processing multiply accumulate operations.
  • FIG. 9 shows a computing device 500 of an embodiment of the present application.
  • Computing device 500 can also be referred to as a multiply accumulator.
  • computing device 500 is based on a multiplier (represented by Mul) and an adder (represented by ALU), and may also include the following elements:
  • Multiplication buffer unit (represented by Mul_Buffer): Can be the multiplication buffer unit in Figure 8.
  • the multiply and accumulate instructions are prepared in the multiply buffer unit, and each entry of the multiply buffer unit includes ⁇ "instruction type opcode", "source operand 0 value src0", “source operand 1 value src1” ” ⁇ 3 fields, the depth of the multiplication buffer unit can be set according to the width of the command transmission.
  • Cache unit (represented by ALU_Buffer).
  • Addition cache unit (represented by ALU_Buffer): The multiplication result in the cache multiply accumulate instruction in ALU_Buffer, the data of ALU_Buffer can come from Mul_Scheduler and Mul. Similar to Mul_Buffer, the depth of ALU_Buffer can be set according to the width of the command transmission.
  • Addition scheduler (represented by ALU_Scheduler): ALU_Scheduler schedules the addition of the multiply-accumulate instruction to avoid pipeline stalls caused by data correlation between multiple consecutive multiply-accumulate instructions.
  • Accumulating register banks (represented by ACC): Multiple multiply accumulate registers ensure that multiple sets of multiply accumulate instructions are simultaneously run in the multiply accumulator. Among them, a set of multiply and accumulate instructions is defined as an instruction sequence containing one MACC instruction. For the MACC instruction, please refer to the following detailed description. The number of accumulating registers can be set according to the instruction transmission width.
  • the computing device 500 can support two multiply-accumulate instructions, which are:
  • computing device 500 can be divided into three steps:
  • ALU_Scheduler schedules the addition operation.
  • FIG. 10 and FIG. 11 are schematic flowcharts showing the multiply and accumulate operations of the embodiment of the present application, respectively.
  • FIG. 10 is a flowchart of the multiplication operation in the multiply and accumulate operation in the embodiment of the present application.
  • Fig. 11 is a flow chart showing the addition operation of the multiply-accumulate operation of the embodiment of the present application. The specific flow of performing the multiply accumulate instruction by the computing device 500 will be described in detail below with reference to FIGS. 9 through 11.
  • the multiply-accumulate instruction is decoded, and after reading the operand, the instruction buffer enters the Mul_Scheduler.
  • the instruction enters Mul_Scheduler as long as the instruction buffer has a multiply and accumulate instruction, and these multiply accumulate instructions are ready for two source operands except the accumulating register, these The instructions are sent to Mul_Buffer.
  • the condition that Mul_Bufffer stops accepting the multiply accumulate instruction is: 1) Mul_Buffer is full; 2) there is no multiply accumulate instruction currently decoded.
  • the instruction when entering the Mul_Scheduler, may consider the data correlation between the multiply-accumulated instruction and other instructions, and perform corresponding scheduling.
  • the Mul_Buffer buffers a plurality of multiply and accumulate instructions that are ready for operation and decoded.
  • the multiplication result is directly obtained by modifying the sign bit or the shift operation, and the multiplication result is sent to the ALU_Buffer.
  • the accumulating register tag (English: Tag) to be written to, such as Tag(ACC0), indicates that the multiply accumulate instruction finally needs to write the result to the accumulating register ACC0.
  • the result of the multiplication in the multiply-accumulate instruction is buffered in the ALU_Buffer, and these multiplication results need to be added to the accumulating register.
  • the ALU_Scheduler schedules the addition operation.
  • the specific scheduling method is as follows:
  • the data of the bottom of ALU_Buffer (the earliest ALU_Buffer) is used as the first source operand of ALU addition, and the second source operand of ALU addition is selected as follows:
  • the ALU_Buffer contains data with the same accumulated register tag as the first source operand, the data is used as the second source operand of the ALU, and the ALU addition result is written back to the ALU_Buffer, and the accumulating register Tag is reserved.
  • the ALU_Buffer does not contain data with the same accumulated register tag as the first source operand, then the accumulating register with the same tag as the first source operand is used as the second source operand of the ALU, and the ALU addition result is written back. Corresponding accumulation register.
  • the accumulator register acts as the ALU second source operand into the ALU pipeline
  • the accumulator register cannot enter the second source operand again and enters the ALU pipeline before the accumulating register completes the write back operation, which ensures that the ALU pipeline is entered. No data correlation occurs between multiple multiply and accumulate instructions within the same group.
  • the computing device 500 may perform optimization processing on the numerical values of -1, 0 , 2 n, etc. in the multiplication operation, and in the operation in the addition operation, first, the addition buffer label has the same accumulated register label. The data is summed and then summed with the multiply-accumulated result in the accumulator register, thereby reducing the number of accesses to the accumulator register, thereby reducing pipeline stalls and improving the efficiency and throughput of the processing multiply-accumulate operation.
  • the memory access device and the computing device of the embodiment of the present application are described above with reference to FIG. 1 to FIG. 11 .
  • the device applied to the convolutional neural network operation according to the embodiment of the present application will be described with reference to FIG. 12 to FIG. 15 . .
  • FIG. 12 is a schematic diagram of an apparatus 700 applied to a convolutional neural network in accordance with an embodiment of the present application.
  • the device 700 applied to the convolutional neural network operation includes the memory access device 710 and the computing device 720 in the embodiment of the present application.
  • the access device 710 may be any of the memory devices in the embodiment of the present application.
  • the computing device 720 may be any computing device in the embodiment of the present application.
  • the device applied to the convolutional neural network operation includes a memory access unit, and the memory access unit may be. Therefore, fast address non-aligned access can be realized by intercepting data in the cascading data block, thereby improving the efficiency of address non-aligned access.
  • the computing device included in the convolutional neural network computing device determines the operation of the multiplication operation by a symbol inversion operation or a shift operation when the source operand of the multiplication operation in the first multiply-accumulate instruction includes -1 or 2 n As a result, it is directly sent to the addition unit without multiplication by the multiplier, thereby increasing the rate and throughput of the multiply-accumulate operation and reducing the power consumption of the multiply-accumulate operation.
  • FIG. 13 is a schematic structural diagram of a device 800 according to an embodiment of the present application.
  • Device 800 can be applied to convolutional neural network operations.
  • the memory access device 710 in FIG. 7 may include an input buffer area 830 and a cascading unit 850 in FIG. Further, the memory access device 710 may further include a control unit 810, a weight buffer 840, and a broadcast unit 860.
  • Computing device 720 in FIG. 7 can include fast multiply-accumulate array 870 in FIG.
  • the device 800 includes:
  • Control Unit (CU) 810 Receives the decoding circuit signal and generates corresponding control logic to control the entire system.
  • Memory 820 stores input data, weight data, and final convolution results.
  • the memory 820 can be system memory.
  • the memory 820 can be a DRAM.
  • Input Buffer (IB) 830 connected to the control unit 810, the memory 820, and the cascading unit 850, and buffers the input data of the convolution operation according to the parallel calculation direction of the convolution, and can support the loop self-index or immediate Number indexing two access methods.
  • the input buffer may be the input buffer unit described in FIGS. 2 to 7.
  • the input buffer can be a cache.
  • Weight Buffer (WB) 840 connected to the control unit 810, the memory 820, and the broadcast unit 820, and buffers the weight data of the convolution operation according to the parallel calculation direction of the convolution.
  • the weight buffer can be a cache.
  • Cascading Unit (CaU) 850 Cascading two vectors and truncating the appropriate bit segments from the two vectors to generate new vector data for convolution (multiply-accumulate) operations to reduce The number of accesses to the input buffer when accessing the same cross-store row.
  • the cascading unit may be the cascading unit described in FIGS. 2 to 7.
  • Broadcasting Unit (BU) 860 Broadcasts a single weight data of a convolution kernel to generate vector data.
  • a multiply-accumulate array (MAC Array, MACA) 870 multiply and accumulate input data and convolution weights using an algorithm and a control scheduling method.
  • the multiply-accumulate array 870 can be the computing device described above.
  • the multiply-accumulate array can be the computing device 500 or computing device 700 described above.
  • Partial-Sum Buffer (PB,) 880 multiply accumulate result generated by the cache fast multiply-accumulate array 870, and output the data in the portion and the buffer 880 to the fast according to the decoding control signal generated by the control unit 810.
  • the accumulation array is used for accumulating operations with new multiplication results; or output to memory 820 as a final convolution result.
  • the input buffer 830 can be used to read input data for each volume operation.
  • the convolution kernel size is 3*3
  • the shaded portion in Fig. 5 indicates all the related data in a column direction convolution parallel operation (four convolution operations are performed simultaneously in the column direction), wherein the position of the first convolutional convolution kernel indicated in the dotted line frame .
  • a convolution operation is related to 6 vectors, and is represented as d0 to d5 in the column direction, and these 6 vectors are respectively stored in the input buffer unit (that is, equivalent to the input buffer area 830).
  • FIG. 14 is a schematic diagram showing the operation of the weight buffer in the embodiment of the present application.
  • the weight buffer buffers the convolution kernel weight according to the parallel direction of the convolution operation. If they are parallel in the row direction, they are stored in a row sequence; if they are parallel in the row direction, they are stored in a column sequence.
  • the weight buffer is set to 1 write port and 1 read port, and the depth of the buffer can be flexibly set.
  • the cascading unit 850 can splicing the two vector registers end to end and intercepting consecutive vector values of one unit in the spliced 2x length vector register.
  • the concatenation unit 850 applied to the convolution operation reference may be made to the description of the related content in FIG. 4, and details are not described herein again.
  • Figure 15 shows a schematic diagram of a broadcast unit applied to a convolution operation.
  • the broadcast unit broadcasts a single element in the vector register in the form of a vector, and for convolution operations, broadcasts each weight element in the convolution kernel into a vector form.
  • TYPE can be 8, 16, 32, 64, bits (English: bits ).
  • TYPE can represent the width of an element in the vector.
  • VRm represents a vector register.
  • the broadcast unit broadcasts element 4 in the vector register to form vector data.
  • the multiply-accumulate array 870 can include PS multiply accumulators.
  • PS represents processor parallel granularity.
  • FIG. 16 is a diagram showing the relationship of the multiply-accumulate array 870 with the cascading unit and the broadcasting unit. As shown in FIG. 16, the multiply-accumulate array 870 receives the input data generated by the cascade unit 850, and the convolution kernel weight data generated by the broadcast unit 860 performs a multiply-accumulate operation. It optimizes the special values such as -1/0/2 n in the multiplication operation. Since the convolution operation usually contains a large number of -1/0/2 n special values, the speed of the multiply-accumulate operation can be improved.
  • the device 800 provided by the embodiment of the present application can improve the convolution operation speed and throughput, and the input buffer can buffer the reused input data and reduce the number of accesses to the slow memory; the cascading unit can generate the vector data across the storage line. To avoid frequent access to the input buffer; the multiply accumulator performs fast multiplication of special values such as -1, 0 , 2 n in the multiplier, and can automatically process data correlation.
  • system and “network” are used interchangeably herein.
  • the term “and/or” in this context is merely an association describing the associated object, indicating that there may be three relationships, for example, A and / or B, which may indicate that A exists separately, and both A and B exist, respectively. B these three situations.
  • the character "/" in this article generally indicates that the contextual object is an "or" relationship.
  • B corresponding to A means that B is associated with A, and B can be determined according to A.
  • determining B from A does not mean that B is only determined based on A, and that B can also be determined based on A and/or other information.
  • the disclosed systems, devices, and methods may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, or an electrical, mechanical or other form of connection.
  • the unit described as a separate component may or may not be physically separated, and the component displayed as a unit may or may not be a physical unit, that is, may be located in one place, or It can also be distributed to multiple network elements. Some or all of the units may be selected according to actual needs to achieve the objectives of the embodiments of the present application.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, can be stored in a computer readable storage medium.
  • the technical solution of the present application may be in essence or part of the contribution to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Complex Calculations (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

一种访存设备、计算设备和应用于卷积神经网络运算的设备,以提高访存效率和计算运算吞吐量,同时降低计算功耗。包括:输入缓存单元,用于缓存待计算的数据块;级联单元,与输入缓存单元相连,级联单元从输入缓存单元中读取待计算的数据块,待计算的数据块包括第一数据块和第二数据块;将第一数据块和第二数据块首尾相连,得到级联数据块;从级联数据块中截取第三数据块,第三数据块包含级联数据块中的一段连续的数据,且第三数据块的长度与输入缓存单元中的数据块的长度相等。

Description

访存设备、计算设备和应用于卷积神经网络运算的设备 技术领域
本申请涉及计算机领域,尤其涉及计算机领域中的访存设备、计算设备和应用于卷积神经网络运算的设备。
背景技术
卷积神经网络(convolutional neural network,CNN)是深度学习中使用最广泛的算法,它广泛应用于图像分类、语音识别、视频理解、人脸检测等多种应用中。针对神经网络数据密集性的计算特点,卷积神经网络运算通常采用定制的神经网络处理器。近年来,神经网络处理器成为学术界和工业界的研究热点。
对于当前的神经网络处理器,概括而言,其研究方向主要包含计算和存储两方面。其中在计算方面,卷积运算的核心是乘累加运算。卷积运算中通常包含大量的-1,0,2n等特殊数据,这些数据占用了很大一部分的计算资源。但是-1,0,2n等特殊数据是在运行时产生的,而编译器只能进行静态优化,不能对运行中的数据进行优化。导致计算的速率和吞吐量较低。
在存储方面,由于卷积算法的数据局部性强,所以存在频繁地址非对齐访存。而在对缓存进行地址非对齐访问时,缓存需要同时访问连续的两个访存块,且经过复杂的地址译码、数据选通、旋转移位等多个操作,功耗较高,同时难以在一个时钟周期内产生所需要的访问数据。
发明内容
本申请提供了一种访存设备、计算设备和应用于卷积神经网络运算的设备,以提高访存效率和计算运算吞吐量,同时降低计算功耗。
第一方面,提供了一种访存设备,包括:输入缓存单元,用于缓存待计算的数据块;级联单元,与所述输入缓存单元相连,所述级联单元用于从所述输入缓存单元中读取所述待计算的数据块,所述待计算的数据块包括第一数据块和第二数据块;将所述第一数据块和所述第二数据块首尾相连,得到级联数据块;从所述级联数据块中截取第三数据块,所述第三数据块包含所述级联数据块中的一段连续的数据,且所述第三数据块的长度与所述输入缓 存单元中的数据块的长度相等。
级联单元可以将从输入缓存单元中读取的第一数据块和第二数据块首尾相连,得到级联数据块。并从级联数据中截取任意起始位置的一个数据块长度的第三数据块。从而能够通过任意截取级联数据块中数据的方法实现快速的地址非对齐访问,提高了地址非对齐访问的效率。
在一种可能的实现方式中,所述访存设备还包括:控制单元,所述控制单元与所述级联单元相连,用于向所述级联单元发送第一控制指令,所述第一控制指令用于指示所述级联数据块的截取方式;所述级联单元根据所述第一控制指令,从所述级联数据块中截取所述第三数据块。
在本申请实施例中,可以根据第一控制指令从输入缓存单元的两个数据块中按照任意起始地址快速取得一个向量长度数据,即通过一条指令支持任意地址非对齐访问,能够精简地址非对齐访问的指令,提高访存效率。
在一种可能的实现方式中,所述第一控制指令包含第一指示信息,所述第一指示信息用于指示所述第三数据块在所述级联数据块中的起始位置。
在一种可能的实现方式中,所述第一指示信息包含所述第三数据块的起始位置的数据序号,所述第一控制指令还包括第二指示信息,所述第二指示信息用于指示所述待计算的数据块的数据格式;所述级联设备根据所述数据序号以及所述数据格式,确定所述第三数据块在所述级联数据块中的起始位置。
在一种可能的实现方式中,所述输入缓存单元包括读端口,所述读端口与第一控制寄存器相连,所述第一控制寄存器存储有第一配置信息,所述第一配置信息用于指示所述输入缓存单元中的待读取数据块的地址范围、在所述地址范围内的起始地址和步长,所述读端口从所述起始地址开始,以所述步长为相邻两次读操作的地址增长步长,循环读取所述地址范围内的数据块。
在一种可能的实现方式中,所述输入缓存单元包括写端口,所述写端口与第二控制寄存器相连,所述第二控制寄存器存储有第二配置信息,所述第二配置信息用于指示所述输入缓存单元中的存储新的数据块的地址范围、在所述地址范围的起始地址和步长,所述写端口从所述起始地址开始,以所述步长为相邻两次写操作的地址增长步长,将新的数据块循环写入所述地址范围中。
在本申请实施例中,读端口或写端口对应的控制寄存器只需存储待读取 数据块的地址范围、在所述地址范围的起始地址和步长,便可以访问相应的数据。从而能够精简写端口或读端口的指令。进一步地,在循环自索引的访问方式下,可以配置访问数据的地址范围和步长,提高了访问输入缓存单元的数据的灵活性。
第二方面,提供了在一种计算设备,所述计算设备包括乘法缓存单元、乘法调度单元和加法单元,所述乘法缓存单元用于缓存待处理的乘累加指令;所述乘法调度单元用于从所述乘法缓存单元获取第一乘累加指令,当所述第一乘累加指令中的乘法运算的源操作数包括可优化操作数时,通过优化操作确定所述乘法运算的运算结果,并将所述第一乘累加指令中的乘法运算的运算结果直接发送至所述加法单元,n为大于等于0的整数,所述可优化操作数包括-1或2n,所述优化操作包括符号取反操作或移位操作;所述加法单元根据所述第一乘累加指令中的乘法运算的运算结果,执行所述第一乘累加指令中的加法运算,得到所述第一乘累加指令对应的乘累加运算的运算结果。
在本申请实施例中,计算设备在第一乘累加指令中的乘法运算的源操作数包含-1或2n时,在通过符号取反操作或移位操作确定所述乘法运算的运算结果,直接发送给加法单元,而无需通过乘法器进行乘法运算,从而提高乘累加运算的速率和吞吐量以及降低乘累加运算的功耗。
在一种可能的实现方式中,所述乘法调度单元用于在一个时钟周期内调度从所述乘法缓存单元获取的多个乘累加指令,所述多个乘累加指令包含一个第一类型乘累加指令和至少一个第二类型乘累加指令,所述第一类型乘累加指令中的乘法运算的源操作数不包括-1,0和2n中的任一项,所述第二类型乘累加指令中的乘法运算的源操作数包括-1,0或2n
本申请实施例中的计算设备可以在一个时钟周期内处理多条乘累加指令,从而提高了乘累加运算的速度和吞吐量。
在一种可能的实现方式中,所述加法单元还包括加法缓存单元,加法调度单元、加法器和至少一个累加寄存器,所述加法缓存单元用于缓存用于加法运算的源操作数,所述源操作数包括所述待处理的乘累加指令中的乘法运算的运算结果;所述加法调度单元确定所述第一乘累加指令的加法运算的第一源操作数和第二源操作数,其中,所述第一源操作数与所述第二源操作数对应相同的目标累加寄存器,所述第二源操作数来自所述加法缓存单元或所述目标累加寄存器;所述加法调度单元对所述第一源操作数和第二源操作数 进行求和,得到求和结果;所述加法调度单元将所述求和结果写入所述加法缓存单元或所述目标累加寄存器。
在本申请实施例中,加法调度单元利用加法器对加法缓存单元中的对应相同累加寄存器的乘累加指令的乘法运算的运算求和,从而能够减少了访问累加寄存器次数,减少访问累加寄存器产生的流水线停顿,提高了处理乘累加运算的速率和吞吐量。
在一种可能的实现方式中,当所述加法缓存单元存储有对应于所述目标累加寄存器的目标数据时,所述加法调度单元将所述目标数据确定为所述第二源操作数,并将所述求和结果写入所述加法缓存单元;当所述加法缓存单元未存储所述目标数据时,所述加法调度单元将所述目标累加寄存器存储的乘累加结果作为所述第二源操作数,并将所述求和结果写入所述目标累加寄存器。
在本申请实施例中,加法调度单元利用加法器首先对加法缓存单元中的对应相同累加寄存器的乘累加指令的乘法运算的运算求和,从而能够减少了访问累加寄存器次数,减少访问累加寄存器产生的流水线停顿,提高了处理乘累加运算的速率和吞吐量。
在一种可能的实现方式中,当所述第一乘累加指令是第一组乘累加指令中的第一个乘累加指令时,所述乘法调度单元用于为所述第一组乘累加指令标识新的目标累加寄存器,所述第一组乘累加指令中的乘累加指令中的乘法运算的运算结果对应相同的累加寄存器。
第三方面,提供了一种应用于卷积神经网络运算的设备,包括如第一方面或第一方面中任一种可能的实现方式中的访存设备以及包括如第二方面或第二方面中任一种可能的实现方式中的计算设备。
附图说明
图1是本申请实施例的卷积神经网络运算的过程示意图。
图2是本申请实施例的访存设备的结构示意图。
图3是本申请实施例的级联单元进行级联运算的过程示意图。
图4是本申请实施例的级联单元应用于卷积运算的示意图。
图5是本申请实施例的输入缓存单元的结构示意图。
图6是本申请实施例的访问输入缓存单元的方法的示意图。
图7是本申请实施例的输入缓存单元相邻两次卷积运算写数据的示意图。
图8是本申请实施例的计算设备的示意性结构图。
图9是本申请又一实施例的计算设备的示意性结构图。
图10是本申请实施例的乘累加运算的流程示意图。
图11是本申请另一实施例的乘累加运算的流程示意图。
图12是本申请实施例的应用于卷积神经网络的设备的示意图。
图13是本申请又一实施例的应用于卷积神经网络的设备的结构示意图。
图14是本申请实施例中的权值缓冲区的工作示意图。
图15是本申请实施例中的广播单元应用于卷积运算的示意图。
图16是本申请实施例中的乘累加阵列与级联单元、广播单元的关系结构图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。
针对上述提出的神经网络设备访问地址非对齐数据操作复杂、效率较低的问题,本申请实施例提出了一种访存设备,其中心思想在于能通过设置输入缓存单元和级联单元实现快速地址非对齐的数据块访问。
本申请实施例还提出了一种计算设备,其中心思想在于通过在计算设备中引入乘法调度单元,其用于对运行中产生的-1,0,2n等特殊数据进行快速乘累加运算,能够提高乘累加运算的速率和吞吐率。进一步地,通过在计算设备中引入加法调度单元,实现计算设备同时对多条乘累加指令进行操作,以及自行处理多条乘累加指令之间的数据相关,这里的数据相关是指指令间的数据存在依赖性,例如需要B指令需要依赖于A指令的运算结果。其中,乘累加运算应用于卷积算法、二维滤波算法或有限脉冲响应(Finite impulse response,FIR)算法等。
本申请实施例还提出了一种应用于卷积神经网络运算的设备,该设备包含上述的访存设备和计算设备。该设备能够从计算和存储两方面对卷积运算的过程进行优化,在计算方面,其通过设置计算设备对运行中产生的-1,0,2n等特殊数据进行快速乘累加运算,能够提高乘累加运算的速率和吞吐率。在存储方面,其通过设置访存设备,针对卷积运算数据重叠,即数据局部性强 的特点,设置输入缓存单元以及级联单元,实现数据缓存和快速的地址非对齐访问的功能。从而减少了访问缓存的次数,以及提高了地址非对齐访存效率。
可选地,本申请实施例中的各设备可以应用于卷积神经网络运算。为了便于理解,下文首先介绍卷积神经网络和应用于卷积神经网络运算的设备。卷积神经网络是人工神经网络的一种,已成为当前语音分析和图像识别领域的研究热点。它的权值共享网络结构使之更类似于生物神经网络,降低了网络模型的复杂度,减少了权值的数量。该优点在网络的输入是多维图像时表现的更为明显,使图像可以直接作为网络的输入,避免了传统识别算法中复杂的特征提取和数据重建过程。卷积网络是为识别二维形状而特殊设计的一个多层感知器,这种网络结构对平移、比例缩放、倾斜或者其他形式的变形具有高度不变性。
为了便于理解,首先结合图1对卷积计算的过程进行简要介绍。卷积运算,其实就是加权求和的过程,例如,使用到的图像区域中的每个元素分别与卷积核中的每个元素对应相乘,所有乘积之和作为区域中心像素的新值。卷积核即一个大小固定、由数值参数构成的矩阵,矩阵的参考点通常位于矩阵的中心,矩阵的大小为核尺寸。如图1所示,卷积核矩阵G与输入矩阵R中相同大小的数据块进行点积运算,得到输出矩阵O中的一个的计算结果;然后卷积核以指定的移动步长在输入矩阵中不断移动,遍历所有的数据,得到输出矩阵O。其中,卷积核尺寸为3*3,卷积核移动步长为1,
G5=R1*G1+R2*G2+R3*G3+R4*G4+R5*G5+R6*G6+R7*G7+R8*G8+R9*G9,如果对一幅图像进行卷积运算,可利用以数组为核的参考点的3*3卷积核。首先将核的参考点定位于图像的第一个像素点,核的其余元素覆盖图像中其对应的局部像素点。对于每一个核点,我们可以得到这个点的值以及图像中对应图像点的值。将这些值相乘并求和,并将这个结果放在与输入图像参考点所对应的位置。通过在整个图像上扫描卷积核,对图像的每个点重复此操作。最终可以得到图像的卷积图像。
在CNN网络中,卷积计算占整个运算90%以上的运算量,是整个CNN运算的主要组成部分。
对于当前应用于卷积神经网络运算的神经网络处理器,概括而言,其研究方向主要包含计算和存储两方面。其中在计算方面,主要是针对应用算法 计算密集型且包含大量稀疏数据的特点,设计专用并行计算通路(如快速乘累加电路),提高卷积运算的速率和计算吞吐量。在存储方面,则针对应用算法数据局部性强、存在频繁地址非对齐访存等特点,设计专用存储通路,减少数据传输,增加数据传输带宽。
下文结合图2,首先介绍本申请实施例的访存设备。图2示出了本申请实施例的访存设备100的结构示意图。如图2所示,访存设备100包括:
输入缓存单元110,用于缓存待计算的数据块。
例如,在应用于卷积神经网络运算时,输入缓存单元可以用于缓存卷积运算的输入数据。
级联单元120,与所述输入缓存单元110相连,所述级联单元120从所述输入缓存单元110中读取所述待计算的数据块,所述待计算的数据块包括第一数据块和第二数据块;将所述第一数据块和所述第二数据块首尾相连,得到级联数据块;从所述级联数据块中截取第三数据块,所述第三数据块包含所述级联数据块中的一段连续的数据,且所述第三数据块的长度与所述输入缓存单元110中的任意一个数据块的长度相等。
可选地,上述第一数据块和第二数据块可以是属于输入缓存单元中不同存储行的数据块,或者,上述第一数据块和第二数据块也可以是输入缓存单元中相同存储行的数据块。上述第一数据块和第二数据块的长度可以是一个向量数据的长度。即级联单元可以从级联数据块中按照任意起始地址快速取得一个向量长度数据。或者说级联单元可以根据一条指令支持任意的地址非对齐访问。
其中,上述第一数据块和第二数据块首尾相连,可以指将第一数据块作为高位,第二数据块作为低位,连接在一起,得到级联数据块。
在本申请实施例中,级联单元可以将从输入缓存单元中读取的第一数据块和第二数据块首尾相连,得到级联数据块。并从级联数据中截取任意起始位置的一个数据块长度的第三数据块。从而能够通过任意截取级联数据块中数据的方法实现快速的地址非对齐访问,提高了地址非对齐访问的效率。
本申请实施例中,对于需要多次访问输入缓存单元相同的两个数据块中的不同数据时,可以首先获取两个数据块,然后将两个数据块进行级联后得到级联数据块,并直接从级联后的级联数据块中截取所需的第三数据块,无需多次访问输入缓存单元,减少了访问数据缓存单元的次数,从而减少了访 问数据缓存单元的功耗,以及减少了地址非对齐访问的时间,提高了地址非对齐访问的效率。
例如,在本申请实施例中,可以在一个时钟周期内产生地址非对齐访问所需的数据。
可选地,访存设备100还包括控制单元130,所述控制单元130与所述级联单元120相连,用于向所述级联单元120发送第一控制指令,所述第一控制指令用于指示所述级联数据块的截取方式;所述级联单元120根据所述第一控制指令,从所述级联数据块中截取所述第三数据块。
其中,上述控制单元130可以用于接收译码电路信号,并根据译码电路信号产生相应的控制逻辑控制访存设备中的单元。
上述第一控制指令可以用于指示截取级联数据块的方式。例如,第一控制指令可以包含第一指示信息,该第一指示信息可以用于指示第三数据块在级联数据块中的起始位置。级联单元可以根据第一指示信息,从该起始位置开始截取所述第三数据块。
在本申请实施例中,可以根据第一控制指令从输入缓存单元的两个数据块中按照任意起始地址快速取得一个向量长度数据,即通过一条指令支持任意地址非对齐访问,能够精简地址非对齐访问的指令,提高访存效率。
又例如,所述第一指示信息可以包含所述第三数据块的起始位置的数据序号,所述第一控制指令还可以包括第二指示信息,所述第二指示信息用于确定所述待计算的数据块的数据格式,所述级联设备根据所述起始位置的数据序号以及所述待计算的数据块的数据格式,确定所述第三数据块在所述级联数据块中的起始位置。其中,上述数据格式可以指示数据块中的每个元素的宽度。上述数据序号用于指示数据块中的元素的序号。
作为一个示例,该指令可以表示为“VRt=CAS.TYPE(VRm,VRn),Rs”。其中,其中,CAS表示为指令操作码,TYPE表示级联运算的数据格式,例如,TYPE可以是8、16、32、64、比特(英文:bits)。或者,TYPE可以表示向量中一个元素的宽度。VRm、VRn分别表示级联前的两个向量寄存器。Rs表示对级联后的数据块截取的初始位置,Rs可以配合TYPE,确定截取数据块的起始位置和截取长度。
图3示出了级联单元进行级联运算的过程。级联单元可以从输入缓存单元读取数据,并将数据读入向量寄存器中。在本申请实施例中,一个数据块 的单位长度可以等于向量寄存器的长度。数据块也可以称为向量数据。根据该指令,级联单元需要对VRm和VRn这两个向量寄存器进行级联,得到一个两倍向量长度的向量数据。其中,VRm和VRn储存的向量数据分别对应第一数据块和第二数据块。接下来,级联单元可以根据第一指示信息,确定截取第三数据块的起始位置和长度。例如,当RS=4时,表示从级联数据块的第4个单位长度的元素开始截取一个向量长度的第三数据块。具体地,假定向量长度是256bits,TYPE是32bits的。定义处理器并行粒度PS(Parallelism Size),表示共享局部存储单元的同类运算单元的数量。例如,可以是单指令多数据流(Single Instruction Multiple Data,SIMD)或向量处理器中算数逻辑单元(arithmetic and logic unit,ALU)运算单元数量,或者图像处理器(Graphics Processing Unit,GPU)中的多流处理器(streaming multiprocessor,SM)中流处理器(streaming processor,SP)的数量。可以根据TYPE确定PS的大小,用公式表示为:PS=256/32=8。即一个向量数据包括8个单位长度的元素。或者说一个向量寄存器包含8个单位长度。则级联后的数据块包括16个单位长度。当Rs=4时,表示级联单元把级联产生的16个单位的向量从地址4开始截取8个单位的元素,以得到新的向量,即得到第三数据块。
作为一个示例,图4示出了级联单元应用于卷积运算的示意图。对于3*3的卷积核,需要在列方向上连续访问这个跨存储行的向量数据时,只需把Rs分别设置成0、1、2。其中Rs=0为地址对齐访问,Rs=1或2为地址非对齐访问。
在本申请实施例中,可以根据第一控制指令从输入缓存单元的两个数据块中按照任意起始地址快速取得一个向量长度数据,即通过一条指令支持任意地址非对齐访问,能够精简地址非对齐访问的指令,提高访存效率。
本申请实施例的访问设备可以应用于卷积运算,尤其对于多次跨行访问相同的两个存储行时,可以通过多条级联指令,设置不同的Rs即可以完成上述功能,而不需要多次访问输入缓存单元。
可选地,输入缓存单元可以通过控制寄存器(Control Register,CR)对读写有效区域进行配置。输入缓存单元可以包括读端口和写端口。写端口用于根据指令将数据写入输入缓存单元。读端口用于根据指令将数据从输入缓存单元中读取出来。上述读端口和写端口可以分别是一个或多个端口。每个 读端口或写端口可以对应一个用于存储指令的控制寄存器。读端口或写端口根据对应的控制寄存器的配置进行读操作或写操作。
输入缓存单元可以支持多种读写方式。例如,输入缓存单元可以支持循环自索引或立即数索引。其中,循环自索引可以通过硬件自动维护指针I,确定访问输入缓存单元的位置。例如,循环自索引可以根据地址范围、在所述地址范围的起始地址和步长确定访问输入缓存单元的具体地址。
例如,以读端口为例,所述输入缓存单元包括读端口,所述读端口与第一控制寄存器相连,所述第一控制寄存器存储有第一配置信息,所述第一配置信息用于指示所述输入缓存单元中的待读取数据块地址范围、在所述地址范围内的起始地址和步长,所述读端口从所述起始地址开始,以所述步长为相邻两次读操作的地址增长步长,循环读取所述地址范围内的数据块。
其中,上述起始地址也可以称为通过循环起始地址(例如,用start表示),上述地址范围可以指输入缓存单元中的部分地址范围。地址范围也可以称为循环窗口长度(例如,用Winlen表示)。上述步长可以指读端口每次读取的地址增长步长。例如,上述步长也可以称为循环地址增长步长(例如,用step表示)。
例如,图5示出了本申请实施例的输入缓存单元的结构示意图。如图5所示,假设输入缓存单元包括读端口0和读端口1两个读端口。循环窗口长度为6个缓存行。循环自索引起始地址为1,读端口0需要读数据d0/d2/d4,读端口1需要读数据d1/d3/d5,那么对读端口0和1分别配置为“start=1,WinLen=6,Step=2”,“start=2,WinLen=6,Step=2”。这样在连续的时钟周期内,读端口0访问输入缓存单元的地址为“IB[1]-IB[3]-IB[5]-IB[1]-IB[3]…”,对应访问的数据为“d0-d2-d4-d0-d2…”;读端口1访问输入缓存单元的地址为“IB[2]-IB[4]-IB[6]-IB[2]-IB[4]…”,对应访问的数据为“d1-d3-d5-d1-d3…”。
图6示出了本申请实施例的访问输入缓存单元的方法的示意图。如图6所示,访问输入缓存单元的方式可以表示为Addr=Start+I%WinLen,其中I=I+step,%表示取余。Addr表示读端口访问输入缓存器的具体地址,Start表示起始地址,I表示内部指针,Winlen表示循环窗口的长度,step表示步长。循环窗口可以是输入缓存单元内部的任一部分连续的地址范围。读端口根据步长依次循环读取循环窗口内的数据。
又例如,以写端口为例,所述输入缓存单元包括写端口,所述写端口与第二控制寄存器相连,所述第二控制寄存器存储有第二配置信息,所述第二配置信息用于指示所述输入缓存单元中的存储新的数据块的地址范围、在所述地址范围的起始地址和步长,所述写端口从所述起始地址,以所述步长为相邻两次写操作的地址增长步长,将新的数据块循环写入所述地址范围中。其中,写端口写入数据的具体方式与读端口读取数据的方式相似,此处不再赘述。
作为一个示例,循环自索引读操作的指令格式可以表示为:MOV IB[I++],Dest,表示从输入缓存单元的内部自索引地址处读出数据到目的寄存器(可以称为Dest),同时更新自索引指针I=I+step,其中,MOV表示数据搬运,数据方向是数据缓存单元→寄存器。IB表示输入缓存单元。
循环自索引写操作的指令格式可以表示为LD Addr,IB[I++],表示从存储器的Addr地址处加载数据,写往输入缓存单元内部自索引地址处,同时更新自索引指针I=I+Step,其中,LD表示数据搬运,数据方向是存储器→输入缓存单元,IB表示输入缓存单元。
作为一个示例,立即数索引的读操作的指令格式可以表示为MOV IB[imm],Dest,表示从IB[imm]处读取数据到dest。
立即数索引的写操作的指令格式可以表示为LD Addr,IB[imm],表示从存储器Addr地址处加载数据,写往IB[imm]处。
在本申请实施例中,输入缓存单元支持循环自索引的访问方式,在该访问方式下,读端口或写端口对应的控制寄存器只需存储待读取数据块的地址范围、在所述地址范围的起始地址和步长,便可以访问相应的数据。从而能够精简写端口或读端口的指令。进一步地,在循环自索引的访问方式下,可以配置访问数据的地址范围和步长,提高了访问输入缓存单元的数据的灵活性。
可选地,如图7所示,图7中的上下两部分别示出了输入缓存单元在行方向上相邻两次卷积运算写数据的示意图。从图7中可以看出,当卷积核在输入行数据行方向上移动时,相邻两次卷积运算输入数据具有重叠部分。可以对输入缓存单元的写操作分成初始化和更新两个阶段。对于k*k的卷积核,在初始化阶段加载2*k个向量;由于数据的重叠特性,更新阶段则只需要加载2个向量,重复利用之前的2k-2个向量,由此可以减少输入缓冲区访问 ***内存的次数。***内存例如可以是动态随机存储器(Dynamic Random Access Memory,DRAM)。
例如,在图7中,在第一次卷积运算时,输入数据为d0~d5。第二次卷积运算时,随着卷积核沿着行方向移动,输入数据为d2~d7。两次相邻的卷积运算的输入数据d3~d5是重叠的。因此,在第二次卷积运算时,只需写入数据d6和d7,并覆盖数据d0和d1。
上文结合图1至图7介绍了本申请实施例的访存设备,下文将结合图8至图11介绍本申请实施例的计算设备。
图8示出了本申请实施例的计算设备300的示意性结构图。计算设备300可以用于乘累加运算。例如,计算设备300可以是乘累加阵列或包含乘累加阵列的设备。如图8所示,计算设备300包括乘法缓存单元310、乘法调度单元320和加法单元330,
所述乘法缓存单元310用于缓存待处理的乘累加指令。
可选地,乘法缓存单元310也可以称为乘法缓存单元(英文:Mul_Buffer)。乘法缓存单元可以缓存译码之后操作数准备好的乘累加指令。可选地,乘法缓存单元中的每个表项可以包括3个域。3个域分别为“指令类型(英文:opcode)”,“源操作数0值(英文:src0)”,“源操作数1值(src1)”。乘法缓存单元的深度可以根据指令发射的宽度设置。
所述乘法调度单元320用于从所述乘法缓存单元310获取第一乘累加指令,当所述第一乘累加指令中的乘法运算的源操作数包括可优化操作数时,通过优化操作确定所述乘法运算的运算结果,并将所述第一乘累加指令中的乘法运算的运算结果直接发送至所述加法单元,n为大于等于0的整数,所述可优化操作数包括-1、0或2n,所述优化操作包括符号取反操作、移位操作或取消操作。
所述加法单元330根据所述第一乘累加指令中的乘法运算的运算结果,执行所述第一乘累加指令中的加法运算,得到所述第一乘累加指令对应的乘累加运算的运算结果。
上述乘法调度单元320也可以称作乘法调度器(英文:Mul_Scheduler),乘法调度单元320可以根据乘法缓存单元310中的指令类型和源操作数对乘累加指令中的乘法运算-1,0,2n(n≥0)等运行中产生的特殊数据进行调度。例如,在具体调度过程中,对于源操作数包含0的情况,可以取消乘法结果。 对于源操作数包含-1的情况,可以通过修改符号位取反获得乘法结果。对于源操作数包含2n的情况下,可以通过移位操作获得乘法运算的结果,并将上述乘法运算的运算结果直接发送至加法单元330,以便于加法单元执行加法操作。
在本申请实施例中,计算设备在第一乘累加指令中的乘法运算的源操作数包含-1或2n时,在通过符号取反操作或移位操作确定所述乘法运算的运算结果,直接发送给加法单元,而无需通过乘法器进行乘法运算,从而提高乘累加运算的速率和吞吐量以及降低乘累加运算的功耗。
可选地,所述乘法调度单元用于在一个时钟周期内调度从所述乘法缓存单元获取的多个乘累加指令,所述多个乘累加指令包含一个第一类型乘累加指令和至少一个第二类型乘累加指令,所述第一类型乘累加指令中的乘法运算的源操作数不包括-1,0和2n中的任一项,所述第二类型乘累加指令中的乘法运算的源操作数包括-1,0或2n
可选地,上述计算设备还可以包括乘法器,乘法器与乘法调度单元320以及加法单元330相连。当第一乘累加指令中的乘法运算的源操作数中不包含-1,0,或2n时,乘法调度单元320将源操作数发送至乘法器,乘法器根据源操作数获得乘法运算的运算结果,并将该运算结果输送至加法单元330。
应理解,乘法器在一个时钟周期内可以处理一条第一类型乘累加指令,而乘法调度单元在一个时钟周期内可以不利用乘法器处理多条第二类型乘累加指令。
本申请实施例中的计算设备可以在一个时钟周期内处理多条乘累加指令,从而提高了乘累加运算的速度和吞吐量。
可选地,当乘累加调度指令的源操作数包含-1,0或2n(n≥0)时,乘法调度单元320在对-1,0或2n(n≥0)进行上述调度处理的时候,可以继续从乘法缓存单元310读取下一条指令,并继续进行上述调度。直至乘累加指令的源操作数不包含-1,0或2n(n≥0),或者直至乘法缓存单元310中不包含待处理的乘累加指令。在乘累加指令的源操作数不包含-1,0或2n(n≥0)时,乘法调度单元320可以将该乘累加指令发送给乘法器,由乘法器进行乘法运算的处理之后,将获取的乘法运算的运算结果发送给加法单元。因此,本申请实施例中,计算设备300可以同时处理多条乘累加指令,提高了乘累加运算的速率和吞吐量。
可选地,在一个时钟周期内,乘法调度单元可以顺序从乘法缓存单元获取乘累加指令,并根据上述调度方法对乘累加指令进行调度。例如,在一个时钟周期内,乘法调度单元在获取到第一类型乘累加指令后,可以将该第一类型乘累加指令发送至乘法器,乘法调度单元获取的下一条指令为第二类型乘累加指令时,可以将该第二类型乘累加指令进行移位、取反操作或取消操作之后,将运算结果直接发送至加法单元。若乘法调度单元获取的下一条指令还是第一类型乘累加指令时,由于乘法器已经在处理第一类型乘累加指令,则乘法调度单元可以停止从乘法缓存单元获取乘累加指令。直至下一个时钟周期,再开始处理乘累加指令。
可选地,上述加法单元330还包括加法缓存单元,加法调度单元、加法器和至少一个累加寄存器。所述加法缓存单元用于缓存用于加法运算的源操作数,所述源操作数包括所述待处理的乘累加指令中的乘法运算的运算结果;所述加法调度单元确定所述第一乘累加指令的加法运算的第一源操作数和第二源操作数,其中,所述第一源操作数与所述第二源操作数对应相同的目标累加寄存器,所述第二源操作数来自所述加法缓存单元或所述目标累加寄存器;所述加法调度单元利用所述加法器对所述第一源操作数和第二源操作数进行求和,得到求和结果;所述加法调度单元将所述求和结果写入所述加法缓存单元或所述目标累加寄存器。
可选地,乘法调度单元可以对乘累加指令分配对应的累加寄存器标签(英文;tag)。一组乘累加指令对应于相同的累加寄存器。该组乘累加指令中的乘法运算的运算结果需要进行求和,并将求和结果写入该组乘累加指令对应的累加寄存器中。
可选地,第一源操作数可以是加法缓存单元最底部的数据(即进入加法缓冲单元最早的数据)。第二源操作数的选择包括两种方式。第一种方式中,若加法缓存单元中还存储有与第一源操作数对应相同累加寄存器的目标数据,则可以将该目标数据作为加法运算的第二源操作数,并将加法运算的运算结果写往加法缓存单元。第二种方式中,若加法缓存单元不包含与第一源操作数对应相同累加寄存器的数据,则将第一源操作数对应的累加寄存器存储的数据作为加法运算的第二源操作数,并将加法运算的运算结果写往累加寄存器。
可选地,在第一种方式中,当把加法运算的运算结果写回加法缓存单元 时,可以保留该运算结果对应的累加寄存器的标签,以便于该运算结果再次作为加法运算的源操作数时,确定其对应的累加寄存器,即该加法运算的运算结果与第一源操作数、第二源操作数对应相同的累加寄存器。并且第二源操作数可以是乘累加指令中乘法运算的运算结果,也可以是乘累加指令中的乘法运算的运算结果求和后的求和结果。
在第二种方式中,存储于累加寄存器中的数据可以是与第一乘累加指令属于同一组乘累加指令中的乘累加指令的乘累加运算结果。
在本申请实施例中,由于加法缓存单元可以同时缓存多个乘累加指令的乘法运算的结果,并且加法调度单元利用加法器首先对加法缓存单元中的对应相同累加寄存器的乘累加指令的乘法运算的运算求和,从而能够减少了访问累加寄存器次数,减少访问累加寄存器产生的流水线停顿,提高了处理乘累加运算的速率和吞吐量。
可选地,当所述第一乘累加指令是第一组乘累加指令中的第一个乘累加指令时,所述乘法调度单元用于为所述第一乘累加指令标识新的目标累加寄存器,所述第一组乘累加指令中的乘累加指令中的乘法运算的运算结果对应相同的累加寄存器。其中,上述标识新的目标累加寄存器,可以是为第一组乘累加指令分配新的累加寄存器标签。
作为一个示例,上述加法缓存单元也可以称为加法缓冲器(英文:Arithmetic Logic Unit Buffer,ALU_Buffer)。加法缓存单元可以用于缓存乘累加指令的乘法结果。加法缓存单元的数据可以来源于乘法调度单元或乘法器。加法缓存单元的深度可以根据指令发射的宽度确定。
作为一个示例,上述加法调度单元可以称为加法调度器(英文:ALU_scheduler)对乘累加指令的加法运算进行调度。加法调度单元通过对多条乘累加指令进行调度,以避免多条连续乘累加指令之间数据相关引起的流水线停顿。
作为一个示例,上述至少一个累加寄存器可以是多个累加寄存器,上述多个乘累加寄存器可以保证在该计算设备中同时运行多组乘累加指令。累加寄存器的数目可以根据指令发射宽度设置。
可选地,上述加法缓存单元缓存待处理的乘累加指令的乘法运算的运算结果。待处理的乘累加指令可以包括多个乘累加指令。该多个乘累加指令可以包含至少一组乘累加指令。其中,至少一组乘累加指令中的每组乘累加指 令的乘法运算的结果用于求和,每组乘累加指令对应于至少一个累加寄存器中的一个累加寄存器。每组乘累加指令的求和结果用于写入对应的累加寄存器。
可选地,在累加寄存器存储的数据充当加法运算的第二源操作数进入加法运算流水线时,在累加寄存器未完成写回操作之前,该累加寄存器不再作为第二源操作数进入加法运算的流水线,以保证同一组内部的多条乘累加指令之间不发生数据相关。
应理解,在现有技术中,当乘累加运算设备在乘累加指令存在数据相关时,需要流水线停顿。且现有技术中的加法单元中在求和时,每处理一个目标乘累加指令的乘累加运算,都需要将对应的累加寄存器内的乘累加运算结果读取到加法器中,与该条目标乘累加指令的乘法运算的结果进行求和,并将得到的求和结果作为更新的乘累加运算的运算结果存入乘累加寄存器中。该乘累加寄存器在没有完成写回操作之前,不能再次充当第二源操作数进入加法器的流水线,从而导致乘累加运算的速率和吞吐量较低。
在本申请实施例中,由于加法缓存单元可以同时缓存多个乘累加指令的乘法运算的结果,并且加法调度单元利用加法器首先对加法缓存单元中的对应同一累加寄存器的乘累加指令的乘法运算的运算求和,在对加法缓存单元中的对应同一累加寄存器的所有乘累加指令进行求和,并在获得求和结果之后,再与累加寄存器中的乘累加结果求和,并将结果写回累加寄存器,从而能够减少访问累加寄存器的次数,提高了处理乘累加运算的速率和吞吐量。
作为一个具体示例,图9示出了本申请实施例的计算设备500。计算设备500也可以称为乘累加器。如图9所示,计算设备500以乘法器(用Mul表示)和加法器(用ALU表示)为基础,还可以包括以下各单元:
乘法缓存单元(用Mul_Buffer表示):可以是图8中的乘法缓存单元。乘法缓存单元中缓存了译码之后操作数准备好的乘累加指令,乘法缓存单元的每个表项包含{“指令类型opcode”,“源操作数0值src0”,“源操作数1值src1”}3个域,乘法缓存单元的深度可以根据指令发射的宽度自行设置。
乘法调度器(用Mul_Scheduler表示):可以是图8中的乘法调度单元。根据Mul_Buffer中的指令类型和源操作数数据对乘累加指令中乘法运算-1/0/2n(n>=0)等特殊值进行调度,调度完成的指令可以发往Mul和Mul流水线后面的加法缓存单元(用ALU_Buffer表示)。
加法缓存单元(用ALU_Buffer表示):ALU_Buffer中缓存乘累加指令中的乘法结果,ALU_Buffer的数据可以来自Mul_Scheduler和Mul。同Mul_Buffer类似,ALU_Buffer的深度可以根据指令发射的宽度自行设定。
加法调度器(用ALU_Scheduler表示):ALU_Scheduler对乘累加指令的加法运算进行调度,避免多条连续乘累加指令之间数据相关引起的流水线停顿。
累加寄存器组(用ACC表示):多个乘累加寄存器保证在该乘累加器中同时运行多组乘累加指令。其中,一组乘累加指令定义为包含1个MACC指令的指令序列,关于MACC指令请参见后文的详细介绍,累加寄存器的数目可以根据指令发射宽度自行设定。
通过增加以上硬件逻辑,该计算设备500可以支持两种乘累加指令,它们分别为:
1)正常乘累加指令,定义为“MAC X,Y”。其功能为把输入数据X和Y进行乘法运算,并把该乘法结果与当前的累加寄存器进行相加,最后把加法结果写回至乘累加寄存器,即ACC+=X*Y.
2)设置乘累加寄存器初值为零的乘累加指令,定义为“MACC X,Y”。其功能为把输入数据X和Y进行乘法运算,把乘法结果写回至累加寄存器(相当于先设定乘累加寄存器初值为0,然后进行乘累加运算),即ACC=X*Y。
计算设备500的工作可以分为三个步骤:
1)译码且读操作数完成之后的乘累加指令进入Mul_Buffer;
2)Mul_Scheduler对乘法运算进行调度;
3)ALU_Scheduler对加法运算进行调度。
图10和图11分别示出了本申请实施例的乘累加运算的流程示意图。其中图10是本申请实施例的乘累加运算中乘法运算的流程图。图11是本申请实施例的乘累加运算的加法运算的流程图。下文将结合图9至图11,详细介绍利用计算设备500执行乘累加指令的具体流程。
a.译码且读操作数完成之后的乘累加指令进入Mul_Buffer
如图10所示,乘累加指令经译码、读操作数之后由指令缓冲区进入Mul_Scheduler。指令在进入Mul_Scheduler时,只要指令缓冲区存在乘累加指令,且这些乘累加指令都准备好了除累加寄存器外的两个源操作数,这些 指令就发往Mul_Buffer。Mul_Bufffer停止接受乘累加指令的条件为:1)Mul_Buffer已满;2)当前没有译码完成的乘累加指令。
可选地,指令在进入Mul_Scheduler时,可以考虑乘累加指令与其他指令之间的数据相关,并进行相应的调度。
经过本步骤之后,Mul_Buffer中缓存了多条操作数准备好、译码完成的乘累加指令。
b.Mul_Scheduler对乘法运算进行调度。
如图10所示,Mul_Scheduler对Mul_Buffer中乘累加指令的乘法运算进行调度。其用于对乘累加指令的源操作数进行判断,并根据判断结果确定调度方式。其主要包括两种情况:第一种情况是乘累加指令的源操作数含有-1/0/2n(n>=0);第二种情况是是乘累加指令的源操作数不含有-1/0/2n(n>=0)。这两种情况的调度方式分别如下所述。
1.若乘累加指令的源操作数含有-1/0/2n(n>=0)时,对该指令进行如下处理,并继续从Mul_Buffer中取下一条指令进行判断。
1)若源操作数含有0,直接把该指令取消掉。
2)若源操作数含有-1//2n(n>=0),通过修改符号位或移位操作直接得到乘法结果,并把该乘法结果发往ALU_Buffer。同时标记需要写往的累加寄存器标签(英文:Tag),如Tag(ACC0),表示该条乘累加指令最终需要把结果写往累加寄存器ACC0。
2.若指令源操作数不含-1/0/2n(n>=0)时,该指令正常发往Mul,标记需要写往的累加寄存器Tag。继续从Mul_Buffer中取指令进行判断,如源操作数含有-1/0/2n(n>=0)则重复步骤1,直到乘累加指令的乘法源操作数不含有-1/0/2n(n>=0)为止。
正常的进入Mul流水线的MAC/MACC指令,经过乘法器的流水线延时,最终把乘法运算结果写往ALU_Buffer,并标记需要写往的累加寄存器Tag。对于MACC指令,设定相应Tag的累加寄存器值为0。
经过该步骤之后,ALU_Buffer中缓存了多条乘累加指令中乘法运算的结果,这些乘法结果需要与累加寄存器进行加法运算。
c.ALU_Scheduler对加法运算进行调度
如图11所示,ALU_Scheduler对加法运算进行调度。具体调度方法如下所示:
以ALU_Buffer最底部(进入ALU_Buffer最早)的数据作为ALU加法运算第一源操作数,ALU加法运算第二源操作数按照如下方法选择:
1.若ALU_Buffer中含有与第一源操作数具有相同累加寄存器Tag的数据,则把该数据作为ALU第二源操作数,ALU加法运算结果写回ALU_Buffer,保留累加寄存器Tag。
2.若ALU_Buffer中不含有与第一源操作数具有相同累加寄存器Tag的数据,则把与第一源操作数相同标示Tag的累加寄存器作为ALU第二源操作数,ALU加法运算结果写回到对应的累加寄存器。
当累加寄存器充当ALU第二源操作数进入ALU流水线时,由于产生数据相关,在该累加寄存器没有完成写回操作之前,该累加寄存器不能再次充当第二源操作数而进入ALU流水线,这保证了同一组内部的多条乘累加指令之间不会发生数据相关。
在本申请实施例中,计算设备500可以对乘法运算中的-1、0、2n等数值值进行优化处理,对加法运算中的操作时,首先将加法缓存单元的具有相同累加寄存器标签的数据求和,然后再与累加寄存器中的乘累加结果求和,从而能够减少访问累加寄存器的次数,进而减少了流水线卡顿,提高了处理乘累加运算的效率和吞吐量。
可选地,上文结合图1至图11介绍了本申请实施例的访存设备和计算设备,下文将结合图12至图15,介绍本申请实施例的应用于卷积神经网络运算的设备。
图12是本申请实施例的应用于卷积神经网络的设备700的示意图。如图12所示,应用于卷积神经网络运算的设备700包括本申请实施例中访存设备710和计算设备720。其中,访存设备710可以是本申请实施例中的任一访存设备,计算设备720可以是本申请实施例中的任一计算设备。
在本申请实施例中,应用于卷积神经网络运算的设备包括访存单元,访存单元可以。从而能够通过截取级联数据块中数据的方法实现快速的地址非对齐访问,提高了地址非对齐访问的效率。并且卷积神经网络运算设备包含的计算设备在第一乘累加指令中的乘法运算的源操作数包含-1或2n时,在通过符号取反操作或移位操作确定所述乘法运算的运算结果,直接发送给加法单元,而无需通过乘法器进行乘法运算,从而提高乘累加运算的速率和吞吐量以及降低乘累加运算的功耗。
作为一个具体实施例,图13示出了本申请实施例的设备800的结构示意图。设备800可以应用于卷积神经网络运算。其中图7中的访存设备710可以包括图13中的输入缓存区830,级联单元850。进一步地,访存设备710还可以包括控制单元810,权值缓冲区840,广播单元860。图7中的计算设备720可以包括图13中的快速乘累加阵列870。具体地,如图13所示,设备800包括:
控制单元(CU,Control Unit)810:接收译码电路信号,并产生相应的控制逻辑控制整个***。
存储器(Memory)820:存储输入数据、权值数据以及最终的卷积结果。其中,存储器820可以是***内存,例如,存储器820可以是DRAM。
输入缓冲区(Input Buffer,IB)830:与控制单元810、存储器820、级联单元850相连,按照卷积的并行计算方向,对卷积运算的输入数据进行缓冲,可以支持循环自索引或立即数索引两种访问方式。其中,输入缓冲区可以是图2至图7中所述的输入缓存单元。例如,输入缓冲区可以是缓存。
权值缓冲区(Weight Buffer,WB)840:与控制单元810、存储器820、广播单元820相连,按照卷积的并行计算方向,对卷积运算的权值数据进行缓冲。例如,权值缓冲区可以是缓存。
级联单元(Cascading Unit,CaU)850:对两个向量进行级联,并从两个向量中截取合适的位段,产生新的向量数据,用于卷积(乘累加)运算,以减少多次访问相同的跨存储行时对输入缓冲区的访问次数。其中,级联单元可以是图2至图7中所述的级联单元。
广播单元(Broadcasting Unit,BU)860:对卷积核的单个权值数据进行广播,产生向量数据。
乘累加阵列(MAC Array,MACA)870:采用算法和控制调度方法,对输入数据和卷积权值进行乘累加运算。其中,乘累加阵列870可以是上述计算设备。例如,乘累加阵列可以是上述计算设备500或计算设备700。
部分和缓冲区(Partial-Sum Buffer,PB,)880:缓存快速乘累加阵列870产生的乘累加结果,根据控制单元810产生的译码控制信号,把部分和缓冲区880中的数据输出至快速乘累加阵列,以用于与新的乘法结果进行累加操作;或输出至存储器820,以作为最终的卷积结果。
可选地,输入缓存区830可以用于读取每次卷次运算时的输入数据。输 入缓存830的结构示意图可以参考图5所示的输入缓存单元的结构示意图。如图5所示,假设卷积核大小为3*3,处理器并行度PS=4。图5中的阴影部分表示一个列方向上卷积并行运算(在列方向上同时执行4个卷积操作)所有相关的数据,其中虚线框内表示的第一个卷积运算卷积核的位置。可以看出,一个卷积运算与6个向量相关,在列方向表示为d0~d5,分别把这6个向量存放在输入缓存单元(即相当于输入缓存区830)中。
图14示出了本申请实施例中的权值缓冲区的工作示意图。如图14所示,可选地,权值缓冲区根据卷积运算的并行方向,对卷积核权值进行缓存。如果在行方向并行,则按照行序列进行存储;如果在行方向并行,则按照列序列进行存储。权值缓冲区设置为1个写端口,1个读端口,缓冲区的深度可以灵活设置。
可选地,级联单元850可以把两个向量寄存器首尾相连拼接,并在拼接后的2x长度向量寄存器中截取连续的一个单位的向量值。级联单元850应用于卷积运算的示意图可以参考图4的相关内容的描述,此处不再赘述。
图15示出了广播单元应用于卷积运算的示意图。如图15所示,广播单元把向量寄存器中的单个元素广播乘向量的形式,对于卷积运算,则把卷积核中的每个权值元素广播成向量形式。其指令格式可以为“VRt=BRO.TYPE VRm,Rs”,其中BRO表示为操作码,TYPE表示级联运算的数据格式,例如,TYPE可以是8、16、32、64、比特(英文:bits)。或者,TYPE可以表示向量中一个元素的宽度。VRm表示向量寄存器。在图15中,作为一个示例,广播单元把向量寄存器中的4号元素进行广播,形成向量数据。
可选地,乘累加阵列870可以包括PS个乘累加器。PS表示处理器并行粒度。图16示出了乘累加阵列870与级联单元、广播单元的关系结构图。如图16所示,乘累加阵列870接收级联单元850产生的输入数据,以及广播单元860产生的卷积核权值数据进行乘累加运算。它对乘法运算中的-1/0/2n等特殊值进行优化处理,由于卷积运算中通常包含大量的-1/0/2n特殊值,因此能提高乘累加运算的速度。同时,它可以通过硬件自行处理数据相关,并通过专用的指令读出累加寄存器的值。乘累加阵列870包括的乘累加器的具体结构可以参考图7至图11中的计算设备的相关内容,此处不再赘述。
本申请实施例提供的设备800,能够提高卷积运算速度和吞吐量,其输 入缓冲区可以缓存重复使用的输入数据,减少访问慢速存储器的次数;级联单元可以产生跨存储行的向量数据,避免频繁访问输入缓冲区;乘累加器对乘法运算器中-1、0、2n等特殊值进行快速乘法运算,并能自动处理数据相关。
另外,本文中术语“***”和“网络”在本文中常被可互换使用。本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
应理解,在本申请实施例中,“与A相应的B”表示B与A相关联,根据A可以确定B。但还应理解,根据A确定B并不意味着仅仅根据A确定B,还可以根据A和/或其它信息确定B。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的***、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的***、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,该单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。
该作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或 者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本申请实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
该集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例该方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
以上某一实施例中的技术特征和描述,为了使申请文件简洁清楚,可以理解应用于其他实施例,在其他实施例不再一一赘述。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (12)

  1. 一种访存设备,其特征在于,包括:
    输入缓存单元,用于缓存待计算的数据块;
    级联单元,与所述输入缓存单元相连,所述级联单元用于从所述输入缓存单元中读取所述待计算的数据块,所述待计算的数据块包括第一数据块和第二数据块;将所述第一数据块和所述第二数据块首尾相连,得到级联数据块;从所述级联数据块中截取第三数据块,所述第三数据块包含所述级联数据块中的一段连续的数据,且所述第三数据块的长度与所述输入缓存单元中的数据块的长度相等。
  2. 如权利要求1所述的访存设备,其特征在于,所述访存设备还包括:
    控制单元,所述控制单元与所述级联单元相连,用于向所述级联单元发送第一控制指令,所述第一控制指令用于指示所述级联数据块的截取方式;
    所述级联单元根据所述第一控制指令,从所述级联数据块中截取所述第三数据块。
  3. 如权利要求2所述的访存设备,其特征在于,所述第一控制指令包含第一指示信息,所述第一指示信息用于指示所述第三数据块在所述级联数据块中的起始位置。
  4. 如权利要求3所述的访存设备,其特征在于,所述第一指示信息包含所述第三数据块的起始位置的数据序号,所述第一控制指令还包括第二指示信息,所述第二指示信息用于指示所述待计算的数据块的数据格式;
    所述级联设备根据所述数据序号以及所述数据格式,确定所述第三数据块在所述级联数据块中的起始位置。
  5. 如权利要求1-4中任一项所述的访存设备,其特征在于,所述输入缓存单元包括读端口,所述读端口与第一控制寄存器相连,所述第一控制寄存器存储有第一配置信息,所述第一配置信息用于指示所述输入缓存单元中的待读取数据块的地址范围、在所述地址范围内的起始地址和步长,所述读端口从所述起始地址开始,以所述步长为相邻两次读操作的地址增长步长,循环读取所述地址范围内的数据块。
  6. 如权利要求1-4中任一项所述的访存设备,其特征在于,所述输入缓存单元包括写端口,所述写端口与第二控制寄存器相连,所述第二控制寄存器存储有第二配置信息,所述第二配置信息用于指示所述输入缓存单元中 的存储新的数据块的地址范围、在所述地址范围的起始地址和步长,所述写端口从所述起始地址开始,以所述步长为相邻两次写操作的地址增长步长,将新的数据块循环写入所述地址范围中。
  7. 一种计算设备,其特征在于,所述计算设备包括乘法缓存单元、乘法调度单元和加法单元,
    所述乘法缓存单元用于缓存待处理的乘累加指令;
    所述乘法调度单元用于从所述乘法缓存单元获取第一乘累加指令,当所述第一乘累加指令中的乘法运算的源操作数包括可优化操作数时,通过优化操作确定所述乘法运算的运算结果,并将所述第一乘累加指令中的乘法运算的运算结果直接发送至所述加法单元,n为大于等于0的整数,所述可优化操作数包括-1或2n,所述优化操作包括符号取反操作或移位操作;
    所述加法单元根据所述第一乘累加指令中的乘法运算的运算结果,执行所述第一乘累加指令中的加法运算,得到所述第一乘累加指令对应的乘累加运算的运算结果。
  8. 如权利要求7所述的计算设备,其特征在于,所述乘法调度单元用于在一个时钟周期内调度从所述乘法缓存单元获取的多个乘累加指令,所述多个乘累加指令包含一个第一类型乘累加指令和至少一个第二类型乘累加指令,所述第一类型乘累加指令中的乘法运算的源操作数不包括-1,0和2n中的任一项,所述第二类型乘累加指令中的乘法运算的源操作数包括-1,0或2n
  9. 如权利要求7或8所述的计算设备,其特征在于,所述加法单元还包括加法缓存单元,加法调度单元、加法器和至少一个累加寄存器,
    所述加法缓存单元用于缓存用于加法运算的源操作数,所述源操作数包括所述待处理的乘累加指令中的乘法运算的运算结果;
    所述加法调度单元确定所述第一乘累加指令的加法运算的第一源操作数和第二源操作数,其中,所述第一源操作数与所述第二源操作数对应相同的目标累加寄存器,所述第二源操作数来自所述加法缓存单元或所述目标累加寄存器;
    所述加法调度单元对所述第一源操作数和第二源操作数进行求和,得到求和结果;
    所述加法调度单元将所述求和结果写入所述加法缓存单元或所述目标 累加寄存器。
  10. 如权利要求9所述的计算设备,其特征在于,当所述加法缓存单元存储有对应于所述目标累加寄存器的目标数据时,所述加法调度单元将所述目标数据确定为所述第二源操作数,并将所述求和结果写入所述加法缓存单元;当所述加法缓存单元未存储所述目标数据时,所述加法调度单元将所述目标累加寄存器存储的乘累加结果作为所述第二源操作数,并将所述求和结果写入所述目标累加寄存器。
  11. 如权利要求9或10所述的计算设备,其特征在于,当所述第一乘累加指令是第一组乘累加指令中的第一个乘累加指令时,所述乘法调度单元用于为所述第一组乘累加指令标识新的目标累加寄存器,所述第一组乘累加指令中的乘累加指令中的乘法运算的运算结果对应相同的累加寄存器。
  12. 一种应用于卷积神经网络运算的设备,包括如权利要求1至权利要求6中任一项所述的访存设备,以及如权利要求7至权利要求11中任一项所述的计算设备。
PCT/CN2016/110436 2016-12-16 2016-12-16 访存设备、计算设备和应用于卷积神经网络运算的设备 WO2018107476A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2016/110436 WO2018107476A1 (zh) 2016-12-16 2016-12-16 访存设备、计算设备和应用于卷积神经网络运算的设备
CN201680091648.1A CN110073329B (zh) 2016-12-16 2016-12-16 访存设备、计算设备和应用于卷积神经网络运算的设备

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/110436 WO2018107476A1 (zh) 2016-12-16 2016-12-16 访存设备、计算设备和应用于卷积神经网络运算的设备

Publications (1)

Publication Number Publication Date
WO2018107476A1 true WO2018107476A1 (zh) 2018-06-21

Family

ID=62557794

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/110436 WO2018107476A1 (zh) 2016-12-16 2016-12-16 访存设备、计算设备和应用于卷积神经网络运算的设备

Country Status (2)

Country Link
CN (1) CN110073329B (zh)
WO (1) WO2018107476A1 (zh)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109062611A (zh) * 2018-02-05 2018-12-21 上海寒武纪信息科技有限公司 神经网络处理装置及其执行向量缩放指令的方法
CN110780921A (zh) * 2019-08-30 2020-02-11 腾讯科技(深圳)有限公司 数据处理方法和装置、存储介质及电子装置
CN110991619A (zh) * 2019-12-09 2020-04-10 Oppo广东移动通信有限公司 神经网络处理器、芯片和电子设备
CN111008040A (zh) * 2019-11-27 2020-04-14 厦门星宸科技有限公司 缓存装置及缓存方法、计算装置及计算方法
CN111242293A (zh) * 2020-01-13 2020-06-05 腾讯科技(深圳)有限公司 一种处理部件、数据处理的方法以及电子设备
CN111290698A (zh) * 2018-12-07 2020-06-16 上海寒武纪信息科技有限公司 数据存取方法、数据处理方法、数据存取电路和运算装置
CN111782580A (zh) * 2020-06-30 2020-10-16 北京百度网讯科技有限公司 复杂计算装置、方法、人工智能芯片和电子设备
CN111814972A (zh) * 2020-07-08 2020-10-23 上海雪湖科技有限公司 一种基于fpga的神经网络卷积运算加速方法
CN112424745A (zh) * 2018-07-19 2021-02-26 赛灵思公司 在mac电路中使用不同核对一组数据执行连续mac运算
CN112445525A (zh) * 2019-09-02 2021-03-05 中科寒武纪科技股份有限公司 数据处理方法、相关设备及计算机可读介质
CN112559046A (zh) * 2020-12-09 2021-03-26 清华大学 数据处理装置及人工智能处理器
CN112613053A (zh) * 2020-12-25 2021-04-06 北京天融信网络安全技术有限公司 一种数据加解密方法及装置
CN112631955A (zh) * 2020-12-18 2021-04-09 北京地平线机器人技术研发有限公司 数据处理方法、装置、电子设备以及介质
US11398086B2 (en) 2020-06-01 2022-07-26 Hcl Technologies Limited System and method for performing a convolution operation with functional safety mechanism
CN117057403A (zh) * 2023-10-10 2023-11-14 苏州元脑智能科技有限公司 一种运算模块、基于脉冲神经网络的加速器及方法

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112329910B (zh) * 2020-10-09 2024-06-04 东南大学 一种面向结构剪枝结合量化的深度卷积神经网络压缩方法
CN114581281A (zh) * 2020-11-30 2022-06-03 北京君正集成电路股份有限公司 一种基于第一层4bit卷积计算的优化方法
CN114581280A (zh) * 2020-11-30 2022-06-03 北京君正集成电路股份有限公司 一种基于4bit普通卷积计算的优化方法
CN113448624B (zh) * 2021-07-15 2023-06-27 安徽聆思智能科技有限公司 数据存取方法及装置、***、ai加速器

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101150358A (zh) * 2006-09-21 2008-03-26 大唐移动通信设备有限公司 增强上行控制信道的处理和复用方法
CN101147684A (zh) * 2007-11-15 2008-03-26 上海交通大学 多源马鞍线轨迹锥形束ct近似重建方法
CN101404555A (zh) * 2008-08-07 2009-04-08 北京九方中实电子科技有限责任公司 数字传输中的一种卷积交织解交织的方法
CN101610141A (zh) * 2008-06-18 2009-12-23 中兴通讯股份有限公司 多天线多用户数据的联合检测方法及其处理装置
CN102629189A (zh) * 2012-03-15 2012-08-08 湖南大学 基于fpga的流水浮点乘累加方法
CN103944535A (zh) * 2014-04-22 2014-07-23 天津大学 一种利用频响特性配置的全相位滤波器组的方法及其装置

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6757705B1 (en) * 1998-08-14 2004-06-29 Microsoft Corporation Method and system for client-side caching
US6421682B1 (en) * 1999-07-26 2002-07-16 Microsoft Corporation Catalog management system architecture having data table objects and logic table objects
US20030105837A1 (en) * 2001-11-30 2003-06-05 Yury Kamen Interception for optimal caching of distributed applications
US7664879B2 (en) * 2004-11-23 2010-02-16 Cisco Technology, Inc. Caching content and state data at a network element
CN1964227B (zh) * 2005-11-11 2012-03-07 华为技术有限公司 一种数据交互方法及数据收发模块
CN100583024C (zh) * 2008-01-04 2010-01-20 清华大学 一种用于浮点除法和平方根运算的预处理电路结构
CN101547019B (zh) * 2008-03-25 2012-10-03 卓胜微电子(上海)有限公司 Dtmb***中信道估计方法及实现该方法的装置
CN101605116B (zh) * 2008-06-10 2012-03-14 卓胜微电子(上海)有限公司 帧结构保护间隔的构成方法、循环卷积重构方法及装置
CN101882216B (zh) * 2009-05-08 2012-11-21 成都市华为赛门铁克科技有限公司 构建数据指纹的方法、装置及电子设备
CN102388385B (zh) * 2011-09-28 2013-08-28 华为技术有限公司 数据处理的方法和装置
CN104077233B (zh) * 2014-06-18 2017-04-05 百度在线网络技术(北京)有限公司 多通道卷积层处理方法和装置
CN106203621B (zh) * 2016-07-11 2019-04-30 北京深鉴智能科技有限公司 用于卷积神经网络计算的处理器

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101150358A (zh) * 2006-09-21 2008-03-26 大唐移动通信设备有限公司 增强上行控制信道的处理和复用方法
CN101147684A (zh) * 2007-11-15 2008-03-26 上海交通大学 多源马鞍线轨迹锥形束ct近似重建方法
CN101610141A (zh) * 2008-06-18 2009-12-23 中兴通讯股份有限公司 多天线多用户数据的联合检测方法及其处理装置
CN101404555A (zh) * 2008-08-07 2009-04-08 北京九方中实电子科技有限责任公司 数字传输中的一种卷积交织解交织的方法
CN102629189A (zh) * 2012-03-15 2012-08-08 湖南大学 基于fpga的流水浮点乘累加方法
CN103944535A (zh) * 2014-04-22 2014-07-23 天津大学 一种利用频响特性配置的全相位滤波器组的方法及其装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JIANG, RONG, 32-BIT MICROCOMPUTER PRINCIPLE , ASSEMBLY LANGUAGE, AND INTERFACING, 31 July 2009 (2009-07-31), pages 55 *

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109062611A (zh) * 2018-02-05 2018-12-21 上海寒武纪信息科技有限公司 神经网络处理装置及其执行向量缩放指令的方法
CN109062611B (zh) * 2018-02-05 2023-05-23 上海寒武纪信息科技有限公司 神经网络处理装置及其执行向量缩放指令的方法
CN112424745A (zh) * 2018-07-19 2021-02-26 赛灵思公司 在mac电路中使用不同核对一组数据执行连续mac运算
CN112424745B (zh) * 2018-07-19 2024-01-26 赛灵思公司 在mac电路中使用不同核对一组数据执行连续mac运算
CN111290698A (zh) * 2018-12-07 2020-06-16 上海寒武纪信息科技有限公司 数据存取方法、数据处理方法、数据存取电路和运算装置
CN111290698B (zh) * 2018-12-07 2022-05-03 上海寒武纪信息科技有限公司 数据存取方法、数据处理方法、数据存取电路和运算装置
CN110780921B (zh) * 2019-08-30 2023-09-26 腾讯科技(深圳)有限公司 数据处理方法和装置、存储介质及电子装置
CN110780921A (zh) * 2019-08-30 2020-02-11 腾讯科技(深圳)有限公司 数据处理方法和装置、存储介质及电子装置
CN112445525A (zh) * 2019-09-02 2021-03-05 中科寒武纪科技股份有限公司 数据处理方法、相关设备及计算机可读介质
CN111008040B (zh) * 2019-11-27 2022-06-14 星宸科技股份有限公司 缓存装置及缓存方法、计算装置及计算方法
CN111008040A (zh) * 2019-11-27 2020-04-14 厦门星宸科技有限公司 缓存装置及缓存方法、计算装置及计算方法
CN110991619A (zh) * 2019-12-09 2020-04-10 Oppo广东移动通信有限公司 神经网络处理器、芯片和电子设备
CN111242293A (zh) * 2020-01-13 2020-06-05 腾讯科技(深圳)有限公司 一种处理部件、数据处理的方法以及电子设备
US11398086B2 (en) 2020-06-01 2022-07-26 Hcl Technologies Limited System and method for performing a convolution operation with functional safety mechanism
CN111782580B (zh) * 2020-06-30 2024-03-01 北京百度网讯科技有限公司 复杂计算装置、方法、人工智能芯片和电子设备
CN111782580A (zh) * 2020-06-30 2020-10-16 北京百度网讯科技有限公司 复杂计算装置、方法、人工智能芯片和电子设备
CN111814972A (zh) * 2020-07-08 2020-10-23 上海雪湖科技有限公司 一种基于fpga的神经网络卷积运算加速方法
CN111814972B (zh) * 2020-07-08 2024-02-02 上海雪湖科技有限公司 一种基于fpga的神经网络卷积运算加速方法
CN112559046A (zh) * 2020-12-09 2021-03-26 清华大学 数据处理装置及人工智能处理器
CN112631955A (zh) * 2020-12-18 2021-04-09 北京地平线机器人技术研发有限公司 数据处理方法、装置、电子设备以及介质
CN112631955B (zh) * 2020-12-18 2024-01-19 北京地平线机器人技术研发有限公司 数据处理方法、装置、电子设备以及介质
CN112613053A (zh) * 2020-12-25 2021-04-06 北京天融信网络安全技术有限公司 一种数据加解密方法及装置
CN112613053B (zh) * 2020-12-25 2024-04-23 北京天融信网络安全技术有限公司 一种数据加解密方法及装置
CN117057403A (zh) * 2023-10-10 2023-11-14 苏州元脑智能科技有限公司 一种运算模块、基于脉冲神经网络的加速器及方法
CN117057403B (zh) * 2023-10-10 2024-02-13 苏州元脑智能科技有限公司 一种运算模块、基于脉冲神经网络的加速器及方法

Also Published As

Publication number Publication date
CN110073329B (zh) 2021-06-22
CN110073329A (zh) 2019-07-30

Similar Documents

Publication Publication Date Title
WO2018107476A1 (zh) 访存设备、计算设备和应用于卷积神经网络运算的设备
CN110582785B (zh) 配置用于执行层描述符列表的具有功率效率的深度神经网络模块
CN109542515B (zh) 运算装置及方法
WO2019109795A1 (zh) 卷积运算处理方法及相关产品
CN107315574B (zh) 一种用于执行矩阵乘运算的装置和方法
WO2019128404A1 (zh) 矩阵乘法器
EP3832499B1 (en) Matrix computing device
CN111580865B (zh) 一种向量运算装置及运算方法
EP3664093A1 (en) Semiconductor memory device employing processing in memory (pim) and method of operating the semiconductor memory device
US20180121386A1 (en) Super single instruction multiple data (super-simd) for graphics processing unit (gpu) computing
CN107315717B (zh) 一种用于执行向量四则运算的装置和方法
US8595467B2 (en) Floating point collect and operate
JP7387017B2 (ja) アドレス生成方法及びユニット、深層学習処理器、チップ、電子機器並びにコンピュータプログラム
CN107315716B (zh) 一种用于执行向量外积运算的装置和方法
KR102371844B1 (ko) 인공 지능 칩에 적용되는 산출 방법 및 인공 지능 칩
JP2014197433A (ja) 画素速度での画像処理のための方法および装置
CN111651205A (zh) 一种用于执行向量内积运算的装置和方法
CN111651202A (zh) 一种用于执行向量逻辑运算的装置
CN112348182A (zh) 一种神经网络maxout层计算装置
US11334358B2 (en) Hardware accelerator having reconfigurable instruction set and reconfigurable decoder
US11841792B1 (en) Instructions with multiple memory access modes
JP2005071351A (ja) プロセッサおよびプロセッサの動作方法
US20130297908A1 (en) Decomposing Operations in More than One Dimension into One Dimensional Point Operations
US20140160135A1 (en) Memory Cell Array with Dedicated Nanoprocessors
WO2022220835A1 (en) Shared register for vector register file and scalar register file

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16923863

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16923863

Country of ref document: EP

Kind code of ref document: A1