WO2019041251A1 - 芯片装置及相关产品 - Google Patents

芯片装置及相关产品 Download PDF

Info

Publication number
WO2019041251A1
WO2019041251A1 PCT/CN2017/099991 CN2017099991W WO2019041251A1 WO 2019041251 A1 WO2019041251 A1 WO 2019041251A1 CN 2017099991 W CN2017099991 W CN 2017099991W WO 2019041251 A1 WO2019041251 A1 WO 2019041251A1
Authority
WO
WIPO (PCT)
Prior art keywords
data block
unit
basic
main unit
data
Prior art date
Application number
PCT/CN2017/099991
Other languages
English (en)
French (fr)
Inventor
刘少礼
陈天石
王秉睿
张尧
Original Assignee
北京中科寒武纪科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to CN202010628834.2A priority Critical patent/CN111860815A/zh
Priority to CN201910102972.4A priority patent/CN109902804B/zh
Priority to EP19212365.1A priority patent/EP3654209A1/en
Priority to CN201910534118.5A priority patent/CN110231958B/zh
Priority to EP19211995.6A priority patent/EP3651030A1/en
Priority to EP19212002.0A priority patent/EP3651031A1/en
Priority to CN201780002287.3A priority patent/CN109729734B8/zh
Priority to JP2019553977A priority patent/JP7065877B2/ja
Priority to CN201910530860.9A priority patent/CN110245751B/zh
Priority to KR1020197029020A priority patent/KR102467688B1/ko
Priority to CN201910531031.2A priority patent/CN110222308B/zh
Priority to EP17923228.5A priority patent/EP3605402B1/en
Priority to CN201910534527.5A priority patent/CN110083390B/zh
Priority to KR1020197037903A priority patent/KR102477404B1/ko
Application filed by 北京中科寒武纪科技有限公司 filed Critical 北京中科寒武纪科技有限公司
Priority to CN201910534528.XA priority patent/CN110245752B/zh
Priority to KR1020197037895A priority patent/KR102481256B1/ko
Priority to CN201811462676.7A priority patent/CN109615061B/zh
Priority to PCT/CN2017/099991 priority patent/WO2019041251A1/zh
Priority to EP19212368.5A priority patent/EP3654210A1/en
Priority to EP19212010.3A priority patent/EP3654208A1/en
Priority to TW107125681A priority patent/TWI749249B/zh
Priority to US16/168,778 priority patent/US11409535B2/en
Publication of WO2019041251A1 publication Critical patent/WO2019041251A1/zh
Priority to US16/663,164 priority patent/US11531553B2/en
Priority to US16/663,181 priority patent/US11561800B2/en
Priority to US16/663,206 priority patent/US11334363B2/en
Priority to US16/663,174 priority patent/US11775311B2/en
Priority to US16/663,210 priority patent/US11354133B2/en
Priority to US16/663,205 priority patent/US11347516B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3818Decoding for concurrent execution
    • G06F9/3822Parallel decoding, e.g. parallel decode units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2218/00Aspects of pattern recognition specially adapted for signal processing
    • G06F2218/02Preprocessing

Definitions

  • the present disclosure relates to the field of communication and chip technologies, and in particular to a chip device and related products.
  • ANN Artificial Neural Network
  • a neural network is an operational model consisting of a large number of nodes (or neurons) connected to each other.
  • the calculation of the existing neural network is based on a CPU (Central Processing Unit) or a GPU (Graphics Processing Unit), and the calculation has high power consumption and long calculation time.
  • the embodiments of the present disclosure provide a neural network operation method and related products, which can reduce computation time and reduce power consumption of the module.
  • an embodiment of the present disclosure provides a method for computing a neural network, where the method is applied to a chip device, where the chip device includes: a main unit and a plurality of basic units, and the method includes the following steps: acquiring the main unit a data block to be calculated and an operation instruction, according to the operation instruction, dividing the data block to be calculated into a component data block and a broadcast data block; the main unit splitting the distribution data block to obtain a plurality of basic data blocks, Distributing the plurality of basic data blocks to the plurality of basic units, the main unit broadcasting the broadcast data block to the plurality of basic units; and performing, by the basic unit, the basic data blocks and the broadcast data blocks
  • the product operation obtains the operation result, and the operation result is sent to the main unit; the main unit processes the operation result to obtain the data block to be calculated and the instruction result of the operation instruction.
  • the main unit broadcasts the broadcast data block to the multiple basic units, including:
  • the master unit broadcasts the broadcast data block to the plurality of basic units by one time.
  • the basic unit performs an inner product operation on the basic data block and the broadcast data block to obtain an operation result, and sends the operation result to the main unit, including:
  • the basic unit performs inner product processing on the basic data block and the broadcast data block to obtain an inner product processing result, accumulates the inner product processing result to obtain an operation result, and transmits the operation result to the main unit.
  • the operation result is a result of the inner product processing
  • the main unit processes the operation result to obtain the data block to be calculated and the instruction result of the operation instruction, including:
  • the main unit accumulates the operation result to obtain an accumulation result, and the accumulation result is arranged to obtain the data block to be calculated and the instruction result of the operation instruction.
  • the main unit broadcasts the broadcast data block to the multiple basic units, including:
  • the basic unit performs an inner product operation on the basic data block and the broadcast data block to obtain an operation result, and sends the operation result to the main unit, including:
  • the basic unit performs inner product processing on the partial broadcast data block and the basic data block to obtain an inner product processing result, accumulates the inner product processing result to obtain a partial operation result, and sends the partial operation result to The main unit.
  • the basic unit performs an inner product operation on the basic data block and the broadcast data block to obtain an operation result, and sends the operation result to the main unit, including:
  • the basic unit multiplexes the partial broadcast data block n times to execute the partial broadcast data block and the n basic data block inner product operations to obtain n partial processing results, and accumulates the n partial processing results to obtain n partial operations.
  • the n partial operation results are transmitted to the main unit, and n is an integer greater than or equal to 2.
  • a chip device in a second aspect, includes: a main unit and a plurality of basic units, wherein the main unit is configured to acquire a data block to be calculated and an operation instruction, and the to-be-calculated according to the operation instruction Data block partitioning the data block and the broadcast data block; Performing a splitting process to obtain a plurality of basic data blocks, distributing the plurality of basic data blocks to the plurality of basic units, and broadcasting the broadcast data blocks to the plurality of basic units; the basic unit, configured to: Performing an inner product operation on the basic data block and the broadcast data block to obtain an operation result, and transmitting the operation result to the main unit; and the main unit is configured to process the operation result to obtain the to-be-calculated The data block and the instruction result of the operation instruction.
  • the chip device further includes: a branching unit, the branching unit is disposed between the main unit and the basic unit; and the branching unit is configured to forward data.
  • the main unit is specifically configured to broadcast the broadcast data block to the plurality of basic units by one time.
  • the basic unit is specifically configured to perform inner product processing on the basic data block and the broadcast data block to obtain an inner product processing result, and accumulate the inner product processing result to obtain an operation result, where the operation is performed.
  • the result is sent to the main unit.
  • the main unit is configured to, after the operation result is the result of the inner product processing, accumulate the operation result to obtain an accumulation result, and arrange the accumulation result to obtain the data block to be calculated and The result of the instruction of the operation instruction.
  • the main unit is specifically configured to divide the broadcast data block into a plurality of partial broadcast data blocks, and broadcast the plurality of partial broadcast data blocks to the plurality of basic units by multiple times.
  • the basic unit is specifically configured to perform inner product processing on the partial broadcast data block and the basic data block to obtain an inner product processing result, and accumulate the inner product processing result to obtain a partial operation result, Transmitting the partial operation result to the main unit.
  • the basic unit is specifically configured to multiplex the part of the broadcast data block to perform the partial broadcast data block and the n basic data block inner product operations to obtain n partial processing results, and the n partial processing results are performed. After accumulating separately, n partial operation results are obtained, and the n partial operation results are sent to the main unit, and n is an integer greater than or equal to 2.
  • the main unit includes: one or any combination of a main register or a main on-chip buffer circuit;
  • the base unit includes one or any combination of a basic register or a basic on-chip buffer circuit.
  • the main unit includes one or any combination of a vector operator circuit, an arithmetic logic unit circuit, an accumulator circuit, a matrix transposition circuit, a direct memory access circuit, or a data rearrangement circuit.
  • the unit includes one or any combination of an inner product operator circuit or an accumulator circuit.
  • the branch unit is a plurality of branch units, and the main unit is separately connected to the plurality of branch units, and each branch unit is connected to at least one base unit.
  • the branch unit is a plurality of branch units, and the plurality of branch units are connected in series and connected to the main unit, and each branch unit is respectively connected to at least one base unit.
  • the branching unit is specifically configured to forward data between the primary unit and the basic unit.
  • the branching unit is specifically configured to forward data between the primary unit and the basic unit or other branch units.
  • the data is: one or any combination of a vector, a matrix, a three-dimensional data block, a four-dimensional data block, and an n-dimensional data block.
  • the operation instruction is a multiplication instruction, determining that the multiplicative data block is a broadcast data block, and the multiplicand data block is a distribution data block;
  • the operation instruction is a convolution instruction
  • the input data block is a broadcast data block
  • the convolution kernel is a distribution data block.
  • a method for applying a chip device provided by the second aspect, the chip device for performing one or any combination of a matrix multiplication matrix operation, a matrix multiplication vector operation, a convolution operation, or a full connection operation.
  • a chip is provided that integrates the chip arrangement provided by the second aspect.
  • a smart device comprising the chip provided by the sixth aspect.
  • the data is divided into distribution data and broadcast data, and the distribution data is split into basic data blocks and distributed to a plurality of basic units to perform inner product operations.
  • the inner product operation with the largest amount of computation is distributed to a plurality of basic units for simultaneous execution, so that it has the advantages of reducing calculation time and saving power consumption.
  • FIG. 1a is a schematic structural diagram of a chip device provided by the present disclosure.
  • FIG. 1b is a schematic structural diagram of another chip device provided by the present disclosure.
  • FIG. 1c is a schematic diagram of data distribution of the chip device provided by the present disclosure.
  • FIG. 1d is a schematic diagram of data back transmission of a chip device.
  • FIG. 2 is a schematic flow chart of a method for computing a neural network according to an embodiment of the present disclosure.
  • 2a is a schematic diagram of matrix A multiplied by matrix B provided by an embodiment of the present disclosure.
  • FIG. 3 is a schematic flowchart diagram of a method for computing a neural network according to an embodiment of the present disclosure.
  • Figure 3a is a schematic diagram of single sample data for Full Connection 1.
  • Figure 3b is a schematic diagram of multi-sample data for full connection 2.
  • Figure 3c is a schematic diagram of M convolution kernel data for convolution 1.
  • Figure 3d is a schematic diagram of convolution 2 input data.
  • Figure 3e is a schematic diagram of the operation window of a three-dimensional data block of input data.
  • Figure 3f is a schematic diagram of another operational window of a three-dimensional data block of input data.
  • Figure 3g is a schematic diagram of yet another operational window of a three-dimensional data block of input data.
  • references to "an embodiment” herein mean that a particular feature, structure, or characteristic described in connection with the embodiments can be included in at least one embodiment of the present disclosure.
  • the appearances of the phrases in various places in the specification are not necessarily referring to the same embodiments, and are not exclusive or alternative embodiments that are mutually exclusive. Those skilled in the art will understand and implicitly understand that the embodiments described herein can be combined with other embodiments.
  • the CPU is taken as an example to illustrate the operation method of the neural network.
  • the multiplication of the matrix and the matrix is widely used in the neural network.
  • the step taken in calculating C can be to first calculate the completion of the first line, then complete the calculation for the second line, and finally complete the operation for the third line, that is, one row of data for the CPU. After the calculation is completed, the calculation of the second line of data is performed.
  • the CPU completes the calculation on the first line, that is, it needs to be completed, a 11 *b 11 +a 12 *b 21 +a 13 *b 31 , a 11 *b 12 +a 12 * b 22 +a 13 *b 32 and a 11 *b 13 +a 12 *b 23 +a 13 *b 33 ;
  • the CPU or GPU it needs to be calculated line by line, that is, after the first line is calculated, the second line is calculated, and then the third line is calculated until all the lines are calculated.
  • the number of rows may have thousands of rows of data, so the calculation time is very long, and in the calculation, the CPU is in a working state for a long time, and the energy consumption is also high.
  • FIG. 1b is a schematic structural diagram of a chip device, as shown in FIG. 1b.
  • the device includes: a main unit circuit, a basic unit circuit, and a branch unit circuit.
  • the main unit circuit may include a register and/or an on-chip buffer circuit, and the main unit may further include: a vector operator circuit, an ALU (arithmetic and logic unit) circuit, an accumulator circuit, a matrix transposition circuit, and a DMA.
  • ALU arithmetic and logic unit
  • each base unit may include a base register and/or a base on-chip buffer circuit; each base unit may further include: an inner product operation One or any combination of a circuit, a vector operator circuit, an accumulator circuit, and the like.
  • the circuits can all be integrated circuits. If there is a branch unit, wherein the main unit is connected to the branch unit, the branch unit is connected to the basic unit for performing an inner product operation between the data blocks, the main unit for transmitting and receiving external data, and the external unit The data is distributed to a branch unit for transmitting and receiving data of the main unit or the base unit.
  • the structure shown in Figure 1b is suitable for the calculation of complex data, because for the main unit, the number of connected units is limited, so it is necessary to add branch units between the main unit and the basic unit to achieve more basic unit connections. Into, to achieve the calculation of complex data blocks.
  • connection structure of the branch unit and the base unit may be arbitrary, and is not limited to the H-type structure of FIG. 1b.
  • the primary unit to the base unit is a structure for broadcasting or distribution
  • the base unit to the main unit is a structure of a gather.
  • the definitions of broadcasting, distribution and collection are as follows:
  • the data transfer manner of the main unit to the base unit may include:
  • the main unit is connected to a plurality of branch units, and each branch unit is connected to a plurality of base units.
  • the main unit is connected to a branch unit, which is connected to a branch unit, and so on, and a plurality of branch units are connected in series, and then each branch unit is connected to a plurality of base units.
  • the main unit is connected to a plurality of branch units, and each branch unit is connected in series with a plurality of base units.
  • the main unit is connected to a branch unit, which is connected to a branch unit, and so on, and a plurality of branch units are connected in series, and then each branch unit is connected in series with a plurality of base units.
  • the main unit When distributing data, the main unit transmits data to some or all of the base units, and the data received by the base unit of each received data may be different;
  • the main unit When broadcasting data, the main unit transmits data to some or all of the base units, and the base unit of each received data receives the same data.
  • the chip device shown in FIG. 1a or FIG. 1b may be a single physical chip. Of course, in practical applications, the chip device may also be integrated in other chips (for example, CPU, GPU). The specific embodiment does not limit the physical representation of the above chip device.
  • FIG. 1c is a schematic diagram of data distribution of a chip device. As shown by the arrow in FIG. 1c, the arrow is a data distribution direction. As shown in FIG. 1c, after the main unit receives the external data, the external data is removed. After the distribution, the distribution is distributed to a plurality of branch units, and the branch unit transmits the split data to the base unit.
  • FIG. 1d is a schematic diagram of data back transmission of a chip device. As shown by the arrow in FIG. 1d, the arrow is the data return direction, as shown in FIG. 1d, the basic unit will data (for example, the inner product calculation result). ) is passed back to the branch unit, which is passed back to the main unit.
  • data for example, the inner product calculation result
  • FIG. 1a is a schematic structural diagram of another chip device.
  • the chip device includes a main unit and a basic unit, and the main unit is connected to the basic unit. Since the structure shown in Fig. 1a is directly physically connected to the main unit, the number of basic units connected to the structure is limited, which is suitable for simple data calculation.
  • FIG. 2 provides a method for computing a neural network using the above chip device.
  • the method is implemented by using a chip device as shown in FIG. 1a or as shown in FIG. 1b.
  • the method is as shown in FIG. 2, and includes the following steps. :
  • Step S201 The main unit of the chip device acquires a data block to be calculated and an operation instruction.
  • the data block to be calculated in the above step S201 may be specifically a matrix, a vector, a three-dimensional data, a four-dimensional data, a multi-dimensional data, and the like.
  • the specific embodiment of the present disclosure does not limit the specific expression of the data block, and the operation instruction may specifically For, multiplication instructions, convolution instructions, addition instructions, subtraction instructions, BLAS (English: Basic Linear Algebra Subprograms) functions or activation functions, and so on.
  • Step S202 The main unit divides the data block to be calculated into the component data block and the broadcast data block according to the operation instruction.
  • the implementation method of the foregoing step S202 may specifically be:
  • the multiplier data block is determined to be a broadcast data block, and the multiplicand data block is determined. To distribute data blocks.
  • the operation instruction is a convolution instruction
  • the input data block is a broadcast data block
  • the convolution kernel is a distribution data block.
  • Step S2031 The main unit performs split processing on the distributed data block to obtain a plurality of basic data blocks, and distributes the plurality of basic data blocks to multiple basic units.
  • step S2032 the main unit broadcasts the broadcast data block to a plurality of basic units.
  • step S2031 and step S2032 may also be performed in a loop.
  • the main unit splits the distributed data block to obtain a plurality of basic data blocks, and each basic data block is split.
  • the broadcast data block is also split into m broadcast data sub-blocks, the main unit distributes one basic data sub-block at a time and broadcasts a broadcast data sub-block, the basic data sub-block and broadcast data.
  • Sub-blocks are all data blocks that are capable of performing parallel neural network calculations.
  • the basic data block may be the z-th row data of the matrix A, and the basic data sub-block may be the front of the matrix A in the z-th row data.
  • the broadcast data sub-block may be the first 20 rows of data in the z-th column of matrix B.
  • the basic data block in the above step S203 may specifically be a minimum data block capable of performing an inner product operation.
  • the basic data block may be a row of data of a matrix.
  • the basic data block may be Is the weight of a convolution kernel.
  • step S203 For the manner of the foregoing step S203, refer to the description of the following embodiments, and details are not described herein again.
  • step S203 For the method for broadcasting the broadcast data block, refer to the description of the following embodiments, and details are not described herein again.
  • Step S2041 The basic unit of the chip device performs an inner product operation on the basic data block and the broadcast data block to obtain an operation result (possibly an intermediate result).
  • Step S2042 If the operation result is not an intermediate result, the operation result is transmitted back to the main unit.
  • Step S205 The main unit processes the operation result to obtain the data block to be calculated and the instruction result of the operation instruction.
  • the processing in the foregoing step S205 may be an accumulation, a sort, or the like.
  • the disclosure is not limited to the specific manner of the foregoing processing.
  • the specific manner needs to be configured according to different operation instructions, for example, may also include performing a nonlinear transformation or the like.
  • the technical solution provided by the present disclosure receives external data by the main unit, and the external data includes a data block to be calculated and an operation instruction, acquires a data block to be calculated, and an operation instruction, and determines the to-be-calculated according to the operation instruction.
  • the distribution data block of the data block and the broadcast data block split the distribution data block into a plurality of basic data blocks, broadcast the broadcast data block to a plurality of basic units, and distribute the plurality of basic data blocks to the plurality of basic units, and multiple
  • the basic unit performs an inner product operation on the basic data block and the broadcast data block to obtain an operation result, and the plurality of basic units return the operation result to the main unit, and the main unit obtains the instruction result of the operation instruction according to the returned operation result.
  • the technical point of this technical solution is that for the neural network, the large amount of computation lies in the inner product operation between the data block and the data block, the overhead of the inner product operation is large, and the calculation time is long, so the embodiment of the present disclosure passes the operation.
  • the instruction and the instruction to be operated first distinguish the distribution data block and the broadcast data block in the data block to be calculated, and for the broadcast data block, the data block that must be used when implementing the inner product operation, and for the distribution data block,
  • the matrix multiplication is taken as an example.
  • the data block to be calculated is matrix A and matrix B
  • the operation instruction is a multiplication instruction (A*B), which is determined according to the rule of matrix multiplication.
  • the matrix A is a splittable data block, and the matrix B is determined to be a broadcast data block, because for matrix multiplication, the multiplicand matrix A can be split into multiple basic data blocks, and the multiplier matrix B can be broadcast. data block.
  • the multiplicand matrix A needs to perform inner product operations on each row of data and the multiplier matrix B respectively. Therefore, the technical solution of the present application divides the matrix A into M basic data blocks, each of the M basic data blocks.
  • the basic data block can be a row of data of matrix A. Therefore, for matrix multiplication, the time-consuming operation time is performed by multiple basic units separately. Therefore, in the inner product operation, multiple basic units can quickly calculate the result in parallel, thereby reducing the calculation time and less. The calculation time can also reduce the operating time of the chip device, thereby reducing power consumption.
  • a matrix A is multiplied by a vector B.
  • the matrix A has M rows, L columns, and the vector B has L rows. It is assumed that the operator operates a row of the matrix A and the vector B.
  • the matrix A is split into M basic data blocks, and each basic data block is a row of data of the matrix A, and M basic units.
  • the calculation time is t1
  • t2 can be the time for the main unit to split the data
  • t3 can be the operation for processing the inner product operation.
  • the chip device provided by the present disclosure has a short working time, and experimentally proves that the working time of the chip device In a very short time, its energy consumption will be much lower than the long working hours, so it has the advantage of saving energy.
  • the main unit can broadcast the broadcast data block to the multiple basic units in multiple manners.
  • Mode A Broadcasting the data block to the plurality of basic units by one time.
  • the broadcast refers to performing "one-to-many" data transmission, that is, the main unit simultaneously transmits the same data block to a plurality of (all or part of) base units), for example, a matrix A* matrix B, where matrix B is a broadcast
  • the data block broadcasts the matrix B to the plurality of basic units by one time.
  • the input data is a broadcast data block
  • the input data block is broadcast to the plurality of basic units at one time.
  • Method B dividing the broadcast data block into a plurality of partial broadcast data blocks, and broadcasting the plurality of partial broadcast data blocks to the plurality of basic units by multiple times, for example, the matrix B is broadcasted to the plurality of basic units by multiple times, specifically Each time the matrix N of the matrix B is broadcast.
  • the advantage of this method is that the configuration of the basic unit can be reduced, because the storage space of the registers configured for the basic unit is unlikely to be large, and if the matrix B is sent to the basic unit once for the matrix B with a relatively large amount of data, then the basic The storage of these data requires a relatively large register space. Because the number of basic units is large, increasing the register space inevitably has a great impact on the cost increase. Therefore, the scheme of broadcasting the broadcast data block multiple times is used, that is, for the basic unit. In other words, it only needs to store part of the data of the broadcast data block that is broadcasted each time, thereby reducing the cost.
  • the foregoing method for distributing a plurality of basic data blocks to a plurality of basic units in step S203 may also adopt the above manner A or mode B, except that the transmission mode is a unicast party. And the data transmitted is the basic data block.
  • the implementation method of the foregoing step S204 may specifically be:
  • the basic unit performs inner product processing on the basic data block and the broadcast data block to obtain an inner product processing result, that is, one execution.
  • An inner product operation of one line, the inner product processing result (one of the operation results) is sent to the main unit, and the main unit accumulates the inner product processing result.
  • the basic unit may accumulate the inner product processing result. After that, the accumulated result (another one of the operation results) is sent to the main unit.
  • the above method can reduce the amount of data transmission between the main unit and the basic unit, thereby increasing the speed of calculation.
  • the basic unit performs a partial inner product operation of the basic data block and the partial broadcast data block to obtain a partial processing result, and the basic unit, every time the basic unit receives the partial broadcast data block.
  • the processing result is sent to the main unit, which accumulates the processing result.
  • the basic unit receives n basic data blocks, multiplexing the broadcast data block to perform the inner data operation of the broadcast data block and the n basic data blocks to obtain n partial processing results, and basically The unit transmits the n processing results to the main unit, and the main unit accumulates the n processing results respectively.
  • the above accumulation can also be performed in the basic unit.
  • the amount of data of the broadcast data block is generally very large and the distribution data block is also large, because for the chip device, since it belongs to the configuration of the hardware, the basic unit of the configuration may be innumerable in theory, but In practice, the number is limited, generally tens of basic units, which may change constantly as technology develops, such as increase.
  • the number of rows of the matrix A may have thousands of rows, and the number of columns of the matrix B also has thousands of columns, so that once the broadcast data is sent to the basic unit, the matrix B cannot be realized.
  • the implementation manner may be that part of the data of the broadcast matrix B is broadcasted once, for example, the first five columns of data, and a similar manner may be adopted for the matrix A, and for the basic unit, the partial inner product calculation may be performed each time. Then, the result of the partial inner product calculation is stored in the register, and after all the inner product operations of the row are executed, the result of the inner product calculation of all the rows of the row is accumulated to obtain an operation result, and the operation result is obtained. Send to the main unit.
  • This approach has the advantage of increasing the speed of calculation.
  • FIG. 3 provides a calculation method of a neural network, and the calculation in the embodiment is based on a moment.
  • the calculation method of the matrix A* matrix B indicates that the matrix A* matrix B can be a matrix diagram shown in FIG. 3a.
  • the calculation method of the neural network shown in FIG. 3 is as shown in FIG. 1b.
  • the chip device has 16 basic units.
  • the value of M as shown in FIG. 3a may be 32, and the value of N may be 15.
  • the value of L can be 20. It will of course be understood that the computing device can have any number of basic units.
  • the method is shown in Figure 3 and includes the following steps:
  • Step S301 the main unit receives the matrix A, the matrix B, and the multiplication operation instruction A*B.
  • Step S302 the main unit determines that the matrix B is a broadcast data block according to the multiplication operation instruction A*B, the matrix A is a distribution data block, and the matrix A is divided into 32 basic data blocks, and each basic data block is a row of data of the matrix A. .
  • Step S303 the main unit uniformly allocates 32 basic data blocks to 16 basic units, and evenly allocates 32 basic data blocks to 16 basic units, that is, each basic unit receives 2 basic data blocks, and the two data blocks
  • the allocation method can be any non-repeating allocation order.
  • the allocation method of the foregoing step S303 may adopt some other allocation methods.
  • the database may be unevenly allocated to each basic unit; some of them may not be equally divided.
  • the embodiment of the present disclosure does not limit the manner in which the above basic data blocks are allocated to a plurality of basic units.
  • Step S304 the main unit extracts partial data of the first few columns (such as the first five columns) of the matrix B, and the matrix B broadcasts part of the data of the first five columns to 16 basic units.
  • Step S305 the 16 basic units secondarily multiplex the partial data of the first 5 columns and the 2 basic data blocks to perform the inner product operation and the accumulation operation to obtain 32*5 pre-processing results, and send 32*5 pre-processing results to Main unit.
  • Step S306 the main unit extracts part of the data of the five columns of the matrix B, and the matrix B broadcasts the partial data of the five columns to the 16 basic units.
  • Step S307 the 16 basic units secondarily multiplex the partial data of the 5 columns and the 2 basic data blocks to perform the inner product operation and the accumulation operation to obtain 32*5 processing results, and send 32*5 processing results to Main unit.
  • Step S308 the main unit extracts partial data of the last five columns of the matrix B, and the matrix B has the last five columns. Part of the data is broadcast to 16 basic units.
  • Step S309 the 16 basic units secondarily multiplex the partial data of the last 5 columns and the 2 basic data blocks to perform the inner product operation and the accumulation operation to obtain 32*5 post-processing results, and send 32*5 post-processing results to the Main unit.
  • Step S310 the main unit combines 32*5 pre-processing results, 32*5 processing results, and 32*5 post-processing results in front, middle, and back to obtain a 32*15 matrix C, the matrix C. That is, the result of the instruction of the matrix A* matrix B.
  • the technical solution shown in FIG. 3 splits the matrix A into 32 basic data blocks, and then broadcasts the matrix B in batches, so that the basic unit can obtain the instruction result in batches, since the inner product is split into 16 basics.
  • the unit is calculated, so the calculation time can be greatly reduced, so it has the advantages of short calculation time and low energy consumption.
  • FIG. 1a is a chip device according to the disclosure, the chip device includes: a main unit and a basic unit, the main unit is a hardware chip unit, and the basic unit is also a hardware chip unit;
  • the main unit is configured to perform each successive operation in a neural network operation and transmit data with the basic unit;
  • the basic unit is configured to perform an operation of parallel acceleration in the neural network according to the data transmitted by the main unit, and transmit the operation result to the main unit.
  • the above parallel accelerated operations include, but are not limited to, multiplication operations between data blocks and data blocks, convolution operations, and the like, which are large-scale and parallelizable.
  • Each of the above consecutive operations includes, but is not limited to, a continuous operation such as an accumulation operation, a matrix transposition operation, a data sort operation, and the like.
  • the main unit is configured to acquire a data block to be calculated and an operation instruction, and divide the data block to be calculated into a data block and a broadcast data block according to the operation instruction; Distributing the data block for split processing to obtain a plurality of basic data blocks, distributing the plurality of basic data blocks to the plurality of basic units, and broadcasting the broadcast data block to the plurality of basic units; the basic unit And performing an inner product operation on the basic data block and the broadcast data block to obtain an operation result, and sending the operation result to the main unit; the main unit is configured to be used for the operation The result processing obtains the data block to be calculated and the instruction result of the operation instruction.
  • the chip device further includes: a branching unit, the branching unit is disposed between the main unit and the basic unit; and the branching unit is configured to forward data.
  • the main unit is specifically configured to broadcast the broadcast data block to the plurality of basic units by one time.
  • the basic unit is specifically configured to perform inner product processing on the basic data block and the broadcast data block to obtain an inner product processing result, and accumulate the inner product processing result to obtain an operation result, where the operation is performed.
  • the result is sent to the main unit.
  • the main unit is configured to, after the operation result is the result of the inner product processing, accumulate the operation result to obtain an accumulation result, and arrange the accumulation result to obtain the data block to be calculated and The result of the instruction of the operation instruction.
  • the main unit is specifically configured to divide the broadcast data block into a plurality of partial broadcast data blocks, and broadcast the plurality of partial broadcast data blocks to the plurality of basic units by multiple times.
  • the basic unit is specifically configured to perform inner product processing on the partial broadcast data block and the basic data block to obtain an inner product processing result, and accumulate the inner product processing result to obtain a partial operation result, Transmitting the partial operation result to the main unit.
  • the basic unit is specifically configured to multiplex the part of the broadcast data block to perform the partial broadcast data block and the n basic data block inner product operations to obtain n partial processing results, and the n partial processing results are performed. After accumulating separately, n partial operation results are obtained, and the n partial operation results are sent to the main unit, and n is an integer greater than or equal to 2.
  • the specific embodiment of the present disclosure further provides an application method of the chip device as shown in FIG. 1a, where the application method may be specifically used to perform one of a matrix multiplication matrix operation, a matrix multiplication vector operation, a convolution operation, or a full connection operation. Or any combination.
  • the main unit may also perform a pooling operation, a regularization (normalization) operation, and a neural network operation step such as batch normalization, lrn.
  • a specific embodiment of the present application also provides a chip including the chip device as shown in FIG. 1a or as shown in FIG.
  • the specific implementation of the present application further provides a smart device, where the smart device includes the foregoing chip,
  • the chip integrates a chip arrangement as shown in Figure 1a or as shown in Figure 1b.
  • the smart device includes, but is not limited to, a smart device, a tablet computer, a personal digital assistant, a smart watch, a smart camera, a smart TV, a smart refrigerator, and the like.
  • the foregoing device is for illustrative purposes only, and the specific embodiment of the present application is not limited to the above. The specific form of the device.
  • the input data of the fully connected layer is a vector of length L (such as the vector B in the fully connected 1-single sample shown in Figure 3a) (ie, the input of the neural network is a single sample)
  • the output of the fully connected layer Is a vector of length M
  • the weight of the fully connected layer is an M*L matrix (such as "Matrix A in Figure 3b Fully Connected 1-Single Sample")
  • the weight matrix of the fully connected layer is used as the matrix A (ie, splitting the data block)
  • the input data is used as the vector B (ie, the broadcast data block)
  • the operation is performed in accordance with the method 1 shown in FIG.
  • the specific operation method can be:
  • the input data of the fully connected layer is a matrix (that is, the input of the neural network is a case where multiple samples are operated together as a batch)
  • the input data of the fully connected layer represents N input samples, each sample is a length L Vector
  • the input data is represented by a matrix of L*N, as shown by the matrix B in " Figure 3b Fully Connected 1-Multiple Samples”.
  • the output of the fully connected layer for each sample is a vector of length M, then all
  • the output data of the connection layer is an M*N matrix, such as the result matrix in "Fig.
  • the convolution layer When using the chip device for artificial neural network operation, the convolution layer, the pooling layer, and the regularization layer (also called normalization layer, such as BN (Batch normalization) or LRN (Local Response Normalization)) in the neural network, etc.
  • the main unit uses the data rearrangement circuit of the main unit for each sample of the input data, and places the input data in a certain order.
  • the order may be Arbitrary order;
  • the sequence will place input data such as NHWC and NWHC in the fastest way to change the C-dimensional coordinates represented by the above schematic diagram.
  • C is the dimension of the innermost layer of the data block
  • N is the dimension of the outermost layer of the data block
  • H and W are the dimensions of the middle layer.
  • H and W are the relevant operational window sliding dimensions for convolution and pooling operations (an example of the sliding of the operation window in the W dimension is shown in Figure 3e Convolution 3-Sliding a" and " Figure 3f Convolution 3 - Slide b"
  • Figure 3g the size of the operation window and the size of a convolution kernel in the M convolution kernels
  • Figure 3c M convolution kernels
  • each convolution kernel is a 5*3*3 three-dimensional data block
  • its operation window is also a 5*3*3 three-dimensional data block, as shown in Figure 3c.
  • the KH and KW in the M convolution kernels shown indicate that the dimension corresponding to its KH is the H dimension of the input data, and the corresponding dimension represented by the KW is the W dimension of the input data.
  • the gray part of the graph in Figures 3e, 3f, and 3g It is data used for calculation in each sliding operation window, and the direction of sliding may be H as the sliding direction and then after W is the sliding direction or W is the sliding direction, and then H is the sliding direction.
  • the operation at each sliding window is the data represented by the gray part of the figure.
  • 3c" are respectively subjected to an inner product operation, and the convolution will output a value corresponding to each convolution kernel for each sliding window position, that is, for each The sliding window has M output values; for pooling, the operation at each sliding window is the data block represented by the gray square in the figure in the H and W dimensions (in the example in the figure, the gray data block is in the same In the 9 numbers on a plane, the maximum value is selected, or the average value is calculated.
  • the pooling will output C values for each sliding window position.
  • C is a single sample of the 3D data block except H and W. Another dimension, N represents a total of N samples simultaneously The operation of this layer.
  • the C dimension is defined as: each basic LRN operation selects a continuous data block along the C dimension (ie, a data block of Y*1*1), where Y*1*1 Y in the data block is the value in the C dimension, the value of Y is less than or equal to the maximum value of the C dimension, the first 1 represents the H dimension, and the second 1 represents the W dimension; the remaining two dimensions are defined as The H and W dimensions, that is, for each of the three-dimensional data blocks of each sample, each time an LRN regularization operation is performed, a continuous portion of data in the same W coordinate and different C coordinates in the same H coordinate is performed.
  • the regularization algorithm BN all the values of the coordinates in the same C dimension in the three-dimensional data block of the N samples are averaged and variance (or standard deviation).
  • a square is used to represent a numerical value, which can also be called a weight; the numbers used in the schematic diagram are limited to examples.
  • the dimensional data may be any numerical value (including some The case where the dimension is 1, in which case the four-dimensional data block automatically becomes a three-dimensional data block.
  • the input data is a three-dimensional data block; for example, when the volume In the case where the number of cores is 1, the convolution and the data are a three-dimensional data block).
  • each convolution kernel For a convolutional layer, its weight (all convolution kernels) is as shown in "Figure 3c Convolution 1 - Convolution Kernel", the number of convolution kernels is M, and each convolution kernel consists of C KHs.
  • the matrix of the KW column is composed, so the weight of the convolution layer can be expressed as a four-dimensional data block with four dimensions of M, C, KH, and KW respectively; the input data of the convolutional layer is a four-dimensional data block, and N three-dimensional data blocks.
  • Data block composition each of the three-dimensional data blocks is composed of C H rows and W columns of feature matrices (ie, four dimensions are N, C, H, W data blocks); such as "Figure 3d convolution 2-input The data is shown.
  • each convolution kernel can be a basic data block.
  • the basic data block can also be changed to a smaller temperature, such as a planar matrix of a convolution kernel.
  • the convolution kernel weight set distributed to the i-th base unit is Ai, which has a total of 46 convolution kernels.
  • the i-th base unit the received unit is distributed by the main unit.
  • the convolution kernel weight Ai is stored in its register and / Or in the on-chip cache; the parts of the input data (ie, the sliding window as shown in FIG. 3e, FIG. 3f or as shown in FIG. 3g) are transmitted to each basic unit in a broadcast manner (the above-mentioned manner of broadcasting may be in the above manner A or mode B)
  • the weight of the operation window can be broadcast to all the basic units by means of multiple broadcasts.
  • the weight of the partial operation window can be broadcast each time, for example, each time a plane matrix is broadcasted, As shown in FIG. 3e, a C-plane KH*KW matrix can be broadcast each time.
  • data of the first n rows or the first n columns in a C-plane KH*HW matrix can also be broadcast at a time.
  • the method of transmitting the partial data and the arrangement of the partial data are not limited; the placement mode of the input data is converted into the arrangement mode of the arbitrary dimension order, and then the input data of each part is sequentially broadcast to the base unit in sequence.
  • the foregoing distribution data may also be sent in a manner similar to the operation window of the input data, and details are not described herein again.
  • the input data is converted into a loop in which C is the innermost layer. The effect of this is that the data of C is twisted together, thereby increasing the degree of parallelism of the convolution operation, and making it easier to perform parallel operations on multiple feature maps.
  • the manner in which the input data is placed is converted into a layout order in which the NHWC or the NWHC is placed in each of the basic units, for example, the i-th base unit, and the convolution kernel in the weight Ai and the received broadcast are calculated.
  • the inner product of the corresponding part of the data (ie, the operation window); the data of the corresponding part of the weight Ai can be read directly from the on-chip buffer, or can be read into the register for multiplexing.
  • the result of the inner product of each base unit is accumulated and transmitted back to the main unit.
  • the portion obtained by performing the inner product operation each time the base unit is transferred to the main unit for accumulation; the portion obtained by the inner product operation performed by each base unit may be stored in the register and/or the on-chip buffer of the base unit, and may be accumulated.
  • the part obtained by the inner product operation performed by each basic unit and the partial and partial storage in the register and/or on-chip buffer of the base unit may be accumulated, and in some cases, transmitted to the main unit. Accumulate, transfer back to the main unit after the accumulation is completed.
  • GEMM GEMM calculation refers to the operation of matrix-matrix multiplication in the BLAS library.
  • auxiliary integers as parameters to explain the width and height of the matrix A and B;
  • the input matrix A and the matrix B are respectively subjected to respective op operations; the op operation may be a transposition operation of the matrix, and of course, other operations, such as non-linear function operations, pooling, and the like.
  • the matrix op operation is implemented by using the vector operation function of the main unit; if the op of a certain matrix can be empty, the main unit does not perform any operation on the matrix;
  • GEMV calculation refers to the operation of matrix-vector multiplication in the BLAS library.
  • Corresponding op operation is performed on the input matrix A; the chip device uses the method shown in FIG. 2 to complete the matrix-vector multiplication calculation between the matrix op(A) and the vector B; using the vector operation function of the main unit, on the op ( A) Each of the results of *B is multiplied by alpha; the vector operation function of the main unit is used to implement the step of adding the corresponding positions between the matrices alpha*op(A)*B and beta*C.
  • An activation function usually refers to performing a nonlinear operation on each of a data block (which can be a vector or a multidimensional matrix).
  • the chip device uses the vector calculation function of the main unit to input a vector. Calculate the activation vector of the vector; the main unit passes each value in the input vector through an activation function (a value when the input of the activation function is used, and the output is also a value), and calculates a value output to the corresponding position of the output vector;
  • the source of the above input vector includes, but is not limited to, external data of the chip device, and calculation result data of the basic unit forwarded by the branch unit of the chip device.
  • the calculation result data may specifically be an operation result of performing a matrix multiplication vector; the calculation result data may further perform a calculation result of the matrix multiplication matrix; and the input data may be a calculation result after the offset is added to the main unit.
  • the function of adding two vectors or two matrices can be realized by using the main unit; the main unit can be used to add a vector to each line of a matrix, or the function on each column.
  • the matrix may be from the device performing a matrix multiplication matrix operation; the matrix may be from the device performing a matrix multiplication vector operation; the matrix may be externally accepted from the main unit of the device data.
  • the vector may be from data accepted externally by the main unit of the device.
  • the disclosed apparatus may be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division, and the actual implementation may have another division manner. Multiple units or components may be combined or integrated into another system, or some features may be omitted or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be electrical or otherwise.
  • each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated units/modules are implemented in the form of hardware.
  • the hardware can be a circuit, including a digital circuit, an analog circuit, and the like.
  • Physical implementations of hardware structures include, but are not limited to, physical devices including, but not limited to, transistors, memristors, and the like.
  • the computing modules in the computing device can be any suitable hardware processor, such as a CPU, GPU, FPGA, DSP, ASIC, and the like.
  • the storage unit may be any suitable magnetic storage medium or magneto-optical storage medium such as RRAM, DRAM, SRAM, EDRAM, HBM, HMC, and the like.
  • the described units may or may not be physically separate, ie may be located in one place, or may be distributed over multiple network elements. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Neurology (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Multi Processors (AREA)
  • Image Processing (AREA)
  • Complex Calculations (AREA)
  • Advance Control (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Mobile Radio Communication Systems (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

一种芯片装置以及相关产品,所述芯片装置包括:主单元以及多个与其通信的基本单元,主单元的功能包括:获取待计算的数据块以及运算指令(S201),依据该运算指令对所述待计算的数据块划分成分发数据块以及广播数据块(S202);对所述分发数据块进行拆分处理得到多个基本数据块,将所述多个基本数据块分发至所述多个基本单元,并将所述广播数据块广播至所述多个基本单元(S203)。基本单元的功能包括:对所述基本数据块与所述广播数据块执行内积运算得到运算结果,将所述运算结果发送至主单元(S204)。主单元对所述运算结果处理得到所述待计算的数据块以及运算指令的指令结果(S205)。具有计算处理时间段,能耗低的优点。

Description

芯片装置及相关产品 技术领域
本披露涉及通信及芯片技术领域,具体涉及一种芯片装置以及相关产品。
背景技术
人工神经网络(Artificial Neural Network,即ANN),是20世纪80年代以来人工智能领域兴起的研究热点。它从信息处理角度对人脑神经元网络进行抽象,建立某种简单模型,按不同的连接方式组成不同的网络。在工程与学术界也常直接简称为神经网络或类神经网络。神经网络是一种运算模型,由大量的节点(或称神经元)之间相互联接构成。现有的神经网络的运算基于CPU(Central Processing Unit,中央处理器)或GPU(英文:Graphics Processing Unit,图形处理器)来实现运算,此种运算的功耗高、计算时间长。
发明内容
本披露实施例提供了一种神经网络的运算方法及相关产品,可减少运算计算时间,降低模块的功耗优点。
第一方面,本披露实施例提供一种神经网络的运算方法,所述方法应用在芯片装置内,所诉芯片装置包括:主单元以及多个基本单元,所述方法包括如下步骤:主单元获取待计算的数据块以及运算指令,依据该运算指令对所述待计算的数据块划分成分发数据块以及广播数据块;主单元对所述分发数据块进行拆分处理得到多个基本数据块,将所述多个基本数据块分发至所述多个基本单元,主单元将所述广播数据块广播至所述多个基本单元;基本单元对所述基本数据块与所述广播数据块执行内积运算得到运算结果,将所述运算结果发送至主单元;主单元对所述运算结果处理得到所述待计算的数据块以及运算指令的指令结果。
可选的,所述主单元将所述广播数据块广播至所述多个基本单元,包括:
所述主单元将所述广播数据块通过一次广播至所述多个基本单元。
可选的,所述基本单元对所述基本数据块与所述广播数据块执行内积运算得到运算结果,将所述运算结果发送至主单元,包括:
所述基本单元将所述基本数据块与所述广播数据块执行内积处理得到内积处理结果,将所述内积处理结果累加得到运算结果,将所述运算结果发送至主单元。
可选的,所述运算结果为内积处理的结果,所述主单元对所述运算结果处理得到所述待计算的数据块以及运算指令的指令结果,包括:
所述主单元对所述运算结果累加后得到累加结果,将该累加结果排列得到所述待计算的数据块以及运算指令的指令结果。
可选的,所述主单元将所述广播数据块广播至所述多个基本单元,包括:
将所述广播数据块分成多个部分广播数据块,将所述多个部分广播数据块通过多次广播至所述多个基本单元。
可选的,所述基本单元对所述基本数据块与所述广播数据块执行内积运算得到运算结果,将所述运算结果发送至主单元,包括:
所述基本单元将所述部分广播数据块与所述基本数据块执行一次内积处理后得到内积处理结果,将所述内积处理结果累加得到部分运算结果,将所述部分运算结果发送至所述主单元。
可选的,所述基本单元对所述基本数据块与所述广播数据块执行内积运算得到运算结果,将所述运算结果发送至主单元,包括:
所述基本单元复用n次该部分广播数据块执行该部分广播数据块与该n个基本数据块内积运算得到n个部分处理结果,将n个部分处理结果分别累加后得到n个部分运算结果,将所述n个部分运算结果发送至主单元,所述n为大于等于2的整数。
第二方面,提供一种芯片装置,所述芯片装置包括:主单元以及多个基本单元,所述主单元,用于获取待计算的数据块以及运算指令,依据该运算指令对所述待计算的数据块划分成分发数据块以及广播数据块;对所述分发数据块 进行拆分处理得到多个基本数据块,将所述多个基本数据块分发至所述多个基本单元,将所述广播数据块广播至所述多个基本单元;所述基本单元,用于对所述基本数据块与所述广播数据块执行内积运算得到运算结果,将所述运算结果发送至所述主单元;所述主单元,用于对所述运算结果处理得到所述待计算的数据块以及运算指令的指令结果。
可选的,所述芯片装置还包括:分支单元,所述分支单元设置在主单元与基本单元之间;所述分支单元,用于转发数据。
可选的,所述主单元,具体用于将所述广播数据块通过一次广播至所述多个基本单元。
可选的,所述基本单元,具体用于将所述基本数据块与所述广播数据块执行内积处理得到内积处理结果,将所述内积处理结果累加得到运算结果,将所述运算结果发送至所述主单元。
可选的,所述主单元,用于在如所述运算结果为内积处理的结果时,对所述运算结果累加后得到累加结果,将该累加结果排列得到所述待计算的数据块以及运算指令的指令结果。
可选的,所述主单元,具体用于将所述广播数据块分成多个部分广播数据块,将所述多个部分广播数据块通过多次广播至所述多个基本单元。
可选的,所述基本单元,具体用于将所述部分广播数据块与所述基本数据块执行一次内积处理后得到内积处理结果,将所述内积处理结果累加得到部分运算结果,将所述部分运算结果发送至所述主单元。
可选的,所述基本单元,具体用于复用n次该部分广播数据块执行该部分广播数据块与该n个基本数据块内积运算得到n个部分处理结果,将n个部分处理结果分别累加后得到n个部分运算结果,将所述n个部分运算结果发送至主单元,所述n为大于等于2的整数。
可选的,所述主单元包括:主寄存器或主片上缓存电路的一种或任意组合;
所述基础单元包括:基本寄存器或基本片上缓存电路的一种或任意组合。
可选的,所述主单元包括:向量运算器电路、算数逻辑单元电路、累加器电路、矩阵转置电路、直接内存存取电路或数据重排电路中的一种或任意组合。
可选的,所述单元包括:内积运算器电路或累加器电路等中一个或任意组合。
可选的,所述分支单元为多个分支单元,所述主单元与所述多个分支单元分别连接,每个分支单元与至少一个基础单元连接。
可选的,所述分支单元为多个分支单元,所述多个分支单元串联连接后与所述主单元连接,每个分支单元分别连接至少一个基础单元。
可选的,所述分支单元具体用于转发所述主单元与所述基础单元之间的数据。
可选的,所述分支单元具体用于转发所述主单元与所述基础单元或其他分支单元之间的数据。
可选的,所述数据为:向量、矩阵、三维数据块、四维数据块以及n维数据块中一种或任意组合。
可选的,如所述运算指令为乘法指令,确定乘数数据块为广播数据块,被乘数数据块为分发数据块;
如所述运算指令为卷积指令,确定输入数据块为广播数据块,卷积核为分发数据块。
第三方面,提供一种第二方面提供的芯片装置的应用方法,所述芯片装置用于执行矩阵乘矩阵运算、矩阵乘向量运算、卷积运算或全连接运算中的一种或任意组合。
第四方面,提供一种芯片,所述芯片集成第二方面提供的芯片装置。
第五方面,提供一种智能设备,所述智能设备包括第六方面提供的芯片。
实施本披露实施例,具有如下有益效果:
可以看出,通过本披露实施例,在接收到数据以及运算指令以后,将数据划分为分发数据以及广播数据,将分发数据进行拆分成基本数据块以后分发至多个基本单元执行内积运算,这样将运算量最大的内积运算分发给多个基本单元同时执行,所以其具有减少计算时间,节省功耗的优点。
附图说明
为了更清楚地说明本披露实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本披露的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1a是本披露提供的一种芯片装置的结构示意图。
图1b是本披露提供的另一种芯片装置的结构示意图。
图1c是本披露提供的芯片装置的数据分发示意图。
图1d为一种芯片装置的数据回传示意图。
图2是本披露实施例提供的一种神经网络的运算方法的流程示意图。
图2a是本披露实施例提供的矩阵A乘以矩阵B的示意图。
图3是本披露实施例提供的神经网络的运算方法的流程示意图。
图3a为全连接1的单样本数据示意图。
图3b为全连接2的多样本数据示意图。
图3c为卷积1的M个卷积核数据示意图。
图3d为卷积2输入数据示意图。
图3e为输入数据的一个三维数据块的运算窗口示意图。
图3f为输入数据的一个三维数据块的另一运算窗口示意图。
图3g为输入数据的一个三维数据块的又一运算窗口示意图。
具体实施方式
下面将结合本披露实施例中的附图,对本披露实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本披露一部分实施例,而不是全部的实施例。基于本披露中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本披露保护的范围。
本披露的说明书和权利要求书及所述附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、***、产品或设备没有限定于已列出 的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本披露的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
下面以CPU为例来说明神经网络的运算方法,在神经网络中,矩阵与矩阵的乘法在神经网络中大量的使用,这里以矩阵A与矩阵B的乘法为例来说明CPU的“与”运算方式。假设矩阵A与矩阵B的结果为C,即C=A*B;下述所示:
Figure PCTCN2017099991-appb-000001
对于CPU来说,其在计算得到C采用的步骤可以为首先对第一行的完成计算,然后对第二行完成计算,最后对第三行完成运算,即对于CPU来说其运算时一行数据计算完毕以后再执行第二行数据的计算。以上述公式为例,具体的,首先,CPU对第一行完成计算,即需要完成,a11*b11+a12*b21+a13*b31、a11*b12+a12*b22+a13*b32和a11*b13+a12*b23+a13*b33;计算完上述以后,在计算a21*b11+a22*b21+a23*b31、a21*b12+a22*b22+a23*b32和a21*b13+a22*b23+a23*b33;最后再计算a31*b11+a32*b21+a33*b31、a31*b12+a32*b22+a33*b32和a31*b13+a32*b23+a33*b33
所以对于CPU或GPU来说,其需要一行一行的计算,即对第一行计算完毕以后再进行第二行的计算,然后再执行第三行的计算直至所有行计算完毕,对于神经网络来说,其行数可能有上千行的数据,所以其计算的时间很长,并且在计算时,CPU长期处于工作状态,能耗也高。
参阅图1b,图1b为一种芯片装置的结构示意图,如图1b所示,该芯片 装置包括:主单元电路、基本单元电路和分支单元电路。其中,主单元电路可以包括寄存器和/或片上缓存电路,该主单元还可以包括:向量运算器电路、ALU(arithmetic and logic unit,算数逻辑单元)电路、累加器电路、矩阵转置电路、DMA(Direct Memory Access,直接内存存取)电路、数据重排电路等的一个或任意组合;每个基础单元可以包括基础寄存器和/或基础片上缓存电路;每个基础单元还可以包括:内积运算器电路、向量运算器电路、累加器电路等中一个或任意组合。所述电路都可以是集成电路。如具有分支单元时,其中主单元与分支单元连接,该分支单元与基本单元连接,该基本单元用于执行数据块之间的内积运算,该主单元,用于收发外部数据,以及将外部数据分发至分支单元,该分支单元用于收发主单元或基本单元的数据。如图1b所示的结构适合复杂数据的计算,因为对于主单元来说,其连接的单元的数量有限,所以需要在主单元与基本单元之间添加分支单元以实现更多的基本单元的接入,从而实现对复杂数据块的计算。
分支单元和基础单元的连接结构可以是任意的,不局限在图1b的H型结构。可选的,主单元到基础单元是广播或分发的结构,基础单元到主单元是收集(gather)的结构。广播,分发和收集的定义如下:
所述主单元到基础单元的数据传递方式可以包括:
主单元与多个分支单元分别相连,每个分支单元再与多个基础单元分别相连。
主单元与一个分支单元相连,该分支单元再连接一个分支单元,依次类推,串联多个分支单元,然后,每个分支单元再与多个基础单元分别相连。
主单元与多个分支单元分别相连,每个分支单元再串联多个基础单元。
主单元与一个分支单元相连,该分支单元再连接一个分支单元,依次类推,串联多个分支单元,然后,每个分支单元再串联多个基础单元。
分发数据时,主单元向部分或者全部基础单元传输数据,各个接收数据的基础单元收到的数据可以不同;
广播数据时,主单元向部分或者全部基础单元传输数据,各个接收数据的基础单元收到相同的数据。
收集数据时,部分或全部基础单元向主单元传输数据。需要说明的,如图1a或如图1b所示的芯片装置可以是一个单独的物理芯片,当然在实际应用中,该芯片装置也可以集成在其他的芯片内(例如CPU,GPU),本申请具体实施方式并不限制上述芯片装置的物理表现形式。
参阅图1c,图1c为一种芯片装置的数据分发示意图,如图1c的箭头所示,该箭头为数据的分发方向,如图1c所示,主单元接收到外部数据以后,将外部数据拆分以后,分发至多个分支单元,分支单元将拆分数据发送至基本单元。
参阅图1d,图1d为一种芯片装置的数据回传示意图,如图1d的箭头所示,该箭头为数据的回传方向,如图1d所示,基本单元将数据(例如内积计算结果)回传给分支单元,分支单元在回传至主单元。
参阅图1a,图1a为另一种芯片装置的结构示意图,该芯片装置包括:主单元以及基本单元,该主单元与基本单元连接。如图1a所示的结构由于基本单元与主单元直接物理连接,所以该结构连接的基本单元的数量有限,其适合简单的数据的计算。
参阅图2,图2提供了一种使用上述芯片装置进行神经网络的运算方法,该方法采用如图1a或如图1b所示的芯片装置来执行,该方法如图2所示,包括如下步骤:
步骤S201、芯片装置的主单元获取待计算的数据块以及运算指令。
上述步骤S201中的待计算的数据块具体可以为,矩阵、向量、三维数据、四维数据、多维数据等等,本披露具体实施方式并不限制上述数据块的具体表现形式,该运算指令具体可以为,乘法指令、卷积指令、加法指令、减法指令、BLAS(英文:Basic Linear Algebra Subprograms,基础线性代数子程序)函数或激活函数等等。
步骤S202、主单元依据该运算指令对该待计算的数据块划分成分发数据块以及广播数据块。
上述步骤S202的实现方法具体可以为:
如该运算指令为乘法指令,确定乘数数据块为广播数据块,被乘数数据块 为分发数据块。
如该运算指令为卷积指令,确定输入数据块为广播数据块,卷积核为分发数据块。
步骤S2031、主单元对该分发数据块进行拆分处理得到多个基本数据块,将该多个基本数据块分发至多个基本单元,
步骤S2032,主单元将该广播数据块广播至多个基本单元。
可选的,上述步骤S2031以及步骤S2032也可以采用循环执行,对数据量比较大的情况下,主单元对该分发数据块进行拆分处理得到多个基本数据块,将每个基本数据块拆分成m个基本数据子块,对广播数据块也拆分成m个广播数据子块,主单元每次分发一个基本数据子块以及广播一个广播数据子块,该基本数据子块与广播数据子块均为能够执行并行神经网络计算的数据块。例如,以一个1000*1000的矩阵A*1000*1000的矩阵B为例,该基本数据块可以为矩阵A的第z行数据,该基本数据子块可以为矩阵A第z行数据中的前20列数据,该广播数据子块可以为矩阵B第z列中的前20行数据。
上述步骤S203中的基本数据块具体可以为,能够执行内积运算的最小数据块,以矩阵乘法为例,该基本数据块可以为矩阵的一行数据,以卷积为例,该基本数据块可以为一个卷积核的权值。
上述步骤S203中的分发的方式可以参见下述实施例的描述,这里不再赘述,广播该广播数据块的方法也可以参见下述实施例的描述,这里不再赘述。
步骤S2041、芯片装置的基本单元对该基本数据块与广播数据块执行内积运算得到运算结果,(可能是中间结果)。
步骤S2042、如果运算结果不是中间结果,将运算结果回传至主单元。
上述步骤S204中的回转方式可以参见下述实施例的描述,这里不再赘述。
步骤S205、主单元对该运算结果处理得到该待计算的数据块以及运算指令的指令结果。
上述步骤S205中的处理方式可以为累加、排序等等方式,本披露并不限于上述处理的具体方式,该具体的方式需要依据不同的运算指令来配置,例如还可以包括执行非线性变换等。
本披露提供的技术方案在执行运算时,由主单元接收外部数据,该外部数据包括待计算的数据块以及运算指令,获取到待计算的数据块以及运算指令,依据该运算指令确定该待计算的数据块的分发数据块以及广播数据块,将分发数据块拆分成多个基本数据块,将广播数据块广播给多个基本单元,将多个基本数据块分发至多个基本单元,多个基本单元分别对该基本数据块以及广播数据块执行内积运算得到运算结果,多个基本单元将该运算结果返回给主单元,主单元根据返回的运算结果得到该运算指令的指令结果。此技术方案的技术点在于,针对神经网络,其很大的运算量在于数据块与数据块之间的内积运算,内积运算的开销大,计算时间长,所以本披露实施例通过该运算指令以及待运算的指令首先区分该待计算的数据块中的分发数据块以及广播数据块,对于广播数据块来说,即实现内积运算时必须使用的数据块,对于分发数据块,其属于在内积运算中可以拆分的数据块,以矩阵乘法为例,如待计算的数据块为矩阵A和矩阵B,其运算指令为乘法指令(A*B),依据矩阵乘法的规则,确定矩阵A为可以拆分的分发数据块,确定矩阵B为广播的数据块,因为对于矩阵乘法来说,被乘数矩阵A可以被拆分成多个基本数据块,乘数矩阵B可以为广播数据块。依据矩阵乘法的定义,被乘数矩阵A需要每行数据分别与乘数矩阵B执行内积运算,所以本申请的技术方案将矩阵A分成M个基本数据块,M个基本数据块中,每个基本数据块可以为矩阵A的一行数据。所以对于矩阵乘法来说,其耗时比较大的运算时间被多个基本单元分别执行,所以在内积运算中,多个基本单元可以快速的并行运算出结果,从而减少计算时间,较少的计算时间也能够减少芯片装置的工作时间,从而降低功耗。
下面通过实际的例子来说明本披露提供的技术方案的效果。如图2a所示,为一种矩阵A乘以向量B的示意图,如图2a所示,矩阵A具有M行,L列,向量B具有L行,假设运算器运算矩阵A的一行与向量B的内积所需时间为t1,如采用CPU或GPU计算,其需要计算完一行以后在进行下一行,那么对于GPU或CPU计算的方法计算的时间T0=m*t1。而采用本披露具体实施例提供的技术方案,这里假设基本单元具有M个,则矩阵A会被拆分成M个基本数据块,每个基本数据块为矩阵A的一行数据,M个基本单元同时执行内积运算,那么 其计算时间为t1,对于采用本披露具体实施例提供的技术方案所需要的时间T1=t1+t2+t3,其中t2可以为主单元拆分数据的时间,t3可以为处理内积运算的运算结果得到指令结果所需的时间,由于拆分数据以及处理运算结果的计算量非常小,所以花费的时间非常少,所以T0>>T1,所以采用本披露具体实施方式的技术方案能够非常明显的减少计算的时间,同时对于待运算的数据所产生的功耗来说,由于T0>>T1,所以对于本披露提供的芯片装置由于其工作时间特别短,通过实验证明,当芯片装置的工作时间非常短时,其能耗会远低于工作时间长的能耗,所以其具有节省能耗的优点。
上述步骤S203中主单元将该广播数据块广播至多个基本单元的实现方式有多种,具体可以为:
方式甲、将广播数据块通过一次广播至该多个基本单元。(所述广播是指进行“一对多”的数据传输,即由主单元同时向多个(全部或者一部分)基础单元发送相同的数据块)例如,矩阵A*矩阵B,其中矩阵B为广播数据块,将矩阵B通过一次广播至该多个基本单元,又如,在卷积中,该输入数据为广播数据块,将该输入数据块一次广播至该多个基本单元。此方式的优点在于能够节省主单元与基本单元的数据传输量,即只经过一次广播即能够将所有的广播数据传输至多个基本单元。
方式乙、将广播数据块分成多个部分广播数据块,将多个部分广播数据块通过多次广播至该多个基本单元,例如,矩阵B通过多次广播至该多个基本单元,具体的,每次广播矩阵B的N列数据。此方式的优点在于可以降低基本单元的配置,因为对于基本单元其配置的寄存器的存储空间不可能很大,如果对于数据量比较大的矩阵B,一次将矩阵B下发给基本单元,那么基本单元存储这些数据就需要比较大的寄存器空间,因为基本单元的数量众多,提高寄存器空间必然对成本的增加产生很大影响,所以此时采用多次广播该广播数据块的方案,即对于基本单元来说,其只需要存储每次广播的广播数据块的部分数据即可,从而降低成本。
需要说明的是,上述步骤S203中的将多个基本数据块分发至多个基本单元也可以采用上述方式甲或方式乙,不同点仅仅在于,其传输的方式为单播方 式并且传输的数据为基本数据块。
上述步骤S204的实现方法具体可以为:
如采用方式甲广播该广播数据块以及方式甲的方式分发基本数据块(如图3a所示),基本单元对该基本数据块与广播数据块执行内积处理得到内积处理结果,即一次执行一行的内积运算,将该内积处理结果(运算结果中一种)发送至主单元,主单元将内积处理结果累加,当然在实际应用中,该基本单元可以将该内积处理结果累加后,将累加后的结果(运算结果中的另一种)发送至主单元。上述方式可以减少主单元与基本单元之间的数据传输量,进而提高计算的速度。
如采用方式乙广播数据块,在一种可选的技术方案中,基本单元每接收到部分广播数据块,执行一次基本数据块与部分广播数据块的部分内积运算得到部分处理结果,基本单元将该处理结果发送至主单元,主单元将处理结果累加。在另一种可选方案中,如基本单元接收的基本数据块为n个,复用该广播数据块执行该广播数据块与该n个基本数据块内积运算得到n个部分处理结果,基本单元将该n个处理结果发送至主单元,主单元将n个处理结果分别累加。当然上述累加也可以在基本单元内执行。
对于上述情况一般为广播数据块的数据量非常大且分发数据块也较大,因为对于芯片装置来说,由于其属于硬件的配置,所以其配置的基本单元虽然在理论上可以无数个,但是在实际中其数量有限,一般为几十个基本单元,该数量随着技术发展,可能会不断变化,比如增加。但是对于神经网络的矩阵乘矩阵的运算中,该矩阵A的行数可能有数千行,矩阵B的列数也有数千列,那么一次广播数据将矩阵B下发给基本单元就无法实现,那么其实现的方式可以为,一次广播矩阵B的部分数据,例如前5列数据,对于矩阵A来说也可以采用类似的方式,对于基本单元来说,其就可以每次进行部分内积计算,然后将部分内积计算的结果存储在寄存器内,等该行所有的内积运算执行完毕后,将该行所有的部分内积计算的结果累加即可以得到一种运算结果,将该运算结果发送至主单元。此种方式具有提高计算速度的优点。
参阅图3,图3提供了一种神经网络的计算方法,本实施例中的计算以矩 阵A*矩阵B的计算方式来说明,该矩阵A*矩阵B可以为图3a所示的矩阵示意图,为了方便说明,如图3所示的神经网络的计算方法在如图1b所示的芯片装置内执行,如图1b所示,该芯片装置具有16个基本单元,为了方便描述以及分配,这里设置如图3a所示的M的取值可以为32,该N的取值可以为15,L的取值可以为20。当然可以理解计算装置可以有任意多个基本单元。该方法如图3所示,包括如下步骤:
步骤S301、主单元接收矩阵A、矩阵B以及乘法运算指令A*B。
步骤S302、主单元依据乘法运算指令A*B确定矩阵B为广播数据块,矩阵A为分发数据块,将矩阵A拆分成32个基本数据块,每个基本数据块为矩阵A的一行数据。
步骤S303、主单元将32个基本数据块均匀分配给16个基本单元,将32个基本数据块均匀分配给16个基本单元,即每个基本单元接收2个基本数据块,这两个数据块的分配方式可以是任意不重复的分配顺序。
上述步骤S303的分配方式可以采用一些其他分配方式,例如当数据块数量无法恰好均分给每个基础单元的时候,可以不平均分配数据库给每个基础单元;也可以对其中的一些无法均分的数据块进行分割然后平均分配等方式,本披露具体实施方式并不限制上述基本数据块如何分配给多个基本单元的方式。
步骤S304、主单元提取矩阵B的前几列(比如前5列)的部分数据,矩阵B将前5列的部分数据广播至16个基本单元。
步骤S305、16个基本单元二次复用该前5列的部分数据与2个基本数据块执行内积运算以及累加运算得到32*5个前处理结果,将32*5个前处理结果发送至主单元。
步骤S306、主单元提取矩阵B的中5列的部分数据,矩阵B将中5列的部分数据广播至16个基本单元。
步骤S307、16个基本单元二次复用该中5列的部分数据与2个基本数据块执行内积运算以及累加运算得到32*5个中处理结果,将32*5个中处理结果发送至主单元。
步骤S308、主单元提取矩阵B的后5列的部分数据,矩阵B将后5列的 部分数据广播至16个基本单元。
步骤S309、16个基本单元二次复用该后5列的部分数据与2个基本数据块执行内积运算以及累加运算得到32*5个后处理结果,将32*5个后处理结果发送至主单元。
步骤S310、主单元将32*5个前处理结果、32*5个中处理结果以及32*5个后处理结果按前、中、后组合在一起得到一个32*15的矩阵C,该矩阵C即为矩阵A*矩阵B的指令结果。
如图3所示的技术方案将矩阵A拆分成32个基本数据块,然后分批次广播矩阵B,使得基本单元能够分批次的得到指令结果,由于该内积拆分成16个基本单元来计算,所以能够极大的降低计算的时间,所以其具有计算时间短,能耗低的优点。
参阅图1a,图1a为本披露提供的一种芯片装置,所述芯片装置包括:主单元以及基本单元,所述主单元为硬件芯片单元,所述基本单元也为硬件芯片单元;
所述主单元,用于执行神经网络运算中的各个连续的运算以及与所述基本单元传输数据;
所述基本单元,用于依据所述主单元传输的数据执行神经网络中并行加速的运算,并将运算结果传输给所述主单元。
上述并行加速的运算包括但不限于:数据块与数据块之间的乘法运算、卷积运算等等大规模并且可以并行的运算。
上述各个连续的运算包括但不限于:累加运算、矩阵转置运算、数据排序运算等等连续的运算。
主单元以及多个基本单元,所述主单元,用于获取待计算的数据块以及运算指令,依据该运算指令对所述待计算的数据块划分成分发数据块以及广播数据块;对所述分发数据块进行拆分处理得到多个基本数据块,将所述多个基本数据块分发至所述多个基本单元,将所述广播数据块广播至所述多个基本单元;所述基本单元,用于对所述基本数据块与所述广播数据块执行内积运算得到运算结果,将所述运算结果发送至所述主单元;所述主单元,用于对所述运 算结果处理得到所述待计算的数据块以及运算指令的指令结果。
可选的,所述芯片装置还包括:分支单元,所述分支单元设置在主单元与基本单元之间;所述分支单元,用于转发数据。
可选的,所述主单元,具体用于将所述广播数据块通过一次广播至所述多个基本单元。
可选的,所述基本单元,具体用于将所述基本数据块与所述广播数据块执行内积处理得到内积处理结果,将所述内积处理结果累加得到运算结果,将所述运算结果发送至所述主单元。
可选的,所述主单元,用于在如所述运算结果为内积处理的结果时,对所述运算结果累加后得到累加结果,将该累加结果排列得到所述待计算的数据块以及运算指令的指令结果。
可选的,所述主单元,具体用于将所述广播数据块分成多个部分广播数据块,将所述多个部分广播数据块通过多次广播至所述多个基本单元。
可选的,所述基本单元,具体用于将所述部分广播数据块与所述基本数据块执行一次内积处理后得到内积处理结果,将所述内积处理结果累加得到部分运算结果,将所述部分运算结果发送至所述主单元。
可选的,所述基本单元,具体用于复用n次该部分广播数据块执行该部分广播数据块与该n个基本数据块内积运算得到n个部分处理结果,将n个部分处理结果分别累加后得到n个部分运算结果,将所述n个部分运算结果发送至主单元,所述n为大于等于2的整数。
本披露具体实施方式还提供一种如图1a所示的芯片装置的应用方法,该应用方法具体可以用于执行矩阵乘矩阵运算、矩阵乘向量运算、卷积运算或全连接运算中的一种或任意组合。
具体地,所述主单元还可以执行pooling(池化)运算,规则化(归一化)运算,如batch normalization,lrn等神经网络运算步骤。
本申请具体实施方式还提供一种芯片,该芯片包括如图1a或如1b所示的芯片装置。
本申请具体实施方式还提供一种智能设备,该智能设备包括上述芯片,该 芯片集成有如图1a或如图1b所示的芯片装置。该智能设备包括但不限于:智能手机、平板电脑、个人数字助理、智能手表、智能摄像头、智能电视、智能冰箱等等智能设备,上述设备仅仅为了举例说明,本申请具体实施方式并不局限上述设备的具体表现形式。
上述矩阵乘矩阵运算可以参见如图3所示实施例的描述。这里不再赘述。
使用芯片装置进行全连接运算;
如果全连接层的输入数据是一个长度为L的向量(如“图3a所示全连接1-单样本”中向量B)(即神经网络的输入是单个样本的情况),全连接层的输出是一个长度为M的向量,全连接层的权值是一个M*L的矩阵(如“图3b全连接1-单样本”中矩阵A),则以全连接层的权值矩阵作为矩阵A(即拆分数据块),输入数据作为向量B(即广播数据块),按照上述如图2所示的方法一执行运算。具体的运算方法可以为:
如果全连接层的输入数据是一个矩阵(即神经网络的输入是多个样本作为batch一起进行运算的情况)(全连接层的输入数据表示N个输入样本,每个样本是一个长度为L的向量,则输入数据用一个L*N的矩阵表示,如“图3b全连接1-多样本”中矩阵B表示),全连接层对每一个样本的输出是一个长度为M的向量,则全连接层的输出数据是一个M*N的矩阵,如“图3a全连接1-多样本”中的结果矩阵,全连接层的权值是一个M*L的矩阵(如“图3a全连接1-多样本”中矩阵A),则以全连接层的权值矩阵作为矩阵A(即拆分数据块),输入数据矩阵作为矩阵B(即广播数据块),或者以全连接层的权值矩阵作为矩阵B(即广播数据块),输入向量作为矩阵A(即拆分数据块),按照上述如图2所示的方法一执行运算。
芯片装置
使用所述芯片装置进行人工神经网络运算时候,神经网络中的卷积层,池化层,规则化层(也叫归一化层,如BN(Batch normalization)或者LRN(Local Response Normalization))等的输入数据如“图3d卷积2-输入数据”(为了表示清楚,这里对表示每个样本的三维数据块使用C=5,H=10,W=12作为示例进行说明,实际使用中N,C,H,W的大小不局限在图3d中所示的数值)所 示,图3d中的每一个三维数据块表示一个样本对应与这一层的输入数据,每个三维数据块的三个维度分别是C、H和W,共有N个这样的三维数据块。
在进行上述这些神经网络层的计算时,主单元接收到输入数据后,对每一个输入数据的样本,使用主单元的数据重排电路,将输入数据按照一定的顺序摆放,该顺序可以是任意的顺序;
可选的,该顺序将按上述示意图代表的C维度坐标变化最快的方式摆放输入数据,例如NHWC和NWHC等。其中,C表示数据块最内层的维度,该N表示数据块最外层的维度,H和W是中间层的维度。这样的效果是C的数据是挨在一起的,由此易于提高运算的并行度,更易于多个特征图(Feature map)进行并行运算。
以下解释对于不同的神经网络运算,C、H和W如何理解。对于卷积和池化来说,H和W是进行卷积和池化运算时的相关运算窗口滑动维度(运算窗口在W维度上滑动的示例图如图3e卷积3-滑动a”和“图3f卷积3-滑动b”这两个图表示,运算窗口在H维度上滑动的示意图如图3g所示,其中运算窗口的大小与M个卷积核中的一个卷积核中的大小一致,如图3c所示的M个卷积核,每个卷积核为5*3*3的三维数据块,那么其运算窗口也为5*3*3的三维数据块,对于如图3c所示的M个卷积核中的KH以及KW表示其KH对应的维度为输入数据的H维度,该KW表示的对应的维度为输入数据的W维度。图3e、3f、3g中灰色部分方块是每一次滑动运算窗口进行运算使用的数据,其滑动的方向可以是以H为滑动方向以后在以W为滑动方向或以W为滑动方向完成以后在以H为滑动方向。具体地,对于卷积来说是,每一个滑动窗口处的运算是图中灰色部分方块表示的数据块与“图3c卷积1-卷积核”表示的M个卷积核数据块分别进行内积运算,卷积将对每一个滑动窗口位置对应每一个卷积核输出一个数值,即对于每个滑动窗口具有M个输出数值;对于池化来说,每一个滑动窗口处的运算是图中灰色方块表示的数据块在H和W维度(在图中的例子里是该灰色数据块中处于同一个平面上的9个数中)进行选取最大值,或者计算平均值等运算,池化将对每一个滑动窗口位置输出C个数值。C是单个样本的三维数据块中,除了H和W之外另一个维度,N代表一共有N个样本同时进 行这一层的运算。对于规则化算法中的LRN来说,C维度的定义是:每一次基本的LRN运算沿着C维度选取一个连续的数据块(即Y*1*1的数据块),其中Y*1*1的数据块中的Y为C维度上的取值,Y的取值小于等于C维度的最大值,第一个1表示H维度,第二个1表示W维度;剩下的两个维度定义成H和W维度,即,对每一个样本的三维数据块中,每一次进行LRN规则化的运算时,要对相同的W坐标和相同的H坐标中不同C坐标中连续的一部分数据进行。对于规则化算法BN来说,将N个样本的三维数据块中所有的具有相同的C维度上的坐标的数值求平均值和方差(或者标准差)。
所述“图3c-图3g”中均使用一个方块表示一个数值,也可以称为一个权值;示意图中所使用的数字均仅限举例说明,实际情况中维度数据可能是任意数值(包括某个维度为1的情况,这种情况下,所述四维数据块自动成为三维数据块,例如,当同时计算的样本数量为1的情况下,输入数据就是一个三维数据块;在例如,当卷积核数量为1的情况下,卷积和数据为一个三维数据块)。使用所述芯片装置进行输入数据B和卷积核A之间的卷积运算;
对于一个卷积层,其权值(所有的卷积核)如“图3c卷积1-卷积核”所示,记其卷积核的数量为M,每个卷积核由C个KH行KW列的矩阵组成,所以卷积层的权值可以表示为一个四个维度分别是M,C,KH,KW的四维数据块;卷积层的输入数据为四维数据块,由N个三维数据块组成,每个所述三维数据块由C个H行W列的特征矩阵组成(即四个维度分别是N,C,H,W的数据块);如“图3d卷积2-输入数据”所示。将M个卷积核中的每一个卷积核的权值从主单元分发到K个基础单元中的某一个上,保存在基础单元的片上缓存和/或寄存器中(此时的M个卷积核为分发数据块,每个卷积核可以是一个基本数据块,当然在实际应用中,也可以将该基本数据块变更成更小的温度,例如一个卷积核的一个平面的矩阵);具体的分发方法可以为:如果卷积核的个数M<=K则,给M个基础单元分别分发一个卷积核的权值;如果卷积核的个数M>K,则给每个基础单元分别分发一个或多个卷积核的权值。(分发到第i个基础单元的卷积核权值集合为Ai,共有Mi个卷积核。)在每个基础单元中,例如第i个基础单元中:将收到的由主单元分发的卷积核权值Ai保存在其寄存器和/ 或片上缓存中;将输入数据中各部分(即如图3e、图3f或如3g所示的滑动窗口)以广播的方式传输给各个基础单元(上述广播的方式可以采用上述方式甲或方式乙),在广播时,可以通过多次广播的方式将运算窗口的权值广播至所有的基本单元,具体的,可以每次广播部分运算窗口的权值,例如每次广播一个平面的矩阵,以图3e为例,每次可以广播一个C平面的KH*KW矩阵,当然在实际应用中,还可以一次广播一个C平面的KH*HW矩阵中的前n行或前n列的数据,本披露并不限制上述部分数据的发送方式以及部分数据的排列方式;将输入数据的摆放方式变换为任意维度顺序的摆放方式,然后按顺序依次广播各部分输入数据给基础单元。可选的,上述分发数据即卷积核的发送方式也可以采用与输入数据的运算窗口类似的方法送方式,这里不再赘述。可选的,将输入数据的摆放方式变换为C为最内层的循环。这样的效果是C的数据是挨在一起的,由此提高卷积运算的并行度,更易于多个特征图(Feature map)进行并行运算。可选的,将输入数据的摆放方式变换为维度顺序是NHWC或者NWHC的摆放方式每个基础单元,例如第i个基础单元,计算权值Ai中的卷积核和接收到的广播的数据对应部分(即运算窗口)的内积;权值Ai中对应部分的数据可以直接从片上缓存中读出来使用,也可以先读到寄存器中以便进行复用。每个基础单元内积运算的结果进行累加并传输回主单元。可以将每次基础单元执行内积运算得到的部分和传输回主单元进行累加;可以将每次基础单元执行的内积运算得到的部分和保存在基础单元的寄存器和/或片上缓存中,累加结束之后传输回主单元;也可以将每次基础单元执行的内积运算得到的部分和在部分情况下保存在基础单元的寄存器和/或片上缓存中进行累加,部分情况下传输到主单元进行累加,累加结束之后传输回主单元。
采用芯片装置实现BLAS(英文:Basic Linear Algebra Subprograms,基础线性代数子程序)函数的方法
GEMM,GEMM计算是指:BLAS库中的矩阵-矩阵乘法的运算。该运算的通常表示形式为:C=alpha*op(A)*op(B)+beta*C,其中,A和B为输入的两个矩阵,C为输出矩阵,alpha和beta为标量,op代表对矩阵A或B的某种操作,此外,还会有一些辅助的整数作为参数来说明矩阵的A和B的宽高;
使用所述装置实现GEMM计算的步骤为:
对输入矩阵A和矩阵B进行各自相应的op操作;该op操作可以为矩阵的转置操作,当然还可以是其他的操作,例如,非线性函数运算,池化等。利用主单元的向量运算功能,实现矩阵op操作;如某个矩阵的op可以为空,则主单元对该矩阵不执行任何操作;
采用如图2所示的方法完成op(A)与op(B)之间的矩阵乘法计算;
利用主单元的向量运算功能,对op(A)*op(B)的结果中的每一个值进行乘以alpha的操作;
利用主单元的向量运算功能,实现矩阵alpha*op(A)*op(B)和beta*C之间对应位置相加的步骤;
GEMV
GEMV计算是指:BLAS库中的矩阵-向量乘法的运算。该运算的通常表示形式为:C=alpha*op(A)*B+beta*C,其中,A为输入矩阵,B为输入的向量,C为输出向量,alpha和beta为标量,op代表对矩阵A的某种操作;
使用所述装置实现GEMV计算的步骤为:
对输入矩阵A进行相应的op操作;芯片装置的使用如图2所示的方法完成矩阵op(A)与向量B之间的矩阵-向量乘法计算;利用主单元的向量运算功能,对op(A)*B的结果中的每一个值进行乘以alpha的操作;利用主单元的向量运算功能,实现矩阵alpha*op(A)*B和beta*C之间对应位置相加的步骤。
采用芯片装置实现激活函数的方法
激活函数通常是指对一个数据块(可以是向量或者多维矩阵)中的每个数执行非线性运算。比如,激活函数可以是:y=max(m,x),其中x是输入数值,y是输出数值,m是一个常数;激活函数还可以是:y=tanh(x),其中x是输入数值,y是输出数值;激活函数也可以是:y=sigmoid(x),其中x是输入数值,y是输出数值;激活函数也可以是一个分段线性函数;激活函数可以是任意输入一个数,输出一个数的函数。
实现激活函数时,芯片装置利用主单元的向量计算功能,输入一向量,计 算出该向量的激活向量;主单元将输入向量中的每一个值通过一个激活函数(激活函数的输入时一个数值,输出也是一个数值),计算出一个数值输出到输出向量的对应位置;
上述输入向量的来源包括但不限于:芯片装置的外部数据、芯片装置的分支单元转发的基本单元的计算结果数据。
上述计算结果数据具体可以为进行矩阵乘向量的运算结果;上述计算结果数据具体还可以进行矩阵乘矩阵的运算结果;上述输入数据可以为主单元实现加偏置之后的计算结果。
采用芯片装置实现加偏置操作
利用主单元可以实现两个向量或者两个矩阵相加的功能;利用主单元可以实现把一个向量加到一个矩阵的每一行上,或者每一个列上的功能。
可选的,上述矩阵可以来自所述设备执行矩阵乘矩阵运算的结果;所述矩阵可以来自所述装置执行矩阵乘向量运算的结果;所述矩阵可以来自所述装置的主单元从外部接受的数据。所述向量可以来自所述装置的主单元从外部接受的数据。
上述输入数据以及计算结果数据仅仅是举例说明,在实际应用中,还可以是其他类型或来源的数据,本披露具体实施方式对上述数据的来源方式以及表达方式并不限定。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本披露并不受所描述的动作顺序的限制,因为依据本披露,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本披露所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例 如多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。
另外,在本披露各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元/模块都是以硬件的形式实现。比如该硬件可以是电路,包括数字电路,模拟电路等等。硬件结构的物理实现包括但不局限于物理器件,物理器件包括但不局限于晶体管,忆阻器等等。所述计算装置中的计算模块可以是任何适当的硬件处理器,比如CPU、GPU、FPGA、DSP和ASIC等等。所述存储单元可以是任何适当的磁存储介质或者磁光存储介质,比如RRAM,DRAM,SRAM,EDRAM,HBM,HMC等等。
所述作为说明的单元可以是或者也可以不是物理上分开的,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
以上对本披露实施例进行了详细介绍,本文中应用了具体个例对本披露的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本披露的方法及其核心思想;同时,对于本领域的一般技术人员,依据本披露的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本披露的限制。

Claims (22)

  1. 一种芯片装置,其特征在于,所述芯片装置包括:主单元以及多个基本单元,所述主单元为硬件芯片单元,所述基本单元也为硬件芯片单元;
    所述主单元,用于执行神经网络运算中的各个连续的运算以及与所述基本单元传输数据;
    所述基本单元,用于依据所述主单元传输的数据执行神经网络中并行加速的运算,并将运算结果传输给所述主单元。
  2. 根据权利要求1所述的芯片装置,其特征在于,
    所述主单元,用于获取待计算的数据块以及运算指令,依据该运算指令对所述待计算的数据块划分成分发数据块以及广播数据块;对所述分发数据块进行拆分处理得到多个基本数据块,将所述多个基本数据块分发至所述多个基本单元,将所述广播数据块广播至所述多个基本单元;
    所述基本单元,用于对所述基本数据块与所述广播数据块执行内积运算得到运算结果,将所述运算结果发送至所述主单元;
    所述主单元,用于对所述运算结果处理得到所述待计算的数据块以及运算指令的指令结果。
  3. 根据权利要求2所述的芯片装置,其特征在于,所述芯片装置还包括:分支单元,所述分支单元设置在主单元与至少一个基本单元之间;
    所述分支单元,用于在主单元和上述至少一个基本单元之间转发数据。
  4. 根据权利要求2或3所述的芯片装置,其特征在于,
    所述主单元,具体用于将所述广播数据块通过一次广播至所述多个基本单元。
  5. 根据权利要求4所述的芯片装置,其特征在于,
    所述基本单元,具体用于将所述基本数据块与所述广播数据块执行内积处理得到内积处理结果,将所述内积处理结果累加得到运算结果,将所述运算结果发送至所述主单元。
  6. 根据权利要求4所述的芯片装置,其特征在于,
    所述主单元,用于在如所述运算结果为内积处理的结果时,对所述运算结 果累加后得到累加结果,将该累加结果排列得到所述待计算的数据块以及运算指令的指令结果。
  7. 根据权利要求2或3所述的芯片装置,其特征在于,
    所述主单元,具体用于将所述广播数据块分成多个部分广播数据块,将所述多个部分广播数据块通过多次广播至所述多个基本单元。
  8. 根据权利要求7所述的芯片装置,其特征在于,
    所述基本单元,具体用于将所述部分广播数据块与所述基本数据块执行一次内积处理后得到内积处理结果,将所述内积处理结果累加得到部分运算结果,将所述部分运算结果发送至所述主单元。
  9. 根据权利要求8所述的芯片装置,其特征在于,
    所述基本单元,具体用于复用n次该部分广播数据块执行该部分广播数据块与该n个基本数据块内积运算得到n个部分处理结果,将n个部分处理结果分别累加后得到n个部分运算结果,将所述n个部分运算结果发送至主单元,所述n为大于等于2的整数。
  10. 根据权利要求1所述的芯片装置,其特征在于,
    所述主单元包括:主寄存器或主片上缓存电路的一种或任意组合;
    所述基础单元包括:基本寄存器或基本片上缓存电路的一种或任意组合。
  11. 根据权利要求10所述的芯片装置,其特征在于,
    所述主单元包括:向量运算器电路、算数逻辑单元电路、累加器电路、矩阵转置电路、直接内存存取电路或数据重排电路中的一种或任意组合。
  12. 根据权利要求10或11所述的芯片装置,其特征在于,
    所述单元包括:内积运算器电路或累加器电路等中一个或任意组合。
  13. 根据权利要求2所述的芯片装置,所述分支单元为多个分支单元,所述主单元与所述多个分支单元分别连接,每个分支单元与至少一个基础单元连接。
  14. 根据权利要求2所述的芯片装置,所述分支单元为多个分支单元,所述多个分支单元串联连接后与所述主单元连接,每个分支单元分别连接至少一个基础单元。
  15. 根据权利要求13所述的芯片装置,其特征在于,
    所述分支单元具体用于转发所述主单元与所述基础单元之间的数据。
  16. 根据权利要求14所述的芯片装置,其特征在于,
    所述分支单元具体用于转发所述主单元与所述基础单元或其他分支单元之间的数据。
  17. 根据权利要求1所述的芯片装置,其特征在于,
    所述数据为:向量、矩阵、三维数据块、四维数据块以及n维数据块中一种或任意组合。
  18. 根据权利要求2所述的芯片装置,其特征在于,
    如所述运算指令为乘法指令,确定乘数数据块为广播数据块,被乘数数据块为分发数据块;
    如所述运算指令为卷积指令,确定输入数据块为广播数据块,卷积核为分发数据块。
  19. 一种芯片,其特征在于,所述芯片集成如权利要求1-18任意一项所述的芯片装置。
  20. 一种智能设备,其特征在于,所述智能设备包括如权利要求19所述的芯片。
  21. 一种神经网络的运算方法,其特征在于,所述方法应用在芯片装置内,所述芯片装置包括:主单元以及至少一个基础单元,所述方法包括如下步骤:
    所述主单元执行神经网络运算中的各个连续的运算以及与所述基础单元传输数据;
    所述基础单元依据所述主单元传输的数据执行神经网络中并行加速的运算,并将运算结果传输给所述主单元。
  22. 根据权利要求21所述的方法,其特征在于,所述方法具体包括:
    所述主单元获取待计算的数据块以及运算指令,依据该运算指令对所述待计算的数据块划分成分发数据块以及广播数据块;主单元对所述分发数据块进行拆分处理得到多个基本数据块,将所述多个基本数据块分发至所述至少一个基础单元,所述主单元将所述广播数据块广播至所述至少一个基础单元;
    所述基础单元对所述基本数据块与所述广播数据块执行内积运算得到运算结果,将所述运算结果发送至主单元;主单元对所述运算结果处理得到所述待计算的数据块以及运算指令的指令结果。
PCT/CN2017/099991 2017-08-31 2017-08-31 芯片装置及相关产品 WO2019041251A1 (zh)

Priority Applications (28)

Application Number Priority Date Filing Date Title
CN201910534528.XA CN110245752B (zh) 2017-08-31 2017-08-31 一种使用芯片装置进行全连接运算方法及装置
CN201910102972.4A CN109902804B (zh) 2017-08-31 2017-08-31 一种池化运算方法及装置
CN201910534118.5A CN110231958B (zh) 2017-08-31 2017-08-31 一种矩阵乘向量运算方法及装置
EP19211995.6A EP3651030A1 (en) 2017-08-31 2017-08-31 Chip device and related products
EP19212002.0A EP3651031A1 (en) 2017-08-31 2017-08-31 Chip device and related products
CN201780002287.3A CN109729734B8 (zh) 2017-08-31 2017-08-31 芯片装置及相关产品
JP2019553977A JP7065877B2 (ja) 2017-08-31 2017-08-31 チップ装置および関連製品
CN201910530860.9A CN110245751B (zh) 2017-08-31 2017-08-31 一种gemm运算方法及装置
KR1020197029020A KR102467688B1 (ko) 2017-08-31 2017-08-31 칩 장치 및 관련 제품
CN201910531031.2A CN110222308B (zh) 2017-08-31 2017-08-31 一种矩阵乘矩阵运算方法及装置
EP17923228.5A EP3605402B1 (en) 2017-08-31 2017-08-31 Chip device and related product
CN201910534527.5A CN110083390B (zh) 2017-08-31 2017-08-31 一种gemv运算运算方法及装置
KR1020197037903A KR102477404B1 (ko) 2017-08-31 2017-08-31 칩 장치 및 관련 제품
CN202010628834.2A CN111860815A (zh) 2017-08-31 2017-08-31 一种卷积运算方法及装置
EP19212365.1A EP3654209A1 (en) 2017-08-31 2017-08-31 Chip device and related products
KR1020197037895A KR102481256B1 (ko) 2017-08-31 2017-08-31 칩 장치 및 관련 제품
CN201811462676.7A CN109615061B (zh) 2017-08-31 2017-08-31 一种卷积运算方法及装置
PCT/CN2017/099991 WO2019041251A1 (zh) 2017-08-31 2017-08-31 芯片装置及相关产品
EP19212368.5A EP3654210A1 (en) 2017-08-31 2017-08-31 Chip device and related products
EP19212010.3A EP3654208A1 (en) 2017-08-31 2017-08-31 Chip device and related products
TW107125681A TWI749249B (zh) 2017-08-31 2018-07-25 芯片裝置、芯片、智能設備以及神經網絡的運算方法
US16/168,778 US11409535B2 (en) 2017-08-31 2018-10-23 Processing device and related products
US16/663,164 US11531553B2 (en) 2017-08-31 2019-10-24 Processing device and related products
US16/663,181 US11561800B2 (en) 2017-08-31 2019-10-24 Processing device and related products
US16/663,206 US11334363B2 (en) 2017-08-31 2019-10-24 Processing device and related products
US16/663,174 US11775311B2 (en) 2017-08-31 2019-10-24 Processing device and related products
US16/663,210 US11354133B2 (en) 2017-08-31 2019-10-24 Processing device and related products
US16/663,205 US11347516B2 (en) 2017-08-31 2019-10-24 Processing device and related products

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/099991 WO2019041251A1 (zh) 2017-08-31 2017-08-31 芯片装置及相关产品

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/168,778 Continuation US11409535B2 (en) 2017-08-31 2018-10-23 Processing device and related products

Publications (1)

Publication Number Publication Date
WO2019041251A1 true WO2019041251A1 (zh) 2019-03-07

Family

ID=65436282

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/099991 WO2019041251A1 (zh) 2017-08-31 2017-08-31 芯片装置及相关产品

Country Status (7)

Country Link
US (7) US11409535B2 (zh)
EP (6) EP3654210A1 (zh)
JP (1) JP7065877B2 (zh)
KR (3) KR102477404B1 (zh)
CN (8) CN110245752B (zh)
TW (1) TWI749249B (zh)
WO (1) WO2019041251A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111126582A (zh) * 2019-12-20 2020-05-08 上海寒武纪信息科技有限公司 数据处理方法和相关产品
CN111161705A (zh) * 2019-12-19 2020-05-15 上海寒武纪信息科技有限公司 语音转换方法及装置
CN113743598A (zh) * 2020-05-27 2021-12-03 杭州海康威视数字技术股份有限公司 一种ai芯片的运行方式的确定方法和装置
CN114936633A (zh) * 2022-06-15 2022-08-23 北京爱芯科技有限公司 用于转置运算的数据处理单元及图像转置运算方法

Families Citing this family (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111859273A (zh) * 2017-12-29 2020-10-30 华为技术有限公司 矩阵乘法器
CN110162162B (zh) * 2018-02-14 2023-08-18 上海寒武纪信息科技有限公司 处理器的控制装置、方法及设备
CN110210610B (zh) * 2018-03-27 2023-06-20 腾讯科技(深圳)有限公司 卷积计算加速器、卷积计算方法及卷积计算设备
US11277455B2 (en) 2018-06-07 2022-03-15 Mellanox Technologies, Ltd. Streaming system
US20200106828A1 (en) * 2018-10-02 2020-04-02 Mellanox Technologies, Ltd. Parallel Computation Network Device
CN110162799B (zh) * 2018-11-28 2023-08-04 腾讯科技(深圳)有限公司 模型训练方法、机器翻译方法以及相关装置和设备
US11175946B2 (en) * 2018-12-06 2021-11-16 Advanced Micro Devices, Inc. Pipelined matrix multiplication at a graphics processing unit
US11657119B2 (en) * 2018-12-10 2023-05-23 Advanced Micro Devices, Inc. Hardware accelerated convolution
US11625393B2 (en) 2019-02-19 2023-04-11 Mellanox Technologies, Ltd. High performance computing system
EP3699770A1 (en) 2019-02-25 2020-08-26 Mellanox Technologies TLV Ltd. Collective communication system and methods
WO2021009901A1 (ja) * 2019-07-18 2021-01-21 技術研究組合光電子融合基盤技術研究所 並列計算方法およびシステム
US11481471B2 (en) * 2019-08-16 2022-10-25 Meta Platforms, Inc. Mapping convolution to a matrix processor unit
CN110516793B (zh) * 2019-08-27 2022-06-17 Oppo广东移动通信有限公司 一种池化处理方法及装置、存储介质
CN110826687B (zh) * 2019-08-30 2023-11-21 安谋科技(中国)有限公司 数据处理方法及其装置、介质和***
KR20210071471A (ko) * 2019-12-06 2021-06-16 삼성전자주식회사 뉴럴 네트워크의 행렬 곱셈 연산을 수행하는 장치 및 방법
US11750699B2 (en) 2020-01-15 2023-09-05 Mellanox Technologies, Ltd. Small message aggregation
US11252027B2 (en) 2020-01-23 2022-02-15 Mellanox Technologies, Ltd. Network element supporting flexible data reduction operations
US10713493B1 (en) * 2020-02-06 2020-07-14 Shenzhen Malong Technologies Co., Ltd. 4D convolutional neural networks for video recognition
US11876885B2 (en) 2020-07-02 2024-01-16 Mellanox Technologies, Ltd. Clock queue with arming and/or self-arming features
CN114115995A (zh) * 2020-08-27 2022-03-01 华为技术有限公司 人工智能芯片及运算板卡、数据处理方法及电子设备
CN112491555B (zh) * 2020-11-20 2022-04-05 山西智杰软件工程有限公司 医疗电子签名的处理方法及电子设备
CN112416433B (zh) * 2020-11-24 2023-01-17 中科寒武纪科技股份有限公司 一种数据处理装置、数据处理方法及相关产品
US11556378B2 (en) 2020-12-14 2023-01-17 Mellanox Technologies, Ltd. Offloading execution of a multi-task parameter-dependent operation to a network device
CN112953701B (zh) * 2021-02-04 2023-10-31 沈阳建筑大学 一种四维混沌电路装置
CN112799598B (zh) * 2021-02-08 2022-07-15 清华大学 一种数据处理方法、处理器及电子设备
CN113240570B (zh) * 2021-04-13 2023-01-06 华南理工大学 一种GEMM运算加速器及基于GoogLeNet的图像处理加速方法
CN112990370B (zh) * 2021-04-26 2021-09-10 腾讯科技(深圳)有限公司 图像数据的处理方法和装置、存储介质及电子设备
CN115481713A (zh) * 2021-06-15 2022-12-16 瑞昱半导体股份有限公司 改进卷积神经网络进行计算的方法
KR20230068572A (ko) * 2021-11-11 2023-05-18 삼성전자주식회사 메모리 어레이 내의 연결 회로
CN116150555A (zh) * 2021-11-19 2023-05-23 中科寒武纪科技股份有限公司 计算装置、利用计算装置实施卷积运算的方法及相关产品
US11922237B1 (en) 2022-09-12 2024-03-05 Mellanox Technologies, Ltd. Single-step collective operations
CN117974417B (zh) * 2024-03-28 2024-07-02 腾讯科技(深圳)有限公司 Ai芯片、电子设备及图像处理方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105488565A (zh) * 2015-11-17 2016-04-13 中国科学院计算技术研究所 加速深度神经网络算法的加速芯片的运算装置及方法
CN105608490A (zh) * 2015-07-29 2016-05-25 上海磁宇信息科技有限公司 细胞阵列计算***以及其中的通信方法
CN105930902A (zh) * 2016-04-18 2016-09-07 中国科学院计算技术研究所 一种神经网络的处理方法、***
CN105956659A (zh) * 2016-05-11 2016-09-21 北京比特大陆科技有限公司 数据处理装置和***、服务器
US20170193368A1 (en) * 2015-12-30 2017-07-06 Amazon Technologies, Inc. Conditional parallel processing in fully-connected neural networks

Family Cites Families (86)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5023833A (en) * 1987-12-08 1991-06-11 California Institute Of Technology Feed forward neural network for unary associative memory
US5956703A (en) * 1995-07-28 1999-09-21 Delco Electronics Corporation Configurable neural network integrated circuit
JPH117438A (ja) * 1997-06-18 1999-01-12 Fuji Xerox Co Ltd 積和演算処理方法、装置及び記録媒体
JP2001188767A (ja) * 1999-12-28 2001-07-10 Fuji Xerox Co Ltd ニューラルネットワーク演算装置及びニューラルネットワークの演算方法
US7672952B2 (en) * 2000-07-13 2010-03-02 Novell, Inc. System and method of semantic correlation of rich content
US6925479B2 (en) * 2001-04-30 2005-08-02 Industrial Technology Research Institute General finite-field multiplier and method of the same
US7065544B2 (en) * 2001-11-29 2006-06-20 Hewlett-Packard Development Company, L.P. System and method for detecting repetitions in a multimedia stream
US7737994B1 (en) * 2003-09-26 2010-06-15 Oracle America, Inc. Large-kernel convolution using multiple industry-standard graphics accelerators
US20050125477A1 (en) * 2003-12-04 2005-06-09 Genov Roman A. High-precision matrix-vector multiplication on a charge-mode array with embedded dynamic memory and stochastic method thereof
US7634137B2 (en) * 2005-10-14 2009-12-15 Microsoft Corporation Unfolded convolution for fast feature extraction
US7805386B2 (en) * 2006-05-16 2010-09-28 Greer Douglas S Method of generating an encoded output signal using a manifold association processor having a plurality of pairs of processing elements trained to store a plurality of reciprocal signal pairs
US8644643B2 (en) * 2006-06-14 2014-02-04 Qualcomm Incorporated Convolution filtering in a graphics processor
JP4942095B2 (ja) * 2007-01-25 2012-05-30 インターナショナル・ビジネス・マシーンズ・コーポレーション マルチコア・プロセッサにより演算を行う技術
US20080288756A1 (en) * 2007-05-18 2008-11-20 Johnson Timothy J "or" bit matrix multiply vector instruction
US8190543B2 (en) * 2008-03-08 2012-05-29 Tokyo Electron Limited Autonomous biologically based learning tool
US9152427B2 (en) * 2008-10-15 2015-10-06 Hyperion Core, Inc. Instruction issue to array of arithmetic cells coupled to load/store cells with associated registers as extended register file
US20100122070A1 (en) * 2008-11-07 2010-05-13 Nokia Corporation Combined associative and distributed arithmetics for multiple inner products
US20110025816A1 (en) * 2009-07-31 2011-02-03 Microsoft Corporation Advertising as a real-time video call
US8577950B2 (en) * 2009-08-17 2013-11-05 International Business Machines Corporation Matrix multiplication operations with data pre-conditioning in a high performance computing architecture
US8583896B2 (en) * 2009-11-13 2013-11-12 Nec Laboratories America, Inc. Massively parallel processing core with plural chains of processing elements and respective smart memory storing select data received from each chain
US20110314256A1 (en) * 2010-06-18 2011-12-22 Microsoft Corporation Data Parallel Programming Model
US8577820B2 (en) * 2011-03-04 2013-11-05 Tokyo Electron Limited Accurate and fast neural network training for library-based critical dimension (CD) metrology
US10078620B2 (en) * 2011-05-27 2018-09-18 New York University Runtime reconfigurable dataflow processor with multi-port memory access module
CN102214160B (zh) * 2011-07-08 2013-04-17 中国科学技术大学 一种基于龙芯3a的单精度矩阵乘法优化方法
CN103631761B (zh) * 2012-08-29 2018-02-27 睿励科学仪器(上海)有限公司 并行处理架构进行矩阵运算并用于严格波耦合分析的方法
DE102013104567A1 (de) * 2013-05-03 2014-11-06 Infineon Technologies Ag Chipanordnung, Chipkartenanordnung und Verfahren zum Herstellen einer Chipanordnung
CN103440121B (zh) * 2013-08-20 2016-06-29 中国人民解放军国防科学技术大学 一种面向向量处理器的三角矩阵乘法向量化方法
DE102013109200A1 (de) * 2013-08-26 2015-02-26 Infineon Technologies Austria Ag Chip, Chip-Anordnung und Verfahren zum Herstellen eines Chips
CN107451077B (zh) * 2013-08-27 2020-08-18 珠海艾派克微电子有限公司 测试头、芯片加工装置及显示芯片类型号的方法
US20150324686A1 (en) * 2014-05-12 2015-11-12 Qualcomm Incorporated Distributed model learning
CN104036451B (zh) * 2014-06-20 2018-12-11 深圳市腾讯计算机***有限公司 基于多图形处理器的模型并行处理方法及装置
CN104317352B (zh) * 2014-10-13 2017-10-24 中国科学院光电技术研究所 一种自适应光学控制***快速去倾斜分量处理方法
CN104346318B (zh) * 2014-10-15 2017-03-15 中国人民解放军国防科学技术大学 面向通用多核dsp的矩阵乘加速方法
CN104463324A (zh) * 2014-11-21 2015-03-25 长沙马沙电子科技有限公司 一种基于大规模高性能集群的卷积神经网络并行处理方法
CN105701120B (zh) * 2014-11-28 2019-05-03 华为技术有限公司 确定语义匹配度的方法和装置
CN104992430B (zh) * 2015-04-14 2017-12-22 杭州奥视图像技术有限公司 基于卷积神经网络的全自动的三维肝脏分割方法
CN104866855A (zh) * 2015-05-07 2015-08-26 华为技术有限公司 一种图像特征提取方法及装置
US10489703B2 (en) 2015-05-20 2019-11-26 Nec Corporation Memory efficiency for convolutional neural networks operating on graphics processing units
US10417555B2 (en) * 2015-05-29 2019-09-17 Samsung Electronics Co., Ltd. Data-optimized neural network traversal
CN104866904B (zh) * 2015-06-16 2019-01-01 中电科软件信息服务有限公司 一种基于spark的遗传算法优化的BP神经网络并行化方法
CN106293893B (zh) * 2015-06-26 2019-12-06 阿里巴巴集团控股有限公司 作业调度方法、装置及分布式***
CN105005911B (zh) * 2015-06-26 2017-09-19 深圳市腾讯计算机***有限公司 深度神经网络的运算***及运算方法
WO2017031630A1 (zh) * 2015-08-21 2017-03-02 中国科学院自动化研究所 基于参数量化的深度卷积神经网络的加速与压缩方法
CN105260776B (zh) * 2015-09-10 2018-03-27 华为技术有限公司 神经网络处理器和卷积神经网络处理器
CN106548124B (zh) * 2015-09-17 2021-09-07 松下知识产权经营株式会社 主题推定***、主题推定方法
EP3154001B1 (en) * 2015-10-08 2019-07-17 VIA Alliance Semiconductor Co., Ltd. Neural network unit with neural memory and array of neural processing units that collectively shift row of data received from neural memory
CN106485319B (zh) * 2015-10-08 2019-02-12 上海兆芯集成电路有限公司 具有神经处理单元可动态配置以执行多种数据尺寸的神经网络单元
CN105373517A (zh) * 2015-11-09 2016-03-02 南京大学 基于Spark的分布式稠密矩阵求逆并行化运算方法
CN105608056A (zh) * 2015-11-09 2016-05-25 南京大学 一种基于Flink的大规模矩阵并行化的计算方法
CN105426344A (zh) * 2015-11-09 2016-03-23 南京大学 基于Spark的分布式大规模矩阵乘法的矩阵计算方法
US11024024B2 (en) * 2015-12-15 2021-06-01 The Regents Of The University Of California Systems and methods for analyzing perfusion-weighted medical imaging using deep neural networks
CN105512723B (zh) * 2016-01-20 2018-02-16 南京艾溪信息科技有限公司 一种用于稀疏连接的人工神经网络计算装置和方法
CN111353588B (zh) * 2016-01-20 2024-03-05 中科寒武纪科技股份有限公司 用于执行人工神经网络反向训练的装置和方法
CN111353589B (zh) * 2016-01-20 2024-03-01 中科寒武纪科技股份有限公司 用于执行人工神经网络正向运算的装置和方法
US11055063B2 (en) * 2016-05-02 2021-07-06 Marvell Asia Pte, Ltd. Systems and methods for deep learning processor
US10796220B2 (en) * 2016-05-24 2020-10-06 Marvell Asia Pte, Ltd. Systems and methods for vectorized FFT for multi-dimensional convolution operations
KR102459854B1 (ko) * 2016-05-26 2022-10-27 삼성전자주식회사 심층 신경망용 가속기
CN106126481B (zh) * 2016-06-29 2019-04-12 华为技术有限公司 一种计算***和电子设备
CN106203621B (zh) * 2016-07-11 2019-04-30 北京深鉴智能科技有限公司 用于卷积神经网络计算的处理器
CN106228240B (zh) * 2016-07-30 2020-09-01 复旦大学 基于fpga的深度卷积神经网络实现方法
US10891538B2 (en) * 2016-08-11 2021-01-12 Nvidia Corporation Sparse convolutional neural network accelerator
US20180046903A1 (en) * 2016-08-12 2018-02-15 DeePhi Technology Co., Ltd. Deep processing unit (dpu) for implementing an artificial neural network (ann)
CN106407561B (zh) * 2016-09-19 2020-07-03 复旦大学 一种并行gpdt算法在多核soc上的划分方法
CN106446546B (zh) * 2016-09-23 2019-02-22 西安电子科技大学 基于卷积自动编解码算法的气象数据填补方法
CN106650922B (zh) * 2016-09-29 2019-05-03 清华大学 硬件神经网络转换方法、计算装置、软硬件协作***
CN106504232B (zh) * 2016-10-14 2019-06-14 北京网医智捷科技有限公司 一种基于3d卷积神经网络的肺部结节自动检测***
US9779786B1 (en) * 2016-10-26 2017-10-03 Xilinx, Inc. Tensor operations and acceleration
CN107239824A (zh) * 2016-12-05 2017-10-10 北京深鉴智能科技有限公司 用于实现稀疏卷积神经网络加速器的装置和方法
WO2018103736A1 (en) * 2016-12-09 2018-06-14 Beijing Horizon Information Technology Co., Ltd. Systems and methods for data management
CN106844294B (zh) * 2016-12-29 2019-05-03 华为机器有限公司 卷积运算芯片和通信设备
US10417364B2 (en) * 2017-01-04 2019-09-17 Stmicroelectronics International N.V. Tool to create a reconfigurable interconnect framework
IT201700008949A1 (it) * 2017-01-27 2018-07-27 St Microelectronics Srl Procedimento di funzionamento di reti neurali, rete, apparecchiatura e prodotto informatico corrispondenti
CN106951395B (zh) * 2017-02-13 2018-08-17 上海客鹭信息技术有限公司 面向压缩卷积神经网络的并行卷积运算方法及装置
CN106940815B (zh) * 2017-02-13 2020-07-28 西安交通大学 一种可编程卷积神经网络协处理器ip核
US11663450B2 (en) * 2017-02-28 2023-05-30 Microsoft Technology Licensing, Llc Neural network processing with chained instructions
CN107066239A (zh) * 2017-03-01 2017-08-18 智擎信息***(上海)有限公司 一种实现卷积神经网络前向计算的硬件结构
US10528147B2 (en) * 2017-03-06 2020-01-07 Microsoft Technology Licensing, Llc Ultrasonic based gesture recognition
WO2018174926A1 (en) * 2017-03-20 2018-09-27 Intel Corporation Systems, methods, and apparatuses for tile transpose
CN106970896B (zh) * 2017-03-30 2020-05-12 中国人民解放军国防科学技术大学 面向向量处理器的二维矩阵卷积的向量化实现方法
US10186011B2 (en) * 2017-04-28 2019-01-22 Intel Corporation Programmable coarse grained and sparse matrix compute hardware with advanced scheduling
US10169298B1 (en) * 2017-05-11 2019-01-01 NovuMind Limited Native tensor processor, using outer product unit
WO2018222896A1 (en) * 2017-05-31 2018-12-06 Intel Corporation Gradient-based training engine for quaternion-based machine-learning systems
US10167800B1 (en) * 2017-08-18 2019-01-01 Microsoft Technology Licensing, Llc Hardware node having a matrix vector unit with block-floating point processing
US10963780B2 (en) * 2017-08-24 2021-03-30 Google Llc Yield improvements for three-dimensionally stacked neural network accelerators
US20190102671A1 (en) * 2017-09-29 2019-04-04 Intel Corporation Inner product convolutional neural network accelerator
US11222256B2 (en) * 2017-10-17 2022-01-11 Xilinx, Inc. Neural network processing system having multiple processors and a neural network accelerator

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105608490A (zh) * 2015-07-29 2016-05-25 上海磁宇信息科技有限公司 细胞阵列计算***以及其中的通信方法
CN105488565A (zh) * 2015-11-17 2016-04-13 中国科学院计算技术研究所 加速深度神经网络算法的加速芯片的运算装置及方法
US20170193368A1 (en) * 2015-12-30 2017-07-06 Amazon Technologies, Inc. Conditional parallel processing in fully-connected neural networks
CN105930902A (zh) * 2016-04-18 2016-09-07 中国科学院计算技术研究所 一种神经网络的处理方法、***
CN105956659A (zh) * 2016-05-11 2016-09-21 北京比特大陆科技有限公司 数据处理装置和***、服务器

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111161705A (zh) * 2019-12-19 2020-05-15 上海寒武纪信息科技有限公司 语音转换方法及装置
CN111126582A (zh) * 2019-12-20 2020-05-08 上海寒武纪信息科技有限公司 数据处理方法和相关产品
CN111126582B (zh) * 2019-12-20 2024-04-05 上海寒武纪信息科技有限公司 数据处理方法和相关产品
CN113743598A (zh) * 2020-05-27 2021-12-03 杭州海康威视数字技术股份有限公司 一种ai芯片的运行方式的确定方法和装置
CN113743598B (zh) * 2020-05-27 2023-08-04 杭州海康威视数字技术股份有限公司 一种ai芯片的运行方式的确定方法和装置
CN114936633A (zh) * 2022-06-15 2022-08-23 北京爱芯科技有限公司 用于转置运算的数据处理单元及图像转置运算方法

Also Published As

Publication number Publication date
EP3654210A1 (en) 2020-05-20
EP3654209A1 (en) 2020-05-20
CN110245752A (zh) 2019-09-17
CN109902804B (zh) 2020-12-18
EP3605402A1 (en) 2020-02-05
US11347516B2 (en) 2022-05-31
EP3605402B1 (en) 2022-08-31
CN109729734B (zh) 2020-10-27
KR20200008544A (ko) 2020-01-28
US11354133B2 (en) 2022-06-07
US11561800B2 (en) 2023-01-24
US11409535B2 (en) 2022-08-09
KR102481256B1 (ko) 2022-12-23
CN111860815A (zh) 2020-10-30
EP3605402A4 (en) 2020-10-21
US11531553B2 (en) 2022-12-20
CN109729734A (zh) 2019-05-07
US20200057652A1 (en) 2020-02-20
US11334363B2 (en) 2022-05-17
US20200057647A1 (en) 2020-02-20
CN109902804A (zh) 2019-06-18
JP2020530916A (ja) 2020-10-29
CN110083390B (zh) 2020-08-25
CN110231958B (zh) 2020-10-27
CN110245751B (zh) 2020-10-09
CN109729734B8 (zh) 2020-11-24
CN110083390A (zh) 2019-08-02
TW201913460A (zh) 2019-04-01
TWI749249B (zh) 2021-12-11
US20200057648A1 (en) 2020-02-20
US20200057651A1 (en) 2020-02-20
CN110245752B (zh) 2020-10-09
EP3651031A1 (en) 2020-05-13
US11775311B2 (en) 2023-10-03
US20190065208A1 (en) 2019-02-28
CN110231958A (zh) 2019-09-13
KR20200037749A (ko) 2020-04-09
US20200057649A1 (en) 2020-02-20
US20200057650A1 (en) 2020-02-20
KR102477404B1 (ko) 2022-12-13
CN110245751A (zh) 2019-09-17
CN110222308A (zh) 2019-09-10
EP3654208A1 (en) 2020-05-20
KR20200037748A (ko) 2020-04-09
KR102467688B1 (ko) 2022-11-15
JP7065877B2 (ja) 2022-05-12
EP3651030A1 (en) 2020-05-13
CN110222308B (zh) 2020-12-29

Similar Documents

Publication Publication Date Title
TWI749249B (zh) 芯片裝置、芯片、智能設備以及神經網絡的運算方法
CN109615061B (zh) 一种卷积运算方法及装置
JP6888073B2 (ja) チップ装置および関連製品
JP6888074B2 (ja) チップ装置および関連製品
CN109615062B (zh) 一种卷积运算方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17923228

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2019553977

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 20197029020

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2017923228

Country of ref document: EP

Effective date: 20191024

NENP Non-entry into the national phase

Ref country code: DE