CN112348180A - Data processing device and configuration method, neural network processor, chip and equipment - Google Patents

Data processing device and configuration method, neural network processor, chip and equipment Download PDF

Info

Publication number
CN112348180A
CN112348180A CN202011353547.1A CN202011353547A CN112348180A CN 112348180 A CN112348180 A CN 112348180A CN 202011353547 A CN202011353547 A CN 202011353547A CN 112348180 A CN112348180 A CN 112348180A
Authority
CN
China
Prior art keywords
level
unit
logic
control unit
units
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011353547.1A
Other languages
Chinese (zh)
Inventor
刘君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Oppo Mobile Telecommunications Corp Ltd
Original Assignee
Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Oppo Mobile Telecommunications Corp Ltd filed Critical Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority to CN202011353547.1A priority Critical patent/CN112348180A/en
Publication of CN112348180A publication Critical patent/CN112348180A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Logic Circuits (AREA)

Abstract

The embodiment of the application discloses a data processing device, including: a control unit; the M-level logic units are stacked, and a first-level logic unit in the M-level logic units is connected with the control unit; and a through channel is arranged in each level of logic unit to realize a transmission channel for transmitting data from the Mth level of logic unit to the control unit, and part of the logic units in the M level of logic units comprise a beating register. The embodiment of the application also discloses a method for configuring the data processing device, a neural network processor, a chip and electronic equipment.

Description

Data processing device and configuration method, neural network processor, chip and equipment
Technical Field
The embodiment of the application relates to, but not limited to, the technical field of electronic information, and in particular relates to a data processing device, a configuration method, a neural network processor, a chip and equipment.
Background
In recent years, Convolutional Neural Networks (CNNs) have been widely used in many fields, such as face recognition, intelligent video surveillance, and automatic driving.
In a data processing device for convolutional neural network computation in the related art, after a last-stage logic unit obtains target data, the target data is directly transmitted to a control unit through an external wire directly connected between the last-stage logic unit and the control unit, so that the control unit controls the target data.
Disclosure of Invention
The embodiment of the application provides a data processing device, a configuration method, a neural network processor, a chip and equipment.
In a first aspect, a data processing apparatus is provided, including: a control unit; and
m-level logic units, wherein M is an integer greater than 1, the M-level logic units are arranged in a stacked manner, and a first-level logic unit in the M-level logic units is connected with the control unit;
and a through channel is arranged in each level of logic unit to realize a transmission channel for transmitting data from the Mth level of logic unit to the control unit, and part of logic units in the Mth level of logic unit comprise a beating register.
In a second aspect, there is provided a method of configuring a data processing apparatus, the method comprising:
setting a control unit;
m-level logic units are arranged in a stacked mode, M is an integer larger than 1, and a first-level logic unit in the M-level logic units is connected with the control unit;
a through channel is arranged in each level of logic unit to realize a transmission channel for transmitting data from the Mth level of logic unit to the control unit; and
and setting a beating register in a part of the M-level logic units.
In a third aspect, a neural network processor is provided, which includes the data processing apparatus described above.
In a fourth aspect, a chip is provided, which includes the neural network processor.
In a fifth aspect, an electronic device is provided, which includes the chip.
In the embodiment of the application, because the M-level logic units are stacked and the through channel is arranged in each level of logic unit, the M-level logic unit can transmit data through the through channel arranged in each level of logic unit, thereby avoiding the situation that wiring needs to be arranged outside the data processing device and ensuring that the overall size of the data processing device is small; in addition, because the beat registers are arranged in a part of the logic units in the M-level logic units, the trouble of timing violation can be eliminated.
Drawings
Fig. 1 is a schematic diagram illustrating a process of matrix multiply-add calculation of a convolutional neural network according to an embodiment of the present disclosure;
fig. 2 is a schematic diagram of a physical implementation model of a data processing apparatus according to an embodiment of the present application;
FIG. 3a is a schematic diagram of a physical implementation structure of a data processing apparatus provided in the related art;
FIG. 3b is a diagram illustrating a physical implementation structure of another data processing apparatus provided in the related art;
FIG. 3c is a diagram illustrating a physical implementation structure of another data processing apparatus provided in the related art;
fig. 4 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of another data processing apparatus according to an embodiment of the present application;
FIG. 6 is a flowchart illustrating a method for configuring a data processing apparatus according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of a neural network processor according to an embodiment of the present disclosure;
fig. 8 is a schematic structural diagram of a chip according to an embodiment of the present disclosure;
fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solution of the present application will be specifically described below by way of examples with reference to the accompanying drawings. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.
It should be noted that: in the present examples, "first", "second", etc. are used for distinguishing similar objects and are not necessarily used for describing a particular order or sequence.
The technical means described in the embodiments of the present application may be arbitrarily combined without conflict.
As Neural Network Processing Units (NPUs) are increasingly applied to the field of intelligent terminals, the importance of the NPUs is also increasingly important. The core operation of the NPU is convolution operation, the essence of the convolution operation is the multiplication and addition calculation of a matrix, and the goal of accelerating deep learning training and reasoning can be achieved by accelerating the multiplication and addition operation of the matrix through hardware.
Fig. 1 is a schematic diagram of a process of matrix multiply-add calculation of a convolutional neural network according to an embodiment of the present application, as shown in fig. 1, since the multiply-add calculation of a matrix has a feature of 3-dimensional calculation, an element in the matrix can be multiplied in parallel by designing a three-dimensional hardware calculation unit 101, and then an add calculation is performed, where the three-dimensional hardware calculation unit 101 in fig. 1 is a 4 × 4 × 4 three-dimensional calculation unit, and the 4 × 4 × 4 three-dimensional calculation unit is used to perform multiply-add operation on two 4 × 4 matrices (matrix a and matrix B), where a process of the 4 × 4 × 4 × 4 three-dimensional calculation unit performing multiply-add operation on the matrix is as follows:
first, the 4 × 4 × 4 three-dimensional calculation unit may calculate the product of each element in parallel, wherein each of the 4 × 4 × 4 three-dimensional calculation units acquires an element in the matrix a and an element in the matrix B, respectively, and performs multiplication calculation on the acquired elements.
Wherein, the matrix
Figure BDA0002801964200000031
Matrix array
Figure BDA0002801964200000032
For example, a front table in a 4 × 4 × 4 three-dimensional computing unitFour cells in the 0 th column of the surface calculate A, respectively11×B11,A21×B11,A31×B11,A41×B11(ii) a Four cells of the last column of the front surface in the 4 × 4 × 4 three-dimensional calculation cells calculate a, respectively11×B14,A21×B14,A31×B14,A41×B14(ii) a Four cells of the 0 th column of the upper surface in the 4 × 4 × 4 three-dimensional calculation cells calculate a, respectively14×B41,A13×B31,A12×B21,A11×B11(ii) a Four cells of the last column of the upper surface in the 4 × 4 × 4 three-dimensional calculation cells calculate a, respectively14×B44,A13×B34,A12×B24,A11×B14
Next, the calculated 64 products are summed in a certain dimension, for example, in the direction from the front surface to the back surface, resulting in a matrix a × B, which includes 16 elements.
And finally, summing the obtained matrix A multiplied by B and the matrix C to obtain a matrix A multiplied by B + C. Wherein the summation of the matrices may be a parallel computation.
The multiplication of two 4 × 4 matrices is described above, but in the case where both matrix a 'and matrix B' are 8 × 8 matrices, the manner of performing two 8 × 8 matrices by the three-dimensional hardware computation unit 101 in fig. 1 is as follows:
wherein, the matrix
Figure BDA0002801964200000041
Matrix array
Figure BDA0002801964200000042
Wherein, A '11, A'12, A '21, A'22, B '11, B'12, B '21 and B'22 are all 4 multiplied by 4 matrixes. The results thus obtained for A '. times.B' are:
Figure BDA0002801964200000043
thus multiplying two 8 x 8 matricesThe arithmetic operation is converted into an addition and multiplication operation of two 4 × 4 matrices.
In the embodiment of the present application, the three-dimensional hardware calculation unit 101 is a 4 × 4 × 4 three-dimensional calculation unit, and the 4 × 4 × 4 three-dimensional calculation unit can perform parallel multiplication operations on a 4-fold-dimension matrix. In other embodiments, the three-dimensional hardware computing unit 101 may be a 3 × 3 × 3 three-dimensional computing unit, where the 3 × 3 × 3 three-dimensional computing unit may perform parallel multiplication operations on a 3-fold dimensional matrix, or the three-dimensional hardware computing unit 101 may be a 2 × 2 × 2 three-dimensional computing unit or a 5 × 5 × 5 three-dimensional computing unit, and the like, which is not limited in this embodiment.
In the case where the three-dimensional hardware computation unit 101 is a 4 × 4 × 4 three-dimensional computation unit, if the dimension of the matrix to be computed is not a multiple of 4, several rows 0 and several columns 0 may be added to the matrix, thereby obtaining a multiple-dimension matrix of 4.
In some embodiments, to increase the computational power of the NPU, it may be implemented by increasing the number of cells in the three-dimensional hardware computation unit 101. For example, the computing power of the three-dimensional computing unit of 8 × 4 × 4 is doubled compared to the computing power of the three-dimensional computing unit of 4 × 4 × 4, and the computing power of the three-dimensional computing unit of 16 × 4 × 4 is doubled compared to the computing power of the three-dimensional computing unit of 8 × 4 × 4.
Fig. 2 is a schematic diagram of a physical implementation model of a data processing apparatus according to an embodiment of the present disclosure, as shown in fig. 2, a data processing apparatus 200 may include a Control (Control) unit and an M-level logic unit, where M may be an integer greater than or equal to 3, for example, the value of M in fig. 2 may be 4, 8, or 16. The data processing apparatus 200 may further include: an M-level computing unit and an M-level logic unit. Two sides of each of the M-level logic units may be respectively connected to a computing unit and a Memory unit. The ith-level computing unit and the ith-level logic unit can be communicated with each other, the ith-level logic unit and the ith-level storage unit can be communicated with each other, and i is any integer greater than or equal to 1 and less than or equal to M.
The i-th level logic unit is used for acquiring an i-th level data processing result calculated by the i-th level calculation unit and transmitting the calculated i-th level data processing result to the i + 1-th level logic unit, and the i + 1-th level logic unit not only transmits the i + 1-th level data processing result calculated by the i + 1-th level calculation unit to the i + 2-th level logic unit, but also transmits the i-th level data processing result received from the i-th level logic unit to the i + 2-th level logic unit, so that the last level (M-th level) logic unit can acquire M data processing results respectively calculated by the M-level calculation unit, and the last level (M-th level) logic unit can transmit the M data processing results to the control unit, so that the control unit correspondingly controls the M data processing results.
In the embodiment of the present application, the operations of acquiring data, processing data, receiving data, or forwarding data of any logic unit may be performed through cooperation of some element or some components in the logic unit.
In the embodiment of the present application, the computing unit connected to the ith-level logic unit may be referred to as an ith-level computing unit, and the storage unit connected to the ith-level logic unit may be referred to as an ith-level storage unit.
In some embodiments, the value of M may be related to the computational power of the NPU, e.g., when the number of cells in a three-dimensional hardware computation cell is 4, the value of M is 4; when the number of units in the three-dimensional hardware computing unit is 8, the value of M is 8; when the number of cells in the three-dimensional hardware calculation unit is 16, the value of M is 16. Of course, the value of M may be other, and is not limited herein.
In order to enable the logic unit at the last stage to send the obtained M data processing results back to the control unit, the following three methods are provided in the related art, which are respectively shown in fig. 3a to 3c described below.
Fig. 3a is a schematic diagram of a physical implementation structure of a data processing apparatus provided in the related art, and as shown in fig. 3a, a technician finds that an M-1-th level logic unit needs to send data with a larger bit width to a control unit, so that a routing channel is reserved between the M-level logic unit and an M-level computing unit, and all logic units can adopt a multiplexing mode, so that the M-1-th level logic unit can send feature map data to the control unit through the routing channel. In this manner, the M-level logic cells are logic cells having the same layout, each of which receives data from the upper side and outputs data from the lower side.
However, in this way, the skilled person finds that the size of the data processing device is large due to the need to reserve routing channels between the M-level logic cells and the M-level computational cells and at the bottom of the logic cells.
In order to solve the problem of large size of the data processing device, the skilled person proposes the solution of fig. 3 b.
Fig. 3b is a schematic diagram of a physical implementation structure of another data processing apparatus provided in the related art, and as shown in fig. 3b, a feed through (feedthru) path concept is introduced. In this embodiment, a through channel may be disposed inside each stage of logic unit to implement a transmission channel for data transmission from the mth stage of logic unit to the control unit, where the through channel is a feedthrough path. Because the through channel is arranged in each level of logic unit, the wiring in the channel can be placed in the logic unit. In this way, the designer does not need to reserve a wiring channel between the M-level logic unit and the M-level computing unit, and does not need to reserve a wiring channel at the bottom of the logic unit, thereby reducing the size of the data processing device.
However, the skilled person finds that as the multiplication matrix increases from 4 × 4 × 4 to 8 × 8 × 8 or 16 × 16 × 16 or even more, it can also be said that as the number of stages of the logic units increases from 4 stages to 8 stages or 16 stages, the timing at which the logic unit of the last stage sends M data processing results to the control unit is difficult to satisfy, resulting in a problem of timing violation.
In order to solve the problem that the timing sequence for the last-stage logic unit to send M data processing results to the control unit is difficult to satisfy, which causes a timing sequence violation, the technical staff provides the scheme in fig. 3 c.
Fig. 3c is a schematic diagram of a physical implementation structure of another data processing apparatus provided in the related art, and as shown in fig. 3c, a beat Register or Source Register (RS) is added in a logic unit of each stage, and data is output from the logic unit by using the Source Register, so that a timing violation can be avoided, and further, a reduction of path delay can be facilitated to help timing convergence.
It should be noted that, although the logic units shown in fig. 3a to 3c may include 8-level logic units, 16-level logic units, 32-level logic units, etc., they do not constitute a limitation to the embodiments of the present application, and any number of logic units applied to the embodiments of the present application should be within the scope of the embodiments of the present application.
However, in this manner of the related art, the skilled person finds the following problems again: because each level of logic unit is provided with one beat register, the beat register redundancy is brought, and the beat register is wasted.
Fig. 4 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application, and as shown in fig. 4, the data processing apparatus 400 includes: the device comprises a control unit and an M-level logic unit, wherein M is an integer larger than 1.
The M-level logic units are arranged in a stacked mode, and a first-level logic unit in the M-level logic units is connected with the control unit; and a transmission channel for transmitting data from the Mth-level logic unit to the control unit is arranged in each level of logic unit, and a part of logic units in the M-level logic units comprise a beating register.
In the embodiment of the present application, the data processing apparatus 400 may be the same apparatus as the data processing apparatus 200. It should be noted that although fig. 4 shows that the beat register is set in the M-th level logic unit, in the application process, the beat register may be set in any part of the M-level logic units.
The M-level logic cells may have at least one of the following characteristics: the size of the M-level logic units can be the same, the M-level logic units can be cascaded, any two adjacent logic units in the M-level logic units can be contacted, and the side parts of the M-level logic units can be flush when the M-level logic units are arranged in a stacked mode.
The architectures of the logic units for setting the beat registers in the M-level logic units may be the same, and each logic unit for setting the beat registers may be configured by using the same configuration method. The architectures of the logic units without the beat registers in the M-level logic units may be the same, and each logic unit without the beat registers may be configured by using the same configuration method.
The through channel arranged inside each level of logic unit in the embodiment of the application can be used as a data transmission channel between the mth level of logic unit and the control unit, so that the mth level of logic unit can transmit data to the control unit through the through channel in each level of logic unit.
Each logic unit in the M-level logic units may be provided with a through channel at the same position, so that in the case of stacking the M-level logic units, the M through channels may be sequentially butted, thereby making a transmission channel for transmitting data from the M-level logic unit to the control unit shortest, and further reducing a transmission path of the data during transmission.
In the embodiment of the present application, in the case that the M-level logic units are stacked, the transmission channel of data formed by the M-level through channel is long, and in order to avoid timing violations, in a physical design, the timing violations can be repaired by setting a beat register in the transmission channel of data.
The beat register in the embodiment of the present application may be a source register, or may be a buffer unit capable of buffering data, and the buffer unit may be another type of register or a memory.
In the embodiment of the application, the beat register is not set in each logic unit, but a part of logic units in the M-level logic units include the beat register, so that the beat register is set in each logic unit in the part of logic units in the M-level logic units, thereby not only effectively avoiding timing violation, but also avoiding redundancy of the beat register.
In the embodiment of the present application, since the M-level logic units are stacked and each level of logic unit is internally provided with the through channel, the M-level logic unit can transmit data through the through channel arranged inside each level of logic unit, thereby avoiding the situation of routing arranged outside the data processing device 400 and enabling the size of the data processing device 400 to be small; in addition, because the beat registers are arranged in a part of the logic units in the M-level logic units, the trouble of timing violation can be eliminated.
Referring to fig. 4, in the embodiment of the present application, the first-level logic unit to the M-1-level logic unit are respectively provided with a feedthrough port (not shown) at a side facing the control unit and a side facing away from the control unit, the M-level logic unit is provided with a feedthrough port at a side facing the control unit, and the through channel forms a transmission channel through a plurality of feedthrough ports.
The side of the control unit facing the first level logic unit may also be provided with feed-through ports.
In the embodiment of the application, when the control unit and the M-level logic unit are stacked, two adjacent feed-through ports are connected. For example, a feed-through port arranged on the side of the M-th level logic unit facing the control unit is connected with a feed-through port arranged on the side of the M-1-th level logic unit facing away from the control unit; the feed-through port arranged on one side, facing the control unit, of the M-1 level logic unit is connected with the feed-through port arranged on one side, back to the control unit, of the M-2 level logic unit; feed-through ports provided up to the side of the level 1 logic unit facing the control unit are connected to the feed-through ports of the control unit.
The transmission channel may be formed by a through channel within the logic unit, a feedthrough port on the control unit, a feedthrough port on the logic unit, and a beat register in the logic unit in common.
Referring to fig. 4, transmission traces (not shown) may be disposed in the transmission channels, and the transmission traces connect the mth logic unit and the control unit to implement data transmission from the mth logic unit to the control unit.
The transmission trace in the embodiment of the present application may be a through signal line.
Through the mode, the transmission wires are arranged inside each logic unit of the M-level logic units, and data are transmitted through the transmission wires, so that the problem that the transmission wires are arranged outside the logic units to cause large size of the logic units is greatly solved.
In some embodiments, the transmission trace can be divided into multiple segments, and the multiple segments can be connected between two adjacent beat registers.
Fig. 5 is a schematic structural diagram of another data processing apparatus according to an embodiment of the present application, and as shown in fig. 5, the data processing apparatus 400 further includes M-level computing units and M-level storage units, where each level of the M-level logic units is respectively connected to one computing unit and one storage unit. Fig. 5 shows that the data processing apparatus 400 may include 4-level logic units, 16-level logic units, 32-level logic units, and the like, and in other embodiments, the value of M may be other, which is not limited in this embodiment.
In some embodiments, the computing unit and the storage unit may be respectively connected at opposite sides of each stage of the logic unit. The M-level computing units may be stacked, and the M-level storage units may be stacked. In other embodiments, the computing unit and the storage unit may be respectively connected to two adjacent sides of each stage of logic unit.
The logic unit can be respectively and electrically connected with the computing unit and the storage unit, so that data interaction between the data logic unit and the computing unit and between the logic unit and the storage unit can be realized.
In some embodiments, each level of logic cells may abut one compute cell and/or each level of logic cells may abut one memory cell. The two unit abutments in the embodiments of the present application may be: the two units are connected in contact, and no other wires are arranged between the two units except for the wires used for connecting the two units.
The M-level computing units are side flush in the case of stacking, and/or the M-level storage units are side flush in the case of stacking.
In some embodiments, two adjacent computing units in the M-level computing unit may be electrically connected to each other, and/or two adjacent computing units in the M-level storage unit may be electrically connected to each other, so that data interaction between two adjacent computing units or data interaction between two adjacent storage units may be implemented.
The size and/or architecture of the M-level computing units may be the same, and the M-level computing units may be configured by the same configuration method, and/or the size and/or architecture of the M-level storage units may be the same, and the M-level storage units may be configured by the same configuration method.
In the embodiment of the present application, any one of the M-level computing units includes: an addition calculation unit and/or a multiplication calculation unit.
The calculation unit may perform addition and/or multiplication calculations on the data to be calculated. Each level of the calculation unit can obtain data to be calculated, calculate the data to be calculated to obtain a data processing result, send the obtained data processing result to the logic unit passing through the level and the logic unit below the level to the logic unit (mth level logic unit) of the last level, so that the logic unit of the last level can obtain M data processing results sent to the control unit according to the data processing result calculated by each level of the calculation unit, and transmit the M data processing results to the control unit through the through channel.
The M-level memory cell in the embodiment of the application comprises one or a combination of the following: the Memory comprises a register, a register group consisting of at least two registers, a Random Access Memory (RAM), a Read-Only Memory (ROM), a CACHE (CACHE), a FLASH Memory (FLASH) and a Double-Rate synchronous dynamic Random Access Memory (DDR).
In the embodiment of the application, M is a multiple of 4, wherein the last level logic unit of each 4 levels of logic units in the M levels of logic units comprises a beat register. The larger M indicates the larger amount of data that can be processed by the data processing apparatus 400 once or in one clock cycle. M may be equal to 4, 8, 16, 32, etc., for example.
The following describes the process of processing data by each stage (i-th stage) of logic unit, each stage (i-th stage) of calculation unit, and each stage of storage unit in the data processing apparatus 400:
the control unit is used for acquiring data to be processed and transmitting the data to be processed to the M-level logic unit; the ith-level storage unit is used for storing the ith-level data to be processed; the ith-level logic unit is used for acquiring ith-level data to be processed from the ith-level storage unit and sending the ith-level data to the ith-level calculation unit; the ith-level computing unit is used for computing the ith-level data to be processed to obtain an ith-level data processing result and sending the ith-level data processing result to the ith-level logic unit; when i is smaller than M, the ith-level logic unit is also used for transmitting the ith-level data processing result to the (i + 1) th-level logic unit; and when i is equal to M, the M-level logic unit is also used for transmitting the M-level data processing result to the control unit.
The data to be processed may be the matrices a (or a '), B (or B'), and C in the above embodiments.
The data to be processed may include two parts of data, and the first part of data may include: the image characteristic data, the second portion of data may include: convolution data and/or a bias term for performing convolution processing on the image feature data.
In some embodiments, the first portion of data may include the matrix a or a 'described above and the second portion of data may include the matrix B or B' described above. In other embodiments, the first portion of data may include the matrix B or B 'described above, and the second portion of data may include the matrix a or a' described above. In some embodiments, the matrix C may be included in the first portion of data or the second portion of data.
In some embodiments, all data to be processed may be sent to a first-level logic unit, the first-level logic unit obtains data that needs to be processed by a first-level computing unit and stores the first-level data to be processed in a first-level storage unit, the first-level logic unit sends remaining data to a second-level logic unit, the second-level logic unit obtains data that needs to be processed by the second-level computing unit and stores the second-level data to be processed in the second-level storage unit, and the remaining data is sent to a third-level logic unit until an mth-level logic unit obtains the mth-level data to be processed.
The ith-stage data to be processed may be all or part of the data that needs to be processed by the ith-stage computing unit. For example, the part of the ith-stage calculation unit may be the second partial data of the ith stage.
In this way, each level of the M-level computing unit obtains a level of data processing result, and the level of data processing result is transmitted to the M-level logic unit through the logic unit of each level and the lower level logic unit of each level of the logic unit, so that the M-level logic unit can obtain the M-level data processing result.
The M-level data processing result may be target data transmitted from the above-mentioned mth-level logic unit to the control unit. The mode of sending the target data to the control unit by the M-level logic unit is as follows: the M-level logic unit sends target data to the control unit through a through channel arranged in the M-level logic unit and a beating register arranged in a part of logic units in the M-level logic unit.
In some embodiments, the dimensions of the compute units, logic units, and memory units at any level in the stacking direction are the same. The calculation units, the logic units, and the storage units at any stage have the same size in the stacking direction, so that the data processing apparatus 400 is compact in structure, and the data processing apparatus 400 is small in size.
The control unit in the embodiment of the present application may be configured to control target data. In some embodiments, the control unit may perform subsequent processing on the obtained target data, for example, the control unit may input the obtained target data into the active layer, send the data obtained through the active layer into the pooling layer to obtain pooled data, and input the pooled data into the M-level logic unit through the control unit, or input the pooled data into the fully-connected layer.
In the embodiment of the present application, since the data processing apparatus 400 further includes an M-level computing unit and an M-level storage unit, there is no gap between the M-level computing unit and the M-level logic unit, and there is no gap between the M-level logic unit and the M-level storage unit, so that the data processing apparatus 400 has a compact structure, and each unit plays its own role and cooperates with each other, thereby being easier to produce and manufacture.
According to the embodiment of the application, through the arrangement of the through channel, the wiring in the channel is placed into the logic unit, and the purpose of eliminating the channel and area waste caused by the channel is achieved; meanwhile, every 4-level logic unit is a group, and a beating Register (RS) is put into the last logic unit in the group, so that the time sequence violation trouble is eliminated, and the NPU calculation capacity continuous superposition including the data processing device 400 is realized.
The embodiment of the application can achieve the following effective effects: under the condition of fine tuning module codes, the connection relation is changed again to eliminate the waste of the channel area in the NPU, the length of the channel is shortened, the number of RSs in a time sequence path is reduced, time sequence convergence is facilitated, and the power consumption of a chip is effectively reduced. And a first-stage RS is added at every 3-stage logic unit, so that the time sequence requirement can be met, and the infinite extension of the NPU (neutral point unit) power calculation is realized.
In the embodiment of the application, the waste of the channel area is eliminated by adopting a feed-through port, a good effect is obtained, and NPU physical implementation means are enriched; the length of the channel is shortened, the number of RSs in a time sequence path is reduced, time sequence convergence is facilitated, and the power consumption of a chip is effectively reduced; a primary RS (one beat in time sequence) is added within a certain distance, so that the infinite expansion of the NPU computational power can be realized, and an optimal solution is provided for the physical realization of a large multiplication matrix NPU; according to the scheme, the RS is not introduced into each logic unit blindly, so that the area and the power consumption of a chip are effectively saved.
Fig. 6 is a flowchart illustrating a method for configuring a data processing apparatus according to an embodiment of the present application, where as shown in fig. 6, the method includes:
s601, setting a control unit.
S602, M levels of logic units are stacked, wherein M is an integer larger than 1, and a first level of logic unit in the M levels of logic units is connected with the control unit.
And S603, arranging a through channel in each level of logic unit to realize a transmission channel for data transmission from the Mth level of logic unit to the control unit.
And S604, setting a beating register in a part of logic units in the M-level logic units.
The steps of the method for configuring a data processing apparatus in the embodiment of the present application may be performed by a configuration device, which may be any device that can be used to configure the data processing apparatus in any of the embodiments described above.
In some embodiments, the method further comprises: feed-through ports are provided on the sides of the first-stage logic unit to the M-1 th-stage logic unit facing the control unit and the side facing away from the control unit, respectively, and the feed-through ports are provided on the side of the M-th-stage logic unit facing the control unit, so that the through channels form transmission channels via the plurality of feed-through ports.
In some embodiments, the method further comprises: and transmission wires are arranged in the transmission channel and are connected with the Mth-level logic unit and the control unit so as to realize data transmission from the Mth-level logic unit to the control unit.
In some embodiments, the method further comprises: a computing unit and a storage unit are respectively arranged on two sides of each level of logic unit in the M levels of logic units.
In some embodiments, any one of the M-level computing units comprises: an addition calculation unit and/or a multiplication calculation unit.
In some embodiments, the M-level memory cells include one or a combination of: the device comprises a register, a register group consisting of at least two registers, a random access memory RAM, a read-only memory ROM, a CACHE memory CACHE, a flash memory and a double-rate synchronous dynamic random access memory DDR.
In some embodiments, M is a multiple of 4, and setting a beat register in a portion of the M-level logic units includes: and setting a beating register in the last level logic unit of every 4 levels of logic units in the M levels of logic units.
It should be noted that, the embodiment of the present application does not limit the sequence of the above steps in execution.
Fig. 7 is a schematic structural diagram of a neural network processor according to an embodiment of the present disclosure, and as shown in fig. 7, the neural network processor 700 may include the data processing apparatus 400 according to any of the embodiments.
The neural network processor 700 may be a Programmable logic device, such as a Field Programmable Gate Array (FPGA), or an Application Specific Integrated Circuits (ASIC).
In an embodiment of the present application, a chip may be provided, where the chip includes the neural network processor described above.
Fig. 8 is a schematic structural diagram of a chip provided in an embodiment of the present disclosure, as shown in fig. 8, the chip 800 includes not only the neural network processor 700, but also a Central Processing Unit (CPU) 801 and a Graphics Processing Unit (GPU) 802, and the Graphics Processing Unit 802, the CPU 801, and the neural network processor 700 may be packaged together by a packaging process, where the packaging process includes one of the following: chip On Board (COB) packages, System In a Package (SIP), System In a Chip (SOC) packages, and Chip stacking.
In some embodiments, chip 800 may include neural network processor 700 and central processor 801, but not graphics processor 802. In other embodiments, chip 800 may include neural network processor 700 and graphics processor 802, rather than central processor 801.
In some implementations, the chip 800 may also include an input interface (not shown). The graphic processor 802, the central processing unit 801 or the neural network processor 700 may control the input interface to communicate with other devices or chips 800, and specifically, may obtain information or data sent by other devices or chips.
In some implementations, the chip 800 may also include an output interface (not shown). The graphics processor 802, the central processing unit 801 or the neural network processor 700 may control the output interface to communicate with other devices or chips 800, and in particular, may output information or data to other devices or chips 800.
It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as a system-on-chip, a system-on-chip or a system-on-chip, etc.
Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure, and as shown in fig. 9, an electronic device 900 according to an embodiment of the present disclosure includes any one of the chips 800 described above.
Electronic device 900 may refer to a terminal device, which may include: the mobile terminal includes a server, a mobile phone, a tablet computer, a notebook computer, a handheld computer, a personal digital assistant, a portable media player, an intelligent sound box, a navigation device, a display device, a wearable device such as a smart watch, a Virtual Reality (VR) device, an Augmented Reality (AR) device, a pedometer, a digital TV, a desktop computer, a device in intelligent driving, a Wireless Fidelity (Wi-Fi) access point, an evolved base station, a base station for next generation communication, such as a 5G base station, a small station, a micro station or a Transmission Reception Point (TRP), and may also be any device capable of performing convolution processing on data, such as a relay station, an access point or a vehicle-mounted device.
The first memory unit and/or the second memory unit may be selected from one of the following, in addition to the random access memory RAM, the read only memory ROM, the CACHE memory CACHE, the flash memory, and the double data rate synchronous dynamic random access memory DDR described above: one or at least two registers, Memory blocks (Memory blocks), Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), Synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous SDRAM (DDR SDRAM), enhanced synchronous SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Direct Memory bus RAM (DR RAM), Programmable Read-Only Memory (PROM-on-rom), Erasable Programmable Read-Only Memory (EPROM-EPROM), EEPROM), magnetic Random Access Memory (FRAM), or magnetic surface Memory, among others. That is, the second storage unit, the third storage unit, or the sixth storage unit in the embodiments of the present application are intended to include, but not be limited to, these and any other suitable types of memories.
Here, it should be noted that: the description of the embodiments of the method, the neural network processor, the chip and the electronic device for configuring a data processing apparatus is similar to the description of the embodiments of the data processing apparatus described above, and the same embodiments have the same or similar advantageous effects. For technical details not disclosed in the embodiments of the method for configuring a data processing apparatus, the neural network processor, the chip and the electronic device of the present application, reference is made to the description of the embodiments of the data processing apparatus of the present application for understanding.
In the present embodiment, a plurality of times or a plurality of times are referred to as two or more, two or more times or two or more times, unless otherwise specified.
It should be appreciated that reference throughout this specification to "one embodiment" or "some embodiments" or "one embodiment" or "an embodiment of the present application" or "the preceding embodiments" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrase "one embodiment" or "some embodiments" or "in one embodiment" or "in an embodiment" or "the present embodiment" or "the preceding embodiments" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application. The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.
In the description of the present application, it is to be understood that the terms "center," "longitudinal," "lateral," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "clockwise," "counterclockwise," and the like are used in the orientations and positional relationships indicated in the drawings for convenience in describing the present application and for simplicity in description, and are not intended to indicate or imply that the referenced devices or elements must have a particular orientation, be constructed in a particular orientation, and be operated in a particular manner, and are not to be construed as limiting the present application. Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, features defined as "first", "second", may explicitly or implicitly include one or more of the described features. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.
In the description of the present application, it is to be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; may be mechanically connected, may be electrically connected or may be in communication with each other; either directly or indirectly through intervening media, either internally or in any other relationship. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art as appropriate.
In this application, unless expressly stated or limited otherwise, the first feature "on" or "under" the second feature may comprise direct contact of the first and second features, or may comprise contact of the first and second features not directly but through another feature in between. Also, the first feature being "on," "above" and "over" the second feature includes the first feature being directly on and obliquely above the second feature, or merely indicating that the first feature is at a higher level than the second feature. A first feature being "under," "below," and "beneath" a second feature includes the first feature being directly under and obliquely below the second feature, or simply meaning that the first feature is at a lesser elevation than the second feature.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.
Features disclosed in several of the product embodiments provided in the present application may be combined in any combination to yield new product embodiments without conflict.
The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.
Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as a removable Memory device, a Read Only Memory (ROM), a magnetic disk, or an optical disk.
Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the related art may be embodied in the form of a software product stored in a storage medium, and including several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.
It should be noted that the drawings in the embodiments of the present application are only for illustrating schematic positions of the respective devices on the terminal device, and do not represent actual positions in the terminal device, actual positions of the respective devices or the respective areas may be changed or shifted according to actual conditions (for example, a structure of the terminal device), and a scale of different parts in the terminal device in the drawings does not represent an actual scale.
The above description is only for the embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (15)

1. A data processing apparatus, comprising:
a control unit; and
m-level logic units, wherein M is an integer greater than 1, the M-level logic units are arranged in a stacked manner, and a first-level logic unit in the M-level logic units is connected with the control unit;
and a through channel is arranged in each level of logic unit to realize a transmission channel for transmitting data from the Mth level of logic unit to the control unit, and part of the logic units in the M level of logic units comprise a beating register.
2. The apparatus according to claim 1, wherein the first-stage to M-1-stage logic units are provided with feed-through ports at a side facing the control unit and a side facing away from the control unit, respectively, the M-stage logic unit is provided with a feed-through port at a side facing the control unit, and the through channel forms the transmission channel via a plurality of feed-through ports.
3. The apparatus according to claim 2, wherein a transmission trace is disposed in the transmission channel, and the transmission trace connects the mth logic unit and the control unit to implement data transmission from the mth logic unit to the control unit.
4. The apparatus of claim 1, further comprising M-level computing units and M-level storage units, wherein each level of the M-level logic units is connected to one computing unit and one storage unit respectively.
5. The apparatus of claim 4, wherein any one of the M-level computing units comprises: an addition calculation unit and/or a multiplication calculation unit.
6. The apparatus of claim 5, wherein the M-level memory cells comprise one or a combination of: the device comprises a register, a register group consisting of at least two registers, a random access memory RAM, a read-only memory ROM, a CACHE memory CACHE, a flash memory and a double-rate synchronous dynamic random access memory DDR.
7. The apparatus of any of claims 1 to 6, wherein M is a multiple of 4, wherein a last stage logic unit of every 4 stages of logic units in the M stages of logic units comprises the beat register.
8. The apparatus of claim 4,
the control unit is used for acquiring data to be processed and transmitting the data to be processed to the M-level logic unit;
the ith-level storage unit is used for storing the ith-level data to be processed;
the ith-level logic unit is used for acquiring ith-level data to be processed from the ith-level storage unit and sending the ith-level data to the ith-level calculation unit;
the ith-level computing unit is used for computing the ith-level data to be processed to obtain an ith-level data processing result and sending the ith-level data processing result to the ith-level logic unit;
when i is smaller than M, the ith-level logic unit is also used for transmitting the ith-level data processing result to the (i + 1) th-level logic unit; and
and when i is equal to M, the M-level logic unit is also used for transmitting the M-level data processing result to the control unit.
9. A method of configuring a data processing apparatus, the method comprising:
setting a control unit;
m-level logic units are arranged in a stacked mode, M is an integer larger than 1, and a first-level logic unit in the M-level logic units is connected with the control unit;
a through channel is arranged in each level of logic unit to realize a transmission channel for transmitting data from the Mth level of logic unit to the control unit; and
and setting a beating register in a part of the M-level logic units.
10. The method of claim 9, further comprising:
feed-through ports are provided on the sides of the first-stage logic unit to the M-1 th-stage logic unit facing the control unit and the side facing away from the control unit, respectively, and a feed-through port is provided on the side of the M-th-stage logic unit facing the control unit, so that the through channel forms the transmission channel via a plurality of feed-through ports.
11. The method of claim 10, further comprising:
and arranging transmission wires in the transmission channel, wherein the transmission wires are connected with the Mth-level logic unit and the control unit so as to realize data transmission from the Mth-level logic unit to the control unit.
12. The method of claim 9, wherein M is a multiple of 4, and wherein setting a beat register in a portion of the M-level logic units comprises:
and setting the beating register in the last level logic unit of each 4 levels of logic units in the M levels of logic units.
13. A neural network processor, comprising the data processing apparatus of any one of claims 1 to 8.
14. A chip comprising the neural network processor of claim 13.
15. An electronic device comprising the chip of claim 14.
CN202011353547.1A 2020-11-27 2020-11-27 Data processing device and configuration method, neural network processor, chip and equipment Pending CN112348180A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011353547.1A CN112348180A (en) 2020-11-27 2020-11-27 Data processing device and configuration method, neural network processor, chip and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011353547.1A CN112348180A (en) 2020-11-27 2020-11-27 Data processing device and configuration method, neural network processor, chip and equipment

Publications (1)

Publication Number Publication Date
CN112348180A true CN112348180A (en) 2021-02-09

Family

ID=74364996

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011353547.1A Pending CN112348180A (en) 2020-11-27 2020-11-27 Data processing device and configuration method, neural network processor, chip and equipment

Country Status (1)

Country Link
CN (1) CN112348180A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115236493A (en) * 2022-07-28 2022-10-25 摩尔线程智能科技(北京)有限责任公司 DFT test circuit, test system and test method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101133390A (en) * 2005-03-04 2008-02-27 爱特梅尔公司 Single-cycle low-power cpu architecture
CN111553461A (en) * 2019-02-11 2020-08-18 希侬人工智能公司 Customizable chip for artificial intelligence applications

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101133390A (en) * 2005-03-04 2008-02-27 爱特梅尔公司 Single-cycle low-power cpu architecture
CN111553461A (en) * 2019-02-11 2020-08-18 希侬人工智能公司 Customizable chip for artificial intelligence applications

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ALEXEY LOPICH等: "Architecture and design of a programmable 3D-integrated cellular processor array for image processing", 《2011 IEEE/IFIP 19TH INTERNATIONAL CONFERENCE ON VLSI AND SYSTEM-ON-CHIP》, 5 October 2011 (2011-10-05), pages 1 - 5 *
何万涛等: "《面结构光投影三维测量技术》", 31 August 2020, 哈尔滨工业大学出版社, pages: 139 - 142 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115236493A (en) * 2022-07-28 2022-10-25 摩尔线程智能科技(北京)有限责任公司 DFT test circuit, test system and test method

Similar Documents

Publication Publication Date Title
US11100193B2 (en) Dataflow accelerator architecture for general matrix-matrix multiplication and tensor computation in deep learning
US7461361B2 (en) Method of creating core-tile-switch mapping architecture in on-chip bus and computer-readable medium for recording the method
KR20200008544A (en) Chip device and related products
JP2020091853A5 (en)
US11640303B2 (en) Calculating device
US11487845B2 (en) Convolutional operation device with dimensional conversion
CN110222818B (en) Multi-bank row-column interleaving read-write method for convolutional neural network data storage
CN108170640B (en) Neural network operation device and operation method using same
US20140376295A1 (en) Memory device and system including the same
US11069400B1 (en) High bandwidth memory and system having the same
CN112348180A (en) Data processing device and configuration method, neural network processor, chip and equipment
JPH04232562A (en) Computer apparatus
KR20210059623A (en) Electronic device and method for inference Binary and Ternary Neural Networks
CN117651953A (en) Hybrid machine learning architecture with neural processing unit and in-memory computing processing
CN110909872A (en) Integrated circuit chip device and related product
CN114003198A (en) Inner product processing unit, arbitrary precision calculation device, method, and readable storage medium
US12039430B2 (en) Electronic device and method for inference binary and ternary neural networks
CN111382847B (en) Data processing device and related product
CN215341078U (en) Memory access circuit, integrated chip and electronic equipment
CN114548386A (en) Data processing device, neural network processor, chip and electronic equipment
CN114565088A (en) Data processing device and method, neural network processor, chip and electronic equipment
CN115563052A (en) Memory access circuit, integrated chip, electronic device and memory access method
CN111738429B (en) Computing device and related product
CN111932436B (en) Deep learning processor architecture for intelligent parking
US20220036169A1 (en) Systolic computational architecture for implementing artificial neural networks processing a plurality of types of convolution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination