CN114997389A - Convolution calculation method, AI chip and electronic equipment - Google Patents

Convolution calculation method, AI chip and electronic equipment Download PDF

Info

Publication number
CN114997389A
CN114997389A CN202210838417.XA CN202210838417A CN114997389A CN 114997389 A CN114997389 A CN 114997389A CN 202210838417 A CN202210838417 A CN 202210838417A CN 114997389 A CN114997389 A CN 114997389A
Authority
CN
China
Prior art keywords
data
weight
block
convolution
calculation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210838417.XA
Other languages
Chinese (zh)
Inventor
高聪
李智
李国嵩
何浩
王刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Denglin Technology Co ltd
Chengdu Denglin Technology Co ltd
Original Assignee
Shanghai Denglin Technology Co ltd
Chengdu Denglin Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Denglin Technology Co ltd, Chengdu Denglin Technology Co ltd filed Critical Shanghai Denglin Technology Co ltd
Priority to CN202210838417.XA priority Critical patent/CN114997389A/en
Publication of CN114997389A publication Critical patent/CN114997389A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Algebra (AREA)
  • Neurology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Complex Calculations (AREA)

Abstract

The application relates to a convolution calculation method, an AI chip and electronic equipment, and belongs to the technical field of computers. The method comprises the following steps: performing convolution calculation on each weight block and the same data block respectively to obtain convolution data of each weight block relative to the data block, wherein each weight block is one part of complete weight required by the convolution calculation, and the weight number in each weight block is not more than the total number of calculation units; and adding the convolution data of each weight block relative to the same data block to obtain a convolution calculation result. The convolution calculation result can be obtained by dividing the complete weight required by the convolution calculation into a plurality of weight blocks, performing convolution calculation by using each weight block and the same data block respectively, and then adding the convolution data of each weight block relative to the same data block, so that the convolution calculation when the weight quantity required by the convolution calculation is larger than the total calculation unit quantity is supported.

Description

Convolution calculation method, AI chip and electronic equipment
Technical Field
The application belongs to the technical field of computers, and particularly relates to a convolution calculation method, an AI chip and electronic equipment.
Background
In recent years, Artificial Intelligence (AI) technology has been rapidly developed and has achieved remarkable results, and particularly, in the directions of image detection and recognition, language recognition, and the like, the recognition rate of AI exceeds that of humans. And neural network processing is an important processing technology for realizing artificial intelligence.
With the wide application of the large-scale convolutional neural network, higher requirements are put forward for hardware implementation. The current convolution calculation method only supports the calculation that the weight quantity required by the convolution calculation is less than or equal to the quantity of hardware parallel calculation units. However, due to the constraints of hardware area and power consumption, the parallel computation amount in one cycle is always limited in the hardware implementation process, and when the weight number required by the convolution computation is larger than the number of hardware parallel computing units, the conventional convolution computation method cannot achieve the convolution computation.
Disclosure of Invention
In view of the above, an object of the present application is to provide a convolution calculation method, an AI chip and an electronic device, so as to solve the problem that the conventional convolution calculation method cannot be applied to the problem that the number of weights required by convolution calculation is larger than the number of hardware-parallel calculation units.
The embodiment of the application is realized as follows:
in a first aspect, an embodiment of the present application provides a convolution calculation method, which is applied to an AI chip including a plurality of calculation units, where the method includes: performing convolution calculation on each weight block and the same data block respectively to obtain convolution data of each weight block relative to the data block, wherein each weight block is one part of complete weight required by the convolution calculation, the weight number in each weight block is not more than the total number of calculation units, each data block is one part of complete data required by the convolution calculation, and the complete data are obtained by segmenting the complete data according to the maximum output data size supported by hardware and the size of the complete weight; and adding the convolution data of each weight block relative to the same data block to obtain a convolution calculation result.
In the embodiment of the application, the complete weight required by the convolution calculation is divided into the weight blocks, the convolution calculation is carried out by respectively utilizing the weight blocks and the same data block, and then the convolution data of the weight blocks relative to the same data block are added to obtain the convolution calculation result, so that the convolution calculation method supporting the condition that the weight quantity required by the convolution calculation is larger than the total calculation unit quantity is provided. Meanwhile, the weight blocks are divided according to the number of the computing units supported by the hardware, so that the weight number in each weight block is not more than the total number of the computing units, the convolution calculation is realized, meanwhile, the computing units provided by the hardware can be fully utilized, and the convolution operation efficiency is further improved; similarly, the complete data is obtained by segmenting according to the maximum output data size supported by the hardware and the size of the complete weight, so that the number of data blocks is reduced as much as possible under the condition that the divided data blocks meet the hardware requirement, the interaction times of the AI chip and the external memory can be further reduced, and the convolution operation efficiency is improved.
With reference to a possible implementation manner of the embodiment of the first aspect, performing convolution calculation on each weight block and the same data block to obtain convolution data of each weight block with respect to the data block includes: performing convolution calculation on the ith weight block and target data corresponding to the ith weight block in the data block to obtain convolution data of the ith weight block relative to the data block, wherein i is 1 to n in sequence, and n is an integer greater than 1; and the target data is data corresponding to the ith weight block, which is selected from data obtained by simulating sliding in the data block according to a preset step length by using the complete weight.
In the embodiment of the application, the convolution calculation is performed on the ith weight block and the target data corresponding to the ith weight block in the data block in a serial manner, so that the maximum convolution calculation can be supported as much as possible under the condition of a limited calculation unit.
With reference to a possible implementation manner of the embodiment of the first aspect, adding the convolved data of each weight block with respect to the same data block to obtain a convolution calculation result includes: and adding the convolution data of the (i + 1) th weight block relative to the data block and the convolution accumulated data of the (i) th weight block to obtain the convolution accumulated data of the (i + 1) th weight block, wherein the convolution accumulated data of the (1) th weight block is equal to the convolution data of the (1) th weight block relative to the data block, i sequentially takes 1 to n, and n is an integer greater than 1.
In the embodiment of the application, the convolution data of the (i + 1) th weight block relative to the data block is added with the convolution accumulated data of the ith weight block in a serial mode, so that the storage space for storing the convolution data of each weight block relative to the data block can be reduced.
With reference to a possible implementation manner of the embodiment of the first aspect, before performing convolution calculation on each weight block and the same data block to obtain convolution data of each weight block with respect to the data block, the method further includes: and dividing the complete weight into a plurality of weight blocks according to the number of computing units supported by hardware, wherein the weight number in each weight block is not more than the total number of computing units.
In the embodiment of the present application, the complete weight is divided into a plurality of weight blocks according to the number of calculation units supported by hardware, thereby making it possible to support convolution calculation when the number of weights required for convolution calculation is greater than the total number of calculation units. The inventor of the present application has found that, compared to dividing the weight blocks according to the buffer size inside the AI chip, the computation units are more likely to generate bottlenecks in high-performance hardware computation, and the computation units provided by hardware can be fully utilized while realizing convolution computation by dividing the weight blocks according to the number of computation units supported by hardware, thereby further improving the efficiency of convolution computation.
With reference to a possible implementation manner of the embodiment of the first aspect, performing convolution calculation on each weight block and the same data block to obtain convolution data of each weight block with respect to the data block includes: and performing convolution calculation on each weight block and each data block respectively to obtain convolution data of each weight block relative to the data block.
In the embodiment of the application, considering the constraint of hardware area and power consumption, the output number of the accumulation module is limited, and the method can be suitable for more scenes by switching the complete data required by the convolution calculation into a plurality of data blocks.
With reference to one possible implementation manner of the embodiment of the first aspect, the method further includes: and dividing the complete data into a plurality of data blocks according to the maximum output data size supported by hardware and the size of the complete weight.
In the embodiment of the application, the data blocks are divided according to the maximum output data size supported by hardware and the size of the complete weight, so that the number of the data blocks is reduced as much as possible under the condition that the divided data blocks meet the hardware requirement, and the interaction times of the AI chip and the external memory can be further reduced.
In a second aspect, an embodiment of the present application further provides an AI chip, including: a convolution module and an accumulation module; the convolution module is used for performing convolution calculation on each weight block and the same data block respectively to obtain convolution data of each weight block relative to the data block, wherein each weight block is one part of complete weight required by the convolution calculation, the weight number in each weight block is not more than the number of calculation units in the convolution module, each data block is one part of complete data required by the convolution calculation, and the complete data are obtained by segmenting the complete data according to the maximum output data size supported by hardware and the size of the complete weight; and the accumulation module is used for adding the convolution data of each weight block relative to the same data block to obtain a convolution calculation result.
With reference to one possible implementation manner of the embodiment of the second aspect, the convolution module includes: a plurality of calculation units and summation units; a plurality of calculation units, each calculation unit for multiplying a corresponding weight by data; and the summing unit is used for summing the calculation results of the calculation units.
With reference to one possible implementation manner of the embodiment of the second aspect, the AI chip further includes: the first buffer area is used for storing input data blocks; and/or a second buffer area for storing the inputted weight values.
In the embodiment of the application, the first cache region and/or the second cache region are/is additionally arranged in the AI chip, so that the data interaction frequency between the AI chip and an external memory can be reduced, the performance overhead of the chip can be saved, and the convolution calculation efficiency can be improved.
With reference to one possible implementation manner of the embodiment of the second aspect, the AI chip further includes:
the control module is used for replacing the data block in the first cache region with the next data block after convolution calculation is finished by using each weight block and the data block in the first cache region respectively; and after the convolution calculation of the weight blocks stored in the second cache region and the data blocks stored in the first cache region is completed, replacing the weight blocks in the second cache region with the next weight blocks.
In a second aspect, an embodiment of the present application further provides an electronic device, including: a memory and an AI chip as provided in the embodiments of the second aspect and/or in connection with any one of the possible implementations of the embodiments of the second aspect, the AI chip being connected to the memory, the memory being configured to store the complete data and the complete weights required for the convolution calculation.
Additional features and advantages of the present application will be set forth in the description that follows. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and drawings.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts. The above and other objects, features and advantages of the present application will become more apparent from the accompanying drawings. Like reference numerals refer to like parts throughout the drawings. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the application.
Fig. 1 is a schematic diagram illustrating a structure of an AI chip connected to a memory according to an embodiment of the present disclosure.
Fig. 2 shows a schematic structural diagram of another AI chip provided in the embodiment of the present application.
Fig. 3 is a schematic structural diagram of another AI chip provided in the embodiment of the present application.
Fig. 4 shows a flowchart of a convolution calculation method according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, relational terms such as "first," "second," and the like may be used solely in the description herein to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
In the description of the present application, it is also to be noted that, unless otherwise explicitly specified or limited, the terms "connected" and "connected" are to be interpreted broadly, e.g., as meaning either a fixed connection, a detachable connection, or an integral connection; or may be an electrical connection; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present application can be understood in a specific case by those of ordinary skill in the art.
Further, the term "and/or" in the present application is only one kind of association relationship describing the associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. The term "plurality" means two or more.
Based on the fact that the conventional convolution calculation method does not support calculation in which the number of weights required for convolution calculation is greater than the number of hardware-parallel calculation units, the embodiment of the present application provides a convolution calculation method applied to an AI chip to support calculation when the number of weights required for convolution calculation is greater than the number of hardware-parallel calculation units.
For better understanding, an AI (Artificial Intelligence) chip provided in the embodiment of the present application will be described below with reference to fig. 1. The AI chip includes: a convolution module and an accumulation module. A memory external to the AI chip is used to store the complete data and complete weights needed for the convolution calculations. Wherein, W in fig. 1 represents weight (weight), i.e. W1 and W2 … … Wn both represent weight; d represents data (data), namely D1 and D2 … … Dn represent data.
The Memory may be a Memory commonly used in the market, and the Memory may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like.
And the convolution module is used for performing convolution calculation on each weight block and the same data block respectively to obtain convolution data of each weight block relative to the data block, wherein each weight block is one part of complete weight required by the convolution calculation, and the weight number in each weight block is not more than the number of calculation units in the convolution module.
Wherein the convolution module includes: a plurality of calculation units and a summing unit. Each calculation unit is configured to multiply the corresponding weight and the data. And the summing unit is used for summing the calculation results of the calculation units.
And the accumulation module is used for adding the convolution data of each weight block relative to the same data block to obtain a convolution calculation result. The accumulation module comprises a storage area used for storing convolution data of the weight block relative to the data block.
In order to support convolution calculation when the number of weights required for convolution calculation is larger than the number of total calculation units, the complete weights required for convolution calculation are divided into a plurality of weight blocks, so that the weight blocks required for convolution calculation each time are only a part of the complete weights. Convolution calculation is carried out on each weight block and the same data block respectively to obtain convolution data of each weight block relative to the data block, and then the convolution data of each weight block relative to the same data block are added to obtain a convolution calculation result.
Each weight for convolution calculation is a part of the complete weight required for convolution calculation, and when dividing the weight, in one embodiment, the complete weight may be divided into a plurality of weights according to the number of calculation units supported by hardware, and the number of weights in each weight is not greater than the total number of calculation units. For example, assuming that the number of computing units is 100, the number of weights in the full weight is 10 × 100, and the size of the full weight is 10 × 100, the full weight may be divided into 10 weight blocks with the size of 100. The size of each weight may be 1 × 100, 100 × 1, 10 × 10, 4 × 25, 25 × 4. For another example, assuming that the size of the complete weight is 100 × 100, it may be split into 100 weight blocks with the size of 100. The division method illustrated here is performed in such a manner that each weight block is sequentially convolved with a data block (serial method), or may be performed in such a manner that each weight block is simultaneously convolved with a data block (parallel method).
The number of outputs of the accumulation module is also limited in consideration of constraints imposed by hardware area and power consumption, and optionally, in one embodiment, the data block at each time of convolution calculation is not the complete data required for the convolution calculation but a part of the complete data required for the convolution calculation. The complete data required by the convolution calculation is also segmented, so that the data block input each time is only one part of the complete data, and the requirements of more scenes are met. When the complete data is divided into a plurality of data blocks, the complete data may be divided into a plurality of data blocks according to the maximum output data size (depending on the size of the buffer area in the accumulation module) supported by hardware and the size of the complete weight. That is, each data block is a part of the complete data required for convolution calculation, and is obtained by slicing the complete data according to the maximum output data size and the size of the complete weight supported by hardware.
For example, assuming that the size of the complete weight is 100 × 100, assuming that the maximum output data amount supported by the hardware is 1024, it may correspond to various sizes such as 1 × 1024, 2 × 512, 4 × 256, 8 × 128, 16 × 64, 32 × 32, etc. If the data block is divided according to the size of the output data being 1 × 1024, the size of the maximum data block is (100 + 1-1) =100 = 1024+ 99); if the data block is divided according to the size of the output data being 16 × 64, the size of the maximum data block is (100 + 16-1) =100 + 64-1) =115 × 163; if the data block is divided according to the size of the output data being 32 × 32, the size of the largest data block is (100 + 32-1) = 131. The principle of dividing the data block according to the size of the remaining output data is consistent with the principles of the above examples, and are not illustrated here.
The minimum size of the data block is determined by the remaining data size after dividing the data block (the remaining data size is smaller than the size of the maximum data block), and for better understanding, it is assumed that the size of the complete data is 1310 × 1000, and if the data block is divided by 131 × 131, the size of the maximum data block is 131 × 131, and the size of the minimum data block is 131 × 83. For another example, assuming that the size of the complete data is 1000 × 1310, and the data blocks are divided by 131 × 131, the size of the largest data block is 131 × 131, and the size of the smallest data block is 83 × 131.
When convolution calculation is performed on each weight block and the same data block to obtain convolution data of each weight block relative to the data block, the convolution data may be in a serial manner or in a parallel manner, and if the convolution data is in a serial manner, the specific process may be: and performing convolution calculation on the ith weight block and target data corresponding to the ith weight block in the data block to obtain convolution data of the ith weight block relative to the data block, wherein i is 1 to n in sequence, and n is an integer greater than 1.
The target data is data corresponding to the ith weight block, which is simulated to slide in the data block according to a preset step length (for example, the step length is 1) by using the complete weight, and is selected from the data obtained by sliding. For example, assuming that the size of the full weight is 100 × 100, assuming that the size of the data block is 131 × 131, a total of 32 × 32 slips are involved, and the size of data obtained by each slip is 100 × 100, the target data is the portion of data (data size is 25 × 25) corresponding to the ith weight block in the data obtained by each slip (data size is 100 × 100). The process of sliding in the data block by a preset step size (e.g. step size 1) using the full weight is consistent with the sliding process involved in the conventional convolution calculation, and will not be described in detail here.
It should be noted that, if a data block is a part of complete data required for convolution calculation, the convolution module needs to perform convolution calculation on each data block by using each weight block and the data block, so as to obtain convolution data of each weight block relative to the data block.
When convolution calculation results are obtained by adding convolution data of each weight block relative to the same data block, a serial mode or a parallel mode can be adopted, and if the convolution calculation results are obtained by adopting the serial mode, the specific process can be as follows: and adding the convolution data of the (i + 1) th weight block relative to the data block and the convolution accumulated data of the (i) th weight block to obtain the convolution accumulated data of the (i + 1) th weight block, wherein the convolution accumulated data of the (1) th weight block is equal to the convolution data of the (1) th weight block relative to the data block, i sequentially takes 1 to n, and n is an integer greater than 1. That is, the convolution data of the 2 nd weight block relative to the data block is added with the convolution data of the 1 st weight block relative to the data block to obtain the convolution accumulated data of the 2 nd weight block; adding the convolution data of the 3 rd weight block relative to the data block and the convolution accumulated data of the 2 nd weight block to obtain the convolution accumulated data of the 3 rd weight block; and in the same way, adding the convolution data of the nth weight block relative to the data block and the convolution accumulated data of the (n-1) th weight block to obtain the convolution accumulated data of the nth weight block.
For a better understanding, the following description is given with a specific understanding. Assuming a full weight size of 100 × 100 and a weight size of 25 × 25 for each weight, there are 16 weight blocks (assuming numbers of 1 st to 16 th weight blocks in sequence), assuming a full data size of 1310 × 1000, assuming a maximum data block size of 131 × 131, there are 10 × 8 data blocks (assuming numbers of 1 to 80 data blocks in sequence), 10 × 7 data blocks of 131 × 131, and 10 data blocks of 131 × 83 in 10 × 8 data blocks. The process of the above convolution calculation is:
for the data block 1, simulating to slide on the data block 1 by using the complete weight according to the step length of 1, convolving the target data corresponding to the 1 st weight block in the data obtained by sliding with the 1 st weight block, and storing the result into an accumulation module, for example, storing the calculation result of convolving the target data corresponding to the 1 st sliding with the 1 st weight block into an address 0 in the accumulation module; storing the convolution calculation result of the target data corresponding to the 2 nd sliding and the 1 st weight block to the address 1 in the accumulation module; and in analogy, the calculation result of the convolution of the target data corresponding to the 1024 th sliding and the 1 st weight block is stored in the address 1023 in the accumulation module.
After the convolution calculation of the 1 st weight block and the data block 1 is completed, simulating to slide on the data block 1 by using the complete weight according to the step length of 1, performing convolution calculation on target data corresponding to the 2 nd weight block in the data obtained by sliding and the 2 nd weight block, and adding the calculation result to the result stored in the accumulation module, for example, convolving the target data corresponding to the 1 st sliding and the 2 nd weight block, and updating the data of the address 0 stored in the accumulation module by using the calculation result, namely updating the data of the address 0 to be equal to the previously stored data and the current calculation result; convolving the target data corresponding to the 2 nd sliding with the 2 nd weight block, and updating the data of the address 1 stored in the accumulation module by using the calculation result, namely the updated data of the address 1 is equal to the previously stored data plus the current calculation result; and in analogy, convolving the target data corresponding to the 1024 th sliding with the 1 st weight block, and updating the data of the address 1023 stored in the accumulation module by using the calculation result, namely the updated data of the address 1023 is equal to the previously stored data + the current calculation result.
And repeating the operation similar to the data block 1 until the convolution calculation of each weight block (16 weight blocks in total) and the data block 1 is completed, switching to the next data block (data block 2) after the convolution data of each weight block relative to the data block 1 is added, repeating the operation similar to the data block 1 until the convolution calculation of each weight block and the data block 2 is completed, switching to the next data block (data block 3) after the convolution data of each weight block relative to the data block 2 is added, and so on until the convolution calculation of all the data blocks (80 data blocks in total) is completed.
In order to reduce data interaction between the AI chip and the external memory, in an optional embodiment, the AI chip further includes a first buffer and/or a second buffer, as shown in fig. 2. The first buffer area is used for storing the input data block. And the second buffer area is used for storing the input weight. It should be noted that the first buffer and the second buffer may not exist at the same time, and therefore the specific example shown in fig. 2 is not to be construed as a limitation to the present application. By additionally arranging the first cache region and the second cache region inside the AI chip, the data interaction frequency between the AI chip and an external memory can be reduced, so that the performance overhead of the chip can be saved, and the convolution calculation efficiency can be improved.
In order to better control the weight block and the data block input to the AI chip when performing the convolution calculation, in an optional embodiment, the AI chip further includes: and the control module is respectively connected with the first buffer area and the second buffer area and is also connected with an external memory, as shown in fig. 3. And the control module is used for controlling the reading and writing of the data in the first cache region and/or controlling the reading and writing of the weight in the second cache region. For example, the control module is configured to replace the data block in the first cache area with the next data block after performing convolution calculation on the data block in the first cache area and each weight block; and after the convolution calculation of the weight blocks stored in the second cache region and the data blocks stored in the first cache region is completed, replacing the weight blocks in the second cache region with the next weight blocks. The control module is further used for reading the data in the first buffer area and the weight in the second buffer area and outputting the weight to the convolution module, so that the convolution module can perform convolution calculation on the corresponding weight and the data.
For example, when performing convolution calculation on data in the first buffer area and weights stored in the second buffer area, the control module simulates sliding in data blocks stored in the first buffer area according to a preset step length by using the complete weights, and selects target data corresponding to the weight blocks stored in the second buffer area from the data obtained by sliding. And the target data and the weight stored in the second buffer area are respectively sent to a convolution module, so that the convolution module can conveniently carry out convolution calculation on the corresponding weight and the data.
Based on the above description, assuming that the size of the weight block stored in the second buffer area is 25 × 25, if the simulation uses the complete weight (size 100 × 100) to slide 32 × 32 times in the data block (size 131 × 131) stored in the first buffer area according to the preset step size, 32 × 32 target data can be obtained, and each target data includes 25 × 25 parameters; these 32 × 32 pieces of target data can be regarded as one data group. Correspondingly, if a plurality of weight blocks exist, a plurality of data sets are corresponded.
In addition, the control module can be further connected with the accumulation module and used for adding the convolution data of each weight block relative to the same data block in the accumulation module to obtain a convolution calculation result, and then controlling the accumulation module to output the convolution calculation result obtained by adding.
The AI chip provided by the embodiment of the present application may be an AI chip capable of performing the above convolution calculation, and the AI chip may be a processor. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Based on the same inventive concept, the embodiment of the present application further provides an electronic device, which includes a memory and the AI chip described above. The memory is used for storing complete data and complete weight required by convolution calculation.
The electronic device may support the above convolution calculation, including but not limited to products such as a mobile phone, a tablet, a computer, and the like.
In the AI chip provided in the embodiment of the present application, the implementation principle and the generated technical effect are the same as those of the AI chip embodiment described above, and for a brief description, reference may be made to corresponding contents in the AI chip embodiment described above for a part of the embodiment of the electronic device that is not mentioned above.
Based on the same inventive concept, the embodiment of the present application provides a convolution calculation method applied to an AI chip including a plurality of calculation units, and the process thereof will be described below with reference to fig. 4.
S1: and performing convolution calculation on each weight block and the same data block respectively to obtain convolution data of each weight block relative to the data block.
The convolution module can be used for performing convolution calculation on each weight block and the same data block respectively to obtain the convolution data of each weight block relative to the data block.
Each weight block is a part of complete weight required by convolution calculation, the weight number in each weight block is not more than the total number of calculation units, each data block is a part of complete data required by convolution calculation, and the complete data are obtained by segmenting the complete data according to the maximum output data size supported by hardware and the size of the complete weight. Through seamless grouping and matching of the weight blocks and the data blocks, the running performance of hardware can be further improved.
The process of obtaining the convolution data of each weight block relative to the data block by performing convolution calculation on each weight block and the same data block may be as follows: i, sequentially taking 1 to n, and carrying out convolution calculation on the ith weight block and target data corresponding to the ith weight block in the data block to obtain convolution data of the ith weight block relative to the data block, wherein n is an integer greater than 1; and the target data is data which is simulated to slide in the data block according to a preset step length by using the complete weight and is selected from the data obtained by sliding and corresponds to the ith weight block.
Optionally, if the data block is a part of the complete data required for the convolution calculation; performing convolution calculation on each weight block and the same data block respectively to obtain convolution data of each weight block relative to the data block, wherein the convolution data comprises the following steps: and performing convolution calculation on each weight block and each data block respectively to obtain convolution data of each weight block relative to the data block.
In an alternative embodiment, before S1, the convolution calculating method further includes: and dividing the complete weight into a plurality of weight blocks according to the number of the computing units supported by the hardware, wherein the weight number in each weight block is not more than the total number of the computing units.
S2: and adding the convolution data of each weight block relative to the same data block to obtain a convolution calculation result.
After the convolution data of each weight block relative to the data block is obtained, the convolution data of each weight block relative to the same data block are added. The convolution calculation result can be obtained by adding the convolution data of each weight block relative to the same data block by using the accumulation module.
Optionally, the process of adding the convolution data of each weight block with respect to the same data block to obtain the convolution calculation result may be: and i sequentially takes 1 to n, and the convolution data of the (i + 1) th weight block relative to the data block is added with the convolution accumulated data of the (i) th weight block to obtain the convolution accumulated data of the (i + 1) th weight block, wherein the convolution accumulated data of the (1) th weight block is equal to the convolution data of the (1) th weight block relative to the data block, and n is an integer greater than 1.
The convolution calculation method provided in the embodiment of the present application has the same implementation principle and technical effect as those of the foregoing AI chip embodiment, and for the sake of brief description, no part of the method embodiment is mentioned, and reference may be made to the corresponding contents in the foregoing AI chip embodiment.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A convolution calculation method applied to an AI chip including a plurality of calculation units, the method comprising:
performing convolution calculation on each weight block and the same data block respectively to obtain convolution data of each weight block relative to the data block, wherein each weight block is one part of complete weight required by the convolution calculation, the weight number in each weight block is not more than the total number of calculation units, each data block is one part of complete data required by the convolution calculation, and the complete data are obtained by segmenting the complete data according to the maximum output data size supported by hardware and the size of the complete weight;
and adding the convolution data of each weight block relative to the same data block to obtain a convolution calculation result.
2. The method of claim 1, wherein the obtaining of the convolution data of each weight block with respect to the data block by performing convolution calculation on each weight block and the same data block respectively comprises:
performing convolution calculation on the ith weight block and target data corresponding to the ith weight block in the data block to obtain convolution data of the ith weight block relative to the data block, wherein i is 1 to n in sequence, and n is an integer greater than 1; and the target data is data corresponding to the ith weight block, which is selected from data obtained by simulating sliding in the data block according to a preset step length by using the complete weight.
3. The method of claim 1, wherein adding the convolved data of each of the weight blocks with respect to the same data block to obtain a convolution calculation comprises:
and adding the convolution data of the (i + 1) th weight block relative to the data block and the convolution accumulated data of the (i) th weight block to obtain the convolution accumulated data of the (i + 1) th weight block, wherein the convolution accumulated data of the (1) th weight block is equal to the convolution data of the (1) th weight block relative to the data block, i sequentially takes 1 to n, and n is an integer greater than 1.
4. The method of claim 1, wherein before performing the convolution calculation using each weight block and the same data block to obtain the convolution data of each weight block relative to the data block, the method further comprises:
and dividing the complete weight into a plurality of weight blocks according to the number of computing units supported by hardware, wherein the weight number in each weight block is not more than the total number of computing units.
5. The method according to any one of claims 1-4, wherein obtaining the convolution data of each weight block with respect to the data block by performing convolution calculation on each weight block and the same data block respectively comprises:
and performing convolution calculation on each weight block and each data block respectively to obtain convolution data of each weight block relative to the data block.
6. An AI chip, comprising:
the convolution module is used for performing convolution calculation on each weight block and the same data block respectively to obtain convolution data of each weight block relative to the data block, wherein each weight block is one part of complete weight required by the convolution calculation, the weight number of each weight block is not more than the number of calculation units in the convolution module, each data block is one part of complete data required by the convolution calculation, and the complete data is obtained by segmenting the complete data according to the maximum output data size supported by hardware and the size of the complete weight;
and the accumulation module is used for adding the convolution data of each weight block relative to the same data block to obtain a convolution calculation result.
7. The AI chip of claim 6, wherein the convolution module comprises:
a plurality of calculation units, each calculation unit for multiplying a corresponding weight by data;
and the summation unit is used for adding the calculation results of the calculation units.
8. The AI chip of claim 6, further comprising:
the first buffer area is used for storing input data blocks; and/or the presence of a gas in the gas,
and the second buffer area is used for storing the input weight.
9. The AI chip of claim 8, further comprising:
the control module is used for replacing the data block in the first cache region with the next data block after convolution calculation is finished by using each weight block and the data block in the first cache region respectively; and after the convolution calculation of the weight blocks stored in the second cache region and the data blocks stored in the first cache region is completed, replacing the weight blocks in the second cache region with the next weight blocks.
10. An electronic device, comprising:
a memory for storing complete data and complete weights required for convolution calculations;
and the AI chip of any of claims 6-9, the AI chip coupled to the memory.
CN202210838417.XA 2022-07-18 2022-07-18 Convolution calculation method, AI chip and electronic equipment Pending CN114997389A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210838417.XA CN114997389A (en) 2022-07-18 2022-07-18 Convolution calculation method, AI chip and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210838417.XA CN114997389A (en) 2022-07-18 2022-07-18 Convolution calculation method, AI chip and electronic equipment

Publications (1)

Publication Number Publication Date
CN114997389A true CN114997389A (en) 2022-09-02

Family

ID=83020876

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210838417.XA Pending CN114997389A (en) 2022-07-18 2022-07-18 Convolution calculation method, AI chip and electronic equipment

Country Status (1)

Country Link
CN (1) CN114997389A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115983337A (en) * 2022-12-14 2023-04-18 北京登临科技有限公司 Convolution calculation unit, AI operation array and related equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112200300A (en) * 2020-09-15 2021-01-08 厦门星宸科技有限公司 Convolutional neural network operation method and device
CN112215745A (en) * 2020-09-30 2021-01-12 深圳云天励飞技术股份有限公司 Image processing method and device and electronic equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112200300A (en) * 2020-09-15 2021-01-08 厦门星宸科技有限公司 Convolutional neural network operation method and device
CN112215745A (en) * 2020-09-30 2021-01-12 深圳云天励飞技术股份有限公司 Image processing method and device and electronic equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115983337A (en) * 2022-12-14 2023-04-18 北京登临科技有限公司 Convolution calculation unit, AI operation array and related equipment

Similar Documents

Publication Publication Date Title
Samimi et al. Res-DNN: A residue number system-based DNN accelerator unit
EP3499428A1 (en) Method and electronic device for convolution calculation in neutral network
WO2022037257A1 (en) Convolution calculation engine, artificial intelligence chip, and data processing method
CN114503125A (en) Structured pruning method, system and computer readable medium
CN113741858B (en) Memory multiply-add computing method, memory multiply-add computing device, chip and computing equipment
CN110580519B (en) Convolution operation device and method thereof
CN110555516A (en) FPGA-based YOLOv2-tiny neural network low-delay hardware accelerator implementation method
Wang et al. DSP-efficient hardware acceleration of convolutional neural network inference on FPGAs
CN113344179B (en) IP core of binary convolution neural network algorithm based on FPGA
CN114997389A (en) Convolution calculation method, AI chip and electronic equipment
US20220253668A1 (en) Data processing method and device, storage medium and electronic device
CN114003198B (en) Inner product processing unit, arbitrary precision calculation device, method, and readable storage medium
CN115238863A (en) Hardware acceleration method, system and application of convolutional neural network convolutional layer
Sun et al. HSIM-DNN: Hardware simulator for computation-, storage-and power-efficient deep neural networks
CN116842304A (en) Method and system for calculating irregular sparse matrix
CN112765540A (en) Data processing method and device and related products
CN110766136A (en) Compression method of sparse matrix and vector
Huang et al. A low-bit quantized and hls-based neural network fpga accelerator for object detection
Kumar et al. Complex multiplier: implementation using efficient algorithms for signal processing application
CN113642722A (en) Chip for convolution calculation, control method thereof and electronic device
Moon et al. Multipurpose Deep-Learning Accelerator for Arbitrary Quantization With Reduction of Storage, Logic, and Latency Waste
CN112308217A (en) Convolutional neural network acceleration method and system
CN113112009A (en) Method, apparatus and computer-readable storage medium for neural network data quantization
Kong et al. A high efficient architecture for convolution neural network accelerator
Kumar et al. Fast Approximate Matrix Multiplier based on Dadda Reduction and Carry Save Ahead Adder

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination