CN113138957A

CN113138957A - Chip for neural network inference and method for accelerating neural network inference

Info

Publication number: CN113138957A
Application number: CN202110336218.4A
Authority: CN
Inventors: 聂玉虎; 林龙; 崔文朋; 史存存; 刘瑞; 王岳; 郑哲; 万能; 汪晓; 章海斌
Original assignee: State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; Global Energy Interconnection Research Institute; Beijing Smartchip Microelectronics Technology Co Ltd; Overhaul Branch of State Grid Anhui Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; Global Energy Interconnection Research Institute; Beijing Smartchip Microelectronics Technology Co Ltd; Overhaul Branch of State Grid Anhui Electric Power Co Ltd
Priority date: 2021-03-29
Filing date: 2021-03-29
Publication date: 2021-07-20

Abstract

The invention relates to the field of artificial intelligence, and provides a chip for neural network reasoning and a method for accelerating the neural network reasoning. The chip for neural network inference comprises a storage and calculation unit, wherein the storage and calculation unit comprises a plurality of storage and calculation arrays with different input lengths, and the plurality of storage and calculation arrays are used for deploying convolution kernels corresponding to the input lengths of the storage and calculation arrays. According to the invention, the storage calculation arrays corresponding to different input lengths are additionally arranged in the storage calculation unit to adapt to the pruned convolution kernels, so that the power consumption is reduced, the computing resources are maximally utilized, and the requirements of computing resource utilization rate and low power consumption can be met.

Description

Chip for neural network inference and method for accelerating neural network inference

Technical Field

The invention relates to the field of artificial intelligence, in particular to a chip for neural network reasoning and a method for accelerating the neural network reasoning.

Background

The convolutional neural network algorithm comprises a large number of parameters, the parameter quantity of the convolutional neural network model obtained by adopting a conventional training method is very large and is usually in the order of hundreds of megameters, while the inference operation needs to consume a large number of operation resources, and an embedded platform with precious hardware resources cannot bear the storage burden.

In order to accelerate the reasoning speed of the convolutional neural network algorithm on a chip, various model lightweight methods are developed at present. Network pruning is a widely used method in deep neural network compression. For example, an Optimal Brain Damage (OBD) method, in which all weight parameters in the network are regarded as a single parameter, improves the accuracy and generalization capability of the network by removing insignificant weights in the network with second derivative approximation parameter significance based on diagonal hypothesis, extreme hypothesis and quadratic hypothesis during optimization. Because single weight pruning is unstructured, parameters of a pruning part can obtain a sparse network, so that convolution kernels after pruning are irregular, and the problem that a chip is difficult to consider both computational resource utilization rate and low power consumption is solved.

Disclosure of Invention

The invention aims to provide a chip for neural network inference and a method for accelerating the neural network inference, so as to solve the problem that the chip is difficult to take account of computational resource utilization rate and low power consumption.

In order to achieve the above object, an aspect of the present invention provides a chip for neural network inference, including a storage unit, where the storage unit includes a plurality of storage computation arrays with different input lengths, and the plurality of storage computation arrays are used to deploy convolution kernels corresponding to the input lengths of the storage computation arrays.

Further, the convolution kernels deployed on the storage compute array are pruned and clustered.

Further, the storage and computation unit comprises four storage and computation arrays, and the input lengths of the four storage and computation arrays are 1 bit, 3 bits, 6 bits and 9 bits respectively.

Further, the 1-bit storage computation array is used for deploying convolution kernels of 1-bit parameters;

the 3-bit storage computing array is used for deploying convolution kernels of 2-bit or 3-bit parameters;

the 6-bit storage computing array is used for deploying convolution kernels with parameters from 4 bits to 6 bits;

the 9-bit storage compute array is used to deploy convolution kernels of 7-bit to 9-bit parameters.

Furthermore, each storage computation array corresponds to a convolution kernel, and a plurality of storage computation arrays perform parallel operation.

Another aspect of the present invention provides a method for accelerating neural network inference, based on the chip for neural network inference of claim 1, the method comprising:

pruning and clustering convolution kernel parameters of each layer of the convolution neural network;

and distributing the clustered convolution kernels to a storage calculation array of the chip for neural network inference, which corresponds to the parameter bits of the convolution kernels.

Further, the pruning and clustering of the parameters of each layer of convolution kernel of the convolutional neural network includes: pruning the parameters of each layer of convolution kernel of the convolution neural network; quantizing the parameters of each layer of the convolution kernel after pruning; and clustering the quantized convolution kernel parameters of each layer.

Further, the pruning of the parameters of each layer of convolution kernel of the convolutional neural network includes:

and acquiring parameter values of each layer of convolution kernel of the convolution neural network, and cutting off the parameters of each layer of convolution kernel, which are smaller than a preset threshold value.

Further, the allocating the clustered convolution kernels to the storage computation array of the chip for neural network inference corresponding to the parameter bits of the convolution kernels includes:

allocating the convolution kernel of the 1-bit parameter to a 1-bit storage calculation array;

allocating convolution kernels of 2-bit or 3-bit parameters to a 3-bit storage computation array;

allocating the convolution kernel with 4-bit to 6-bit parameters to a 6-bit storage calculation array;

the convolution kernel with 7-bit to 9-bit parameters is assigned to a 9-bit storage compute array.

The present invention also provides a storage medium having stored thereon computer program instructions which, when executed, implement the method of accelerating neural network inference described above.

According to the chip for neural network inference, the storage calculation arrays corresponding to different input lengths are additionally arranged in the storage calculation unit to be matched with the pruned convolution kernels, so that the power consumption is reduced, the computing resources are maximally utilized, and the requirements of computing resource utilization rate and low power consumption can be met.

Additional features and advantages of embodiments of the invention will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the embodiments of the invention without limiting the embodiments of the invention. In the drawings:

FIG. 1 is a block diagram of a memory unit of a chip for neural network inference provided by an embodiment of the present invention;

FIGS. 2-5 are exemplary diagrams of a memory compute array of a chip for neural network inference provided by one embodiment of the present invention;

FIG. 6 is an exemplary diagram of a convolution kernel corresponding to the storage compute array shown in FIGS. 2-5;

FIG. 7 is a flow chart of a method for accelerating neural network inference provided by an embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.

Typically, Convolutional Neural Networks (CNN) are implemented using a 3 × 3 convolution kernel. In order to accelerate the reasoning of the neural network algorithm, a chip for reasoning operation is designed in a mode of maximizing the parallelism of convolution kernel operation. However, the input dimensionality of the pruned neural network operation unit is not fixed, because the parameters of the cut convolution kernels of different convolution layers are not fixed (i.e. the input length of the convolution kernels is not fixed), if the operation units with the same number of bits are adopted for each convolution kernel to perform inference calculation, the computational power resource cannot be maximally utilized, the power consumption is increased, and the inference efficiency is not high.

Fig. 1 is a block diagram of a storage unit of a chip for neural network inference provided in an embodiment of the present invention. The embodiment of the invention provides a storage and calculation integrated chip for neural network reasoning, which comprises a storage and calculation unit. As shown in fig. 1, the storage computation unit includes a plurality of storage computation arrays with different input lengths, and the plurality of storage computation arrays are used for deploying convolution kernels corresponding to the input lengths of the storage computation arrays. And the convolution kernels deployed on each storage computing array are subjected to pruning and clustering processing. The chip provided by the embodiment of the invention reduces the power consumption, simultaneously maximally utilizes the computing resources, and meets the requirements of computing resource utilization rate and low power consumption.

Fig. 2-5 are exemplary diagrams of a memory compute array of a chip for neural network inference provided by an embodiment of the invention. The storage unit of the present embodiment includes four storage calculation arrays, the input lengths of which are 1 bit, 3 bits, 6 bits, and 9 bits, respectively, as shown in fig. 2 to 5, fig. 2 is a storage calculation array with an input length of 1 bit, fig. 3 is a storage calculation array with an input length of 3 bits, fig. 4 is a storage calculation array with an input length of 6 bits, and fig. 5 is a storage calculation array with an input length of 9 bits. Wherein, the 1-bit storage calculation array is used for deploying convolution kernels with 1-bit parameters; the 3-bit storage computing array is used for deploying convolution kernels with 2-bit or 3-bit parameters; the 6-bit storage computing array is used for deploying convolution kernels with parameters from 4 bits to 6 bits; a 9-bit storage compute array is used to deploy convolution kernels of 7-bit to 9-bit parameters. Fig. 6 is an exemplary diagram of a convolution kernel corresponding to the storage compute array shown in fig. 2-5. As shown in fig. 6, from left to right, the first convolution kernel corresponds to the 1-bit storage and computation array shown in fig. 2, the second convolution kernel corresponds to the 3-bit storage and computation array shown in fig. 3, the third convolution kernel corresponds to the 6-bit storage and computation array shown in fig. 4, and the fourth convolution kernel corresponds to the 9-bit storage and computation array shown in fig. 5. In the process of reasoning operation of the convolutional neural network, when the pruned and quantized convolutional kernels are distributed, each convolutional kernel is distributed to a corresponding storage calculation array, namely each storage calculation array corresponds to one convolutional kernel, and a plurality of storage calculation arrays perform parallel operation.

According to the storage and calculation integrated chip for neural network inference, storage and calculation arrays corresponding to different input lengths are additionally arranged in the storage and calculation unit to be matched with the pruned convolution kernels, so that the power consumption is reduced, the computing resources are maximally utilized, and the requirements of computing resource utilization rate and low power consumption can be met.

FIG. 7 is a flow chart of a method for accelerating neural network inference provided by an embodiment of the present invention. As shown in fig. 7, the method for accelerating neural network inference provided in this embodiment is based on the above chip for neural network inference, and includes the following steps:

and S1, pruning and clustering the convolution kernel parameters of each layer of the convolution neural network.

In a specific embodiment, pruning and clustering comprise the following sub-steps:

s11, pruning the parameters of each layer of convolution kernel of the convolutional neural network, for example, obtaining parameter values of each layer of convolution kernel of the convolutional neural network, and pruning the parameters of each layer of convolution kernel that are smaller than a preset threshold.

And S12, quantizing each layer of the pruned convolution kernel parameters, for example, int8 quantizing the retained convolution kernel parameters after pruning, and converting the convolution operation (multiplication and addition instruction) of float 32bit into the convolution operation of int8 to reduce the calculation amount.

S13, clustering the quantized convolution kernel parameters of each layer, namely clustering the reserved parameters one by one to obtain cluster0, cluster1, cluster2 and cluster3, wherein the cluster0 has fewer parameters and the cluster3 has more parameters.

And S2, distributing the clustered convolution kernels to a storage calculation array of the chip for neural network inference, wherein the storage calculation array corresponds to the parameter bits of the convolution kernels. Specifically, a convolution kernel with 1-bit parameters is allocated to a 1-bit storage computation array; allocating convolution kernels of 2-bit or 3-bit parameters to a 3-bit storage computation array; allocating the convolution kernel with 4-bit to 6-bit parameters to a 6-bit storage calculation array; the convolution kernel with 7-bit to 9-bit parameters is assigned to a 9-bit storage compute array. For example, cluster0, cluster1, cluster2 and cluster3 are input into 1-bit, 3-bit, 6-bit and 9-bit storage computing arrays respectively.

According to the method for accelerating neural network inference, the training convolution kernels after pruning and clustering are deployed to the storage calculation array with the corresponding input length, computational resources are utilized to the maximum extent, power consumption is reduced, and neural network algorithm inference can be accelerated.

Embodiments of the present invention also provide a machine-readable storage medium having stored thereon computer program instructions which, when executed, implement the above-described method of accelerating neural network inference.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, systems and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims

1. A chip for neural network inference comprises a storage and computation unit, and is characterized in that the storage and computation unit comprises a plurality of storage and computation arrays with different input lengths, and the plurality of storage and computation arrays are used for deploying convolution kernels corresponding to the input lengths of the storage and computation arrays.

2. The chip for neural network inference of claim 1, wherein the convolution kernels deployed on the storage compute array are pruned and clustered.

3. The chip for neural network inference according to claim 1, wherein said memory computation unit includes four memory computation arrays, and the input lengths of the four memory computation arrays are 1 bit, 3 bits, 6 bits, and 9 bits, respectively.

4. The chip for neural network inference of claim 3, wherein the 1-bit storage computation array is used to deploy convolution kernels of 1-bit parameters;

5. The chip for neural network inference of claim 1, wherein each said storage compute array corresponds to a convolution kernel, and a plurality of said storage compute arrays operate in parallel.

6. A method for accelerating neural network inference, based on the chip for neural network inference of claim 1, the method comprising:

7. The method for accelerating neural network inference as claimed in claim 6, wherein said pruning and clustering convolution kernel parameters of each layer of the convolution neural network comprises:

pruning the parameters of each layer of convolution kernel of the convolution neural network;

quantizing the parameters of each layer of the convolution kernel after pruning;

and clustering the quantized convolution kernel parameters of each layer.

8. The method for accelerating neural network inference as claimed in claim 7, wherein said pruning of the parameters of the convolution kernels of each layer of the convolutional neural network comprises:

9. The method for accelerating neural network inference as claimed in claim 6, wherein said assigning the clustered convolution kernels to a storage computation array of the chip for neural network inference corresponding to parameter bits of the convolution kernels comprises:

10. A storage medium having computer program instructions stored thereon that, when executed, implement the method of accelerating neural network inference of any of claims 6-9.