CN112598129A

CN112598129A - Adjustable hardware-aware pruning and mapping framework based on ReRAM neural network accelerator

Info

Publication number: CN112598129A
Application number: CN202110236303.3A
Authority: CN
Inventors: 何水兵; 杨斯凌; 陈伟剑; 陈平; 陈帅犇; 银燕龙; 任祖杰; 曾令仿; 杨弢
Original assignee: Zhejiang University ZJU; Zhejiang Lab
Current assignee: Zhejiang University ZJU; Zhejiang Lab
Priority date: 2021-03-03
Filing date: 2021-03-03
Publication date: 2021-04-02

Abstract

The invention provides a pruning and mapping framework based on adjustable hardware perception of a ReRAM neural network accelerator, which comprises a DDPG agent and the ReRAM neural network accelerator; the DDPG agent consists of a behavior decision module Actor and a Critic judgment module, wherein the behavior decision module Actor is used for making a pruning decision on a neural network; the ReRAM neural network accelerator is used for mapping a model formed under a pruning decision generated by the behavior decision module Actor, and feeding back performance parameters mapped by the model under the pruning decision as signals to the Critic; the performance parameters include energy consumption, delay and model accuracy of the simulator; the Critic judgment module updates the reward function value according to the feedback performance parameters and guides the pruning decision of the action decision module Actor at the next stage; the method of the invention utilizes the reinforcement learning DDPG agent to make a pruning scheme which is most matched with hardware and user requirements and has the highest efficiency, thereby improving the delay performance and the energy consumption performance on the hardware while ensuring the accuracy.

Description

Adjustable hardware-aware pruning and mapping framework based on ReRAM neural network accelerator

Technical Field

The invention relates to the field of computer science artificial intelligence, in particular to a pruning and mapping framework aiming at adjustable hardware perception based on a ReRAM neural network accelerator.

Background

The deep neural network plays an important promoting role in the development of the fields of computer vision, natural language processing, robotics and the like, and the application of the neural network at the IoT equipment end is rapidly developed along with the development of a mobile Internet of things platform. Due to the computational intensive and massive data mobility of the neural network, the application of the neural network may result in high energy consumption and high latency, however, the IoT platform has limited computational resources and limited energy support, and thus the IoT devices need a more efficient neural network mapping scheme to reduce energy consumption and latency. Resistive random access memory (ReRAM) due to its very low energy leakage, high density storage and in-memory computing characteristics, ReRAM-based neural network accelerators provide solutions to the limitations of IoT devices. On the other hand, because a large number of sparse neural network models are larger and larger at present, a large amount of unnecessary resource waste and delay increase are caused, and pruning is carried out before the models are mapped to the ReRAM neural network accelerator, so that the size of the models can be greatly reduced, and the energy consumption of hardware and the delay of application are reduced. However, when the hardware specification and the type of the ReRAM neural network accelerator are different and the requirements of users for different levels of delay, energy consumption and the like are met, the traditional deep learning pruning scheme cannot sense the change of the hardware and the requirements of the users to generate the same pruning scheme, so that the performance inefficiency of model mapping on the hardware of the ReRAM neural network accelerator is caused, and the performance advantage development of the ReRAM neural network accelerator is restricted.

Disclosure of Invention

In order to more efficiently explore the mapping of a convolutional neural network on a ReRAM neural network accelerator according to the requirements of a mobile device user, the invention provides an adjustable intelligent hardware-aware pruning and mapping framework, wherein a feedback (such as delay, energy consumption, energy efficiency and the like) obtained by a reinforcement learning agent from the ReRAM neural network accelerator hardware is used for replacing signals (such as model size, floating point operation times and the like) which cannot be expressed on the hardware accelerator performance, a depth deterministic strategy gradient (DDPG) is used for searching and deciding a pruning strategy, so that the more friendly pruning strategy based on the ReRAM neural network accelerator is determined, the delay and the energy consumption of a pruned and mapped neural network model on a hardware accelerator are reduced, and the wearable mobile Internet of things device can realize the deep learning application under the limited resource, and the delay, the energy consumption and the energy consumption of the hardware and the user are reduced, The requirements of energy consumption are different, and a pruning and mapping framework which is most suitable for the requirements is found.

The technical scheme adopted by the invention is as follows:

a pruning and mapping framework based on adjustable hardware perception of a ReRAM neural network accelerator comprises a DDPG agent and the ReRAM neural network accelerator; the DDPG agent consists of a behavior decision module Actor and a Critic judgment module, wherein the behavior decision module Actor is used for making a pruning decision on a neural network model;

the ReRAM neural network accelerator is used for mapping a model formed under a pruning decision generated by the behavior decision module Actor, and feeding back performance parameters mapped by the model under the pruning decision as signals to the Critic; the performance parameters comprise energy consumption, delay and model accuracy of a ReRAM neural network accelerator;

the Critic judging module is used for updating the reward function value according to the feedback performance parameter, evaluating the performance of the action decision module Actor and guiding the pruning decision of the action decision module Actor at the next stage to make the reward function value converge;

the value of the reward function is selected according to the requirements of the userReward1 (energy consumption) and/orReward2 (deferred) update, the actual performance in hardware has been achieved:

Reward1=-Error×log(Energy)

Reward2=-Error×log(Latency)

wherein the content of the first and second substances,Error=1-accuracy，accuracyin order to be a measure of the accuracy of the model,Energyfor the power consumption performance of the ReRAM neural network accelerator,Latencyis the delay performance of the ReRAM neural network accelerator.

Further, the action decision module Actor for making a pruning decision on the neural network model specifically comprises:

the behavior decision module Actor is used for characterizing the second time according to the inputkState parameters of a layer neural networks _kAnd outputting the sparse rate, and compressing the neural network model layer by using a compression algorithm according to the sparse rate of each layer. Namely: the current layer is compressed using a specified compression algorithm (e.g., channel pruning). Then, the agent moves to the next layerk+1, and receiving states _k+1Until the last layer is completed.

Further, the state parameters _kCharacterization was performed using 8 features:

（k, type, in _channels, out _channels, stride, kernelsize, flops[k], a _k-1）

whereinkIs an index of a layer or layers,typeis the kind of layer or layers that are,in _channelsthe number of input channels is represented by,out _channelsthe number of output channels is indicated,stridewhich represents the step size of the convolution,kernelsizerepresenting a convolutionKernel length, hence convolution kernel size ofin _channels×out _channels×kernelsize×stride；flops[k]Is the firstkNumber of floating point operations of layer and before passing to behavior decision module Actor at [0, 1 ]]Internal scaling;a _k-1is the pruning action made by the previous layer and can be expressed by compression rate.

Because the traditional deep learning pruning optimization scheme guides the pruning decision of the reinforcement learning agent by using the floating point operation times or the size of the model as a pruning signal, when the hardware specification and the type of the ReRAM neural network accelerator are different and the requirements of users on different levels such as delay and energy consumption are met, the traditional deep learning pruning scheme cannot sense the change of the hardware and the requirements of the users to generate the same pruning scheme, so that the performance inefficiency of model mapping on the hardware of the ReRAM neural network accelerator is caused. The invention provides an adjustable intelligent hardware perception pruning scheme and a mapping framework, which are different from the original scheme that a neural network is directly mapped on a ReRAM neural network accelerator, the framework utilizes a deep deterministic strategy gradient (DDPG) in reinforcement learning to search and decide a pruning strategy, and selects actual performance (such as delay, energy consumption and the like) in hardware to feed back to an agent in reinforcement learning according to the requirements of a user. After hardware-aware pruning is carried out according to user requirements, the model is mapped to the ReRAM neural network accelerator, so that the delay and energy consumption of the neural network model applied to the accelerator can be greatly reduced, and the mapping performance is improved.

Drawings

FIG. 1 is an overall block diagram of the hardware-aware pruning and mapping framework of the present invention;

FIG. 2 is a flow chart of an experiment;

fig. 3 is a histogram comparing pruning strategies searched under the NeuroSim hardware configurations of three types of simulators, i.e., configuration 2, configuration 3, and configuration 4, under the delay perception scheme adopted by the VGG-16.

Detailed Description

FIG. 1 is an overall block diagram of the hardware-aware pruning and mapping framework of the present invention, as shown, including a DDPG agent and a ReRAM neural network accelerator; the ReRAM neural network accelerator comprises a plurality of Processing units (PEs), wherein each Processing unit consists of a cross array formed by a plurality of ReRAM units, an on-chip cache, a nonlinear activation Processing unit, a modulus-to-electric converter and other peripheral circuits (only the cross array, the on-chip cache and the nonlinear activation Processing unit are drawn in the figure). The DDPG agent consists of a behavior decision module Actor and a Critic judging module; the entire pruning and mapping framework contains two levels. In the first level, an Actor of a behavior decision module of the DDPG agent makes a pruning decision from the first level to the last level on a neural network model according to hardware feedback, maps the model formed under the pruning decision on a ReRAM neural network accelerator in the second level, and feeds back performance parameters mapped by the model under the pruning decision scheme as signals to a Critic judgment module in the DDPG agent of the first level. The Critic evaluation module is responsible for evaluating the performance of the behavior decision module Actor, updating the reward function value under the hardware type and the user requirement and guiding the next-stage pruning decision of the behavior decision module Actor. After a certain period number, the reward function value is converged, the system finds out an optimal pruning decision scheme, conducts CKPT model derivation after pruning according to the hardware perception pruning strategy, and then conducts model fine adjustment on the CKPT model so as to guarantee precision.

As a preferred scheme, the action decision module Actor of the DDPG agent makes a pruning decision from the first layer to the last layer on the neural network model according to the performance parameter feedback specifically includes:

the behavior decision module Actor receives a neural network model from the environmentkState parameter of layers _kOutputting the sparse rate, and using a specified compression algorithm according to the sparse rate of each layer to obtain the compressed dataa _kCompress the current layer, and then the proxy moves to the next layerk+1, and receiving states _k+1。

For each layerkState parameter ofs _kThe characterization is performed by using 8 characteristics:

（k, type, in _channels, out _channels, stride, kernelsize, flops[k], a _k-1） (1)

whereinkIs an index of a layer or layers,typeis the type of layer (including convolutional layers and fully-connected layers, denoted 0 and 1, respectively),in _channelsthe number of input channels is represented by,out _channelsthe number of output channels is indicated,stridewhich represents the step size of the convolution,kernelsizerepresents the convolution kernel length, so the convolution kernel size isin _channels×out _channels×kernelsize×stride；flops[k]Is the firstkNumber of floating point operations of layer and before passing to behavior decision module Actor at [0, 1 ]]Internal scaling;a _k-1is the pruning action made by the previous layer.

In order to realize a pruning decision with finer granularity, the pruning action of the behavior decision module Actor adopts a compression rate to express,a _kϵ（0,1]namely: using a channel pruning algorithm, rounding to the nearest fraction that can ultimately result in an integer number of channels asa _kAnd are combined witha _kThe current layer is compressed.

After the final layer of pruning decision is completed, the accuracy and the performance parameters (delay or energy consumption) of the hardware are evaluated on the ReRAM neural network accelerator by adopting a verification set, the value of the reward function is calculated and returned to the critic of the evaluation module.

The accuracy of the calculation is similar to the fine-tuned accuracy, so that the accuracy of the reward function can be evaluated without fine tuning for quick search.

The value of the reward function is passedReward1 and/orReward2, updating:

Reward1=-Error×log(Energy)

Reward2=-Error×log(Latency)

When support for energy is very limited in some IoT devices and the user's demand for delay is not very urgent (i.e., the user places more emphasis on "energy" performance in pruning mappings), post-pruning accuracy and "energy" performance mapped on ReRAM neural network accelerator hardware may be considered in the reward function. While in some IoT devices the support for energy is somewhat sufficient and the user's demand for latency is large (i.e., the user places more emphasis on "latency" performance in the pruning mapping), the post-pruning accuracy and "latency" performance mapped on the ReRAM neural network accelerator hardware may be considered in the reward function. Therefore, the method can design the pruning scheme and the mapping frame more efficiently and accurately according to different hardware and user requirements.

In addition, the configuration of the ReRAM neural network accelerator can be adjusted, hardware perception is achieved, and a pruning scheme and a mapping framework under the optimal configuration and user requirements are obtained.

The invention is further illustrated below with reference to specific examples, in which the following experiments are carried out:

experimental configuration:

(1) operating the system: ubuntu 18.04.3 LTS;

(2) a CPU: model number 8-core Intel (R) Xeon (R) Gold 6126 CPU @ 2.60GHz, equipped with 32GB DRAM;

(3) GPU: tesla V10032 GB video memory;

(4) a storage device: 512GB, SK hynix SC311 SATA SSD; western Digital WDC WD40EZRZ-75G HDD;

configuring a neural network model:

(1) a neural network model: CIFAR10, Plain20 and VGG-16, and the structures of the CIFAR10, the Plain20 and the VGG-16 are shown in Table 1.

TABLE 1 neural network model and structural representation thereof

(2) Data set: cifar10, comprising 60000 color images, 32 × 32 in size, divided into 10 classes of 6000 images each, wherein 50000 images were used for training and 10000 images were used for testing;

(3) batch size: 1024 pictures/batch (CIFAR10, Plain20), 512 pictures/batch (VGG-16);

(4) the number of training rounds is as follows: 70 rounds (epoch);

(5) number of rounds of finetune: 50 epoch;

ReRAM neural network accelerator configuration:

experiments were performed using the simulator NeuroSim of the ReRAM neural network accelerator, the configuration of which is shown in table 2.

TABLE 2 simulator NeuroSim configuration

Experimental procedure

FIG. 2 is a flow chart of the whole pruning experiment. The method comprises the following steps:

the method comprises the following steps: when a user writes a neural network model code, inputting the pruning and mapping framework of the invention, pre-training the pruning and mapping framework and storing the pruning and mapping framework to obtain a CKPT file;

step two: searching out an optimal pruning strategy by using reinforcement learning and hardware perception;

step three: pruning is carried out according to the optimal pruning strategy obtained in the reinforcement learning in the step two, and the parameters after pruning are stored in a CKPT file;

step four: in order to ensure the accuracy rate after pruning, carrying out parameter fine adjustment on the CKPT model;

step five: and (5) simulating on a simulator NeuroSim to obtain the accuracy after pruning, and finishing the final pruning strategy mapping.

The final test results are:

in the original direct neural network mapping scheme (the number of floating point operations or the size of a model is used as a signal for guiding the pruning decision of the reinforcement learning agent), under the corresponding hardware configuration, the delays of a hardware simulator are 957150.6ns/image, 4830571.6ns/image and 1026976.8ns/image respectively for CIFAR10 (configuration 2), PIIAN 20 (configuration 1) and VGG-16 (configuration 2), and the energy consumption of the simulator is 23488814pJ, 15816979.0pJ and 8058001.0pJ respectively. By adopting the hardware perception-delay perception scheme, the delay performance of the three models is respectively improved by 57.358%, 6.771% and 38.017%, and the accuracy of Top5 is respectively improved by 0.210%, 0.190% and 0.290%. By adopting an energy consumption perception scheme, the energy consumption performance of the three models is respectively improved by 76.833%, 5.615% and 38.425%, and the accuracy of Top5 is respectively improved by 0.270%, 0.230% and 0.420%. By adopting the hardware perception pruning scheme and the mapping framework, the accuracy is ensured while the performance concerned by the user is improved.

Table 3 shows pruning strategies searched under the NeuroSim hardware configurations of the three simulators, configuration 2, configuration 3, and configuration 4, in the case that the VGG-16 employs the delay sensing scheme. Wherein the list in the pruning strategy represents the retention of each group of filter channels.

Table 3 pruning strategy searched under three simulator NeuroSim hardware configurations under delay perception scheme adopted by VGG-16

Fig. 3 is a histogram comparing pruning strategies searched under the NeuroSim hardware configurations of three types of simulators, i.e., configuration 2, configuration 3, and configuration 4, under the delay perception scheme adopted by the VGG-16. The abscissa represents the number of groups of filters, and the ordinate represents the channel retention under the pruning strategy. The histogram shows that the channel retention rate distribution of each group of filters has different trends under different hardware configurations. This also justifies the need for hardware-aware strategies in the face of different hardware configurations.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. This need not be, nor should all embodiments be exhaustive. And obvious variations or modifications of the invention may be made without departing from the scope of the invention.

Claims

1. A pruning and mapping framework based on adjustable hardware perception of a ReRAM neural network accelerator is characterized by comprising a DDPG agent and the ReRAM neural network accelerator; the DDPG agent consists of a behavior decision module Actor and a Critic judgment module, wherein the behavior decision module Actor is used for making a pruning decision on a neural network model;

the value of the reward function is passedReward1 and/orReward2, updating:

Reward1=-Error×log(Energy)

Reward2=-Error×log(Latency)

2. The pruning and mapping framework of claim 1, wherein the behavior decision module Actor is configured to make a pruning decision for the neural network model by:

the behavior decision module Actor is used for characterizing the second time according to the inputkState parameters of a layer neural networks _kOutput the sparse rate, and according to each layerThe sparsity ratio uses a compression algorithm to compress the neural network model layer by layer.

3. The pruning and mapping framework of claim 2, wherein the state parameterss _kCharacterization was performed using 8 features:

whereinkIs an index of a layer or layers,typeis the kind of layer or layers that are,in _channelsthe number of input channels is represented by,out _channelsthe number of output channels is indicated,stridewhich represents the step size of the convolution,kernelsizerepresents the convolution kernel length, so the convolution kernel size isin _channels×out _channels×kernelsize×stride；flops[k]Is the firstkNumber of floating point operations of layer and before passing to behavior decision module Actor at [0, 1 ]]Internal scaling;a _k-1is the pruning action made by the previous layer.