CN109657780A

CN109657780A - A kind of model compression method based on beta pruning sequence Active Learning

Info

Publication number: CN109657780A
Application number: CN201811501702.2A
Authority: CN
Inventors: 丁贵广; 钟婧
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2018-06-15
Filing date: 2018-12-10
Publication date: 2019-04-19

Abstract

The present invention provides a kind of model compression methods based on beta pruning sequence Active Learning, one is proposed end to end based on the beta pruning frame of sequence Active Learning, it can be with the importance of each layer of Active Learning network, generate beta pruning priority, reasonable beta pruning decision is made, solves the problems, such as the unreasonable of existing simple in-order pruning method, preferentially in the smallest network layer beta pruning of influence power, it conforms to the principle of simplicity to difficult gradually beta pruning, minimizes the model accuracy loss of beta pruning process；It is finally lost with model simultaneously to be oriented to, multi-angle, efficiently, flexibly rapidly assessment convolution kernel importance is transplanted to portable equipment for subsequent large-sized model and provides technical support to guarantee the correctness and validity of whole process model compression.The experimental results showed that, model compression method provided by the invention based on beta pruning sequence Active Learning shows leading under multiple data sets, Multi-model MPCA, can be in the case where guaranteeing model accuracy, huge compression model volume has very strong actual application prospect.

Description

A kind of model compression method based on beta pruning sequence Active Learning

Technical field

The invention belongs to neural network model technology field more particularly to a kind of models based on beta pruning sequence Active Learning Compression method.

Background technique

In recent years, flourishing with deep neural network, science circle and industry have witnessed deep learning jointly and have existed The important breakthrough of the various fields such as computer vision, natural language processing.Convolutional neural networks (CNN) are in certain visual fields Expressive force has been even more than the visual processes ability of the mankind.

Although depth network obtains great breakthrough in visual field, the size and calculation amount of model become it in reality Bottleneck in the application of border.Depth network application needs to depend on the quick computing capability of hardware, a large amount of memory spaces in reality scene And battery capacity.Large Scale Neural Networks can be efficiently run in the server of computer room, quickly calculated by GPU, be but difficult to It applies in the limited mobile device of resource and low frequency CPU, such as smart phone, wearable equipment etc..It is so limited, deep learning Numerous scientific achievements be difficult to be transformed into practical applicable scene.In order to solve this problem, Recent study personnel are proposed The method of many model compressions, it is intended to which compact model size improves model running speed, and keeps model accuracy not as far as possible Become, uses compressed model transplantations into small device to realize.The essence of model compression is to generate a small pattern Type makes it and large scale network have same expressive force.This is a greatly challenge, extensive net for researcher Network is well-designed by experts and scholars and verifies effective network structure, each parameter learnt is to network entirety Performance contributes, inappropriate to give up certain parameters and damage modelling effect.Actual scene if necessary to one extremely Small model then needs the relationship between balance model complexity and modelling effect, suitably makes certain aspect and gives up

The model compression method of mainstream is divided into several branches, and the first kind is " neural network trimming ", including rarefaction beta pruning With two methods of structuring beta pruning.The emphasis of pruning method is connected to the network weight in assessment, and cutting influences lesser power to network Weight, and by retraining come the precision of Restoration model.Rarefaction beta pruning is sporadically cut in network small significance and is connected, can be with Largely compact model size reduces memory overhead；But it is limited to the realization in bottom library, there are still tired in network acceleration It is difficult.And structuring beta pruning can keep the regular shape of convolution nuclear structure well, it is usually substantially single using convolution kernel as beta pruning Position；Network model after structuring beta pruning, tactical rule and complete can be carried out directly at acceleration according to traditional convolution method Reason.Another kind of is " neural network parameter is shifted and shared ", that is, passes through the modes such as parameter quantization, low order estimation or knowledge extraction Compression network model.Neural network parameter transfer and shared method, can be with usually as the subsequent compression step of model beta pruning Further compression network model；Be used alone can not in the case where keeping model accuracy extensive compression network volume.Third Class is " Neural Network Structure Design ", searches for planned network structure automatically by the network structure or machine that artificially design new Method, directly design a miniature neural network.Automated Design network can discharge hand labor, give special scenes, specific The task of data makes most suitable network structure to measure, but the complexity of this method is very high, needs to consume a large amount of meter Resource is calculated, set objective is otherwise unable to reach.

From the point of view of present case, structuring beta pruning is in compression network volume and to improve two aspect of model running speed most Effective method.Existing structuring technology of prunning branches mainly includes using sequence beta pruning or global pruning two ways.Sequentially Beta pruning has preset beta pruning sequence, by from front to back, from back to front in the way of layer-by-layer beta pruning, in each layer according to certain ratio Example subtracts relatively unessential convolution kernel.Global pruning mode sets an importance threshold value, small during each round beta pruning It is cut simultaneously in all convolution kernels of the network of the threshold value.In fact, depth model compression is the task of a system level, It needs to make beta pruning decision according to world model.Research emphasis is placed in assessment convolution kernel importance by existing method, but beta pruning Strategy is too simple, leads to that the effect is unsatisfactory.However, there is an important phenomenon ignored, each convolutional layer is different Importance: if subtracting only a few convolution kernel on an important convolutional layer, the precision of overall model may also can be substantially Decline；, whereas if, even if cutting off a large amount of convolution kernels, can also hardly be caused to precision on a unessential convolutional layer It influences, the mode of beta pruning or global pruning does not obviously account for the importance of each convolutional layer to beta pruning result in sequence The influence of generation.

Summary of the invention

In order to solve the above technical problem, the present invention provides a kind of model compression sides based on beta pruning sequence Active Learning Method, comprising:

S1. LSTM learning network temporal aspect is utilized, the decision whether each network layer needs beta pruning is made；

S2. the network layer parameter of the network layer is assessed and is cut in selected network layer, and propose Restoration Mechanism Model accuracy after beta pruning is restored at once；

S3. acceleration retraining is carried out using tutor's network to the model after beta pruning；

S4. according to the expressive force and complexity of the model of retraining after beta pruning, feedback excitation R is obtained, with enhancing study side Method updates LSTM；

S5. input of highest 5 models of feedback excitation as new round LSTM is chosen, step 1- step 4 is repeated, until LSTM terminates training process when no longer generating more preferably beta pruning decision, obtains model after optimal beta pruning.

Further, the step S1 includes:

(1) neural network model is come out with string representation first, as the input of LSTM, concrete mode are as follows:

With (m_i,n_i) indicate neural network i-th of node ξ_i, wherein m indicates that node type, value are divided in { 0,1,2 } Convolution, pondization and full attended operation are not represented；N indicates node attribute values, and when the node is convolution, n represents this layer of convolution kernel Quantity；When the node is Chi Huashi, n represents pond step-length；When the node is full connection, n represents this layer of neuronal quantity；

(2) LSTM beta pruning decision is obtained, is specifically included:

At each moment, the input of a host node and its next node as multilayer LSTM, which can be indicated For [m_i, n_i, m_i+1.n_i+1]；LSTM is made whether the decision of beta pruning using softmax function to host node currently entered, auxiliary Node only provides auxiliary information, does not carry out beta pruning prediction to it.

Further, the network layer includes convolutional layer and/or full articulamentum；The network layer parameter of convolutional layer is convolution Core, the network layer parameter of full articulamentum are full Connecting quantity.

Further, in step s 2, the method assessed convolution kernel and cut in convolutional layer includes:

By calculating the L2 norm of each channel set in i+1 convolutional layer, importance scores s is obtained_j, specific as following Formula:

s_j=| | C_i+1,j||₂,s.t.j∈[1,x_i]

Wherein, C_i+1,jIndicate that j-th of channel set of i+1 convolutional layer, s.t. represent the abbreviation of subject to, x_iTable Show the convolution nuclear volume in i-th of convolutional layer；

According to compression ratio, selects and cut importance scores s in i+1 convolutional layer_jThe smallest channel set is right with them The convolution kernel in i-th of convolutional layer answered.

Further, the Restoration Mechanism includes: to select a part of convolution kernel of i+1 convolutional layer and according to certain ratio Example amplification convolution nuclear parameter is specific such as following formula:

Wherein, F_{I+1, j}Indicate j-th of convolution kernel of i+1 convolutional layer；Indicate the i+1 convolution of beta pruning J-th of convolution kernel of layer；A is a hyper parameter, for selecting the biggish convolution kernel of deviation.

Further, in step s3, retraining process is accelerated using the method for knowledge extraction, by the input model of LSTM As teacher's network, for the model after beta pruning as student network, the classification of all categories of Internet-supported Study of students teacher's network output is general Rate z, z include knowledge abundant compared to training label, include similitude and otherness between classification.

Further, in step s 4, feedback excitation R is calculated by following formula:

R=performance- λ × complexity

Wherein, expressive force performance is indicated by model in the accuracy rate of verifying collection or the loss of training set；It is complicated Spend complexity is indicated by model FLOPs or ginseng population size；λ is a hyper parameter, needs to test by the intersection of experiment Card is to select optimal value.

Compared with prior art, the beneficial effects of the present invention are:

Model compression method provided by the invention based on beta pruning sequence Active Learning proposes one and is based on end to end The beta pruning frame of sequence Active Learning can generate beta pruning priority with the importance of each layer of Active Learning network, make reasonable Beta pruning decision solves the problems, such as the unreasonable of existing simple in-order pruning method, preferentially in the smallest network layer beta pruning of influence power, It conforms to the principle of simplicity to difficult gradually beta pruning, minimizes the model accuracy loss of beta pruning process；It is finally lost with model simultaneously to be oriented to, it is polygonal Degree, efficient, flexible rapidly assessment convolution kernel importance, to guarantee the correctness and validity of whole process model compression, after being Continuous large-sized model is transplanted to portable equipment and provides technical support.The experimental results showed that provided by the invention be based on beta pruning sequence master The model compression method of dynamic study showed under multiple data sets, Multi-model MPCA it is leading, can be in the feelings for guaranteeing model accuracy Under condition, huge compression model volume has very strong actual application prospect.

Detailed description of the invention

Fig. 1 is the flow chart of the model compression method based on beta pruning sequence Active Learning.

Specific embodiment

Come to carry out in detail model compression method provided by the invention for carrying out beta pruning only for convolutional layer below Explanation.

A kind of model compression method based on beta pruning sequence Active Learning, as shown in Figure 1, comprising:

S1. LSTM (Long Short-Term Memory) learning network temporal aspect is utilized, making each convolutional layer is The no decision for needing beta pruning；

S2. convolution kernel is assessed and is cut in selected convolutional layer, convolution kernel appraisal procedure considers former and later two The relevance of convolutional layer, the method rapid evaluation convolution kernel importance driven using non-data, and propose Restoration Mechanism to beta pruning Model accuracy is restored at once afterwards；

S4. calculate beta pruning after retraining model expressive force and complexity, obtain feedback excitation R, with enhancing study side Method updates LSTM；

S5. the model of retraining after beta pruning is saved locally, chooses highest 5 models of feedback excitation as new The input of one wheel LSTM, repeats step 1- step 4, until LSTM terminates training process when no longer generating more preferably beta pruning decision, Obtain model after optimal beta pruning.By iterating, LSTM can preferably analyze network architecture, and proposition is correctly cut Branch sequence, so that entire beta pruning process is precisely effective.

Wherein, step S1 includes:

With (m_i,n_i) indicate neural network i-th of node ξ_i, wherein m indicates that node type, value are divided in { 0,1,2 } Convolution, pondization and full attended operation are not represented；N indicates node attribute values, and when the node is convolution, n represents this layer of convolution kernel Quantity；When the node is Chi Huashi, n represents pond step-length；When the node is full connection, n represents this layer of neuronal quantity；It adopts In fashion described above, each node is indicated with two values, then a neural network can be come out with string representation；

Wherein, convolution can carry out cut operator, referred to as host node；And other nodes remain constant, referred to as assist Node, for providing the auxiliary information of neural network；

(2) LSTM beta pruning decision is obtained, is specifically included:

At each moment, a host node and its next node (including host node and auxiliary node) are used as multilayer The input of LSTM, the input are represented by [m_i, n_i, m_i+1.n_i+1]；LSTM is using softmax function to host node currently entered It is made whether the decision of beta pruning, auxiliary node only provides auxiliary information, does not carry out beta pruning prediction to it；There is N number of master for one The network structure of node, LSTM input include altogether N number of moment, that is, repeat the above steps n times；Auxiliary node is made together with host node It is inputted for the LSTM at each moment, but beta pruning decision cannot be obtained, being intended only as auxiliary information, to help LSTM to better understand whole A network.

In step s 2, assessment is carried out to convolution kernel and cutting specifically includes:

Define a hyper parameter R_prune, remove (R in the convolutional layer i for needing beta pruning_prune×x_i) convolution kernel, x_iIt indicates the The convolution nuclear volume of i convolutional layer.

One convolution operation can be by triple < I_i, W_i, O_i> indicate, whereinRepresent i-th of convolution Input tensor, including port number x_i-1, height h, width w,Indicate set of real numbers；Convolution kernel tensor Wherein convolution nuclear shape is k × k, and the port number for inputting tensor is x_i-1；Export tensor O_iPort number be x_i。

From the perspective of convolution kernel, W_iInclude x_iA convolution kernelFrom the perspective of channel, W_iInclude x_i-1A channel set

When j-th of convolution kernel of a certain convolutional layer is cut up, corresponding j-th of channel set C of its next convolutional layer_i+1,jBecome In vain, it is also desirable to be cut up simultaneously；And other convolutional layers are not influenced by beta pruning in network, structure remains unchanged.

After i-th of convolutional layer is cut up convolution kernel, the output bias of i+1 convolutional layer is transferred to final loss letter Number, directly results in the loss of neural network accuracy；Importance low i-th layer of convolution kernel and corresponding i+1 channel set are removed, It can make output bias Δ O_i+1It is minimum；Due to usually there is activation primitive between two convolutional layers, crowd canonical or Chi Huacao Make, the convolution kernel F compared to i-th layer_i, i+1 layer channel set C_i+1To output valve O_i+1There is more immediate influence；Therefore can lead to The importance for crossing assessment i+1 layer channel set, carrys out the importance of i-th layer of convolution kernel of indirect assessment.

The present invention assesses the importance of channel set using L2 norm, because L2 norm is in addition in view of other than numerical values recited, Also contemplate difference between numerical value；The L2 norm of channel set and the size that output characteristic value can be embodied to a certain extent, L2 compared with Small channel set tends to produce the output of weak activation, and absolute value is biggish tends to produce the output activated by force.

s_j=| | C_i+1,j||₂,s.t.j∈[1,x_i]

In step s 2, after carrying out beta pruning to channel set and convolution kernel, F_i+1BecomeC_i+1BecomeModel The reduction of parameter is bound to cause the loss of precision.In order to reduce loss, output bias Δ O is minimized_i+1, to a convolution kernel After carrying out beta pruning, Restoration Mechanism is used at once, is lost with recovered part.

Restoration Mechanism specifically includes:

It selects a part of convolution kernel of i+1 layer and amplifies convolution nuclear parameter according to a certain percentage, specific such as following formula:

Wherein, F_i+1,jIndicate j-th of convolution kernel of i+1 convolutional layer；Indicate the i+1 convolution of beta pruning J-th of convolution kernel of layer；A is a hyper parameter, for selecting the biggish convolution kernel of deviation.Optimal a will pass through cross validation Mode obtain.

In step s3, in order to improve the efficiency of model retraining, the method using knowledge extraction accelerates retraining process. Using the input model of LSTM as teacher's network, for the model after beta pruning as student network, Internet-supported Study of students teacher's network is defeated Class probability z of all categories out.Z includes knowledge abundant compared to training label, includes similitude and otherness between classification.Cause Model after this beta pruning can save the training time under conditions of learning knowledge content as much.

Following formula presents the loss function g of the model training after beta pruning, using L2 norm, minimizes academics and students Output category probability.

Cross entropy loss function f (x, y, θ) based on data label and the loss function g (x, z, θ) based on tutor are done Weighted combination obtains final loss function L, referring to following formula.

L=β f (x, y, θ)+g (x, z, θ)

Wherein, x indicates data, and y indicates that data label, θ indicate model parameter, and β is weight hyper parameter, is obtained by experiment Optimal value.

Step S4 includes:

(1) feedback excitation R is calculated according to following formula:

R=performance- λ × complexity

Wherein, expressive force performance is indicated by model in the accuracy rate of verifying collection or the loss of training set；It is complicated Spend complexity is indicated by model FLOPs (flops) or ginseng population size；λ is a hyper parameter, needs to lead to The cross validation of experiment is crossed to select optimal value.

(2) LSTM is trained using the Policy-Gradient algorithm of enhancing study Plays, it is made to generate better beta pruning decision, It is specific as follows:

About each section in the Policy-Gradient algorithm training LSTM and formula for specifically how using enhancing study Plays Concrete meaning can be issued on Machine Learning periodical with reference to Ronald J.Williams in 1992 《Simple statistical gradient-following algorithms for connectionist reinforcement learning》(8(3-4):229–256,1992.)。

It, can also be only to full connection according to the structure feature of neural network in model compression method provided by the invention Layer carries out beta pruning, or carries out beta pruning to convolutional layer and full articulamentum simultaneously.The object cut in full articulamentum is full connection ginseng Number.

Effect assessment

Pass through three standard data sets CIFAR10, CIFAR100, MNIST in image classification field, three common networks Experiment on structure VGG, ResNet, three layers of fully-connected network, the model provided by the invention based on beta pruning sequence Active Learning Compression method shows validity.Specifically, in CIFAR10 data set, the compression ratio of VGG19 network is being kept substantially The compression ratio of 84.7%, ResNet be can reach in the case that precision is constant up to 34.1%, while precision improvement 0.56%； In CIFAR100 data set；The compression ratio of VGG19 network can reach 70.1% in the case where keeping precision constant；MNIST number According to concentration, the compression ratio of three layers of fully-connected network can reach 87.27% in the case where holding precision is constant substantially.VGG19 exists CIFAR10's tests while showing the susceptibility that LSTM instructs beta pruning that can show each layer of network well.And the present invention mentions The model compression method based on beta pruning sequence Active Learning supplied is experiments verify that be better than the non-data driving side based on convolution kernel Method.The above experimental result leads over existing pruning method at present, it was demonstrated that provided by the invention actively to be learned based on beta pruning sequence The validity of the model compression method of habit.

Finally it should be noted that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although reference Preferred embodiment describes the invention in detail, those skilled in the art should understand that, it can be to of the invention Technical solution is modified or replaced equivalently, and without departing from the spirit and scope of the technical solution of the present invention, should all be covered In scope of the presently claimed invention.

Claims

1. a kind of model compression method based on beta pruning sequence Active Learning, which is characterized in that the described method includes:

S2. the network layer parameter of the network layer is assessed and is cut in selected network layer, and propose Restoration Mechanism to cutting Model accuracy is restored at once after branch；

S4. according to the expressive force and complexity of the model of retraining after beta pruning, feedback excitation R is obtained, more with enhancing learning method New LSTM；

S5. input of highest 5 models of feedback excitation as new round LSTM is chosen, step 1- step 4 is repeated, until LSTM Training process is terminated when no longer generating more preferably beta pruning decision, obtains model after optimal beta pruning.

2. the method according to claim 1, wherein the step S1 includes:

With (m_i,n_i) indicate neural network i-th of node ξ_i, wherein m indicates node type, and value is in { 0,1,2 }, generation respectively Table convolution, pondization and full attended operation；N indicates node attribute values, and when the node is convolution, n represents this layer of convolution nuclear volume； When the node is Chi Huashi, n represents pond step-length；When the node is full connection, n represents this layer of neuronal quantity；

(2) LSTM beta pruning decision is obtained, is specifically included:

At each moment, the input of a host node and its next node as multilayer LSTM, which is represented by [m_i, n_i,m_i+1.n_i+1]；LSTM is made whether the decision of beta pruning, auxiliary node using softmax function to host node currently entered Auxiliary information is only provided, beta pruning prediction is not carried out to it.

3. method according to claim 1 or 2, which is characterized in that the network layer includes convolutional layer and/or full connection Layer；The network layer parameter of convolutional layer is convolution kernel, and the network layer parameter of full articulamentum is full Connecting quantity.

4. according to the method described in claim 3, it is characterized in that, in step s 2, assessing in convolutional layer convolution kernel Method with cutting includes:

By calculating the L2 norm of each channel set in i+1 convolutional layer, importance scores s is obtained_j, it is specific such as following formula:

s_j=| | C_i+1,j||₂,s.t.j∈[1,x_i]

Wherein, C_i+1,jIndicate that j-th of channel set of i+1 convolutional layer, s.t. represent the abbreviation of subject to, x_iIt indicates The convolution nuclear volume of i-th of convolutional layer；

According to compression ratio, selects and cut importance scores s in i+1 convolutional layer_jThe smallest channel set corresponding with them Convolution kernel in i convolutional layer.

5. according to the method described in claim 4, it is characterized in that, the Restoration Mechanism includes: selection i+1 convolutional layer A part of convolution kernel simultaneously amplifies convolution nuclear parameter according to a certain percentage, specific such as following formula:

Wherein, F_i+1,jIndicate j-th of convolution kernel of i+1 convolutional layer；Indicate the i+1 convolutional layer of beta pruning J-th of convolution kernel；A is a hyper parameter, for selecting the biggish convolution kernel of deviation.

6. according to the method described in claim 5, it is characterized in that, in step s3, the method using knowledge extraction accelerates weight Training process, using the input model of LSTM as teacher's network, for the model after beta pruning as student network, Internet-supported Study of students is old The class probability z of all categories of teacher's network output, z include knowledge abundant compared to training label, comprising similitude between classification and Otherness.

7. according to the method described in claim 6, it is characterized in that, in step s 4, feedback excitation R is calculated by following formula It arrives:

R=performance- λ × complexity

Wherein, expressive force performance is indicated by model in the accuracy rate of verifying collection or the loss of training set；Complexity Complexity is indicated by model FLOPs or ginseng population size；λ is a hyper parameter, needs the cross validation by experiment To select optimal value.