CN113255899B

CN113255899B - Knowledge distillation method and system with self-correlation of channels

Info

Publication number: CN113255899B
Application number: CN202110673166.XA
Authority: CN
Inventors: 唐乾坤; 徐晓刚; 王军; 徐冠雷; 何鹏飞; 曹卫强
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2021-06-17
Filing date: 2021-06-17
Publication date: 2021-10-12
Anticipated expiration: 2041-06-17
Also published as: CN113255899A

Abstract

The invention discloses a knowledge distillation method and a knowledge distillation system for channel self-correlation, which comprises the following steps: step S1: inputting the same picture data into the teacher model and the student models to obtain picture characteristics of the student models and the teacher model, and selecting characteristic layers needing knowledge distillation in the student models and the teacher model; step S2: performing channel self-association on the channels of the selected student model and teacher model feature layers; step S3: the self-associated teacher model channel transmits knowledge to the student model channel in a weighting mode; step S4: distilling knowledge according to the associated channels, training, and optimizing a self-associated two-dimensional matrix and a student model during training; s5: and deploying the trained student model, and inputting picture data to perform reasoning test.

Description

Knowledge distillation method and system with self-correlation of channels

Technical Field

The invention relates to the field of computer vision, in particular to a knowledge distillation method and a knowledge distillation system for channel self-correlation.

Background

Although high performance is achieved in current neural networks, these neural networks consume a lot of memory and computational resources. Therefore, in order to deploy the neural networks with good performance to resource-limited platforms such as mobile phones and embedded platforms, model compression is an effective method. Knowledge distillation is one of the current research hotspots in the existing model compression algorithm.

The principle of knowledge distillation is: a complex network with better performance is used as a teacher model, a network with poorer performance and light weight is used as a student model, and when the student model is trained, the output of the teacher model or the output of an intermediate network layer is used as a soft label to supervise the training of the student model. If the number of channels of the middle network layer of the teacher model is inconsistent with the number of channels of the middle network layer of the student model, the prior art uses a conversion layer (usually a convolutional layer) to convert the number of channels of the student model into the number of channels of the student model, which is the same as that of the teacher model, so that although the operation is simple, the conversion layer contains more parameters and calculation amount, the training and optimization burden is increased, and the adoption of a one-to-one manual association mode after the conversion is not beneficial to learning the discriminant characteristics from the teacher model.

The invention provides a knowledge distillation method and a knowledge distillation device, which are characterized in that a student model channel and a teacher model channel can be automatically associated and transmit knowledge when the knowledge is distilled.

Disclosure of Invention

In order to solve the defects of the prior art and realize the purpose of self-correlation of a teacher model and a student model, the invention adopts the following technical scheme:

a method of knowledge distillation of channel self-association comprising the steps of:

step S1: inputting the same picture data into the teacher model and the student models to obtain picture characteristics of the student models and the teacher model, and selecting characteristic layers needing knowledge distillation in the student models and the teacher model;

step S2: performing channel self-association on the channels of the selected student model and teacher model feature layers;

step S3: the self-associated teacher model channel transmits knowledge to the student model channel in a weighting mode;

step S4: training is carried out according to the associated channel distillation knowledge, wherein the knowledge can be example relations, activation values or attention, the knowledge distillation loss, specific task loss and the like are used during training, and a self-associated two-dimensional integer matrix and a student model are optimized during training:

wherein,

the function of the loss is represented by,x _iwhich represents the data of the picture that was input,y _ia label representing the authenticity of the tag,

a predicted output value representing the student model,Wthe parameters that represent the model of the student,Nwhich indicates the number of input pictures,Mrepresenting a two-dimensional integer matrix;

step S5: and deploying the trained student model, and inputting picture data to perform reasoning test.

Further, in step S1, the teacher model and the student model select any existing convolutional neural network model, input the same picture data into the teacher model and the student model, and select one or more feature layers from the intermediate convolutional layers of the teacher model and the student model respectively;

further, in step S1, the intermediate feature layers of the selected student model are:

and the selected middle characteristic layer of the teacher model is as follows:

whereinC _s/tthe number of channels is indicated as such,H _s/tthe height of the feature map is shown,W _s/trepresenting the feature map width.

Further, in step S2, the channels are self-associated as follows:

setting a two-dimensional integer matrix

Wherein

，

The median of the two-dimensional integer matrix is a positive integer and only comprises a 0 or 1 two-dimensional integer matrix, the row of the matrix represents the number of channels of the selected student model feature layer, the column represents the number of channels of the selected teacher model feature layer, when the matrix value is 0, the channel corresponding to the row of the student model feature layer is represented, knowledge is not learned from the channel corresponding to the column of the teacher model feature layer, when the matrix value is 1, the channel corresponding to the row of the student model feature layer is represented, and knowledge is learned from the channel corresponding to the column of the teacher model feature layer; each channel of the student model may be associated with multiple channels of the teacher model, and each channel of the teacher model may transmit knowledge to multiple channels of the student model.

Further, in step S3, when fusing the teacher model channel characteristics, each channel of the student model adopts a weighting method:

wherein,Rthe function of the deformation is represented by,F _t[c _t]representing a feature layerF _tFirst, thec _tCharacteristic of the channel (0)<c _t<C _t），F _s[c _s]Representing a feature layerF _sFirst, thec _sCharacteristic of the channel (0)<c _s<C _s），||•||₂Represents a 2-dimensional norm;

weights in the transfer of knowledge, including but not limited to, by calculating semantic relevance of each associated teacher model and student model channel;

further, in step S4, the loss function of knowledge distillation when training is performed is:

wherein,αindicates weight, dist indicates distance function, an indicates multiplication by element;

further, the invention is the overall loss function in training

The formalization is as follows:

wherein,

and representing a student model task-related loss function, such as an image classification problem, which is cross entropy loss or Softmax loss and the like. Therefore, when training optimization is carried out, the two-dimensional integer matrix and the student model in self-correlation can be simultaneously optimized.

Furthermore, the student model is optimized, the parameters of the student model are optimized by using a random gradient descent method,Was parameters of the student model, the firsttLoss function at sub-iteration

AboutWThe partial derivatives of (a) are:

wherein N represents the number of pictures input during gradient update, thentGradient of secondary update:

update parameters using gradient descent:

wherein

Is the learning rate.

Further, the two-dimensional integer matrix is decomposed by using a Kronecker multiplier in a matrix decomposition mode by optimizing the two-dimensional integer matrixMIs composed ofKSub-matrix:

wherein,

，

thereby, a matrixMExpressed as:

wherein,

which represents a Kronecker multiplication, the method,fas a function without parameters, a two-dimensional integer matrixMThe number of the parameters is as follows:

。

further, the

As a binary gate function:

wherein 1 represents a matrix with 2 rows and 2 columns and all 1 values,Irepresenting a matrix with a diagonal value of 1 for 2 rows and 2 columns and a remainder of 0,

representing learnable gate functions, two-dimensional integer matricesMThe parameter quantities of (a) are reduced to:

wherein

indicating a ceiling operation.

A channel self-associated knowledge distillation system comprising: the knowledge distillation module is respectively connected with the student model module, the teacher model module and the model optimization module, and the student model module and the teacher model module are connected with the model optimization module;

the student model module is a neural network model used for learning knowledge and deployment;

the teacher model module is a neural network model used for extracting and transmitting knowledge;

the knowledge distillation module is used for weighting and extracting and learning knowledge from the middle characteristic layer of the teacher model by the student model and automatically associating the characteristic layer channels;

the model optimization module is used for optimizing parameters of the student model and a two-dimensional integer matrix involved in channel self-correlation, the two-dimensional integer matrix only comprises 0 or 1 two-dimensional integer matrix, rows of the matrix represent the number of channels of a selected student model characteristic layer, columns represent the number of channels of a selected teacher model characteristic layer, when the matrix value is 0, channels corresponding to the rows of the student model characteristic layer are represented, knowledge is not learned from the channels corresponding to the columns of the teacher model characteristic layer, when the matrix value is 1, channels corresponding to the rows of the student model characteristic layer are represented, and knowledge is learned from the channels corresponding to the columns of the teacher model characteristic layer.

The invention has the advantages and beneficial effects that:

the method is independent of a specific neural network model, can be easily applied to the existing neural network model, only needs few parameters and calculated amount compared with the existing manual correlation method, and can obviously improve the performance of the knowledge post-distillation student model, and is superior to the existing technology. The method can be applied to visual tasks such as target classification, target detection, target segmentation and the like.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a schematic diagram of the distillation process of the present invention.

FIG. 3 is a schematic diagram of the association of a teacher model with a student model channel in accordance with the present invention.

Fig. 4 is a schematic diagram of the system of the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.

As shown in fig. 1, a knowledge distillation method of channel self-correlation comprises the following steps:

s1: inputting the same picture data into the teacher model and the student models to obtain picture characteristics of the student models and the teacher model, and selecting characteristic layers needing knowledge distillation in the student models and the teacher model;

in a preferred embodiment, the teacher model and the student model select any existing convolutional neural network model, and input the same picture data into the teacher model and the student model, so as to obtain the picture characteristics of the student model and the teacher model, as shown in fig. 2. The middle feature layer of the student model is selected as follows:

selecting the middle characteristic layer of the teacher model as follows:

S2: performing channel self-association on the channels of the selected student model and teacher model feature layers;

in a preferred embodiment, when the channel is self-correlated, a two-dimensional integer matrix is set, and the matrix only contains 0 or 1;

in a preferred embodiment, the two-dimensional integer matrix is represented as:

wherein

，

. The matrix median is a positive integer and is only 0 or 1.

In a preferred embodiment, a value of 1 in the two-dimensional integer matrix is a corresponding channel for selecting the teacher model feature layer;

in a preferred embodiment, the matrix represents that each channel of the student model may be associated with multiple channels of the teacher model, and each channel of the teacher model may transfer knowledge to multiple channels of the student model, as shown in FIG. 3.

S3: the self-associated teacher model channel transmits knowledge to the student model channel in a weighting mode;

in a preferred embodiment, each channel of the student model can adopt a weighting mode when fusing the channel characteristics of the teacher model, and the weighting can be obtained by semantic similarity measurement and is formalized as:

whereinRThe function of the deformation is represented by,F _t[c _t]representing a feature layerF _tFirst, thec _tCharacteristic of the channel (0)<c _t<C _t），F _s[c _s]Representing a feature layerF _sFirst, thec _sCharacteristic of the channel (0)<c _s<C _s），||•||₂Representing a 2-dimensional norm.

S4: distilling knowledge according to the selected channel, training, and simultaneously optimizing a self-associated two-dimensional integer matrix and a student model during training;

in a preferred embodiment, the loss function of knowledge distillation when training is performed can be formulated as:

wherein,

represents a distance function, which indicates a multiplication by an element;

as a preferred embodiment, the overall loss function of the invention during training is as follows:

wherein,

and representing a student model task-related loss function, such as an image classification problem, which is cross entropy loss or Softmax loss and the like. Therefore, the self-associated two-dimensional integer matrix and the student model can be simultaneously optimized during training optimization, namely:

wherein,

the function of the loss is represented by,x _iwhich represents the data of the picture that was input,

a label representing the authenticity of the tag,

a predicted output value representing the student model,

parameters representing the student model, and N represents the number of pictures entered.

As a preferred embodiment, the two-dimensional integer matrix is optimizedMIn order to reduce the parameter quantity and the optimization difficulty, a matrix decomposition mode can be selected, and optionally, a Kronecker multiplier decomposition matrix is used for decomposing the matrixMIs composed ofKA submatrix, formalized as:

wherein

，

. Thus the matrix M can be expressed as:

wherein

Which represents a Kronecker multiplication, the method,fis a function without parameters. At this timeMThe parameters of the matrix are:

。

alternatively, as a preferred embodiment, the method may be implemented by

Set as a binary gate function to further reduce the number of parameters, formalized as:

where 1 represents a matrix of 2 rows and 2 columns with values of all 1's,Ia matrix with a diagonal value of 1 for 2 rows and 2 columns and a remainder of 0 is shown.

Representing learnable gate functions, such as matricesMThe parameter quantities of (a) are reduced to:

，（

operation of rounding up the representation

In a preferred embodiment, the parameters of the student model are optimized using a stochastic gradient descent method,Was parameters of the student model, the firsttLoss function at sub-iteration

AboutWThe partial derivatives of (a) are:

wherein N represents the number of pictures input during gradient update, thentGradient of secondary update

Is defined as:

the parameters are updated using the gradient descent,

wherein

Is the learning rate;

s5: and deploying the trained student model, and inputting picture data to perform reasoning test.

A knowledge distillation system with self-correlation of channels, as shown in fig. 4, specifically comprising: the system comprises a student model module, a teacher model module, a knowledge distillation module and a model optimization module.

The student model module is used for learning knowledge and a deployed neural network model; the teacher model module is used for extracting and transmitting a neural network model of knowledge; the knowledge distillation module is used for extracting and learning knowledge from the middle characteristic layer of the teacher model by the student model and automatically associating the characteristic layer channels; and the model optimization module is used for optimizing parameters of the student model and the self-associated two-dimensional integer matrix of the first aspect.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A knowledge distillation method of channel self-correlation, characterized by comprising the steps of:

step S1: inputting the same picture data into the teacher model and the student models to obtain the picture characteristics of the student models and the teacher model, selecting a convolution characteristic layer which needs knowledge distillation in the student models and the teacher model, wherein the middle characteristic layer of the selected student models is as follows:

whereinC _s/tthe number of channels is indicated as such,H _s/tthe height of the feature map is shown,W _s/trepresenting a feature map width;

step S2: and performing channel self-association on the channels of the convolution feature layers of the selected student model and the teacher model, wherein the channel self-association mode is as follows:

setting a two-dimensional integer matrix

Wherein

，

The median of the two-dimensional integer matrix is a positive integer and is only 0 or 1, the rows of the two-dimensional integer matrix represent the number of channels of the selected student model feature layer, the columns represent the number of channels of the selected teacher model feature layer, when the matrix value is 0, the channels corresponding to the rows of the student model feature layer are represented, knowledge is not learned from the channels corresponding to the columns of the teacher model feature layer, when the matrix value is 1, the channels corresponding to the rows of the student model feature layer are represented, and knowledge is learned from the channels corresponding to the columns of the teacher model feature layer; each of the student modelsThe channels may be associated with multiple channels of the teacher model, and each channel of the teacher model may transmit knowledge to multiple channels of the student model;

step S4: and (3) distilling knowledge according to the associated channels, training, and simultaneously optimizing a self-associated two-dimensional integer matrix and a student model during training:

wherein,

2. The method of claim 1, wherein in step S1, one or more feature layers are selected from the intermediate convolutional layers of the teacher model and the student model respectively.

3. The method of claim 1, wherein in step S3, each channel of the student model is weighted by the characteristics of the teacher model channel, and the weights include but are not limited to those obtained by calculating semantic correlations between each associated teacher model and student model channel, and are expressed as:

wherein,Rthe function of the deformation is represented by,F _t[c _t]representing a feature layerF _tFirst, thec _tCharacteristic of the channel (0)<c _t<C _t），F _s[c _s]Representing a feature layerF _sFirst, thec _sCharacteristic of the channel (0)<c _s<C _s），||•||₂Representing a 2-dimensional norm.

4. The knowledge distillation method of claim 3, wherein in step S4, the loss function of knowledge distillation in training is:

wherein,αrepresents weight, dist represents distance function, and indicates multiplication by element, integral loss function during training:

wherein,

and the loss function related to the student model task is represented, and the two-dimensional integer matrix and the student model in self-correlation are optimized simultaneously during training optimization.

5. The channel self-correlation knowledge distillation method as claimed in claim 4, wherein the student model is optimized, and the parameters of the student model use random gradientThe optimization of the descending method is carried out,Was parameters of the student model, the firsttLoss function at sub-iteration

AboutWThe partial derivatives of (a) are:

update parameters using gradient descent:

wherein

Is the learning rate.

6. The knowledge distillation method of channel self-correlation as claimed in claim 1, wherein the two-dimensional integer matrix of optimized self-correlation is decomposed by Kronecker multiplier in matrix decompositionMIs composed ofKSub-matrix:

wherein,

，

thereby, a matrixMExpressed as:

wherein,

。

7. a method of knowledge distillation of channel self-correlation as claimed in claim 6, wherein said method is characterized by

Is a binary gate function, and is expressed in the form:

wherein,1representing a matrix of 2 rows and 2 columns with values of all 1,Irepresenting a matrix with a diagonal value of 1 for 2 rows and 2 columns and a remainder of 0,

wherein

indicating a ceiling operation.

8. A channel self-associated knowledge distillation system comprising: the teaching model comprises a student model module, a teacher model module and a model optimization module, and is characterized by also comprising a knowledge distillation module which is respectively connected with the student model module, the teacher model module and the model optimization module, wherein the student model module is connected with the model optimization module;

the knowledge distillation module is used for extracting and learning knowledge from the middle characteristic layer of the teacher model by the student model and automatically associating the characteristic layer channels;