CN113255899B - Knowledge distillation method and system with self-correlation of channels - Google Patents

Knowledge distillation method and system with self-correlation of channels Download PDF

Info

Publication number
CN113255899B
CN113255899B CN202110673166.XA CN202110673166A CN113255899B CN 113255899 B CN113255899 B CN 113255899B CN 202110673166 A CN202110673166 A CN 202110673166A CN 113255899 B CN113255899 B CN 113255899B
Authority
CN
China
Prior art keywords
model
student
knowledge
matrix
channels
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110673166.XA
Other languages
Chinese (zh)
Other versions
CN113255899A (en
Inventor
唐乾坤
徐晓刚
王军
徐冠雷
何鹏飞
曹卫强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202110673166.XA priority Critical patent/CN113255899B/en
Publication of CN113255899A publication Critical patent/CN113255899A/en
Application granted granted Critical
Publication of CN113255899B publication Critical patent/CN113255899B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a knowledge distillation method and a knowledge distillation system for channel self-correlation, which comprises the following steps: step S1: inputting the same picture data into the teacher model and the student models to obtain picture characteristics of the student models and the teacher model, and selecting characteristic layers needing knowledge distillation in the student models and the teacher model; step S2: performing channel self-association on the channels of the selected student model and teacher model feature layers; step S3: the self-associated teacher model channel transmits knowledge to the student model channel in a weighting mode; step S4: distilling knowledge according to the associated channels, training, and optimizing a self-associated two-dimensional matrix and a student model during training; s5: and deploying the trained student model, and inputting picture data to perform reasoning test.

Description

Knowledge distillation method and system with self-correlation of channels
Technical Field
The invention relates to the field of computer vision, in particular to a knowledge distillation method and a knowledge distillation system for channel self-correlation.
Background
Although high performance is achieved in current neural networks, these neural networks consume a lot of memory and computational resources. Therefore, in order to deploy the neural networks with good performance to resource-limited platforms such as mobile phones and embedded platforms, model compression is an effective method. Knowledge distillation is one of the current research hotspots in the existing model compression algorithm.
The principle of knowledge distillation is: a complex network with better performance is used as a teacher model, a network with poorer performance and light weight is used as a student model, and when the student model is trained, the output of the teacher model or the output of an intermediate network layer is used as a soft label to supervise the training of the student model. If the number of channels of the middle network layer of the teacher model is inconsistent with the number of channels of the middle network layer of the student model, the prior art uses a conversion layer (usually a convolutional layer) to convert the number of channels of the student model into the number of channels of the student model, which is the same as that of the teacher model, so that although the operation is simple, the conversion layer contains more parameters and calculation amount, the training and optimization burden is increased, and the adoption of a one-to-one manual association mode after the conversion is not beneficial to learning the discriminant characteristics from the teacher model.
The invention provides a knowledge distillation method and a knowledge distillation device, which are characterized in that a student model channel and a teacher model channel can be automatically associated and transmit knowledge when the knowledge is distilled.
Disclosure of Invention
In order to solve the defects of the prior art and realize the purpose of self-correlation of a teacher model and a student model, the invention adopts the following technical scheme:
a method of knowledge distillation of channel self-association comprising the steps of:
step S1: inputting the same picture data into the teacher model and the student models to obtain picture characteristics of the student models and the teacher model, and selecting characteristic layers needing knowledge distillation in the student models and the teacher model;
step S2: performing channel self-association on the channels of the selected student model and teacher model feature layers;
step S3: the self-associated teacher model channel transmits knowledge to the student model channel in a weighting mode;
step S4: training is carried out according to the associated channel distillation knowledge, wherein the knowledge can be example relations, activation values or attention, the knowledge distillation loss, specific task loss and the like are used during training, and a self-associated two-dimensional integer matrix and a student model are optimized during training:
Figure 849151DEST_PATH_IMAGE001
wherein,
Figure 493759DEST_PATH_IMAGE002
the function of the loss is represented by,x i which represents the data of the picture that was input,y i a label representing the authenticity of the tag,
Figure 576246DEST_PATH_IMAGE003
a predicted output value representing the student model,Wthe parameters that represent the model of the student,Nwhich indicates the number of input pictures,Mrepresenting a two-dimensional integer matrix;
step S5: and deploying the trained student model, and inputting picture data to perform reasoning test.
Further, in step S1, the teacher model and the student model select any existing convolutional neural network model, input the same picture data into the teacher model and the student model, and select one or more feature layers from the intermediate convolutional layers of the teacher model and the student model respectively;
further, in step S1, the intermediate feature layers of the selected student model are:
Figure 534975DEST_PATH_IMAGE004
and the selected middle characteristic layer of the teacher model is as follows:
Figure 127630DEST_PATH_IMAGE005
whereinC s/t the number of channels is indicated as such,H s/t the height of the feature map is shown,W s/t representing the feature map width.
Further, in step S2, the channels are self-associated as follows:
setting a two-dimensional integer matrix
Figure 818506DEST_PATH_IMAGE006
Wherein
Figure 762191DEST_PATH_IMAGE007
Figure 711561DEST_PATH_IMAGE008
The median of the two-dimensional integer matrix is a positive integer and only comprises a 0 or 1 two-dimensional integer matrix, the row of the matrix represents the number of channels of the selected student model feature layer, the column represents the number of channels of the selected teacher model feature layer, when the matrix value is 0, the channel corresponding to the row of the student model feature layer is represented, knowledge is not learned from the channel corresponding to the column of the teacher model feature layer, when the matrix value is 1, the channel corresponding to the row of the student model feature layer is represented, and knowledge is learned from the channel corresponding to the column of the teacher model feature layer; each channel of the student model may be associated with multiple channels of the teacher model, and each channel of the teacher model may transmit knowledge to multiple channels of the student model.
Further, in step S3, when fusing the teacher model channel characteristics, each channel of the student model adopts a weighting method:
Figure 424303DEST_PATH_IMAGE009
wherein,Rthe function of the deformation is represented by,F t [c t ]representing a feature layerF t First, thec t Characteristic of the channel (0)<c t <C t ),F s [c s ]Representing a feature layerF s First, thec s Characteristic of the channel (0)<c s <C s ),||•||2Represents a 2-dimensional norm;
weights in the transfer of knowledge, including but not limited to, by calculating semantic relevance of each associated teacher model and student model channel;
further, in step S4, the loss function of knowledge distillation when training is performed is:
Figure 20500DEST_PATH_IMAGE010
wherein,αindicates weight, dist indicates distance function, an indicates multiplication by element;
further, the invention is the overall loss function in training
Figure 451481DEST_PATH_IMAGE011
The formalization is as follows:
Figure 174849DEST_PATH_IMAGE012
wherein,
Figure 7676DEST_PATH_IMAGE013
and representing a student model task-related loss function, such as an image classification problem, which is cross entropy loss or Softmax loss and the like. Therefore, when training optimization is carried out, the two-dimensional integer matrix and the student model in self-correlation can be simultaneously optimized.
Furthermore, the student model is optimized, the parameters of the student model are optimized by using a random gradient descent method,Was parameters of the student model, the firsttLoss function at sub-iteration
Figure 774775DEST_PATH_IMAGE013
AboutWThe partial derivatives of (a) are:
Figure 693052DEST_PATH_IMAGE014
wherein N represents the number of pictures input during gradient update, thentGradient of secondary update:
Figure 718646DEST_PATH_IMAGE015
update parameters using gradient descent:
Figure 140400DEST_PATH_IMAGE016
wherein
Figure 671875DEST_PATH_IMAGE017
Is the learning rate.
Further, the two-dimensional integer matrix is decomposed by using a Kronecker multiplier in a matrix decomposition mode by optimizing the two-dimensional integer matrixMIs composed ofKSub-matrix:
Figure 687236DEST_PATH_IMAGE018
wherein,
Figure 391887DEST_PATH_IMAGE019
Figure 559825DEST_PATH_IMAGE020
thereby, a matrixMExpressed as:
Figure 58940DEST_PATH_IMAGE021
wherein,
Figure 827175DEST_PATH_IMAGE022
which represents a Kronecker multiplication, the method,fas a function without parameters, a two-dimensional integer matrixMThe number of the parameters is as follows:
Figure 69938DEST_PATH_IMAGE023
further, the
Figure 590918DEST_PATH_IMAGE024
As a binary gate function:
Figure 260934DEST_PATH_IMAGE025
wherein 1 represents a matrix with 2 rows and 2 columns and all 1 values,Irepresenting a matrix with a diagonal value of 1 for 2 rows and 2 columns and a remainder of 0,
Figure 250886DEST_PATH_IMAGE026
representing learnable gate functions, two-dimensional integer matricesMThe parameter quantities of (a) are reduced to:
Figure 235023DEST_PATH_IMAGE027
wherein
Figure 751455DEST_PATH_IMAGE028
indicating a ceiling operation.
A channel self-associated knowledge distillation system comprising: the knowledge distillation module is respectively connected with the student model module, the teacher model module and the model optimization module, and the student model module and the teacher model module are connected with the model optimization module;
the student model module is a neural network model used for learning knowledge and deployment;
the teacher model module is a neural network model used for extracting and transmitting knowledge;
the knowledge distillation module is used for weighting and extracting and learning knowledge from the middle characteristic layer of the teacher model by the student model and automatically associating the characteristic layer channels;
the model optimization module is used for optimizing parameters of the student model and a two-dimensional integer matrix involved in channel self-correlation, the two-dimensional integer matrix only comprises 0 or 1 two-dimensional integer matrix, rows of the matrix represent the number of channels of a selected student model characteristic layer, columns represent the number of channels of a selected teacher model characteristic layer, when the matrix value is 0, channels corresponding to the rows of the student model characteristic layer are represented, knowledge is not learned from the channels corresponding to the columns of the teacher model characteristic layer, when the matrix value is 1, channels corresponding to the rows of the student model characteristic layer are represented, and knowledge is learned from the channels corresponding to the columns of the teacher model characteristic layer.
The invention has the advantages and beneficial effects that:
the method is independent of a specific neural network model, can be easily applied to the existing neural network model, only needs few parameters and calculated amount compared with the existing manual correlation method, and can obviously improve the performance of the knowledge post-distillation student model, and is superior to the existing technology. The method can be applied to visual tasks such as target classification, target detection, target segmentation and the like.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a schematic diagram of the distillation process of the present invention.
FIG. 3 is a schematic diagram of the association of a teacher model with a student model channel in accordance with the present invention.
Fig. 4 is a schematic diagram of the system of the present invention.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.
As shown in fig. 1, a knowledge distillation method of channel self-correlation comprises the following steps:
s1: inputting the same picture data into the teacher model and the student models to obtain picture characteristics of the student models and the teacher model, and selecting characteristic layers needing knowledge distillation in the student models and the teacher model;
in a preferred embodiment, the teacher model and the student model select any existing convolutional neural network model, and input the same picture data into the teacher model and the student model, so as to obtain the picture characteristics of the student model and the teacher model, as shown in fig. 2. The middle feature layer of the student model is selected as follows:
Figure 206752DEST_PATH_IMAGE004
selecting the middle characteristic layer of the teacher model as follows:
Figure 808635DEST_PATH_IMAGE005
whereinC s/t the number of channels is indicated as such,H s/t the height of the feature map is shown,W s/t representing the feature map width.
S2: performing channel self-association on the channels of the selected student model and teacher model feature layers;
in a preferred embodiment, when the channel is self-correlated, a two-dimensional integer matrix is set, and the matrix only contains 0 or 1;
in a preferred embodiment, the two-dimensional integer matrix is represented as:
Figure 268566DEST_PATH_IMAGE006
wherein
Figure 639504DEST_PATH_IMAGE007
Figure 775957DEST_PATH_IMAGE008
. The matrix median is a positive integer and is only 0 or 1.
In a preferred embodiment, a value of 1 in the two-dimensional integer matrix is a corresponding channel for selecting the teacher model feature layer;
in a preferred embodiment, the matrix represents that each channel of the student model may be associated with multiple channels of the teacher model, and each channel of the teacher model may transfer knowledge to multiple channels of the student model, as shown in FIG. 3.
S3: the self-associated teacher model channel transmits knowledge to the student model channel in a weighting mode;
in a preferred embodiment, each channel of the student model can adopt a weighting mode when fusing the channel characteristics of the teacher model, and the weighting can be obtained by semantic similarity measurement and is formalized as:
Figure 271660DEST_PATH_IMAGE009
whereinRThe function of the deformation is represented by,F t [c t ]representing a feature layerF t First, thec t Characteristic of the channel (0)<c t <C t ),F s [c s ]Representing a feature layerF s First, thec s Characteristic of the channel (0)<c s <C s ),||•||2Representing a 2-dimensional norm.
S4: distilling knowledge according to the selected channel, training, and simultaneously optimizing a self-associated two-dimensional integer matrix and a student model during training;
in a preferred embodiment, the loss function of knowledge distillation when training is performed can be formulated as:
Figure 863178DEST_PATH_IMAGE010
wherein,
Figure 88623DEST_PATH_IMAGE029
represents a distance function, which indicates a multiplication by an element;
as a preferred embodiment, the overall loss function of the invention during training is as follows:
Figure 897442DEST_PATH_IMAGE030
wherein,
Figure 739496DEST_PATH_IMAGE013
and representing a student model task-related loss function, such as an image classification problem, which is cross entropy loss or Softmax loss and the like. Therefore, the self-associated two-dimensional integer matrix and the student model can be simultaneously optimized during training optimization, namely:
Figure 541229DEST_PATH_IMAGE001
wherein,
Figure 621181DEST_PATH_IMAGE002
the function of the loss is represented by,x i which represents the data of the picture that was input,
Figure 443643DEST_PATH_IMAGE031
a label representing the authenticity of the tag,
Figure 366469DEST_PATH_IMAGE003
a predicted output value representing the student model,
Figure 362107DEST_PATH_IMAGE032
parameters representing the student model, and N represents the number of pictures entered.
As a preferred embodiment, the two-dimensional integer matrix is optimizedMIn order to reduce the parameter quantity and the optimization difficulty, a matrix decomposition mode can be selected, and optionally, a Kronecker multiplier decomposition matrix is used for decomposing the matrixMIs composed ofKA submatrix, formalized as:
Figure 171931DEST_PATH_IMAGE018
wherein
Figure 962032DEST_PATH_IMAGE033
Figure 450783DEST_PATH_IMAGE034
. Thus the matrix M can be expressed as:
Figure 345052DEST_PATH_IMAGE021
wherein
Figure 399595DEST_PATH_IMAGE035
Which represents a Kronecker multiplication, the method,fis a function without parameters. At this timeMThe parameters of the matrix are:
Figure 501543DEST_PATH_IMAGE023
alternatively, as a preferred embodiment, the method may be implemented by
Figure 8748DEST_PATH_IMAGE024
Set as a binary gate function to further reduce the number of parameters, formalized as:
Figure 205243DEST_PATH_IMAGE025
where 1 represents a matrix of 2 rows and 2 columns with values of all 1's,Ia matrix with a diagonal value of 1 for 2 rows and 2 columns and a remainder of 0 is shown.
Figure 583135DEST_PATH_IMAGE026
Representing learnable gate functions, such as matricesMThe parameter quantities of (a) are reduced to:
Figure 449460DEST_PATH_IMAGE027
,(
Figure 584906DEST_PATH_IMAGE028
operation of rounding up the representation
In a preferred embodiment, the parameters of the student model are optimized using a stochastic gradient descent method,Was parameters of the student model, the firsttLoss function at sub-iteration
Figure 460458DEST_PATH_IMAGE013
AboutWThe partial derivatives of (a) are:
Figure 115693DEST_PATH_IMAGE014
wherein N represents the number of pictures input during gradient update, thentGradient of secondary update
Figure 356181DEST_PATH_IMAGE036
Is defined as:
Figure 103557DEST_PATH_IMAGE015
the parameters are updated using the gradient descent,
Figure 658166DEST_PATH_IMAGE016
wherein
Figure 541809DEST_PATH_IMAGE017
Is the learning rate;
s5: and deploying the trained student model, and inputting picture data to perform reasoning test.
A knowledge distillation system with self-correlation of channels, as shown in fig. 4, specifically comprising: the system comprises a student model module, a teacher model module, a knowledge distillation module and a model optimization module.
The student model module is used for learning knowledge and a deployed neural network model; the teacher model module is used for extracting and transmitting a neural network model of knowledge; the knowledge distillation module is used for extracting and learning knowledge from the middle characteristic layer of the teacher model by the student model and automatically associating the characteristic layer channels; and the model optimization module is used for optimizing parameters of the student model and the self-associated two-dimensional integer matrix of the first aspect.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. A knowledge distillation method of channel self-correlation, characterized by comprising the steps of:
step S1: inputting the same picture data into the teacher model and the student models to obtain the picture characteristics of the student models and the teacher model, selecting a convolution characteristic layer which needs knowledge distillation in the student models and the teacher model, wherein the middle characteristic layer of the selected student models is as follows:
Figure DEST_PATH_IMAGE001
and the selected middle characteristic layer of the teacher model is as follows:
Figure DEST_PATH_IMAGE002
whereinC s/t the number of channels is indicated as such,H s/t the height of the feature map is shown,W s/t representing a feature map width;
step S2: and performing channel self-association on the channels of the convolution feature layers of the selected student model and the teacher model, wherein the channel self-association mode is as follows:
setting a two-dimensional integer matrix
Figure DEST_PATH_IMAGE003
Wherein
Figure DEST_PATH_IMAGE004
Figure DEST_PATH_IMAGE005
The median of the two-dimensional integer matrix is a positive integer and is only 0 or 1, the rows of the two-dimensional integer matrix represent the number of channels of the selected student model feature layer, the columns represent the number of channels of the selected teacher model feature layer, when the matrix value is 0, the channels corresponding to the rows of the student model feature layer are represented, knowledge is not learned from the channels corresponding to the columns of the teacher model feature layer, when the matrix value is 1, the channels corresponding to the rows of the student model feature layer are represented, and knowledge is learned from the channels corresponding to the columns of the teacher model feature layer; each of the student modelsThe channels may be associated with multiple channels of the teacher model, and each channel of the teacher model may transmit knowledge to multiple channels of the student model;
step S3: the self-associated teacher model channel transmits knowledge to the student model channel in a weighting mode;
step S4: and (3) distilling knowledge according to the associated channels, training, and simultaneously optimizing a self-associated two-dimensional integer matrix and a student model during training:
Figure DEST_PATH_IMAGE006
wherein,
Figure DEST_PATH_IMAGE007
the function of the loss is represented by,x i which represents the data of the picture that was input,y i a label representing the authenticity of the tag,
Figure DEST_PATH_IMAGE008
a predicted output value representing the student model,Wthe parameters that represent the model of the student,Nwhich indicates the number of input pictures,Mrepresenting a two-dimensional integer matrix;
step S5: and deploying the trained student model, and inputting picture data to perform reasoning test.
2. The method of claim 1, wherein in step S1, one or more feature layers are selected from the intermediate convolutional layers of the teacher model and the student model respectively.
3. The method of claim 1, wherein in step S3, each channel of the student model is weighted by the characteristics of the teacher model channel, and the weights include but are not limited to those obtained by calculating semantic correlations between each associated teacher model and student model channel, and are expressed as:
Figure DEST_PATH_IMAGE009
wherein,Rthe function of the deformation is represented by,F t [c t ]representing a feature layerF t First, thec t Characteristic of the channel (0)<c t <C t ),F s [c s ]Representing a feature layerF s First, thec s Characteristic of the channel (0)<c s <C s ),||•||2Representing a 2-dimensional norm.
4. The knowledge distillation method of claim 3, wherein in step S4, the loss function of knowledge distillation in training is:
Figure DEST_PATH_IMAGE010
wherein,αrepresents weight, dist represents distance function, and indicates multiplication by element, integral loss function during training:
Figure DEST_PATH_IMAGE011
wherein,
Figure DEST_PATH_IMAGE012
and the loss function related to the student model task is represented, and the two-dimensional integer matrix and the student model in self-correlation are optimized simultaneously during training optimization.
5. The channel self-correlation knowledge distillation method as claimed in claim 4, wherein the student model is optimized, and the parameters of the student model use random gradientThe optimization of the descending method is carried out,Was parameters of the student model, the firsttLoss function at sub-iteration
Figure 735592DEST_PATH_IMAGE012
AboutWThe partial derivatives of (a) are:
Figure DEST_PATH_IMAGE013
wherein N represents the number of pictures input during gradient update, thentGradient of secondary update:
Figure DEST_PATH_IMAGE014
update parameters using gradient descent:
Figure DEST_PATH_IMAGE015
wherein
Figure DEST_PATH_IMAGE016
Is the learning rate.
6. The knowledge distillation method of channel self-correlation as claimed in claim 1, wherein the two-dimensional integer matrix of optimized self-correlation is decomposed by Kronecker multiplier in matrix decompositionMIs composed ofKSub-matrix:
Figure DEST_PATH_IMAGE017
wherein,
Figure DEST_PATH_IMAGE018
Figure DEST_PATH_IMAGE019
thereby, a matrixMExpressed as:
Figure DEST_PATH_IMAGE020
wherein,
Figure DEST_PATH_IMAGE021
which represents a Kronecker multiplication, the method,fas a function without parameters, a two-dimensional integer matrixMThe number of the parameters is as follows:
Figure DEST_PATH_IMAGE022
7. a method of knowledge distillation of channel self-correlation as claimed in claim 6, wherein said method is characterized by
Figure DEST_PATH_IMAGE023
Is a binary gate function, and is expressed in the form:
Figure DEST_PATH_IMAGE024
wherein,1representing a matrix of 2 rows and 2 columns with values of all 1,Irepresenting a matrix with a diagonal value of 1 for 2 rows and 2 columns and a remainder of 0,
Figure DEST_PATH_IMAGE025
representing learnable gate functions, two-dimensional integer matricesMThe parameter quantities of (a) are reduced to:
Figure DEST_PATH_IMAGE026
wherein
Figure DEST_PATH_IMAGE027
indicating a ceiling operation.
8. A channel self-associated knowledge distillation system comprising: the teaching model comprises a student model module, a teacher model module and a model optimization module, and is characterized by also comprising a knowledge distillation module which is respectively connected with the student model module, the teacher model module and the model optimization module, wherein the student model module is connected with the model optimization module;
the student model module is a neural network model used for learning knowledge and deployment;
the teacher model module is a neural network model used for extracting and transmitting knowledge;
the knowledge distillation module is used for extracting and learning knowledge from the middle characteristic layer of the teacher model by the student model and automatically associating the characteristic layer channels;
the model optimization module is used for optimizing parameters of the student model and a two-dimensional integer matrix involved in channel self-correlation, the two-dimensional integer matrix only comprises 0 or 1 two-dimensional integer matrix, rows of the matrix represent the number of channels of a selected student model characteristic layer, columns represent the number of channels of a selected teacher model characteristic layer, when the matrix value is 0, channels corresponding to the rows of the student model characteristic layer are represented, knowledge is not learned from the channels corresponding to the columns of the teacher model characteristic layer, when the matrix value is 1, channels corresponding to the rows of the student model characteristic layer are represented, and knowledge is learned from the channels corresponding to the columns of the teacher model characteristic layer.
CN202110673166.XA 2021-06-17 2021-06-17 Knowledge distillation method and system with self-correlation of channels Active CN113255899B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110673166.XA CN113255899B (en) 2021-06-17 2021-06-17 Knowledge distillation method and system with self-correlation of channels

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110673166.XA CN113255899B (en) 2021-06-17 2021-06-17 Knowledge distillation method and system with self-correlation of channels

Publications (2)

Publication Number Publication Date
CN113255899A CN113255899A (en) 2021-08-13
CN113255899B true CN113255899B (en) 2021-10-12

Family

ID=77188543

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110673166.XA Active CN113255899B (en) 2021-06-17 2021-06-17 Knowledge distillation method and system with self-correlation of channels

Country Status (1)

Country Link
CN (1) CN113255899B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113963022B (en) * 2021-10-20 2023-08-18 哈尔滨工业大学 Multi-outlet full convolution network target tracking method based on knowledge distillation

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110674714A (en) * 2019-09-13 2020-01-10 东南大学 Human face and human face key point joint detection method based on transfer learning
CN112529178A (en) * 2020-12-09 2021-03-19 中国科学院国家空间科学中心 Knowledge distillation method and system suitable for detection model without preselection frame

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11620515B2 (en) * 2019-11-07 2023-04-04 Salesforce.Com, Inc. Multi-task knowledge distillation for language model
CN112199535B (en) * 2020-09-30 2022-08-30 浙江大学 Image classification method based on integrated knowledge distillation
CN112418343B (en) * 2020-12-08 2024-01-05 中山大学 Multi-teacher self-adaptive combined student model training method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110674714A (en) * 2019-09-13 2020-01-10 东南大学 Human face and human face key point joint detection method based on transfer learning
CN112529178A (en) * 2020-12-09 2021-03-19 中国科学院国家空间科学中心 Knowledge distillation method and system suitable for detection model without preselection frame

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"PAYING MORE ATTENTION TO ATTENTION: IMPROVING THE PERFORMANCE OF CONVOLUTIONAL NEURAL NETWORKS VIA ATTENTION TRANSFER";Sergey等;《ICLR 2017》;20170212;第1-9页 *
"Structured Knowledge Distillation for Dense Prediction";Yifan等;《arXiv》;20200614;第1-12页 *
"基于知识蒸馏的胡萝卜外观品质等级智能检测";倪建功等;《农业工程学报》;20200930;第36卷(第18期);第181-185页 *

Also Published As

Publication number Publication date
CN113255899A (en) 2021-08-13

Similar Documents

Publication Publication Date Title
KR101880901B1 (en) Method and apparatus for machine learning
CN112634276A (en) Lightweight semantic segmentation method based on multi-scale visual feature extraction
CN111695467A (en) Spatial spectrum full convolution hyperspectral image classification method based on superpixel sample expansion
WO2017163759A1 (en) System and computer-implemented method for semantic segmentation of image, and non-transitory computer-readable medium
CN111339818B (en) Face multi-attribute recognition system
AU2020200338B2 (en) Image searching apparatus, classifier training method, and program
CN109117894B (en) Large-scale remote sensing image building classification method based on full convolution neural network
CN112132149A (en) Semantic segmentation method and device for remote sensing image
CN112308081B (en) Image target prediction method based on attention mechanism
WO2021042857A1 (en) Processing method and processing apparatus for image segmentation model
CN113780292A (en) Semantic segmentation network model uncertainty quantification method based on evidence reasoning
CN113516133A (en) Multi-modal image classification method and system
CN113255899B (en) Knowledge distillation method and system with self-correlation of channels
CN115546196A (en) Knowledge distillation-based lightweight remote sensing image change detection method
CN114943859B (en) Task related metric learning method and device for small sample image classification
CN115328319A (en) Intelligent control method and device based on light-weight gesture recognition
CN118154867A (en) Semi-supervised remote sensing image semantic segmentation method and system
CN115376317A (en) Traffic flow prediction method based on dynamic graph convolution and time sequence convolution network
CN113590971B (en) Interest point recommendation method and system based on brain-like space-time perception characterization
US20210286544A1 (en) Economic long short-term memory for recurrent neural networks
CN110866866B (en) Image color imitation processing method and device, electronic equipment and storage medium
CN116109945A (en) Remote sensing image interpretation method based on ordered continuous learning
CN115081516A (en) Internet of things flow prediction method based on biological connection group time-varying convolution network
CN112926517B (en) Artificial intelligence monitoring method
JPH08305855A (en) Method and device for pattern recognition of image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant