CN114282690A - Model distillation method, device, equipment and storage medium - Google Patents

Model distillation method, device, equipment and storage medium Download PDF

Info

Publication number
CN114282690A
CN114282690A CN202210101415.2A CN202210101415A CN114282690A CN 114282690 A CN114282690 A CN 114282690A CN 202210101415 A CN202210101415 A CN 202210101415A CN 114282690 A CN114282690 A CN 114282690A
Authority
CN
China
Prior art keywords
characteristic
sample
transformation
disturbance
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210101415.2A
Other languages
Chinese (zh)
Inventor
杨馥魁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202210101415.2A priority Critical patent/CN114282690A/en
Publication of CN114282690A publication Critical patent/CN114282690A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The present disclosure provides a model distillation method, apparatus, device and storage medium, which relates to the technical field of artificial intelligence, in particular to the technical field of deep learning and computer vision, and can be used in scenes such as image processing and image detection. The specific implementation scheme is as follows: respectively inputting the sample images into a teacher model and a student model to obtain a first sample characteristic and a second sample characteristic; respectively carrying out physical transformation on the first sample characteristic and the second sample characteristic to obtain a first disturbance characteristic and a second disturbance characteristic; determining a distillation loss from the first sample characteristic, the second sample characteristic, the first perturbation characteristic, and the second perturbation characteristic; training the student model according to the distillation loss. According to the technology disclosed by the invention, the training precision of the student model can be improved.

Description

Model distillation method, device, equipment and storage medium
Technical Field
The disclosure relates to the technical field of artificial intelligence, in particular to the technical field of deep learning and computer vision, and can be applied to scenes such as image processing, image detection and the like.
Background
With the development of artificial intelligence technology, the knowledge distillation technology is more and more widely applied in the model training process. The knowledge distillation is a technology for training a Student Model (Student Model) with a simple structure by adopting a pre-trained Teacher Model (Teacher Model) with a complex structure so as to endow the Teacher Model with the function of the Student Model. Then, how to train student models with high precision is crucial based on knowledge distillation technology.
Disclosure of Invention
The present disclosure provides a model distillation method, apparatus, device and storage medium.
According to an aspect of the present disclosure, there is provided a model distillation method, the method comprising:
respectively inputting the sample images into a teacher model and a student model to obtain a first sample characteristic and a second sample characteristic;
respectively carrying out physical transformation on the first sample characteristic and the second sample characteristic to obtain a first disturbance characteristic and a second disturbance characteristic;
determining a distillation loss from the first sample characteristic, the second sample characteristic, the first perturbation characteristic, and the second perturbation characteristic;
training the student model according to the distillation loss.
According to another aspect of the present disclosure, there is provided an electronic device including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the model distillation method of any of the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the model distillation method of any one of the embodiments of the present disclosure.
According to the technology disclosed by the invention, the training precision of the student model can be improved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a flow diagram of a model distillation method provided in accordance with an embodiment of the present disclosure;
FIG. 2 is a flow diagram of another model distillation method provided in accordance with an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a model distillation process provided in accordance with an embodiment of the present disclosure;
FIG. 4 is a flow diagram of yet another model distillation method provided in accordance with an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of a model distillation apparatus provided in accordance with an embodiment of the present disclosure;
FIG. 6 is a block diagram of an electronic device for implementing the model distillation method of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a flow chart of a model distillation method according to an embodiment of the present disclosure, which is suitable for a case how a student model is trained based on knowledge distillation technology. The method may be performed by a model distillation apparatus, which may be implemented in software and/or hardware, and may be integrated in an electronic device carrying the model distillation function. As shown in fig. 1, the model distillation method of the present embodiment may include:
and S101, respectively inputting the sample image into a teacher model and a student model to obtain a first sample characteristic and a second sample characteristic.
In this embodiment, the teacher model is a pre-trained model with a complex structure, and the student model is an untrained model with a simple structure. Optionally, the student model is trained through a knowledge distillation technology, and finally the student model has the same function as the teacher model. Furthermore, the teacher model can have a feature extraction function and can be applied to scenes such as face recognition, target object detection and the like.
The sample image can be an image used in model training, such as a face image; the sample image is subjected to feature extraction to obtain sample features, and the sample features can be represented in a matrix form.
Optionally, the sample images are respectively input into the teacher model and the student model, so that a first sample feature output by the teacher model and a second sample feature output by the student model can be obtained.
And S102, respectively carrying out physical transformation on the first sample characteristic and the second sample characteristic to obtain a first disturbance characteristic and a second disturbance characteristic.
In this embodiment, the physical transformation is a transformation mode in which the characteristic nature is not changed; optionally, the physical transformation may include both linear transformation and nonlinear transformation; the linear transformation includes, but is not limited to, scaling transformation, rotation transformation, and the like, and the nonlinear transformation includes, but is not limited to, gaussian transformation, polynomial transformation, and the like. That is, the physical transformation includes at least one of a scaling transformation, a rotation transformation, a gaussian transformation, a polynomial transformation, and the like. Further, in this embodiment, the physical transformation performed on the first sample feature and the second sample feature is the same.
Optionally, if the physical transformation is one of scaling transformation, rotation transformation, gaussian transformation, polynomial transformation, and the like, the first sample feature may be subjected to physical transformation to obtain a first perturbation feature, and the second sample feature may be subjected to physical transformation to obtain a second perturbation feature.
Further, if the physical transformation is at least two of scaling transformation, rotation transformation, gaussian transformation, polynomial transformation, and the like, a target mode for the physical transformation may be selected from a serial mode and a parallel mode according to the arrangement information and/or the type of the sample image; and carrying out physical transformation on the second sample characteristic by adopting the target mode to obtain a second disturbance characteristic.
And S103, determining the distillation loss according to the first sample characteristic, the second sample characteristic, the first disturbance characteristic and the second disturbance characteristic.
Specifically, the first sample characteristic, the second sample characteristic, the first perturbation characteristic, and the second perturbation characteristic may be input into a predetermined loss function, so as to obtain the distillation loss.
Alternatively, the first sample characteristic and the first perturbation characteristic may be added to obtain the first characteristic; adding the second sample characteristic and the second disturbance characteristic to obtain a second characteristic; and inputting the first characteristic and the second characteristic into a preset loss function to obtain the distillation loss.
And S104, training the student model according to the distillation loss.
Specifically, the distillation loss is adopted to train the student model, and the network parameters in the student model are continuously optimized until the model converges. Specifically, in this embodiment, the student model needs to be iteratively trained for multiple times based on the above method based on multiple groups of sample images, and until a preset training stop condition is reached, the adjustment of the network parameters of the student model is stopped, so as to obtain the trained student model. The training stop condition may include: the training times reach the preset times, or the distillation loss converges, and the like.
It should be noted that, in the existing model distillation, the first sample characteristics of the teacher model output and the second sample characteristics of the student model output are directly supervised, so that the characteristics of the student model output are consistent with the characteristics of the teacher model output as much as possible. The model distillation mode ignores the slight difference between the characteristics output by the teacher model and the characteristics output by the student model, so that the training precision of the student model is not high.
In the embodiment, the nuance between the feature output by the teacher model and the feature output by the student model is fully considered, and the nuance between the feature output by the student model and the feature output by the teacher model is amplified by physically changing the first sample feature output by the teacher model and the second sample feature output by the student model, so that the student model has higher training precision.
According to the technical scheme provided by the embodiment of the disclosure, the first sample characteristic output by the teacher model and the second sample characteristic output by the student model can be obtained by respectively inputting the sample images into the teacher model and the student model; then, respectively carrying out physical transformation on the first sample characteristic and the second sample characteristic to obtain a first disturbance characteristic and a second disturbance characteristic; and determining distillation loss based on the first sample characteristic, the second sample characteristic, the first disturbance characteristic and the second disturbance characteristic, and training the student model by adopting the distillation loss. In the scheme, the first sample characteristic output by the teacher model and the second sample characteristic output by the student model are physically changed to amplify the slight difference between the characteristic output by the student model and the characteristic output by the teacher model, namely, the first disturbance characteristic and the second disturbance characteristic are introduced; and training the student model based on the first sample characteristic, the second sample characteristic, the first disturbance characteristic and the second disturbance characteristic, so that the student model has higher training precision.
Alternatively, on the basis of the above embodiment, the determining of the distillation loss according to the first sample characteristic, the second sample characteristic, the first disturbance characteristic and the second disturbance characteristic may further be: determining a first loss from the first sample characteristic and the second sample characteristic; determining a second loss according to the first disturbance characteristic and the second disturbance characteristic; determining a distillation loss according to the first loss and the second loss.
Specifically, the first sample characteristic and the second sample characteristic are input into a preset loss function to obtain a first loss; inputting the first disturbance characteristic and the second disturbance characteristic into the same loss function to obtain a second loss; the first loss and the second loss are added to obtain the distillation loss. Wherein the loss function may be a cross entropy loss function, or a square loss function (i.e., L2 loss function), etc.
It can be understood that the present embodiment trains the student model through the loss before feature transformation (i.e. the first loss) and the loss after feature transformation (i.e. the second loss), so that the student model and the feature output by the teacher model can be kept consistent before and after feature transformation.
Fig. 2 is a flow chart of another model distillation method provided according to an embodiment of the present disclosure, and this embodiment further explains in detail how to obtain the first disturbance characteristic and the second disturbance characteristic based on the above embodiment. As shown in fig. 2, the model distillation method of the present embodiment may include:
s201, inputting the sample image into a teacher model and a student model respectively to obtain a first sample characteristic and a second sample characteristic.
S202, scaling and transforming the first sample characteristic and the second sample characteristic respectively to obtain a first sub-disturbance characteristic and a second sub-disturbance characteristic.
In one implementation, the first sample feature may be subjected to random scaling transformation to obtain a first sub-perturbation feature; and carrying out the same scaling mode as the first sample characteristic on the second sample characteristic to obtain a second sub-disturbance characteristic. Optionally, the scaling used in the random scaling transformation in this embodiment may be any one of 0 to 10.
In yet another embodiment, the scaling may be determined according to dimension information of the first sample feature and/or the second sample feature; and carrying out scaling transformation on the second sample characteristic according to the scaling to obtain a second sub-disturbance characteristic.
In another implementation manner, scaling transformation and translation transformation may be performed on the first sample feature to obtain a first sub-perturbation feature; and carrying out scaling transformation and translation transformation on the second sample characteristic to obtain a second sub-disturbance characteristic.
For example, the first sample feature may be subjected to scaling transformation, and the first sample feature after scaling transformation is subjected to translation transformation to obtain the first sub-perturbation feature. The first sub-perturbation characteristic may be specifically determined by the formula scale T0+ offset1, where T0 represents the first sample characteristic, scale range (0, 10) is used to characterize the scaling of T0 by any value from 0 to 10, and offset range (0, 1) is used to characterize the shifting of the first sample characteristic after scaling transformation by any value from 0 to 1.
Similarly, scaling transformation may be performed on the second sample characteristic, and translation transformation may be performed on the second sample characteristic after scaling transformation, so as to obtain a second sub-disturbance characteristic. It should be noted that, in this embodiment, the first sample feature and the second sample feature are subjected to scaling transformation in the same manner, and the two sample features are subjected to translation transformation in different manners, that is, the translation proportion may be different. It can be understood that, in the embodiment, when the first sub-perturbation feature and the second sub-perturbation feature are determined, scaling transformation and translation transformation are introduced, the transformation types are enriched, and the difference between the output features of the student model and the output features of the teacher model are more comprehensively amplified.
And S203, respectively carrying out rotation transformation on the first sample characteristic and the second sample characteristic to obtain a third sub-disturbance characteristic and a fourth sub-disturbance characteristic.
Optionally, the first sample feature and the second sample feature may be rotation transformed by any one of the following ways:
in the first mode, random rotation transformation can be performed on the first sample characteristic to obtain a third sub-disturbance characteristic; and performing the same rotation mode as the first sample characteristic on the second sample characteristic to obtain a fourth sub-perturbation characteristic.
In a second mode, the first sample characteristic can be subjected to rotation transformation and translation transformation to obtain a third sub-disturbance characteristic; and performing rotation transformation and translation transformation on the second sample characteristic to obtain a fourth sub-disturbance characteristic. For example, the first sample feature is subjected to rotation transformation, and the first sample feature after scaling transformation is subjected to translation transformation to obtain a third sub-perturbation feature; similarly, the second sample feature may be subjected to rotation transformation, and the second sample feature after the rotation transformation may be subjected to translation transformation, so as to obtain a fourth sub-disturbance feature. It should be noted that, in this embodiment, the first sample feature and the second sample feature are transformed in a same manner by rotation, and the first sample feature and the second sample feature are transformed in a different manner by translation, that is, in a different translation ratio.
And S204, respectively carrying out Gaussian transformation on the first sample characteristic and the second sample characteristic to obtain a fifth sub-disturbance characteristic and a sixth sub-disturbance characteristic.
Optionally, the first sample characteristic may be input into a set gaussian function, so as to obtain a fifth sub-disturbance characteristic. For example, the fifth sub-perturbation characteristic may be determined by formula G (T0, 0+ u, 1+ v). Where T0 represents the first sample feature, G () represents a gaussian function with a mean of 0 and a variance of 1, u represents a random disturbance variable of the mean, and v represents a random disturbance variable of the variance.
Correspondingly, the second sample characteristic may also be input into the same gaussian function, so as to obtain a sixth sub-disturbance characteristic. For example, the sixth sub-perturbation characteristic may also be determined by the formula G (S0, 0+ u, 1+ v). Where S0 denotes the second sample feature.
It can be understood that, in the present embodiment, performing gaussian transformation on the first sample feature and the second sample feature is equivalent to adding gaussian noise to the first sample feature and the second sample feature to amplify the slight difference between the two.
And S205, taking at least one of the first sub-disturbance feature, the third sub-disturbance feature and the fifth sub-disturbance feature as the first disturbance feature.
And S206, taking at least one of the second sub-disturbance feature, the fourth sub-disturbance feature and the sixth sub-disturbance feature as a second disturbance feature.
And S207, determining the distillation loss according to the first sample characteristic, the second sample characteristic, the first disturbance characteristic and the second disturbance characteristic.
Optionally, if both the first perturbation feature and the second perturbation feature are the same, the first loss may be determined according to the first sample feature and the second sample feature; determining a second loss according to the first disturbance characteristic and the second disturbance characteristic; determining a distillation loss according to the first loss and the second loss.
Further, if the first disturbance feature and the second disturbance feature are both multiple, a loss is calculated from corresponding sub-disturbance features in the first disturbance feature and the second disturbance feature. For example, if the first perturbation characteristic includes a first sub-perturbation characteristic and a third sub-perturbation characteristic, and the second perturbation characteristic includes a second sub-perturbation characteristic and a fourth sub-perturbation characteristic, the first loss may be determined according to the first sample characteristic and the second sample characteristic; determining a second loss according to the first sub-disturbance characteristic and the second sub-disturbance characteristic; determining a third loss according to the third sub-disturbance characteristic and the fourth sub-disturbance characteristic; determining a distillation loss according to the first loss, the second loss and the third loss.
It should be noted that, in the case that there are a plurality of first perturbation features and a plurality of second perturbation features, this embodiment essentially provides a process of performing physical transformation in a parallel manner, specifically, performing physical transformation on a first sample feature in a parallel manner, and performing physical transformation on a second sample feature in a parallel manner.
And S208, training the student model according to the distillation loss.
According to the technical scheme provided by the embodiment of the disclosure, the first sample characteristic output by the teacher model and the second sample characteristic output by the student model can be obtained by respectively inputting the sample images into the teacher model and the student model; and then, respectively carrying out scaling transformation on the first sample characteristic and the second sample characteristic, respectively carrying out rotation transformation on the first sample characteristic and the second sample characteristic, respectively carrying out Gaussian transformation on the first sample characteristic and the second sample characteristic, determining distillation loss based on the first sample characteristic, the second sample characteristic and the first disturbance characteristic and the second disturbance characteristic obtained by carrying out physical transformation, and training a student model by adopting the distillation loss. According to the scheme, the subtle differences between the characteristics output by the student model and the characteristics output by the teacher model are amplified from different angles by adopting various transformation modes such as scaling transformation, rotation transformation and Gaussian transformation, so that the determined distillation loss is more accurate, and the student model has higher training precision.
The present embodiment provides a preferred example based on the above-described embodiments. The following describes the whole process of student model training in detail by taking the case that the physical transformation is scaling transformation, translation transformation and gaussian transformation, and the scaling transformation and the translation transformation adopt a serial mode and the scaling transformation and the gaussian transformation adopt a parallel mode. Referring to fig. 3, the model distillation process of this example is as follows:
and respectively inputting the sample images into a teacher model and a student model to obtain a first sample characteristic T0 output by the teacher model and a second sample characteristic S0 output by the student model.
Carrying out scaling transformation on the first sample characteristic, and then carrying out translation transformation on the first sample characteristic after scaling transformation to obtain a disturbance characteristic T1; similarly, the second sample characteristic is subjected to scaling transformation, and then the second sample characteristic after scaling transformation is subjected to translation transformation, so as to obtain the disturbance characteristic S1.
Meanwhile, the first sample characteristic can be subjected to Gaussian transformation to obtain a disturbance characteristic T2; and performing Gaussian transformation on the second sample characteristic to obtain a disturbance characteristic S2.
Calculating loss 1 between T0 and S0, loss 2 between T1 and S1, and loss 3 between T2 and S2, respectively, using an L2 loss function; and then adding the loss 1, the loss 2 and the loss 3 to obtain the total loss, namely the distillation loss, training the student model by adopting the distillation loss, and continuously optimizing network parameters in the student model until the model converges.
It can be understood that, in the embodiment, the characteristics output by the teacher model and the characteristics output by the student model are respectively subjected to scaling transformation and gaussian transformation to amplify the slight difference between the characteristics output by the student model and the characteristics output by the teacher model, so that the determined distillation loss is more accurate, and the student model has higher training precision.
FIG. 4 is a flow chart of yet another method for model distillation according to an embodiment of the present disclosure, which is based on the above embodiment and provides an alternative way to determine the first and second perturbation characteristics. As shown in fig. 4, the model distillation method of the present embodiment may include:
s401, inputting the sample image into a teacher model and a student model respectively to obtain a first sample characteristic and a second sample characteristic.
S402, carrying out scaling transformation and/or rotation transformation on the first sample characteristic to obtain a first intermediate characteristic.
Optionally, scaling transformation may be performed on the first sample feature, and the transformed first sample feature is taken as a first intermediate feature; or, the first sample feature may be subjected to rotation transformation, and the transformed first sample feature is taken as the first intermediate feature; or, the first intermediate feature may be obtained by performing scaling transformation on the first sample feature and then performing rotation transformation on the scaled and transformed first sample feature.
Further, in an embodiment, the first sample feature may be subjected to scaling transformation, and then the feature after scaling transformation may be subjected to translation transformation to obtain the first intermediate feature.
Alternatively, the first sample feature may be subjected to rotation transformation, and then the feature after the rotation transformation may be subjected to translation transformation, so as to obtain the first intermediate feature.
Or, the first intermediate feature may be obtained by performing scaling transformation on the first sample feature, performing translation transformation on the feature after scaling transformation, and performing rotation transformation on the feature after translation transformation.
And S403, performing scaling transformation and/or rotation transformation on the second sample characteristic to obtain a second intermediate characteristic.
Optionally, scaling transformation may be performed on the second sample feature, and the transformed second sample feature is used as a second intermediate feature; or, the second sample feature may be subjected to rotation transformation, and the transformed second sample feature is taken as a second intermediate feature; or, the second sample feature may be scaled and transformed first, and then the scaled and transformed second sample feature may be rotated and transformed to obtain the second intermediate feature.
Further, in an implementation manner, scaling transformation may be performed on the second sample feature, and then translation transformation may be performed on the feature after scaling transformation, so as to obtain a second intermediate feature.
Or, the second sample feature may be subjected to rotation transformation, and then the feature after the rotation transformation is subjected to translation transformation, so as to obtain a second intermediate feature.
Or, the second sample feature may be scaled, then the scaled feature may be translated, and then the translated feature may be rotated to obtain the second intermediate feature.
S404, respectively carrying out Gaussian transformation on the first intermediate feature and the second intermediate feature to obtain a first disturbance feature and a second disturbance feature.
Optionally, the first intermediate characteristic may be input to a set gaussian function to obtain the first perturbation characteristic. For example, the first perturbation characteristic may be determined by the formula G (Tz, 0+ u, 1+ v). Where Tz represents a first intermediate feature, G () represents a gaussian function with a mean of 0 and a variance of 1, u represents a random disturbance variable of the mean, and v represents a random disturbance variable of the variance.
Correspondingly, the second intermediate feature may also be input into the same gaussian function to obtain the second perturbation feature. For example, the second disturbance characteristic may also be determined by the formula G (Sz, 0+ u, 1+ v). Where Sz represents the second intermediate feature.
And S405, determining the distillation loss according to the first sample characteristic, the second sample characteristic, the first disturbance characteristic and the second disturbance characteristic.
And S406, training the student model according to the distillation loss.
It should be noted that, in essence, this embodiment provides a process of performing physical transformation in a serial manner, specifically, performing physical transformation on a first sample feature in a serial manner, and performing physical transformation on a second sample feature in a serial manner. Further, based on the linear characteristic and the nonlinear characteristic, when the physical transformation includes linear transformation and nonlinear transformation, the characteristic is firstly subjected to linear transformation, and then the nonlinear transformation is performed. For example, the first sample feature may be scaled and transformed first, and then the scaled and transformed first sample feature may be gaussian transformed.
According to the technical scheme provided by the embodiment of the disclosure, the first sample characteristic output by the teacher model and the second sample characteristic output by the student model can be obtained by respectively inputting the sample images into the teacher model and the student model; and then, carrying out physical transformation such as scaling transformation, rotation transformation, Gaussian transformation and the like on the first sample characteristic, carrying out physical transformation such as scaling transformation, rotation transformation, Gaussian transformation and the like on the second sample characteristic, determining distillation loss based on the first sample characteristic, the second sample characteristic, the first disturbance characteristic and the second disturbance characteristic obtained by carrying out physical transformation, and training a student model by adopting the distillation loss. According to the scheme, the subtle differences between the characteristics output by the student model and the characteristics output by the teacher model are amplified from different angles by adopting various transformation modes such as scaling transformation, rotation transformation and Gaussian transformation, so that the determined distillation loss is more accurate, and the student model has higher training precision.
Fig. 5 is a schematic structural diagram of a model distillation apparatus provided according to an embodiment of the present disclosure. The embodiment of the disclosure is suitable for the situation of how to train a student model based on knowledge distillation technology. The apparatus may be implemented in software and/or hardware, and may implement the model distillation method described in any embodiment of the present disclosure. As shown in fig. 5, the model distillation apparatus includes:
a sample feature determining module 501, configured to input the sample image to a teacher model and a student model respectively to obtain a first sample feature and a second sample feature;
a perturbation feature determining module 502, configured to perform physical transformation on the first sample feature and the second sample feature respectively to obtain a first perturbation feature and a second perturbation feature;
a distillation loss determination module 503, configured to determine a distillation loss according to the first sample characteristic, the second sample characteristic, the first perturbation characteristic, and the second perturbation characteristic;
and the training module 504 is used for training the student model according to the distillation loss.
According to the technical scheme provided by the embodiment of the disclosure, the first sample characteristic output by the teacher model and the second sample characteristic output by the student model can be obtained by respectively inputting the sample images into the teacher model and the student model; then, respectively carrying out physical transformation on the first sample characteristic and the second sample characteristic to obtain a first disturbance characteristic and a second disturbance characteristic; and determining distillation loss based on the first sample characteristic, the second sample characteristic, the first disturbance characteristic and the second disturbance characteristic, and training the student model by adopting the distillation loss. In the scheme, the first sample characteristic output by the teacher model and the second sample characteristic output by the student model are physically changed to amplify the slight difference between the characteristic output by the student model and the characteristic output by the teacher model, namely, the first disturbance characteristic and the second disturbance characteristic are introduced; and training the student model based on the first sample characteristic, the second sample characteristic, the first disturbance characteristic and the second disturbance characteristic, so that the student model has higher training precision.
Illustratively, the physical transformation includes at least one of a scaling transformation, a rotation transformation, and a gaussian transformation, and the physical transformation performed on the first sample feature and the second sample feature is the same.
Illustratively, the disturbance characteristic determination module 502 includes:
the scaling transformation unit is used for respectively carrying out scaling transformation on the first sample characteristic and the second sample characteristic to obtain a first sub-disturbance characteristic and a second sub-disturbance characteristic;
the rotation transformation unit is further used for respectively performing rotation transformation on the first sample characteristic and the second sample characteristic to obtain a third sub-disturbance characteristic and a fourth sub-disturbance characteristic;
the Gaussian transformation unit is further used for respectively carrying out Gaussian transformation on the first sample characteristic and the second sample characteristic to obtain a fifth sub-disturbance characteristic and a sixth sub-disturbance characteristic;
a first disturbance characteristic determination unit, configured to use at least one of the first sub-disturbance characteristic, the third sub-disturbance characteristic, and the fifth sub-disturbance characteristic as a first disturbance characteristic;
and the second disturbance characteristic determination unit is used for taking at least one of the second sub-disturbance characteristic, the fourth sub-disturbance characteristic and the sixth sub-disturbance characteristic as the second disturbance characteristic.
Illustratively, the scaling transform unit is specifically configured to:
carrying out scaling transformation and translation transformation on the first sample characteristic to obtain a first sub-disturbance characteristic;
and carrying out scaling transformation and translation transformation on the second sample characteristic to obtain a second sub-disturbance characteristic.
Illustratively, the disturbance characteristic determination module 502 is specifically configured to:
carrying out scaling transformation and/or rotation transformation on the first sample characteristic to obtain a first intermediate characteristic;
carrying out scaling transformation and/or rotation transformation on the second sample characteristic to obtain a second intermediate characteristic;
and respectively carrying out Gaussian transformation on the first intermediate characteristic and the second intermediate characteristic to obtain a first disturbance characteristic and a second disturbance characteristic.
Illustratively, the distillation loss determination module 503 is specifically configured to:
determining a first loss from the first sample characteristic and the second sample characteristic;
determining a second loss according to the first disturbance characteristic and the second disturbance characteristic;
determining a distillation loss according to the first loss and the second loss.
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related sample images and the like all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 6, the electronic device 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the electronic apparatus 600 can also be stored. The calculation unit 601, the ROM602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
Various components in the electronic device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the electronic device 600 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 performs the respective methods and processes described above, such as the model distillation method. For example, in some embodiments, the model distillation method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 600 via the ROM602 and/or the communication unit 609. When the computer program is loaded into RAM603 and executed by the computing unit 601, one or more steps of the model distillation method described above may be performed. Alternatively, in other embodiments, the calculation unit 601 may be configured to perform the model distillation method in any other suitable manner (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
Artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge map technology and the like.
Cloud computing (cloud computing) refers to a technology system that accesses a flexibly extensible shared physical or virtual resource pool through a network, where resources may include servers, operating systems, networks, software, applications, storage devices, and the like, and may be deployed and managed in a self-service manner as needed. Through the cloud computing technology, high-efficiency and strong data processing capacity can be provided for technical application and model training of artificial intelligence, block chains and the like.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (15)

1. A model distillation method comprising:
respectively inputting the sample images into a teacher model and a student model to obtain a first sample characteristic and a second sample characteristic;
respectively carrying out physical transformation on the first sample characteristic and the second sample characteristic to obtain a first disturbance characteristic and a second disturbance characteristic;
determining a distillation loss from the first sample characteristic, the second sample characteristic, the first perturbation characteristic, and the second perturbation characteristic;
training the student model according to the distillation loss.
2. The method of claim 1, wherein the physical transformation comprises at least one of a scaling transformation, a rotation transformation, and a gaussian transformation, and the physical transformation performed on the first sample feature and the second sample feature is the same.
3. The method of claim 2, wherein the physically transforming the first and second sample features to obtain first and second perturbation features comprises:
respectively carrying out scaling transformation on the first sample characteristic and the second sample characteristic to obtain a first sub-disturbance characteristic and a second sub-disturbance characteristic;
respectively carrying out rotation transformation on the first sample characteristic and the second sample characteristic to obtain a third sub-disturbance characteristic and a fourth sub-disturbance characteristic;
respectively carrying out Gaussian transformation on the first sample characteristic and the second sample characteristic to obtain a fifth sub-disturbance characteristic and a sixth sub-disturbance characteristic;
taking at least one of the first sub-perturbation characteristic, the third sub-perturbation characteristic and the fifth sub-perturbation characteristic as a first perturbation characteristic;
and taking at least one of the second sub-perturbation characteristic, the fourth sub-perturbation characteristic and the sixth sub-perturbation characteristic as a second perturbation characteristic.
4. The method of claim 3, wherein the scaling the first sample feature and the second sample feature to obtain a first sub-perturbation feature and a second sub-perturbation feature respectively comprises:
carrying out scaling transformation and translation transformation on the first sample characteristic to obtain a first sub-disturbance characteristic;
and carrying out scaling transformation and translation transformation on the second sample characteristic to obtain a second sub-disturbance characteristic.
5. The method of claim 2, wherein the physically transforming the first and second sample features to obtain first and second perturbation features comprises:
carrying out scaling transformation and/or rotation transformation on the first sample characteristic to obtain a first intermediate characteristic;
carrying out scaling transformation and/or rotation transformation on the second sample characteristic to obtain a second intermediate characteristic;
and respectively carrying out Gaussian transformation on the first intermediate feature and the second intermediate feature to obtain a first disturbance feature and a second disturbance feature.
6. The method of claim 1, wherein the determining a distillation loss from the first sample characteristic, the second sample characteristic, the first perturbation characteristic, and the second perturbation characteristic comprises:
determining a first loss from the first sample characteristic and the second sample characteristic;
determining a second loss according to the first perturbation characteristic and the second perturbation characteristic;
determining a distillation loss based on the first loss and the second loss.
7. A model distillation apparatus comprising:
the sample characteristic determining module is used for respectively inputting the sample images into the teacher model and the student model to obtain a first sample characteristic and a second sample characteristic;
the disturbance characteristic determination module is used for respectively carrying out physical transformation on the first sample characteristic and the second sample characteristic to obtain a first disturbance characteristic and a second disturbance characteristic;
a distillation loss determination module for determining a distillation loss based on the first sample characteristic, the second sample characteristic, the first perturbation characteristic, and the second perturbation characteristic;
and the training module is used for training the student model according to the distillation loss.
8. The apparatus of claim 7, wherein the physical transformation comprises at least one of a scaling transformation, a rotation transformation, and a gaussian transformation, and the physical transformation performed on the first sample feature and the second sample feature is the same.
9. The apparatus of claim 8, wherein the disturbance feature determination module comprises:
the scaling transformation unit is used for respectively carrying out scaling transformation on the first sample characteristic and the second sample characteristic to obtain a first sub-disturbance characteristic and a second sub-disturbance characteristic;
the rotation transformation unit is further used for respectively performing rotation transformation on the first sample characteristic and the second sample characteristic to obtain a third sub-disturbance characteristic and a fourth sub-disturbance characteristic;
the Gaussian transformation unit is further used for respectively carrying out Gaussian transformation on the first sample characteristic and the second sample characteristic to obtain a fifth sub-disturbance characteristic and a sixth sub-disturbance characteristic;
a first disturbance feature determination unit, configured to use at least one of the first sub-disturbance feature, the third sub-disturbance feature, and the fifth sub-disturbance feature as a first disturbance feature;
a second perturbation characteristic determining unit, configured to use at least one of the second sub-perturbation characteristic, the fourth sub-perturbation characteristic, and the sixth sub-perturbation characteristic as a second perturbation characteristic.
10. The apparatus of claim 9, wherein the scaling transform unit is specifically configured to:
carrying out scaling transformation and translation transformation on the first sample characteristic to obtain a first sub-disturbance characteristic;
and carrying out scaling transformation and translation transformation on the second sample characteristic to obtain a second sub-disturbance characteristic.
11. The apparatus of claim 8, wherein the disturbance feature determination module is specifically configured to:
carrying out scaling transformation and/or rotation transformation on the first sample characteristic to obtain a first intermediate characteristic;
carrying out scaling transformation and/or rotation transformation on the second sample characteristic to obtain a second intermediate characteristic;
and respectively carrying out Gaussian transformation on the first intermediate feature and the second intermediate feature to obtain a first disturbance feature and a second disturbance feature.
12. The apparatus of claim 7, wherein the distillation loss determination module is specifically configured to:
determining a first loss from the first sample characteristic and the second sample characteristic;
determining a second loss according to the first perturbation characteristic and the second perturbation characteristic;
determining a distillation loss based on the first loss and the second loss.
13. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the model distillation method of any one of claims 1-6.
14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the model distillation method of any of claims 1-6.
15. A computer program product comprising a computer program which, when executed by a processor, implements the model distillation method according to any one of claims 1-6.
CN202210101415.2A 2022-01-27 2022-01-27 Model distillation method, device, equipment and storage medium Pending CN114282690A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210101415.2A CN114282690A (en) 2022-01-27 2022-01-27 Model distillation method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210101415.2A CN114282690A (en) 2022-01-27 2022-01-27 Model distillation method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114282690A true CN114282690A (en) 2022-04-05

Family

ID=80881732

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210101415.2A Pending CN114282690A (en) 2022-01-27 2022-01-27 Model distillation method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114282690A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019204824A1 (en) * 2018-04-20 2019-10-24 XNOR.ai, Inc. Improving image classification through label progression
CN111242303A (en) * 2020-01-14 2020-06-05 北京市商汤科技开发有限公司 Network training method and device, and image processing method and device
CA3076424A1 (en) * 2019-03-22 2020-09-22 Royal Bank Of Canada System and method for knowledge distillation between neural networks
US20200364502A1 (en) * 2018-05-29 2020-11-19 Tencent Technology (Shenzhen) Company Limited Model training method, storage medium, and computer device
CN112183492A (en) * 2020-11-05 2021-01-05 厦门市美亚柏科信息股份有限公司 Face model precision correction method, device and storage medium
US20210082399A1 (en) * 2019-09-13 2021-03-18 International Business Machines Corporation Aligning spike timing of models
US20210150340A1 (en) * 2019-11-18 2021-05-20 Salesforce.Com, Inc. Systems and Methods for Distilled BERT-Based Training Model for Text Classification
CN113344213A (en) * 2021-05-25 2021-09-03 北京百度网讯科技有限公司 Knowledge distillation method, knowledge distillation device, electronic equipment and computer readable storage medium
CN113963176A (en) * 2021-10-28 2022-01-21 北京百度网讯科技有限公司 Model distillation method and device, electronic equipment and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019204824A1 (en) * 2018-04-20 2019-10-24 XNOR.ai, Inc. Improving image classification through label progression
US20200364502A1 (en) * 2018-05-29 2020-11-19 Tencent Technology (Shenzhen) Company Limited Model training method, storage medium, and computer device
CA3076424A1 (en) * 2019-03-22 2020-09-22 Royal Bank Of Canada System and method for knowledge distillation between neural networks
US20210082399A1 (en) * 2019-09-13 2021-03-18 International Business Machines Corporation Aligning spike timing of models
US20210150340A1 (en) * 2019-11-18 2021-05-20 Salesforce.Com, Inc. Systems and Methods for Distilled BERT-Based Training Model for Text Classification
CN111242303A (en) * 2020-01-14 2020-06-05 北京市商汤科技开发有限公司 Network training method and device, and image processing method and device
CN112183492A (en) * 2020-11-05 2021-01-05 厦门市美亚柏科信息股份有限公司 Face model precision correction method, device and storage medium
CN113344213A (en) * 2021-05-25 2021-09-03 北京百度网讯科技有限公司 Knowledge distillation method, knowledge distillation device, electronic equipment and computer readable storage medium
CN113963176A (en) * 2021-10-28 2022-01-21 北京百度网讯科技有限公司 Model distillation method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
卢海伟;袁晓彤;: "基于层融合特征系数的动态网络结构化剪枝", 模式识别与人工智能, no. 11, 15 November 2019 (2019-11-15) *
柏沫羽;刘昊;陈浩川;张振华;: "应用知识蒸馏的深度神经网络波束形成算法", 遥测遥控, no. 01, 15 January 2020 (2020-01-15) *

Similar Documents

Publication Publication Date Title
CN113343803A (en) Model training method, device, equipment and storage medium
CN114202076B (en) Training method of deep learning model, natural language processing method and device
CN114187459A (en) Training method and device of target detection model, electronic equipment and storage medium
US20230013796A1 (en) Method and apparatus for acquiring pre-trained model, electronic device and storage medium
CN112580733A (en) Method, device and equipment for training classification model and storage medium
JP7446359B2 (en) Traffic data prediction method, traffic data prediction device, electronic equipment, storage medium, computer program product and computer program
CN113656590A (en) Industry map construction method and device, electronic equipment and storage medium
CN112528995A (en) Method for training target detection model, target detection method and device
CN112949433B (en) Method, device and equipment for generating video classification model and storage medium
CN114581732A (en) Image processing and model training method, device, equipment and storage medium
CN115840867A (en) Generation method and device of mathematical problem solving model, electronic equipment and storage medium
CN114120454A (en) Training method and device of living body detection model, electronic equipment and storage medium
CN113344213A (en) Knowledge distillation method, knowledge distillation device, electronic equipment and computer readable storage medium
CN113641829A (en) Method and device for training neural network of graph and complementing knowledge graph
CN117351299A (en) Image generation and model training method, device, equipment and storage medium
CN114743586B (en) Mirror image storage implementation method and device of storage model and storage medium
US20230111511A1 (en) Intersection vertex height value acquisition method and apparatus, electronic device and storage medium
CN113408304B (en) Text translation method and device, electronic equipment and storage medium
CN113361621B (en) Method and device for training model
CN114817476A (en) Language model training method and device, electronic equipment and storage medium
CN114282690A (en) Model distillation method, device, equipment and storage medium
CN115292467A (en) Information processing and model training method, apparatus, device, medium, and program product
CN114385829A (en) Knowledge graph creating method, device, equipment and storage medium
CN113361575A (en) Model training method and device and electronic equipment
CN113641724A (en) Knowledge tag mining method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination