CN112819050B

CN112819050B - Knowledge distillation and image processing method, apparatus, electronic device and storage medium

Info

Publication number: CN112819050B
Application number: CN202110090849.2A
Authority: CN
Inventors: 高梦雅; 王宇杰; 李全全
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2021-01-22
Filing date: 2021-01-22
Publication date: 2023-10-27
Anticipated expiration: 2041-01-22
Also published as: WO2022156331A1; CN112819050A

Abstract

The application provides a knowledge distillation and image processing method, a knowledge distillation and image processing device, an electronic device and a storage medium. The method can include the steps of respectively utilizing a student model and a teacher model to conduct image processing on an image data set to obtain a first output feature and a second output feature, and determining the corresponding relation between the channel number of the matched feature map pairs between the feature map of each channel included in the first output feature and the feature map of each channel included in the second output feature. Training the student model. And in each round of training, performing feature alignment operation on the output features of the student model and the teacher model by using the determined corresponding relation, and performing knowledge distillation according to the output features after feature alignment.

Description

Knowledge distillation and image processing method, apparatus, electronic device and storage medium

Technical Field

The present application relates to computer technology, and more particularly, to knowledge distillation and image processing methods, apparatuses, electronic devices, and storage media.

Background

Currently, neural network models are rapidly evolving. For example, in an image processing task, operations such as image classification, object detection, semantic segmentation, etc., may be implemented using a deep convolutional neural network model such as RCNN (Region Convolutional Neural Networks, regional convolutional neural network), FAST-RCNN (Fast Region Convolutional Neural Networks, FAST regional convolutional neural network), etc.

However, as tasks become more complex, requirements on processing results become higher, the structure of the neural network model becomes more complex, and the occupied space becomes larger. This may take up significant computing resources and memory space, even resulting in a neural network model that cannot be utilized in devices like cell phones.

Therefore, a model compression method is needed, which can enable a student model with a simple structure to learn from a teacher model with a complex structure, and enable the result of the student model to be as close to the teacher model as possible, so as to complete model compression.

Disclosure of Invention

Accordingly, the present application discloses at least one knowledge distillation method, which comprises:

respectively utilizing a student model and a teacher model to perform image processing on the image data set to obtain a first output characteristic and a second output characteristic;

based on the first output feature and the second output feature, determining a corresponding relation between the matched feature map pairs of the feature maps of the channels included in the first output feature and the feature maps of the channels included in the second output feature;

training the student model; in each training round, respectively utilizing the student model and the teacher model to perform image processing on an input sample image to obtain a third output characteristic and a fourth output characteristic; determining an error between the third output feature and a true feature corresponding to the sample image; performing feature alignment operation by using the determined correspondence so that feature graphs of channels included in the third output feature are matched with feature graphs of channels included in the fourth output feature, wherein the feature graphs are in the same channel number; further determining a gap between the third output feature and the fourth output feature after feature alignment; and updating model parameters of the student model based on the error and the gap.

The application also discloses an image processing method, which comprises the following steps:

acquiring a target image;

performing image processing on the target image by using an image processing model to obtain an image processing result;

the image processing model comprises a model which is obtained by training according to the knowledge distillation method shown in any embodiment.

The application also discloses a knowledge distillation device, which comprises:

the image processing module is used for performing image processing on the image data set by using the student model and the teacher model respectively to obtain a first output characteristic and a second output characteristic;

the corresponding relation determining module is used for determining the corresponding relation between the matched feature map pairs of the feature maps of the channels included in the first output feature and the feature maps of the channels included in the second output feature based on the first output feature and the second output feature;

the training module is used for training the student model; in each training round, respectively utilizing the student model and the teacher model to perform image processing on an input sample image to obtain a third output characteristic and a fourth output characteristic; determining an error between the third output feature and a true feature corresponding to the sample image; performing feature alignment operation by using the determined correspondence so that feature graphs of channels included in the third output feature are matched with feature graphs of channels included in the fourth output feature, wherein the feature graphs are in the same channel number; further determining a gap between the third output feature and the fourth output feature after feature alignment; and updating model parameters of the student model based on the error and the gap.

The application also discloses an image processing device, which comprises:

the acquisition module is used for acquiring a target image;

the image processing module is used for performing image processing on the target image by utilizing the image processing model to obtain an image processing result;

The application also discloses an electronic device, which comprises:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to invoke the executable instructions stored in the memory to implement the knowledge distillation method or the image processing method.

The application also discloses a computer readable storage medium storing a computer program for executing the aforementioned knowledge distillation method or image processing method.

In the application, when the gap between the output characteristics of the student model and the teacher model is determined, the characteristic alignment operation is performed first, so that the characteristic graphs of all channels included in the output characteristics of the student model and the characteristic graphs of all channels included in the output characteristics of the teacher model are matched with each other, and the characteristic graphs with the same channel number can have the same or similar interpretation meaning. Therefore, when the gap is determined, errors caused by mismatching of feature graphs can be reduced, the determined gap is more real and accurate, the difficulty of model convergence is further reduced, the output features of the student model are easy to approach the output features of the teacher model, and the model training efficiency is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

In order to more clearly illustrate one or more embodiments of the present application or the technical solutions in the related art, the drawings that are required for the description of the embodiments or the related art will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments described in one or more embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort to those of ordinary skill in the art.

FIG. 1 is a flow chart of a model training method of the present application;

FIG. 2 is a method flow diagram of a model training method of the present application;

FIG. 3 is a schematic flow chart of a model training according to the present application;

FIG. 4 is a schematic diagram of a transformation matrix according to the present application;

FIG. 5 is a flow chart of a feature alignment method of the present application;

FIG. 6 is a schematic diagram of a knowledge distillation apparatus according to the present application;

fig. 7 is a schematic diagram of a hardware structure of an electronic device according to the present application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items. It will also be appreciated that the term "if," as used herein, may be interpreted as "at … …" or "at … …" or "responsive to a determination," depending on the context.

Before describing the embodiment of the application, a method for achieving model compression through model training in the related art is described. The image processing task will be described below as an example.

Referring to fig. 1, fig. 1 is a schematic flow chart of a model training method according to the present application. It should be noted that the flow description shown in fig. 1 is only a schematic description of the flow of the model training method, and fine adjustment may be performed in practical applications.

As shown in fig. 1, it is generally necessary to perform S102 first in performing model training, and prepare a training sample set.

In the field of image classification, the training sample set described above may typically be a collection of images labeled with classification types of individual objects appearing in the images. In preparing the training sample set, the original image may be labeled with a true value usually by means of manual labeling or machine-assisted labeling. For example, after the original image is acquired, the classification type of the object appearing in the original image (for example, whether the object is a person or an automobile or a big tree, etc.) may be noted using image labeling software, so as to obtain several training samples. In the case of feature encoding of the training samples, the encoding may be performed by one-hot (single hot) encoding or the like, and the specific encoding method is not limited by the present application.

And after the training sample set is obtained, performing model training on the student model by using the training sample set.

In each training round, S104 may be executed first, where the same training sample is used as input, and the student model and the teacher model are input for forward propagation, so as to obtain output features corresponding to the two models respectively.

The model complexity of the student model may be less than the teacher model. The student model and the teacher model can be any type of model, and the purpose of model training is to enable the student model to learn from the teacher model, and enable the output effect of the student model to approach to the teacher model, so that the purpose of compressing the model is achieved.

The teacher model may be a pre-trained model. It is to be understood that the training sample set used for the pre-training may be the same as or different from the sample set constructed in S102 above, which is not limited herein.

After the output features are obtained, S106 may be executed to determine a gap between the output features output by the two models based on the output features respectively corresponding to the two models.

In some examples, the gap may be obtained using a predetermined gap function. In the present application, the structure of the above-described gap function is not particularly limited. In some examples, the gap function may be determined with reference to a common knowledge distillation function.

The above-mentioned knowledge distillation function specifically includes a loss function used in the knowledge distillation algorithm. For example, the loss function may be a cross entropy loss function, an exponential loss function, or the like.

After the output characteristics are obtained, S108 may be executed to determine a loss error based on the output characteristics corresponding to the student model.

In some examples, a predetermined loss function may be used to determine an error between the output feature corresponding to the student model and the real feature corresponding to the training sample. In the present application, the structure of the loss function is not particularly limited. In some examples, the loss function may be determined with reference to a common knowledge distillation function.

After determining the loss error and the gap, S110 may be performed to update model parameters of the student model to complete a round of model training based on the result of the weighted summation of the loss error and the gap.

In this step, a gradient descent method may be used to determine the loss based on the result of the weighted summation of the loss error and the gap. The student model is then back-propagated based on the loss, thereby updating model parameters of the student model.

The counter-propagation may be a random gradient descent method (Stochastic Gradient Descent, SGD), a batch gradient descent method (Batch Gradient Descent, BGD), or a small batch gradient descent method (Mini-Batch Gradient Descent, MBGD), and is not particularly limited herein.

After performing the training once, the above-mentioned S104-S110 may be repeatedly performed until the above-mentioned model converges.

The above is a method for achieving model compression through model training, which is commonly used in the related art. In practical application, the method still exists, and the model convergence speed is low; and the output characteristics of the student model and the teacher model are difficult to be close enough.

In view of this, the present application proposes a knowledge distillation method. According to the method, when the gap between the output characteristics of the student model and the teacher model is determined, characteristic alignment operation is performed first, so that the characteristic diagrams of all channels included in the output characteristics of the student model and the characteristic diagrams of all channels included in the output characteristics of the teacher model are matched with each other, and the characteristic diagrams with the same channel number have the same or similar interpretation meanings. Therefore, when the gap is determined, errors caused by mismatching of feature graphs can be reduced, the determined gap is more real and accurate, the difficulty of model convergence is further reduced, the output features of the student model are easy to approach the output features of the teacher model, and the model training efficiency is improved.

Referring to fig. 2, fig. 2 is a flowchart of a model training method according to the present application.

The model training method shown in fig. 2 can be applied to an electronic device. The electronic device may execute the model training method by carrying a software system corresponding to the model training method. In the embodiment of the present application, the types of the electronic devices may be a notebook computer, a server, a mobile phone, a PAD terminal, etc., which is not particularly limited in the present application.

It can be understood that the above model training method may be performed solely by the terminal device or the server device, or may be performed by the terminal device and the server device in cooperation.

For example, the model training method described above may be integrated with the client. After receiving the model training request, the terminal equipment carrying the client can execute the model training method by providing computing power through the hardware environment.

For another example, the model training method described above may be integrated into a system platform. After receiving the model training request, the server device carrying the system platform can execute the model training method by providing calculation power through the hardware environment.

Also for example, the above model training method can be divided into two tasks of constructing a training sample set and performing model training based on the training sample set. The training sample set can be integrated in the client and carried on the terminal device. The model training tasks may be integrated on the server and onboard the server device. The terminal device may initiate a model training request to the server device after constructing the training sample set. After receiving the model training request, the server device may train the model based on the training sample set in response to the request.

Hereinafter, an execution subject will be described as an example of an electronic device (hereinafter, referred to as a device).

As shown in fig. 2, the method may include:

s202, performing image processing on the image data set by using the student model and the teacher model respectively to obtain a first output characteristic and a second output characteristic.

The student model and the teacher model may be any type of model. For example, in the object detection task, the student model and the teacher model may be graphic models such as RCNN, FAST-RCNN, and the like. In the example segmentation task, the student model and the teacher model may be MASK-RCNN (MASK-based regional convolutional neural network) models. It should be noted that the present application describes the model training method described above by taking an image processing task as an example. In practical situations, the model training method can also be applied to tasks such as word processing tasks and voice processing tasks. The model training method under other tasks is not described in detail in the present application.

The first output feature is specifically an output feature obtained by processing the image dataset through the student model. The second output feature is specifically an output feature obtained by processing the image data set through the teacher model.

In some examples, the first output feature and the second output feature may include a multi-channel feature map. Wherein the feature map for each channel can characterize the image from one interpretation dimension as having a feature meaning. For example, some feature maps may characterize texture features that an image has. For another example, some feature maps may characterize contour features that an image has.

In some examples, in executing S202, on the one hand, the student model may be used to perform image processing on at least a part of the images in the image dataset to obtain output features corresponding to the part of the images respectively. And then, carrying out weighted summation on pixel values at the same positions in the output features respectively corresponding to the sample images to obtain the first output features.

On the other hand, the teacher model can be utilized to process the partial images to obtain output features respectively corresponding to the partial images. And then, carrying out weighted summation on pixel values at the same positions in the output features respectively corresponding to the sample images to obtain the second output features.

It will be appreciated that, after obtaining the output features corresponding to the partial images, a maximum value or a minimum value may be selected from the output features to obtain the first output feature, which will not be described in detail herein.

The first output characteristic is obtained by carrying out average processing on the output characteristic, so that a processing result which is relatively truly balanced after the student model carries out image processing on the image in the image data set can be obtained, and further, the model training effect is ensured.

After determining the first output feature and the second output feature, S204 may be executed to determine, based on the first output feature and the second output feature, a correspondence relationship between the number of channels where the feature map pairs of each channel included in the first output feature matches the feature map of each channel included in the second output feature.

The pair of feature maps specifically refers to a matched pair of feature maps. For example, if the feature map a included in the first output feature matches the feature map B included in the second output feature, the feature map a and the feature map B form a pair of feature map pairs.

When determining the feature map pair, feature maps of all channels included in the first output feature may be sequentially used as current feature maps, and the steps are performed: and respectively carrying out vectorization processing on feature graphs to be matched in the current feature graphs and feature graphs of all channels included in the second output features to obtain a first vector and a second vector. Then, a similarity score between the first vector and the second vector is calculated. Finally, the feature map corresponding to the second vector with the highest similarity score and the current feature map are determined as a pair of feature map pairs. Note that, the method for calculating the similarity may be, for example, a euclidean distance, a cosine distance, or the like, which is not limited herein.

In some examples, when determining the pair of feature maps, the feature maps of the channels included in the second output feature may be sequentially used as the current feature maps, and a method similar to the foregoing steps is performed, and specific processes are not described in detail herein.

The correspondence relationship specifically refers to a correspondence relationship between the number of channels in which the feature map pair is located. For example, if the feature map a in the 5 th channel in the first output feature matches the feature map B in the 3 rd channel in the second output feature, the correspondence may be a correspondence between 1-5 and 2-3. Wherein 1-5 represents the 5 th channel of the first output characteristic and 2-3 represents the 3 rd channel of the second output characteristic. It will be appreciated that other ways of maintaining the correspondence described above may also be used in the present application.

After determining the correspondence, S206 may be continuously executed to train the student model, and train the initialized student model; in each training round, respectively utilizing the student model and the teacher model to perform image processing on an input sample image to obtain a third output characteristic and a fourth output characteristic; determining an error between the third output feature and a true feature corresponding to the sample image; performing feature alignment operation by using the determined correspondence so that feature graphs of channels included in the third output feature are matched with feature graphs of channels included in the fourth output feature, wherein the feature graphs are in the same channel number; further determining a gap between the third output feature and the fourth output feature after feature alignment; and updating model parameters of the student model based on the error and the gap.

The above-mentioned real features are in particular features for determining loss errors. In some examples, the real features may be obtained from the pre-trained student model. For example, in an image classification task, the student model may be an image classification model. At this time, the image classification model may be pre-trained using training samples. After the pre-training is completed, the sample image marked with the real classification can be input into the pre-trained student model for forward propagation, and then the output characteristics of the student model are used as the real characteristics of the sample image. In some examples, the real features may also be features derived by algorithms such as spatial geometrical constraints, using known real features of images preceding the sample image. For example, the sample image may be an image of a sequence of images. It will be appreciated that the sample images in the image sequence are typically successive images whose objects appearing in the images satisfy the spatial geometrical constraint, so that the true characteristics of the sample image can be deduced from the images preceding the sample image.

The error may specifically be a loss error between the third output feature and the real feature corresponding to the sample image. In some examples, the error may be determined using a pre-constructed loss function (e.g., a cross entropy loss function).

The feature alignment operation is to match the feature map of each channel included in the third output feature with the feature map of each channel included in the fourth output feature, the feature maps being the same in number of channels.

In practical applications, the third output feature or the fourth output feature may be subjected to feature transformation based on the correspondence relationship, so as to complete the feature alignment operation.

For example, when the correspondence relationship includes a correspondence relationship between the number of channels in which the pairs of feature maps of the channels included in the output features of the student model and the teacher model are located, the position of the feature maps of the channels of the third output feature may be adjusted according to the correspondence relationship so that the feature maps of the channels included in the third output feature and the feature maps of the channels included in the fourth output feature are matched.

The difference is specifically a difference between the third output feature and the fourth output feature after feature alignment. In some examples, the gap may be determined using a pre-constructed gap function (e.g., a cross entropy loss function). It can be understood that, because the feature alignment operation is performed before the gap is determined, errors caused by mismatching of feature graphs can be reduced when the errors are determined recently, the determined gap is more real and accurate, the difficulty of model convergence is further reduced, the output features of the student model are easy to approach the output features of the teacher model, and the model training efficiency is improved.

Referring to fig. 3, fig. 3 is a schematic flow chart of model training according to the present application.

As shown in fig. 3, in each training round, S2062 may be performed first, and a sample image may be input into the student model and the teacher model to obtain a third output characteristic output by the student model and a fourth output characteristic output by the teacher model.

Then, S2064 may be performed to determine an error between the third output feature and the real feature corresponding to the sample image based on a preset loss function.

In comparison with the related art, S2066 may be performed before determining the gap, and an alignment operation may be performed to match the feature map of each channel included in the third output feature with the feature map of each channel included in the fourth output feature, the feature maps being at the same channel number.

Thereafter, step S2068 is performed to determine a gap between the third output feature and the fourth output feature after the feature alignment.

Finally, updating model parameters of the student model based on the error and the gap by using a back propagation method.

In the above-mentioned scheme, when determining the gap between the output features of the student model and the teacher model, the feature alignment operation is performed first, so that the feature graphs of the channels included in the output features of the student model and the feature graphs of the channels included in the output features of the teacher model are matched with each other, so that the feature graphs in the same number of channels have the same or similar interpretation meaning. Therefore, when the gap is determined, errors caused by mismatching of feature graphs can be reduced, the determined gap is more real and accurate, the difficulty of model convergence is further reduced, the output features of the student model are easy to approach the output features of the teacher model, and the model training efficiency is improved.

The following description of the embodiments is made in connection with a scenario in which model compression is performed using a knowledge distillation algorithm.

In this case, the student model may be a compressed student model, and the teacher model may be a teacher model before compression.

In some examples, the student model and the teacher model may be pre-trained by a training sample set prior to performing the step S202. The pre-training process is not described in detail here.

Here, a pre-trained student model and a teacher model may be acquired.

In some examples, the initialization parameters of the student model may be recorded prior to pre-training the student model. The initialization parameters may include model parameters that the model includes before pre-training.

Here, model parameters of the student model prior to pre-training may be recorded. Therefore, when the student model is subsequently trained, the recorded initialization parameters can be used for initializing the student model, and then the model is trained, so that the model change trend of the student model in the subsequent training process (in the learning process) can be ensured to be the same as that of the pre-training process, and the learning effect of the student model is improved by effectively utilizing the information contained in the initialization parameters of the student model.

After the pre-training is completed, the pre-trained student model and the pre-trained teacher model can be used for performing image processing on the image data set to obtain a first output feature and a second output feature.

After the first output feature and the second output feature are obtained, a binary pattern matching algorithm or a greedy matching algorithm may be used to determine, based on the first output feature and the second output feature, a correspondence relationship between the feature pattern of each channel included in the first output feature and the feature pattern of each channel included in the second output feature, where the corresponding relationship is located, and the number of channels is determined.

Because the bipartite graph matching algorithm or greedy algorithm can determine the matched characteristic graph pairs between the characteristic graphs of the channels included in the first output characteristic and the characteristic graphs of the channels included in the second output characteristic, the corresponding relation can be determined more accurately through the algorithm.

In some examples, when determining the correspondence by using a bipartite graph matching algorithm, feature graphs of the channels included in the first output feature may be sequentially used as current feature graphs, and the steps may be performed:

And deleting the determined feature graphs matched with the feature graphs included in the first output feature from the feature graphs of the channels included in the second output feature according to the maintained correspondence.

Determining a feature map matched with the current feature map in the feature maps of the rest channels in the second output features; and recording the sub-corresponding relation between the channel number of the current feature map and the channel number of the matched feature map.

After the matching is completed for the feature graphs of the channels of the first output feature, the corresponding relation is determined based on the recorded sub-corresponding relation.

Here, by the bipartite graph matching algorithm, a feature graph pair that matches between a feature graph included in the output features of the student model and a feature graph included in the output features of the teacher model can be determined. And then, based on the determined feature map pairs, determining the corresponding relation between the channel numbers of the two feature maps included in each feature map pair.

In some examples, when determining the correspondence by using a greedy matching algorithm, the feature graphs of the channels included in the first output feature are sequentially used as current feature graphs, and the steps are performed: determining a feature map matched with the current feature map in the feature maps of the channels included in the second output feature; recording the sub-corresponding relation between the channel number of the current feature map and the channel number of the matched feature map;

Here, by a greedy matching algorithm, pairs of feature maps that match between feature maps included in the output features of the student model and feature maps included in the output features of the teacher model may be determined. And then, based on the determined feature map pairs, determining the corresponding relation between the channel numbers of the two feature maps included in each feature map pair.

It should be noted that, when determining the feature map pairs, other algorithms besides the bipartite map matching algorithm and the greedy algorithm may also be within the scope of the present application.

In some examples, to facilitate recording the correspondence, a transformation matrix may be generated based on the determined correspondence.

The conversion matrix is used for representing the corresponding relation between the matched characteristic diagram pairs of the characteristic diagrams of all channels included in the first output characteristic and the characteristic diagrams of all channels included in the second output characteristic.

In some examples, the conversion matrix may be a 0-1 matrix for convenience in performing feature alignment operations.

Referring to fig. 4, fig. 4 is a schematic diagram of a transformation matrix according to the present application.

The conversion matrix shown in fig. 4 is used to characterize the correspondence between the number of channels in which the feature map of each channel included in the first output feature matches the feature map of each channel included in the second output feature. The number of rows of the element included in the conversion matrix represents the number of channels of the feature map included in the second output feature, the number of columns of the element represents the number of channels of the feature map included in the first output feature, and the element represents whether the two corresponding feature maps are matched. For example, 0 is a mismatch and 1 is a match.

As shown in fig. 4, the 3 rd element in the first row is 1, which may indicate that the feature map of the 3 rd channel in the first output feature matches the feature map of the 1 st channel in the second output feature. As shown in fig. 4, the 2 nd element in the second row is 1, which may indicate that the feature map of the 2 nd channel in the first output feature matches the feature map of the 2 nd channel in the second output feature. Similarly, if the letter M is used to represent the first output feature, the letter N is used to represent the second output feature, and the letter M1 is used to represent the feature map of the 1 st channel of the first output feature, the transformation matrix shown in fig. 4 represents that M3 matches N1, M2 matches N2, M4 matches N3, M5 matches N4, and M1 matches N5.

On the one hand, the corresponding relation can be conveniently recorded through the conversion matrix. On the other hand, the subsequent feature alignment can be facilitated through the transformation matrix.

It can be understood that the number of rows of the conversion matrix may also represent the number of channels in which the feature map included in the first output feature is located, and the number of columns represents the number of channels in which the feature map included in the second output feature is located.

After determining the correspondence, S206 may be continued to train the student model.

In some examples, the student model may be initialized with the recorded initialization parameters while the student model is being trained. And then training the initialized student model.

Here, the student model can be initialized by using the recorded initialization parameters, and then model training is performed, so that the model change trend of the student model in the subsequent training process (in the learning process) can be ensured to be the same as that in the pre-training process, and the learning effect of the student model is improved by effectively utilizing the information contained in the initialization parameters of the student model.

In some examples, when the number of rows of the conversion matrix indicates the number of channels in which the feature map included in the second output feature is located, and the number of columns indicates the number of channels in which the feature map included in the first output feature is located, the conversion matrix is used to convert the fourth output feature when performing the feature alignment operation using the determined correspondence, so that the feature map of each channel included in the third output feature is matched with the feature map of each channel included in the fourth output feature, where the number of channels is the same.

For example, the feature maps included in the fourth output feature may be numbered in the order from top to bottom. Column vectors are then constructed based on the above numbers. Then, the conversion matrix is multiplied by the column vector to obtain a multiplication result. Here, the multiplication result may characterize the ordering of the feature maps included in the feature-aligned fourth output feature. Finally, the feature graphs included in the fourth output feature may be reordered according to the order indicated by the multiplication result, to obtain a fourth output feature with aligned features.

In this case, the feature graphs of the channels included in the third output feature and the feature graphs of the channels included in the fourth output feature are matched with each other in the feature graphs of the channels included in the fourth output feature, that is, feature alignment of the fourth output feature and the third output feature is completed, so that when a gap between the fourth output feature and the third output feature is determined, errors caused by mismatching of the feature graphs can be reduced, the determined gap is more real and accurate, difficulty in model convergence is further reduced, output features of a student model are easy to approach to output features of a teacher model, and efficiency of model training is improved.

In some examples, when the number of rows of the transformation matrix indicates the number of channels in which the feature map included in the first output feature is located, and the number of columns indicates the number of channels in which the feature map included in the second output feature is located,

and when the determined corresponding relation is used for carrying out characteristic alignment operation, the conversion matrix is used for converting the third output characteristic so that the characteristic diagrams of all channels included in the third output characteristic are matched with the characteristic diagrams of all channels included in the fourth output characteristic, wherein the characteristic diagrams are in the same channel number.

After determining the error between the third output feature and the real feature corresponding to the sample image and the difference between the third output feature and the fourth output feature after feature alignment, model parameters of the student model may be updated based on the error and the difference. Here, a round of parameter updating of the student model may be implemented, and in the training process of the student model, when determining a gap between the output of the student model and the output of the teacher model, a feature alignment operation is performed first, so that feature maps of channels included in the output feature of the student model and feature maps of channels included in the output feature of the teacher model are matched with each other, where the feature maps of the same number of channels are located, so that feature maps of the same number of channels have the same or similar interpretation meaning. Therefore, when the gap is determined, errors caused by mismatching of feature graphs can be reduced, the determined gap is more real and accurate, the difficulty of model convergence is further reduced, the output features of the student model are easy to approach to the output features of the teacher model, and the model training efficiency and the model compression effect are improved.

In some examples, model parameters of the student model may be updated based on the result of the weighted summation of the error and the gap.

The weight of the weighted summation can be set according to the actual situation.

Model training can be realized by comprehensively utilizing the meaning of the error and the difference characterization by updating the model parameters of the student model based on the result of weighted summation of the error and the difference, thereby ensuring that the output characteristics of the trained student model are close to the output characteristics of the teacher model.

The model training can be realized by comprehensively utilizing the meaning of the error and the difference characterization by updating the model parameters of the student model based on the result of weighted summation of the error and the difference, so that the output characteristics of the trained student model are ensured to be close to the output characteristics of the teacher model.

In some examples, in order to further improve the prediction effect of the student model, corresponding relations corresponding to different classification types can be determined according to the image classification types, and then when the student model is trained, corresponding relations can be selected to perform feature alignment according to the classification types of the input sample image, so that the prediction effect of the student model aiming at different classification types is improved.

In some examples, at least a portion of the images included in the image dataset used in S202 may include multiple classification type images.

The classification type may be set according to the actual situation. For example, in an autopilot scenario, the classification types described above may be people, walls, vehicles, and so forth. For another example, in an animal classification scenario, the classification type may include animals such as dogs, cats, pigs, and the like.

At this time, when the output features are subjected to the averaging process to obtain the first output features, the output features corresponding to the images of the respective classification types may be subjected to the averaging process to obtain the first output features corresponding to the respective classification types.

When the output features are subjected to average processing to obtain the second output features, the output features corresponding to the images of the classification types can be subjected to average processing to obtain the second output features corresponding to the classification types.

Here, a first output characteristic of the student model output and a second output characteristic of the teacher model output for the images of the different classification types may be determined.

Then, based on the first output feature and the second output feature, a correspondence relationship between the number of channels in which the feature map of each channel included in the first output feature matches the feature map of each channel included in the second output feature is determined, and based on the first output feature and the second output feature corresponding to each classification type, a correspondence relationship between the number of channels in which the feature map of each channel included in the first output feature and the feature map of each channel included in the second output feature matches the feature map of each channel included in the second output feature is determined.

The corresponding relation between the matched characteristic diagrams of the channels included in the student output characteristics and the characteristic diagrams of the channels included in the teacher output characteristics and the channel number of the matched characteristic diagram pairs can be determined for the images of different classification types. Because the corresponding relation corresponding to the classification type is determined, errors caused by differences of output features corresponding to images of different classification types can be eliminated, and therefore accuracy of the determined corresponding relation can be improved.

Thereafter, at the time of performing the feature alignment operation, the following method may be performed.

Referring to fig. 5, fig. 5 is a flow chart of a feature alignment method according to the present application.

As shown in fig. 5, when performing the feature alignment operation, S502 may be performed first, to determine the classification type corresponding to the sample image.

In some examples, the corresponding classification type may be determined by determining the annotation type of the sample image.

After determining the classification type, S504 may be executed, where feature alignment operation is performed by using the correspondence corresponding to the determined classification type, so that feature graphs of channels included in the third output feature and feature graphs of channels included in the fourth output feature are matched with feature graphs of the same channel number.

Here, the feature alignment operation may be performed according to the correspondence corresponding to the classification type of the input sample image, so that the accuracy of the feature alignment operation may be improved, thereby improving the training effect of the student model, and further improving the prediction effect of the student model.

The application provides an image processing method. The method may be applied to any type of electronic device. According to the method, the image processing model (namely the student model) obtained through training by using the model training method shown in any embodiment is subjected to image processing, so that a good prediction effect can be achieved by using the image processing model with low complexity, and the image processing rate is improved on the basis of not reducing the prediction effect.

The method may include:

a target image is acquired.

And performing image processing on the target image by using an image processing model to obtain an image processing result. The image processing model comprises a model which is obtained by training according to the knowledge distillation method shown in any embodiment.

The image processing model may be any type of model. For example, the image processing model may be an image classification model, an object detection model, an object tracking model, or the like. The image processing model can be obtained by training the knowledge distillation method shown in any one of the embodiments, so that the model has the characteristics of simple structure and good prediction effect, and further, the image processing rate is improved on the basis of not reducing the prediction effect.

Corresponding to any of the above embodiments, the present application also provides a knowledge distillation apparatus.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a knowledge distillation apparatus according to the present application. As shown in fig. 6, the apparatus 600 may include:

the image processing module 610 is configured to perform image processing on the image dataset by using the student model and the teacher model respectively, so as to obtain a first output feature and a second output feature;

a correspondence determining module 620, configured to determine, based on the first output feature and the second output feature, a correspondence between a number of channels in which a pair of feature maps matching between a feature map of each channel included in the first output feature and a feature map of each channel included in the second output feature is located;

the training module 630 is configured to train the student model; in each training round, respectively utilizing the student model and the teacher model to perform image processing on an input sample image to obtain a third output characteristic and a fourth output characteristic; determining an error between the third output feature and a true feature corresponding to the sample image; performing feature alignment operation by using the determined correspondence so that feature graphs of channels included in the third output feature are matched with feature graphs of channels included in the fourth output feature, wherein the feature graphs are in the same channel number; further determining a gap between the third output feature and the fourth output feature after feature alignment; and updating model parameters of the student model based on the error and the gap.

In some embodiments shown, the image processing module 610 is specifically configured to:

performing image processing on sample images in the image data set by using a student model to obtain output features respectively corresponding to the sample images;

the pixel values at the same position in the output characteristics respectively corresponding to the sample images are weighted and summed to obtain the first output characteristic;

performing image processing on the sample images by using a teacher model to obtain output features respectively corresponding to the sample images;

and carrying out weighted summation on pixel values at the same positions in the output features respectively corresponding to the sample images to obtain the second output features.

In some embodiments shown, the correspondence determination module 620 is specifically configured to:

and determining the corresponding relation by using a bipartite graph matching algorithm or a greedy matching algorithm.

sequentially taking the feature graphs of all channels included in the first output feature as current feature graphs respectively, and executing: determining a feature map matched with the current feature map in the feature maps of the channels included in the second output feature; recording the sub-corresponding relation between the channel number of the current feature map and the channel number of the matched feature map;

sequentially taking the feature graphs of all channels included in the first output feature as current feature graphs respectively, and executing: deleting the determined feature graphs matched with the feature graphs included in the first output feature from the feature graphs of the channels included in the second output feature according to the maintained corresponding relationship; determining a feature map matched with the current feature map in the feature maps of the rest channels in the second output features; recording the sub-corresponding relation between the channel number of the current feature map and the channel number of the matched feature map;

In some embodiments shown, the apparatus further comprises:

the pre-training module 630 is configured to pre-train the student model and the teacher model through a training sample set before performing image processing on the image data set to obtain a first output feature and a second output feature by using the student model and the teacher model respectively;

The device further comprises:

the recording module is used for recording initialization parameters corresponding to the student model before the pre-training is carried out on the student model;

the training model is specifically used for:

initializing the student model by using the recorded initialization parameters;

and training the initialized student model.

In some embodiments shown, the apparatus further comprises:

the generation module is used for generating a conversion matrix based on the determined corresponding relation; the conversion matrix is used for representing the number of channels in which the feature map corresponding to the feature map of each channel included in the second output feature is located in the feature map of each channel included in the first output feature, or the number of channels in which the feature map corresponding to the feature map of each channel included in the second output feature is located in the feature map of each channel included in the first output feature.

In some embodiments shown, the training model described above is specifically for:

when the number of rows of the conversion matrix represents the number of channels in which the feature map included in the second output feature is located and the number of columns represents the number of channels in which the feature map included in the first output feature is located, the conversion matrix is used to convert the fourth output feature, so that the feature map of each channel included in the third output feature is matched with the feature map of each channel included in the fourth output feature, wherein the feature maps are in the same number of channels;

Or alternatively, the first and second heat exchangers may be,

and when the number of lines of the conversion matrix represents the number of channels in which the feature map included in the first output feature is located and the number of columns represents the number of channels in which the feature map included in the second output feature is located, converting the third output feature by using the conversion matrix so that the feature map of each channel included in the third output feature is matched with the feature map of each channel included in the fourth output feature, wherein the feature maps are in the same number of channels.

determining a loss according to the result of the weighted summation of the error and the gap;

and back-propagating the student model according to the loss to update model parameters of the student model.

In some embodiments shown, the sample image includes a plurality of classified types of images; the correspondence determining module 620 is specifically configured to:

based on the first output feature and the second output feature corresponding to each classification type, determining a corresponding relation between the number of channels in which the feature map of each channel included in the first output feature and the feature map of each channel included in the second output feature are matched;

The training module 630 is specifically configured to:

determining the classification type corresponding to the sample image;

and performing feature alignment operation by using the corresponding relation corresponding to the determined classification type so as to enable feature graphs of all channels included in the third output feature to be matched with feature graphs of all channels included in the fourth output feature, wherein the feature graphs are in the same channel number.

The application also provides an image processing device, which comprises:

the acquisition module is used for acquiring a target image;

The embodiment of the knowledge distilling apparatus or the image processing apparatus shown in the present application can be applied to an electronic device. Accordingly, the application discloses an electronic device, which may include: a processor.

A memory for storing processor-executable instructions.

Referring to fig. 7, fig. 7 is a schematic diagram of a hardware structure of an electronic device according to the present application.

As shown in fig. 7, the electronic device may include a processor for executing instructions, a network interface for making a network connection, a memory for storing operating data for the processor, and a non-volatile memory for storing corresponding instructions for the knowledge distillation apparatus or the image processing apparatus.

The embodiments of the apparatus may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of an electronic device where the device is located for operation. In terms of hardware, in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 7, the electronic device in which the apparatus is located in the embodiment generally includes other hardware according to the actual function of the electronic device, which will not be described herein.

It should be understood that, in order to increase the processing speed, the instruction corresponding to the knowledge distillation device or the image processing device may also be directly stored in the memory, which is not limited herein.

The present application proposes a computer-readable storage medium storing a computer program for executing the aforementioned knowledge distillation method or image processing method.

One skilled in the relevant art will recognize that one or more embodiments of the application may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the application may take the form of a computer program product on one or more computer-usable storage media (which may include, but are not limited to, magnetic disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

"and/or" in the present application means that there is at least one of them, for example, "a and/or B" may include three schemes: A. b, and "a and B".

The embodiments of the present application are described in a progressive manner, and the same and similar parts of the embodiments are all referred to each other, and each embodiment is mainly described in the differences from the other embodiments. In particular, for data processing apparatus embodiments, the description is relatively simple, as it is substantially similar to method embodiments, with reference to the description of method embodiments in part.

The foregoing describes certain embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the acts or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Embodiments of the subject matter and functional operations described in this disclosure may be implemented in the following: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware which may include the structures disclosed in the present application and structural equivalents thereof, or a combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on a manually-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processes and logic flows described in this application can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows described above may also be performed by, and apparatus may also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

A computer suitable for executing a computer program may comprise, for example, a general-purpose and/or special-purpose microprocessor, or any other type of central processing unit. Typically, the central processing unit will receive instructions and data from a read only memory and/or a random access memory. The essential components of a computer may include a central processing unit for executing or executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks, etc. However, a computer does not have to have such a device. Furthermore, the computer may be embedded in another device, such as a mobile phone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.

Computer readable media suitable for storing computer program instructions and data may include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., internal hard disk or removable disks), magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

While the application contains many specific implementation details, these should not be construed as limiting the scope of any disclosure or the scope of the claims, but rather as primarily describing features of particular embodiments of the particular disclosure. Certain features that are described in this application in the context of separate embodiments can also be implemented in combination in a single embodiment. On the other hand, the various features described in the individual embodiments may also be implemented separately in the various embodiments or in any suitable subcombination. Furthermore, although features may be acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Furthermore, the processes depicted in the accompanying drawings are not necessarily required to be in the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The foregoing description of the preferred embodiment(s) of the application is merely illustrative of the presently preferred embodiment(s) of the application, and is not intended to limit the embodiment(s) of the application to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents, alternatives, and alternatives falling within the spirit and scope of the embodiment(s) of the application.

Claims

1. A method of knowledge distillation, the method comprising:

pre-training a student model and a teacher model, and recording initialization parameters of the student model before pre-training;

respectively utilizing the pre-trained student model and the pre-trained teacher model to perform image processing on the image data set to obtain a first output characteristic and a second output characteristic;

based on the first output feature and the second output feature, determining a corresponding relation between the matched feature map pairs of the feature maps of all channels included in the first output feature and the feature map pairs of all channels included in the second output feature;

initializing the student model by using the recorded initialization parameters, and training the initialized student model; in each training round, respectively utilizing the student model and the teacher model to perform image processing on an input sample image to obtain a third output characteristic and a fourth output characteristic; determining an error between the third output feature and a true feature corresponding to the sample image; performing feature alignment operation by using the determined correspondence so that feature graphs of all channels included in the third output feature and feature graphs of all channels included in the fourth output feature are matched with feature graphs with the same channel number; further determining a gap between the third output feature and the fourth output feature after feature alignment; model parameters of the student model are updated based on the error and the gap.

2. The method of claim 1, wherein the image processing the image dataset using the student model and the teacher model, respectively, to obtain the first output feature and the second output feature comprises:

3. The method according to claim 1 or 2, wherein the determining, based on the first output feature and the second output feature, a correspondence between a feature map of each channel included in the first output feature and a feature map of each channel included in the second output feature, in which the number of channels the feature map matches, includes:

4. A method according to claim 3, wherein determining the correspondence using a greedy matching algorithm comprises:

sequentially taking the feature graphs of all channels included in the first output feature as current feature graphs respectively, and executing: determining a feature map matched with the current feature map in the feature maps of all channels included in the second output feature; recording the sub-corresponding relation between the number of channels in which the current feature map is located and the number of channels in which the matched feature map is located;

after the matching is completed for the feature graphs of the channels of the first output feature, determining the corresponding relation based on the recorded sub-corresponding relation.

5. A method according to claim 3, wherein determining the correspondence using a bipartite graph matching algorithm comprises:

sequentially taking the feature graphs of all channels included in the first output feature as current feature graphs respectively, and executing: deleting the determined feature graphs matched with the feature graphs included in the first output feature from the feature graphs of the channels included in the second output feature according to the maintained corresponding relation; determining a feature map matched with the current feature map in the feature maps of the rest channels in the second output features; recording the sub-corresponding relation between the number of channels in which the current feature map is located and the number of channels in which the matched feature map is located;

6. The method according to any one of claims 1-5, further comprising:

before the image data set is subjected to image processing by utilizing a student model and a teacher model respectively to obtain a first output characteristic and a second output characteristic, the student model and the teacher model are pre-trained through a training sample set;

the method further comprises the steps of:

before the pre-training is carried out on the student model, recording initialization parameters corresponding to the student model;

the training of the student model comprises:

initializing the student model by using the recorded initialization parameters;

and training the initialized student model.

7. The method according to any one of claims 1-6, further comprising:

generating a conversion matrix based on the determined correspondence; the conversion matrix is used for representing the number of channels in which the feature map corresponding to the feature map of each channel included in the second output feature is located in the feature map of each channel included in the first output feature, or the number of channels in which the feature map corresponding to the feature map of each channel included in the second output feature is located in the feature map of each channel included in the first output feature.

8. The method according to claim 7, wherein the performing the feature alignment operation using the determined correspondence relationship to match a feature map of each channel included in the third output feature with a feature map of each channel included in the fourth output feature, the feature maps being in the same channel number, includes:

when the number of lines of the conversion matrix represents the number of channels in which the feature map included in the second output feature is located, and the number of columns represents the number of channels in which the feature map included in the first output feature is located, the conversion matrix is utilized to convert the fourth output feature, so that the feature map of each channel included in the third output feature is matched with the feature map of each channel included in the fourth output feature, wherein the feature maps are in the same number of channels;

or alternatively, the first and second heat exchangers may be,

when the number of lines of the conversion matrix represents the number of channels in which the feature map included in the first output feature is located, and the number of columns represents the number of channels in which the feature map included in the second output feature is located, the conversion matrix is used for converting the third output feature, so that the feature map of each channel included in the third output feature is matched with the feature map of each channel included in the fourth output feature, wherein the feature maps are in the same number of channels.

9. The method of any of claims 1-8, wherein updating model parameters of the student model based on the error and the gap comprises:

10. The method of claim 2, wherein the sample image comprises a plurality of classified types of images;

the determining, based on the first output feature and the second output feature, a correspondence between the number of channels where the feature map pairs of each channel included in the first output feature matches the feature map of each channel included in the second output feature, includes:

and performing feature alignment operation by using the determined correspondence so that feature graphs of channels included in the third output feature and feature graphs of channels included in the fourth output feature are matched, where the feature graphs are in the same channel number, including:

Determining a classification type corresponding to the sample image;

11. An image processing method, the method comprising:

acquiring a target image;

wherein the image processing model comprises a model trained in accordance with the knowledge distillation method of any one of claims 1-10.

12. A knowledge distillation apparatus, the apparatus comprising:

the image processing module is used for pre-training the student model and the teacher model and recording the initialization parameters of the student model before pre-training; respectively utilizing the pre-trained student model and the pre-trained teacher model to perform image processing on the image data set to obtain a first output characteristic and a second output characteristic;

the corresponding relation determining module is used for determining the corresponding relation between the matched feature map pairs of the feature maps of all channels included in the first output feature and the feature map pairs of all channels included in the second output feature based on the first output feature and the second output feature;

The training module is used for initializing the student model by using the recorded initialization parameters and training the initialized student model; in each training round, respectively utilizing the student model and the teacher model to perform image processing on an input sample image to obtain a third output characteristic and a fourth output characteristic; determining an error between the third output feature and a true feature corresponding to the sample image; performing feature alignment operation by using the determined correspondence so that feature graphs of all channels included in the third output feature and feature graphs of all channels included in the fourth output feature are matched with feature graphs with the same channel number; further determining a gap between the third output feature and the fourth output feature after feature alignment; model parameters of the student model are updated based on the error and the gap.

13. An image processing apparatus, characterized in that the apparatus comprises:

the acquisition module is used for acquiring a target image;

the image processing module is used for performing image processing on the target image by utilizing an image processing model to obtain an image processing result;

14. An electronic device, the device comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to invoke executable instructions stored in the memory to implement the knowledge distillation method of any of claims 1-10 or the image processing method of claim 11.

15. A computer-readable storage medium, characterized in that the storage medium stores a computer program for executing the knowledge distillation method of any one of claims 1-10 or the image processing method of claim 11.