CN114611696A

CN114611696A - Model distillation method, device, electronic equipment and readable storage medium

Info

Publication number: CN114611696A
Application number: CN202210332387.5A
Authority: CN
Inventors: 舒红乔; 王奇刚; 李远辉
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2022-03-31
Filing date: 2022-03-31
Publication date: 2022-06-10

Abstract

The application provides a model distillation method, a model distillation device, an electronic device and a readable storage medium, comprising the following steps: determining at least two teacher models and at least two student models, wherein the student models correspond to the teacher models, the at least two student models are the same, and any two teacher models are different; in the process of controlling the at least two student models to learn the knowledge of the corresponding teacher model, controlling each student model to respectively fuse the target knowledge of other student models in the at least two student models; and selecting a student model meeting an agreement condition as a target student model from the at least two student models. In this scheme, a teacher's model only need teach knowledge for a student model, study mutually between the student model, can be quick find the best student model of knowledge learning effect, realized carrying out the rapid distillation to these a plurality of teacher's models.

Description

Model distillation method, device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of information technology, and more particularly, to a model distillation method, apparatus, electronic device, and readable storage medium.

Background

The core idea of knowledge distillation is to train a complex network model and then use the output of the complex network and the true labels of the data to train a smaller network, so the knowledge distillation framework usually consists of a complex model (called teacher model) and a small model (called student model).

There are currently a variety of distillation schemes, such as intermediate profile distillation, distillation of the last hidden layer, etc., and different kinds of distillation are implemented using different algorithms.

In order to select a long student model integrating teacher models (one teacher model corresponds to one algorithm) for different tasks, the teacher of each task needs to teach each student model in turn, and finally selects a student model which best integrates all teacher knowledge.

Generally, an exhaustive mode is adopted for implementing a plurality of teacher models to teach the student models.

Assuming that there are three teacher models, the student is taught by letting teacher 1, then teacher 2, and finally teacher 3. Such as teacher 1, teacher 2, and teacher 3, or teacher 1, teacher 3, and teacher 2. Three teachers have 6 orders to teach students. There may be an order of n factorial (n |) students teaching for n teachers. For distillation of three teacher models, then a distillation process requires distillation training of three teacher models once in six sequences, respectively. For n teachers, then n teachers model more distillation training times will be needed, and the time is long.

Disclosure of Invention

In view of the above, the present application provides a model distillation method, comprising:

a model distillation method comprising:

determining at least two teacher models and at least two student models, wherein the student models correspond to the teacher models, the at least two student models are the same, and any two teacher models are different;

in the process of controlling the at least two student models to learn the knowledge of the corresponding teacher model, controlling each student model to respectively fuse the target knowledge of other student models in the at least two student models;

and selecting a student model meeting an agreement condition as a target student model from the at least two student models.

Optionally, in the method, in the process of controlling the at least two student models to learn the knowledge of the corresponding teacher model, the step of controlling each student model to respectively fuse the target knowledge of other student models in the at least two student models includes:

controlling the at least two student models to learn the knowledge of the corresponding teacher models according to an appointed period;

and in each appointment cycle, determining a first student model of the cycle in the at least two student models, and controlling a second student model to fuse the target knowledge of the first student model so that the second student model fuses the knowledge of a corresponding teacher model and the knowledge of a corresponding teacher model of the first student model, wherein the learning effect of the first student model is better than that of the second student model, and the second student model is the other student model except for the first student model in the at least two student models.

Optionally, the method for determining a first student model of the present period among the at least two student models and controlling a second student model to fuse the target knowledge of the first student model includes:

determining the accuracy of the at least two student models based on agreed assessment rules;

selecting a student model with accuracy meeting the condition from the at least two student models as a first student model, and determining a target parameter for a second student model to learn from the first student model;

and controlling the second student model to fuse the target parameters.

Optionally, the method for determining the accuracy of the at least two student models based on the agreed assessment rules includes:

if the training times of the at least two student models respectively reach the target training times corresponding to the appointed period, respectively inputting the evaluation content into the at least two student models to obtain the recognition results of the at least two student models;

and analyzing the recognition results of the at least two student models to obtain the accuracy of the at least two student models.

Optionally, in the above method, in the first period, determining a target parameter learned by the second student model to the first student model includes:

determining a target parameter proportion learned by the second student model to the first student model based on the accuracy of the first student model and the accuracy of the second student model;

selecting a target parameter of the target parameter proportion in the first student model based on an agreed ranking rule.

Optionally, in the above method, in a non-first period, determining a target parameter for the second student model to learn from the first student model includes at least one of:

determining a designated target parameter for the first student model to learn to the second student model based on a first accuracy of the first student model and a second accuracy of the second student model for a present cycle based on the first student model being different from a first student model for a previous cycle;

based on the fact that the first student model is the same as the first student model of the previous period, and the accuracy difference value between the first student model of the previous period and the second student model is smaller than the accuracy difference value between the first student model of the current period and the second student model, adjusting the target parameter ratio of the second student model to the first student model to be 1; selecting a target parameter of the target parameter proportion in the first student model based on an agreed sorting rule;

based on the fact that the first student model is the same as the first student model of the previous period, the accuracy difference value of the first student model and the second student model of the previous period is larger than the accuracy difference value of the first student model and the second student model of the previous period, and based on the historical target parameter proportion of the previous period and the first accuracy of the first student model and the second accuracy of the second student model in the current period, the target parameter proportion learned by the second student model to the first student model is improved; selecting a target parameter of the target parameter proportion in the first student model based on an agreed ranking rule.

Optionally, the method further includes:

based on the first student model is the same as the first student model of the previous period, the accuracy difference value between the first student model of the previous period and the second student model is larger than the accuracy difference value between the first student model of the current period and the second student model of the previous period, the first student model of the current period is determined as the first student model in the continuous period of the target number before the current period, the target parameter ratio of the second student model to the first student model is determined as 1, and the second student model is controlled to give up the target knowledge fused from the first student model in the target number period before the current period.

A model distillation apparatus comprising:

the system comprises a determining module, a judging module and a judging module, wherein the determining module is used for determining at least two teacher models and at least two student models, the student models correspond to the teacher models, the at least two student models are the same, and any two teacher models are different;

the fusion module is used for controlling each student model to fuse the target knowledge of other student models in the at least two student models respectively in the process of controlling the at least two student models to learn the knowledge of the corresponding teacher model;

and the selection module is used for selecting the student model meeting the appointed conditions from the at least two student models as the target student model.

An electronic device, comprising: a memory, a processor;

wherein, the memory stores a processing program;

the processor is configured to load and execute the processing program stored in the memory to implement the steps of the model distillation method as described in any one of the above.

A readable storage medium having stored thereon a computer program for being invoked and executed by a processor to perform the steps of the model distillation method as set forth in any one of the preceding claims.

As can be seen from the above technical solutions, the present application provides a model distillation method, including: determining at least two teacher models and at least two student models, wherein the student models correspond to the teacher models, the at least two student models are the same, and any two teacher models are different; in the process of controlling the at least two student models to learn the knowledge of the corresponding teacher model, controlling each student model to respectively fuse the target knowledge of other student models in the at least two student models; and selecting a student model meeting an agreement condition as a target student model from the at least two student models. In this scheme, a teacher's model only need teach knowledge for a student model, study mutually between the student model, can be quick find the best student model of knowledge learning effect, realized carrying out the rapid distillation to these a plurality of teacher's models.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only the embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on the provided drawings without creative efforts.

FIG. 1 is a flow diagram of example 1 of a model distillation process provided herein;

FIG. 2 is a flow diagram of example 2 of a model distillation process provided herein;

FIG. 3 is a schematic illustration of the learning process in example 2 of a model distillation method provided herein;

FIG. 4 is a flow chart of example 3 of a model distillation process provided herein;

FIG. 5 is a flow chart of example 4 of a model distillation process provided herein;

FIG. 6 is a flow chart of example 5 of a model distillation process provided herein;

FIG. 7 is a flow chart of a model distillation method example 6 provided herein;

FIG. 8 is a schematic diagram of the structure of an embodiment of a model distillation apparatus provided herein.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

As shown in fig. 1, a flow chart of embodiment 1 of a model distillation method provided herein is applied to an electronic device, and the method includes the following steps:

step S101: determining at least two teacher models and at least two student models;

the student models correspond to the teacher models, wherein the at least two student models are the same, and any two teacher models are different.

The student models correspond to the teacher models one to one, and one student model only learns knowledge from one teacher model.

The teacher models are different, and specifically, the models may be different in type or attribute parameters.

The student models are initially the same, learn different knowledge of different teacher models, learn different knowledge (for example, learn different parameters), and the parameters of the student models are changed accordingly.

Step S102: in the process of controlling the at least two student models to learn the knowledge of the corresponding teacher model, controlling each student model to respectively fuse the target knowledge of other student models in the at least two student models;

the student models have different performances in the process of learning the knowledge corresponding to the teacher model, the student models are well expressed to learn more knowledge, and other student models use partial knowledge for reference from the well-expressed students to continue learning.

Wherein, each student model all can regard as to perform well in the learning process, and each student model realizes that each student model directly or indirectly learns the knowledge of each teacher model through studying each other, has realized the distillation to these a plurality of teacher models.

In particular, the knowledge involved in this application is convolution weights in the teacher model and the student model.

Step S103: and selecting a student model meeting an agreement condition as a target student model from the at least two student models.

After learning is finished, the student models meeting the appointed conditions are selected from the student models.

Wherein, the appointed condition is the best performance and the best learning effect.

In this case, the same detection data may be used to detect the performance of a plurality of student models whose learning has been completed.

Specifically, the detection data are respectively input into a plurality of student models after learning is finished, so that the accuracy of the student models is obtained, the accuracy of the student models corresponds to the performance, and the higher the accuracy is, the better the performance is.

The student model with the highest accuracy is selected as the target student model, and the student model which is best integrated with all teacher knowledge is selected from the plurality of student models.

In summary, the present embodiment provides a model distillation method, which includes: determining at least two teacher models and at least two student models, wherein the student models correspond to the teacher models, the at least two student models are the same, and any two teacher models are different; in the process of controlling the at least two student models to learn the knowledge of the corresponding teacher model, controlling each student model to respectively fuse the target knowledge of other student models in the at least two student models; and selecting a student model meeting an agreement condition as a target student model from the at least two student models. In this scheme, a teacher's model only need teach knowledge for a student model, study mutually between the student model, can be quick find the best student model of knowledge learning effect, realized carrying out the rapid distillation to these a plurality of teacher's models.

As shown in fig. 2, a flow chart of example 2 of a model distillation method provided herein includes the steps of:

step S201: determining at least two teacher models and at least two student models;

step S201 is the same as step S101 in embodiment 1, and details in this embodiment are not repeated.

Step S202: controlling the at least two student models to learn the knowledge of the corresponding teacher models according to an appointed period;

wherein, in each appointed period, the student model learns the knowledge of the teacher model, namely the teacher model trains the student model.

Specifically, the same training data is given to a teacher model and a student model, and the teacher model and the student model output results, wherein the loss between the teacher model and the student model is determined, the loss represents the difference between the student model and the teacher model, and the teacher model trains the student model with the goal of minimizing the loss between the teacher model and the student model, so that the convolution weight in the student model is updated based on the loss between the teacher model and the student model, and the training is performed again until the training times set periodically are reached.

Step S203: in each appointed period, a first student model of the period is determined in the at least two student models, and a second student model is controlled to fuse the target knowledge of the first student model;

the second student model fuses the target knowledge of the first student model, so that the second student model fuses the knowledge of the corresponding teacher model and the knowledge of the corresponding teacher model of the first student model;

wherein the first student model has a better learning effect than the second student model, which is the other of the at least two student models except the first student model.

The student models (first student models) with the best learning effect in the period are determined as the fused student models in each period by appointing a plurality of periods, the remaining student models fuse the target knowledge of the fused student models, and the other student models realize the indication of the teacher models corresponding to the first student models by fusing the target knowledge of the first student models.

The appointed period can be a fixed period number, the period number needs to ensure that at least one or more student models can directly or indirectly learn the knowledge of each teacher model through the learning of the number of periods, such as 15 or 20, and the value of the specific number of periods is not limited in the application.

The appointed period may also be set according to the learning condition, for example, the learning effect of one student model reaches the target effect, and the learning of the student model is ended, or the learning effect of two or more student models reaches the target effect, and the learning of the student model is ended, wherein the reaching of the target effect may be that the accuracy exceeds the threshold.

Wherein, in different periods, the determined first student model may be different, and is specifically determined according to the actual learning effect.

Fig. 3 is a schematic diagram of a learning process, which includes three teacher models ABC and three student models ABC, where a student model a learns a teacher model a, a student model B learns a teacher model B, and a student model c learns the teacher model B, and after learning in a first appointment period, it is determined that the learning effect of the student model c is better than that of the other two student models, the student models B and a merge the target knowledge of the student model c, after the merging is completed, the three student models continue to learn the teacher model, and during learning in a second appointment period, it is determined that the learning effect of the student model B is better than that of the other two student models, the student models c and a merge the target knowledge of the student model B, after the merging is completed, the three student models continue to learn the teacher model, … … is analogized until the end.

In the two processes, the fusion sequence of the student model a refers to the knowledge taught by the teacher model C to the student model C, and refers to the knowledge learned by the student model B from the teacher model B under the reference of the knowledge of the student model C, wherein the reference sequence is also included, so that the direct and indirect learning of the knowledge of the three teacher models is realized.

The first student model is determined for a plurality of times in a plurality of periods, and after the second student model fuses target knowledge to the first student model, the student model meeting the agreement condition is selected from the plurality of student models after the last time of fusion.

In specific implementation, after the learning effect of the student model is determined in the last cycle, the second student model fuses the target knowledge of the first student model, and the student model with the highest accuracy is selected as the target student model based on the set detection data; alternatively, the first student model determined in the last period may be used as the target student model, and the second student model determined this time does not learn knowledge from the first student model.

Step S204: and selecting a student model meeting an agreement condition as a target student model from the at least two student models.

Step S204 is the same as step S103 in embodiment 1, and details are not described in this embodiment.

In summary, the model distillation method provided in this embodiment includes: controlling the at least two student models to learn the knowledge of the corresponding teacher models according to an appointed period; and in each appointment cycle, determining a first student model of the cycle in the at least two student models, and controlling a second student model to fuse the target knowledge of the first student model so that the second student model fuses the knowledge of a corresponding teacher model and the knowledge of a corresponding teacher model of the first student model, wherein the learning effect of the first student model is better than that of the second student model, and the second student model is the other student model except for the first student model in the at least two student models. In the scheme, student models in a plurality of agreed periods remember the learning knowledge of corresponding teacher models, and in every agreed period, according to the learning effect of each student model, other student models realize the mutual fusion of each student model by fusing the target knowledge of the student model with the optimal learning effect in the agreed period, so as to realize the indirect learning of other teacher model knowledge, the teacher model only needs to teach one student model knowledge, the student models are fused with each other, on the premise of ensuring the learning effect, the sequencing teaching time of the teacher model is reduced.

As shown in fig. 4, a flow chart of example 3 of a model distillation method provided herein includes the steps of:

step S401: determining at least two teacher models and at least two student models;

step S402: controlling the at least two student models to learn the knowledge of the corresponding teacher models according to an appointed period;

steps S401 to 402 are the same as steps S201 to 202 in embodiment 2, and are not described in detail in this embodiment.

Step S403: determining the accuracy of the at least two student models based on agreed assessment rules in each agreed period;

and when each appointed period is finished, evaluating the learning effect of the student model based on the appointed evaluation rule.

Specifically, the accuracy of the student model processing data can be analytically determined based on the evaluation rule.

Step S404: selecting a student model with accuracy meeting the condition as a first student model from the at least two student models;

the condition for judging whether the accuracy meets the condition is specifically the one with the highest accuracy.

Specifically, the most accurate one of the plurality of student models is selected as the first student model.

Specifically, after the accuracy is ranked from large to small/from small to large, the highest accuracy one of the accuracy is determined as a first student model; the method can also adopt a bubbling method, namely, the accuracy of two student models is randomly selected to be compared and reserved with the larger one, then the comparison is carried out with the remaining accuracy, and the process is circulated until the last one reserved is the highest one in the plurality of student models.

It should be noted that the first student model determined may be different or the same, evaluated in different periods.

Step S405: determining target parameters for learning from a second student model to the first student model;

wherein, the learning effect based on the first student model is best, and the difference between the different second student models and the first student model may be different.

Specifically, target parameters learned by each second student model to the first student model are determined, wherein the target parameters learned by different second student models may be different or the same.

Step S406: controlling the second student model to fuse the target parameters;

after the target parameters of each student model for learning to the first student model are determined, the target parameters are taught to the second student model so as to control the second student model to learn and fuse the target parameters, and therefore knowledge of the teacher model corresponding to the first student model can be learned.

The second student model can learn knowledge of the corresponding teacher model only by fusing target parameters of the first student model, and the teacher model corresponding to the first student model does not need to train the first student model.

Step S407: and selecting a student model meeting an agreement condition as a target student model from the at least two student models.

Step S407 is the same as step S204 in embodiment 2, and details are not described in this embodiment.

In summary, the model distillation method provided in this embodiment includes: determining the accuracy of the at least two student models based on agreed assessment rules; selecting a student model with accuracy meeting the condition from the at least two student models as a first student model, and determining a target parameter of a second student model for learning to the first student model; and controlling the second student model to fuse the target parameters. In the scheme, each student model evaluates the learning effect of each student model through the learning of an appointed period to the teacher model corresponding to the student model, and selects the best learning effect from the learning effects, and other student models with poorer learning effects learn the specific learning knowledge of the student model with good learning effect, so that the knowledge of the teacher model corresponding to the student model with good learning effect can be fused by other student models, and the time for sequencing and teaching the student models by the teacher model is shortened.

As shown in fig. 5, a flow chart of example 4 of a model distillation method provided herein includes the steps of:

step S501: determining at least two teacher models and at least two student models;

step S502: controlling the at least two student models to learn the knowledge of the corresponding teacher models according to an appointed period;

steps S501 to 502 are the same as steps S401 to 402 in embodiment 3, and are not described in detail in this embodiment.

Step S503: in each appointed period, if the training times of the at least two student models respectively reach the target training times corresponding to the appointed period, respectively inputting evaluation contents into the at least two student models to obtain the recognition results of the at least two student models;

wherein the number of training is set per cycle.

Specifically, the training times in each appointed period may be fixed times, such as 200 times and 500 times; or the number of times obtained by calculation according to the distillation whole training number and the number of cycles, if the distillation needs 10000 times of training and 20 cycles are set, the training number set in each cycle is 500.

Wherein, in each appointment cycle, wherein, in each cycle, the student model learns the knowledge of the teacher model, i.e. the teacher model trains the student model.

In each appointed period, the teacher model trains the student models, the training times in the appointed period are accumulated, and when the training times reach the set target training times, the student models begin to be evaluated.

The student models adopt the same evaluation content, such as evaluation data sets, each evaluation data set comprises a plurality of frames of images, and the student models identify the content in the plurality of frames of images.

As an example, the evaluation data set is 100 pictures, the content of the pictures is various dogs, and the standard label corresponding to the pictures is "dog", the evaluation data set is input into various student models, the student models recognize the 100 pictures, and an output result is obtained, for example, the result of "dog" in the output result of the student model a is 98, the result of "cat" is 2, for example, the result of "dog" in the output result of the student model b is 78, and the result of "cat" is 22.

Step S504: analyzing the recognition results of the at least two student models to obtain the accuracy of the at least two student models;

the same evaluation data set is input into each student model, the student models output identification results, and the accuracy of each student model is determined based on comparison between the output results of the student models and standard labels.

As an example, the evaluation data set is 100 pictures, the content in the pictures is various dogs, the standard label corresponding to the pictures is "dog", the result of "dog" in the result output by the student model a is 98, the result of "cat" is 2, the identification result of "dog" is determined to be accurate by comparing the result of the standard label "dog", the identification result of "cat" is wrong, the accuracy of the student model a is 98%, and similarly, the accuracy of the student model b is calculated to be 78%.

Step S505: selecting a student model with accuracy meeting the condition as a first student model from the at least two student models;

step S506: determining target parameters for learning from a second student model to the first student model;

step S507: controlling the second student model to fuse the target parameters;

step S508: and selecting a student model meeting an agreement condition as a target student model from the at least two student models.

Steps S505 to 508 are the same as steps S404 to 407 in embodiment 3, and are not described in detail in this embodiment.

In summary, the model distillation method provided in this embodiment includes: if the training times of the at least two student models respectively reach the target training times corresponding to the appointed period, respectively inputting the evaluation content into the at least two student models to obtain the recognition results of the at least two student models; and analyzing the recognition results of the at least two student models to obtain the accuracy of the at least two student models. In the scheme, in an appointed period, after the student models are trained by the target training times of the corresponding teacher models, the same evaluation contents are respectively input into the student models, the accuracy of the student models is obtained based on the recognition results of the student models, the learning effect of the student models is determined based on the evaluation contents with the same training times, and the learning effect of the student models can be determined accurately.

As shown in fig. 6, a flow chart of example 5 of a model distillation method provided herein includes the steps of:

step S601: determining at least two teacher models and at least two student models;

step S602: controlling the at least two student models to learn the knowledge of the corresponding teacher models according to an appointed period;

step S603: determining the accuracy of the at least two student models based on agreed assessment rules in each agreed period;

step S604: selecting a student model with accuracy meeting the condition as a first student model from the at least two student models;

steps S601 to 604 are the same as steps S401 to 404 in embodiment 3, and are not described in detail in this embodiment.

Step S605: determining a target parameter proportion learned by the second student model to the first student model based on the accuracy of the first student model and the accuracy of the second student model in a first period;

wherein, in the first period, after the accuracy of the plurality of student models is determined, the target knowledge (target parameters) learned by each second student model to the first student model is respectively determined.

Specifically, based on the accuracy of the two student models, a target parameter ratio for the second student model to learn from the first student model is calculated, and the target parameter ratio can be positively correlated with the accuracy ratio of the first student model to be better than that of the second student model.

For example, the accuracy of the first student model is S1, and the accuracy of the second student model is S2, wherein the target parameter ratio may be (S1-S2)/S2.

Step S606: selecting a target parameter of the target parameter proportion in the first student model based on an agreed sorting rule;

wherein, each student model has a plurality of parameters, and the parameters have positive or negative, and the target parameters of the target parameter proportion are selected according to the importance degree.

Specifically, the larger the absolute value of the parameter is, the larger the importance degree is, the larger the absolute value of the parameter is, the smaller the absolute value of each parameter in the first student model is, and the parameter of the target parameter proportion in the sorting is selected as the target parameter.

For example, the target parameter proportion is 20%, the first student model has 30 parameters, and after the parameters are sorted from large to small in absolute value, 6 parameters with the absolute value sorted to the top 20% are determined as the target parameters.

Step S607: controlling the second student model to fuse the target parameters;

step S608: and selecting a student model meeting an agreement condition as a target student model from the at least two student models.

Steps S607 to 608 are the same as steps S406 to 407 in embodiment 3, and are not described in detail in this embodiment.

In summary, in the model distillation method provided in this embodiment, in the first period, based on the accuracy of the first student model and the accuracy of the second student model, the target parameter ratio learned by the second student model to the first student model is determined; selecting a target parameter of the target parameter proportion in the first student model based on an agreed ranking rule. In the scheme, in the first period, based on the accuracy of the first student model and the accuracy of each second student model, the target parameter proportion of each second student model to the learning of the first student model is respectively determined, the target parameter corresponding to the target parameter proportion is selected from the first student model, the second student model integrates the target parameter, and the student model with poor learning effect can improve the learning efficiency to the specific learning knowledge of the student model with good learning effect.

As shown in fig. 7, a flow chart of example 6 of a model distillation method provided herein includes the steps of:

step S701: determining at least two teacher models and at least two student models;

step S702: controlling the at least two student models to learn the knowledge of the corresponding teacher models according to an appointed period;

step S703: determining the accuracy of the at least two student models based on agreed assessment rules in each agreed period;

step S704: selecting a student model with accuracy meeting the condition as a first student model from the at least two student models;

steps S701 to 704 are the same as steps S401 to 404 in embodiment 3, and details are not described in this embodiment.

Step S705: in a non-first period, judging whether the first student model is the same as the first student model in the previous period;

in the non-first period, the target learning parameters of the period are determined by combining the conditions of the history period.

Specifically, the gap change between the second student model and the first student model is determined based on whether the first student model in the present period is the same as the first student model in the history period.

Firstly, whether the first student model determined in the period and the first student model in the previous period are the same student model is determined.

If the period is different from the first student model of the previous period, step S706 is executed, and if the period is the same as the first student model of the previous period, step S707 is executed.

If the first student model determined in the period is not the same student model, the second student model representing the previous period is better in performance by fusing the target knowledge of the first student model determined in the period, and is determined as the first student model in the period, and the first student model of the previous period is determined as the second student model in the period.

In the upper period, the performance of the student model 1 is better than that of the student model 2, but the learning is a long-term process, in the period, the student model 2 is better than that of the student model 1 after the comprehension of the teacher model 2, because the student model 2 learns some knowledge from the teacher model 2 and the learning effect of the student model 2 is better than that of the student model 1 by using part of important information learned from the teacher model 1 by the student model 1.

If the first student model determined in the period is the same student model, the second student model representing the previous period is not superior to the first student model in the previous period in performance through fusing target knowledge of the first student model determined in the period, and the second student model is still used as the second student model to learn knowledge from the first student model.

Step S706: determining a designated target parameter for the first student model to learn to the second student model based on a first accuracy of the first student model and a second accuracy of the second student model for the present cycle;

the calculation process of the period is similar to the process of calculating the target parameter of the first period.

And calculating and determining the target parameter proportion of the second student model to the first student model based on the accuracy of the first student model and the second student model, and determining the specified target parameter in the first student model based on the target parameter proportion, without considering the historical accuracy condition.

Specifically, based on the accuracy of the two student models, the target parameter ratio of the second student model to the first student model is calculated, and the target parameter ratio is the accuracy ratio of the first student model to the second student model.

Wherein, for example, the accuracy of the first student model is S2, and the accuracy of the second student model is S2, wherein the target parameter ratio is (S1-S2)/S2.

For example, in the previous period, the performance of the student model 1 is better than that of the student model 2, but the learning is a long-term process, in the present period, the student model 2 under the guidance of the teacher model 2 performs better than that of the student model 1, because the student model 2 learns some knowledge from the teacher model 2 and uses part of the important information of the student model 1 from the teacher model 1, so that the learning effect of the student model 2 is better than that of the student model 1, and at the end of the training in the present period, the student model 1 (the second student model in the present period) learns important knowledge to the better student model 2 (the first student model in the present period).

Step S707: judging whether the accuracy difference value of the first student model and the second student model in the previous period is smaller than the accuracy difference value of the first student model and the second student model in the current period;

if the first student models of the present period and the previous periods are the same student model, whether the second student model has an improved condition after learning the target knowledge of the first student model is further judged.

Specifically, the fusion effect of the second student model is determined by calculating the accuracy difference value change of the first student model and the second student model, and if the accuracy difference value is reduced, the representation fusion is effective; if the accuracy difference is larger, the fusion is represented to be invalid.

If the difference between the accuracies of the first student model and the second student model in the previous cycle is smaller than the difference between the accuracies of the first student model and the second student model in the current cycle, executing step S708; otherwise, step S710 is performed.

Step S708: increasing the target parameter ratio learned by the second student model to the first student model based on the historical target parameter ratio of the last cycle and the first accuracy of the first student model and the second accuracy of the second student model in the present cycle;

and if the accuracy difference value of the first student model and the second student model in the last period is smaller than the accuracy difference value of the first student model and the second student model in the last period, representing that the second student model is effective in target knowledge fusion of the first student model, and further improving the target parameter proportion of the second student model for learning to the first student model.

In the upper period, the performance of the student model 1 is superior to that of the student model 2, but the learning is a long-term process, in the period, the student model 2 is under the teaching of the teacher model 2 and important information obtained from the student model 1, and under the adding of the two, the performance of the student model 2 gradually follows up with the student model 1, so that the learning mode is judged to be effective on the student model 2, and therefore, the proportion of reference from the student model 1 is increased by combining with the previous reference proportion, and the learning in the next stage is continued.

For example, if the learning ratio of the previous cycle is P, the accuracy of the first student model is S2, and the accuracy of the second student model is S2, the target parameter ratio determined in the present cycle is P + (S1-S2)/S2.

Step S709: selecting a target parameter of the target parameter proportion in the first student model based on an agreed sorting rule;

wherein each student model has a plurality of parameters, and the parameters have positive and negative values, and the target parameters of the target parameter proportion are selected according to the importance degree.

After the first student model is determined in the period, the student models serving as the first student model are recorded, and after the first student model is determined in each period, if the student models are continuously the same student model, the number of the periods of the student models serving as the first student model is accumulated.

After step S709, the method further includes: and if the first student model in the period is determined as the first student model in the continuous period of the target number before the current period, determining that the target parameter ratio of the second student model to the first student model is 1, and controlling the second student model to give up the target knowledge fused from the first student model in the target number period before the current period.

If the knowledge is continuously learned from the same first student model for multiple times and the proportion is larger than one time, the difference between the second student model and the first student model is represented to be smaller and smaller, but the first student model is not harsh, the second student model is judged to be in a bottleneck in the learning mode, the mode is abandoned, the knowledge learned by the first student model is directly taught to the second student model, and the next brand new learning is continued.

Step S710: adjusting the ratio of target parameters learned by the second student model to the first student model to be 1; selecting a target parameter of the target parameter proportion in the first student model based on an agreed sorting rule;

and if the accuracy difference value of the first student model and the second student model in the last period is not smaller than the accuracy difference value of the first student model and the second student model in the period, representing that the fusion is invalid.

And if the target parameter ratio is 1, the first student model teaches all knowledge to the second student model.

For example, in the previous period, the performance of the student model 1 is better than that of the student model 2, in the present period, under the teaching of the teacher model 2 and under the support of obtaining important information from the student model 1, the performance of the student model 2 after the comprehension is not better than that of the student model 1, and the difference between the student model 2 and the student model 1 is increased, so that the learning mode is judged to be invalid for the student model 2, and therefore, the knowledge learned by the student model 1 is directly taught to the student model 2, and the next brand new learning is continued.

If the ratio of the target parameters of the second student model to the first student model is improved in the last period, the effect of the second student model is reduced after the target parameters of the first student model are fused, the fusion is represented to be invalid, and the first student model teaches all knowledge to the second student model in the period.

For example, in the upper period, after the proportion referred by the student model 1 is increased, in the present period, under the teaching of the teacher model 2 and after the proportion referred by the student model 1 is increased, the difference between the student model 2 and the student model 1 is increased compared with the last time, and then it is determined that the student model 2 starts to be confused in this way, and the student model 2 cannot well integrate two kinds of knowledge. Therefore, the knowledge learned by the student model 1 is directly taught to the student model 2, and the next brand-new learning is continued.

Step S711: controlling the second student model to fuse the target parameters;

step S712: and selecting a student model meeting an agreement condition as a target student model from the at least two student models.

Steps S711 to 712 are the same as steps S406 to 407 in embodiment 3, and are not described in detail in this embodiment.

In summary, in the model distillation method provided in this embodiment, the target learning parameters of the present period are determined in combination with the history period in the non-first period. Specifically, the determination is made based on whether the first student model in the present period is the same as the first student model in the historical period, and the gap variation between the second student model and the first student model. In the scheme, the learning condition of the second student model in the history period can be combined, and the accuracy is higher compared with the method for processing only in a single period.

Corresponding to the embodiment of the model distillation method provided by the application, the application also provides an embodiment of electronic equipment applying the model distillation method.

Fig. 8 is a schematic structural diagram of an embodiment of a model distillation apparatus provided in the present application, the apparatus including the following structure: a determination module 801, a fusion module 802 and a selection module 803;

the determining module 801 is configured to determine at least two teacher models and at least two student models, where the student models correspond to the teacher models, where the at least two student models are the same, and any two teacher models are different;

the fusion module 802 is configured to control each student model to respectively fuse target knowledge of other student models in the at least two student models in the process of controlling the at least two student models to learn knowledge of corresponding teacher models;

the selecting module 803 is configured to select, from the at least two student models, a student model meeting an agreement condition as a target student model.

Optionally, the fusion module includes:

the learning unit is used for controlling the at least two student models to learn the knowledge of the corresponding teacher model according to an appointed period;

and the fusion unit is used for determining a first student model of the period in the at least two student models in each appointed period, and controlling a second student model to fuse the target knowledge of the first student model so that the second student model fuses the knowledge of a corresponding teacher model and the knowledge of a corresponding teacher model of the first student model, wherein the learning effect of the first student model is better than that of the second student model, and the second student model is the other student model except the first student model in the at least two student models.

Optionally, the fusion unit is configured to:

an evaluation subunit, configured to determine accuracy of the at least two student models based on an agreed evaluation rule;

the selection subunit is used for selecting a student model with the accuracy meeting the condition from the at least two student models as a first student model and determining a target parameter for a second student model to learn from the first student model;

and the fusion subunit is used for controlling the second student model to fuse the target parameters.

Optionally, the evaluation subunit is specifically configured to:

Optionally, in the first period, the selecting subunit is configured to:

Optionally, in a non-first period, selecting a subunit for performing at least one of:

Optionally, in a non-first period, the selecting unit is further configured to:

The functions of the components of the model distillation apparatus are explained with reference to the method embodiment, which is not repeated herein.

In summary, the present embodiment provides a model distillation apparatus, which includes: the system comprises a determining module, a calculating module and a judging module, wherein the determining module is used for determining at least two teacher models and at least two student models, and the student models correspond to the teacher models; the fusion module is used for controlling each student model to fuse the target knowledge of other student models in the at least two student models respectively in the process of controlling the at least two student models to learn the knowledge of the corresponding teacher model; and the selection module is used for selecting the student model meeting the appointed conditions from the at least two student models as the target student model. In this scheme, a teacher's model only need teach knowledge for a student model, study mutually between the student model, can be quick find the best student model of knowledge learning effect, realized carrying out the rapid distillation to these a plurality of teacher's models.

Corresponding to the embodiment of the information processing method provided by the application, the application also provides the electronic equipment and the readable storage medium corresponding to the information processing method.

Wherein, this electronic equipment includes: a memory, a processor;

wherein, the memory stores a processing program;

the processor is configured to load and execute the processing program stored in the memory to implement the steps of the information processing method according to any one of the above.

Specifically, the information processing method implemented by the electronic device may refer to the foregoing information processing method embodiment.

Wherein the readable storage medium has stored thereon a computer program, which is called and executed by a processor, to implement the steps of the information processing method according to any one of the above.

Specifically, the computer program stored in the readable storage medium executes the information processing method, and the information processing method embodiments described above may be referred to.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the device provided by the embodiment, the description is relatively simple because the device corresponds to the method provided by the embodiment, and the relevant points can be referred to the method part for description.

The previous description of the provided embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features provided herein.

Claims

1. A model distillation method comprising:

2. The method of claim 1, wherein controlling each of the at least two student models to fuse the target knowledge of the other of the at least two student models respectively in controlling the at least two student models to learn the knowledge of the corresponding teacher model comprises:

3. The method of claim 2, wherein determining a first student model of the period among the at least two student models, controlling a second student model to fuse target knowledge of the first student model, comprises:

determining the accuracy of the at least two student models based on an agreed assessment rule;

selecting a student model with accuracy meeting the condition from the at least two student models as a first student model, and determining a target parameter of a second student model for learning to the first student model;

and controlling the second student model to fuse the target parameters.

4. The method of claim 3, the determining the accuracy of the at least two student models based on an agreed assessment rule, comprising:

5. The method of claim 3, determining target parameters for a second student model to learn from the first student model during a first period, comprising:

6. The method of claim 5, determining target parameters for a second student model to learn from the first student model during a non-first period, comprising at least one of:

7. The method of claim 6, further comprising:

8. A model distillation apparatus comprising:

the fusion module is used for controlling each student model to respectively fuse the target knowledge of other student models in the at least two student models in the process of controlling the at least two student models to learn the knowledge of the corresponding teacher model;

and the selection module is used for selecting the student model meeting the agreement condition from the at least two student models as the target student model.

9. An electronic device, comprising: a memory, a processor;

wherein, the memory stores a processing program;

the processor is configured to load and execute the processing program stored in the memory to implement the steps of the model distillation method as claimed in any one of claims 1 to 7.

10. A readable storage medium having stored thereon a computer program for being invoked and executed by a processor for carrying out the steps of the model distillation method according to any one of claims 1 to 7.