CN114943336A

CN114943336A - Model pruning method, device, equipment and storage medium

Info

Publication number: CN114943336A
Application number: CN202210471355.3A
Authority: CN
Inventors: 高大伟; 谢悦湘; 周子慕; 王桢; 李雅亮; 丁博麟
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-04-28
Filing date: 2022-04-28
Publication date: 2022-08-26

Abstract

The application provides a model pruning method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring random initialization parameters and a plurality of training tasks of a model; and performing multiple rounds of training on the model by adopting training samples corresponding to the multiple training tasks to obtain a first model parameter and a second model parameter which correspond to the model after two different rounds of training. Determining a first shielding matrix corresponding to a first parameter matrix of a target layer in the first model parameters, and determining a second shielding matrix corresponding to a second parameter matrix of the target layer in the second model parameters; and if the similarity between the first shielding matrix and the second shielding matrix is greater than a set threshold, pruning the random initialization parameters according to the shielding matrixes respectively corresponding to the parameter matrixes of the layers in the second model parameters to obtain third model parameters. The scheme reduces the calculation overhead of model training.

Description

Model pruning method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a model pruning method, a model pruning device, model pruning equipment and a storage medium.

Background

User terminal devices, such as smart phones, tablet computers, embedded devices, and the like, are becoming an indispensable part of people's daily life, so that user data is also accumulated on the user terminal devices, and thus a need for training deep learning models using locally collected user data in the terminal devices arises.

Due to the consideration of data security and privacy protection, the limitation of resources such as small user data amount and terminal device computing power is limited, and how to train a deep learning model on the user terminal device by using the user data becomes a difficult problem to be solved urgently.

Disclosure of Invention

The embodiment of the invention provides a model pruning method, a model pruning device, model pruning equipment and a storage medium, which are used for reducing the calculation overhead required by terminal side model training.

In a first aspect, an embodiment of the present invention provides a model pruning method, which is applied to a user terminal device, and the method includes:

acquiring random initialization parameters of a model and a plurality of training tasks for training the model;

performing multiple rounds of training on the model by using training samples corresponding to the multiple training tasks respectively to obtain a first model parameter and a second model parameter corresponding to the model after two different rounds of training, wherein the first model parameter and the second model parameter both comprise parameter matrixes of each layer in the model;

determining a first shielding matrix corresponding to a first parameter matrix of a target layer in the first model parameters and a second shielding matrix corresponding to a second parameter matrix of the target layer in the second model parameters;

and if the similarity between the first shielding matrix and the second shielding matrix is greater than a set threshold, pruning the random initialization parameters according to the shielding matrices corresponding to the parameter matrices of each layer in the second model parameters to obtain third model parameters.

In a second aspect, an embodiment of the present invention provides a model pruning apparatus, which is applied to a user terminal device, and the apparatus includes:

the system comprises an acquisition module, a calculation module and a processing module, wherein the acquisition module is used for acquiring random initialization parameters of a model and a plurality of training tasks for training the model;

the pre-training module is used for performing multi-round training on the model by adopting training samples corresponding to the training tasks respectively to obtain a first model parameter and a second model parameter corresponding to the model after two different rounds of training, wherein the first model parameter and the second model parameter both comprise parameter matrixes of each layer in the model;

a pruning module, configured to determine a first shielding matrix corresponding to a first parameter matrix of a target layer in the first model parameters, and a second shielding matrix corresponding to a second parameter matrix of the target layer in the second model parameters; and if the similarity between the first shielding matrix and the second shielding matrix is greater than a set threshold, pruning the random initialization parameters according to the shielding matrices corresponding to the parameter matrices of each layer in the second model parameters to obtain third model parameters.

In a third aspect, an embodiment of the present invention provides a user equipment, including: a memory, a processor, a communication interface; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to perform the model pruning method of the first aspect.

In a fourth aspect, an embodiment of the present invention provides a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of a user terminal device, causes the processor to implement at least the model pruning method according to the first aspect.

In the embodiment of the invention, in the process of pre-training the model by adopting a meta-learning mode (obtaining the initialization parameters of the model with better performance), firstly, random initialization processing of the model parameters is carried out to obtain the random initialization parameters of the model, then, training samples corresponding to a plurality of training tasks respectively are adopted to carry out multi-round training on the model to obtain first model parameters and second model parameters corresponding to the model after two successive different rounds of training, and the first model parameters and the second model parameters respectively comprise parameter matrixes of each layer in the model. And determining a first shielding matrix corresponding to the first parameter matrix of the target layer in the first model parameters, similarly determining a second shielding matrix corresponding to the second parameter matrix of the target layer in the second model parameters, and calculating the similarity between the first shielding matrix and the second shielding matrix. If the similarity between the first shielding matrix and the second shielding matrix is greater than the set threshold, it indicates that the parameter matrix of the target layer of the model tends to be stable, and at this time, the training may be suspended, so as to perform pruning processing on the random initialization parameter according to the shielding matrices respectively corresponding to the parameter matrices of each layer in the second model parameter, and obtain a third model parameter of the model. At this time, the model having the third model parameter is a sparse model. And then, continuing training the model based on the third model parameter to obtain a target initialization parameter of the model when the model is trained to be converged.

In the training process, only the change condition of the pruning position of a specific layer in the model (reflected by a shielding matrix) needs to be monitored, after the pruning position of the layer is stable, the random initialization parameter can be pruned based on the pruning position to obtain a sparse parameter serving as a starting point for continuous training, the sparse parameter has smaller requirements on storage and calculation resources, and the model parameter can be trained by taking the sparse parameter as the starting point, so that the model can be trained more quickly at lower calculation cost.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a flow chart of a model pruning method according to an embodiment of the present invention;

fig. 2 is a schematic diagram illustrating an implementation process of a model pruning method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating an application of a model pruning method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of another application of the model pruning method provided by the embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating an application of a model pruning method according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a model pruning device according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a user terminal device according to this embodiment;

fig. 8 is a schematic structural diagram of another user terminal device provided in this embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In addition, the sequence of steps in each method embodiment described below is only an example and is not strictly limited.

The problem of small sample Learning (Few-shot Learning) is a problem faced in many practical scenarios. For example, the user data collected at the user terminal device is relatively limited, and there is a need to train a neural network model that provides a certain service based on this small amount of user data. As another example, the medical history information of some rare diseases is relatively limited, and there is a need to train a neural network model that provides a disease prediction function based on this small amount of medical history information.

Meta-Learning (Meta-Learning), also called Learning to learn, is a method that uses past knowledge and experience to guide the Learning of new tasks, so that the network has the ability of Learning to learn, and is a common method for solving the problem of small samples. Common meta-learning methods include, for example, the MAML algorithm, the replay algorithm, and the like.

First, a brief introduction is made to meta learning. In contrast, traditional machine learning is to artificially adjust model parameters first, and then directly train a neural network model under a specific task. The meta-learning is to train a better model initialization parameter through other tasks, and then train a specific task to obtain a model parameter under the specific task. Therefore, meta-learning is to train a model from a large number of tasks, train to obtain an initialized model (i.e., to obtain a better model initialization parameter), and learn faster in a new task with a small amount of data. The initialization model has the capability of fast convergence on unknown tasks.

In fact, in machine learning, a training unit is sample data corresponding to a certain task, and a model is trained through the sample data, and the sample data can be divided into a training set, a test set and a verification set. In meta learning, a training unit is a Task, and generally, two Tasks are a training Task (Train Tasks) and a testing Task (Test Tasks). A number of training tasks are prepared for learning in order to learn a better hyper-parameter (better initialization parameter), and the test task is to train the specific task using the hyper-parameter learned by the training task. Sample data of each task in the training task is divided into a Support set (Support set) and a Query set (Query set); the data in the test task is divided into a training set and a test set.

Such as: task 1 is a speech recognition task, task 2 is an image recognition task,. cndot., task 100 is a text classification task, task 101 is different from the previous 100 tasks in type, the training task is the 100 different tasks, and the testing task is the 101 th task.

In the meta learning, the dependency on the number of sample data corresponding to each task is not large, that is, the number of sample data corresponding to each task may be small. In summary, in the training process of meta-learning, firstly, a model parameter needs to be initialized randomly, then, with the random initialization parameter as a starting point, a plurality of training tasks are used to perform a plurality of rounds of iterative training on the model until the training is converged, and at this time, the obtained model parameter is a better initialization parameter of the model. And then, taking the better initialization parameter as a starting point, and continuing model training aiming at the new task so as to obtain model parameters suitable for the new task.

In the conventional meta-learning process, from the randomly initialized model parameters to the model training to convergence, a large amount of computational resources are actually consumed, and the consumed time is long. To this end, a pruning strategy is considered to reduce the computational overhead of model training and to improve the efficiency of model training.

A basic pruning solution comprises the steps of: 1) firstly, randomly initializing model parameters; 2) then, performing meta-learning training until the model converges, and 3) determining a shielding matrix based on model parameters during model convergence so as to perform pruning on the whole model; 4) retraining to restore the model's behavior.

The Mask matrix (Mask matrix) is used for recording the clipped parameter positions in the model parameter matrix and is a matrix composed of 0 and 1, wherein 0 represents that the parameter values at the corresponding positions are clipped, and 1 represents that the parameter values at the corresponding positions are reserved. In practice, a model will have many parameter matrices, for example, each layer in the model will have a parameter matrix. The model pruning processing actually determines the shielding matrix corresponding to each layer based on the parameter matrix corresponding to each layer in the model, and further prunes the parameter matrix of the corresponding layer based on each shielding matrix so as to subtract the parameter value (set as 0) corresponding to the 0 value position in the shielding matrix in the parameter matrix.

Although the process of continuing the model training based on the pruned model parameters in step 4) through the model pruning can reduce the computation overhead so that the model can be converged more quickly, the solution has the following disadvantages: and 2) in the process from random initialization parameter training to convergence of the model, the training and calculation cost of meta-learning is high, the convergence speed is very low, and the calculation amount of the whole pruning process is large. The traditional model pruning technology in the step 3) is developed aiming at a general machine learning task and is not necessarily suitable for meta-learning scenes.

Based on this, the conventional pruning scheme is not applicable to some situations where the computation overhead is sensitive, for example, when model training is performed on the side of the user terminal device with limited computing power and storage capacity, especially in a scenario where model training is performed by using user data collected on the user terminal device, the need for reducing the computation overhead is more urgent because the model training is completed on the local basis of a meta-learning manner in the user terminal device in consideration of user data security and privacy.

Based on this, the embodiment of the invention provides an efficient model pruning scheme suitable for a meta-learning process based on a Lottery Hypothesis (LTH for short) theory, which can effectively reduce the calculation overhead and improve the model training efficiency. According to the scheme provided by the embodiment of the invention, the sparse model with the rapid convergence capability is obtained by carrying out sparse model pruning on the meta-learning model, so that deep learning model training on the end-side user terminal equipment is enabled. It should be noted that the model pruning scheme provided by the embodiment of the present invention may be applicable to a process of performing model training by using a meta learning training mode, and may also be applicable to a process of performing model training by using other training modes.

The lottery hypothesis states: a randomly initialized deep learning model is given, at least one submodel is included in the deep learning model, the effect (accuracy rate/error rate and the like) same as that of the original model can be trained on the given target learning task, and the submodel is the winning lottery, namely the winning submodel. The winning model is actually a model in which some parameter values in the model parameter matrix of the original model are set to 0 (i.e., a model in which some parameter values are deleted). In practice, the model parameter matrix is also commonly referred to as a model weight matrix, however, in practice, a model will have many parameter matrices, for example, each layer in the model will have a corresponding parameter matrix. In the case of no ambiguity, the purpose of model training is to obtain model parameters, and therefore, the randomly initialized model in the embodiment of the present invention is also equivalent to the randomly initialized parameters of the model.

Through the verification of the inventor, in fact, the lottery hypothesis is still true in meta-learning. Therefore, in the process of model pruning, the winning sub-model can be found as the target of model pruning.

The following describes the model pruning method proposed by combining the lottery hypothesis theory in the embodiment of the present invention.

Fig. 1 is a flowchart of a model pruning method according to an embodiment of the present invention, and as shown in fig. 1, the method includes the following steps:

101. random initialization parameters of the model and a plurality of training tasks for training the model are obtained.

102. And performing multiple rounds of training on the model by adopting training samples corresponding to the multiple training tasks respectively to obtain a first model parameter and a second model parameter corresponding to the model after two different rounds of training, wherein the first model parameter and the second model parameter respectively comprise parameter matrixes of all layers in the model.

103. Determining a first shielding matrix corresponding to a first parameter matrix of the target layer in the first model parameters and a second shielding matrix corresponding to a second parameter matrix of the target layer in the second model parameters; and if the similarity between the first shielding matrix and the second shielding matrix is greater than a set threshold, pruning the random initialization parameters according to the shielding matrixes respectively corresponding to the parameter matrixes of the layers in the second model parameters to obtain third model parameters.

The model to be trained in the embodiment of the present invention may be a neural network model having a certain structure, such as a convolutional neural network model, a cyclic neural network model, a long-short term memory neural network model, and the like.

In an alternative embodiment, the trained model may be trained using meta-learning. When the model is trained in the meta-learning mode, the meta-learning training process can be divided into two stages: a pre-training phase and a training phase for the target task. In the pre-training stage, the model needs to be trained by training samples corresponding to a plurality of training tasks, so as to obtain a better initialization parameter. That is, the pre-training phase requires the model to be optimized from random initialization parameters to better initialization parameters. Therefore, when the model is trained aiming at the target task, the more optimal model initialization parameter can be used as the starting point, and the rapid convergence of the model can be realized.

Therefore, when the model training is performed in the meta learning manner, after step 103, the following steps may be further performed:

104. and training the model based on the third model parameter to obtain a target initialization parameter of the model when the model is trained to be converged.

Generally, the plurality of training tasks and the target task are different tasks, but the tasks have certain similarity and meet a certain set probability distribution.

The scheme provided by the embodiment shown in fig. 1 corresponds to a pre-training stage of meta-learning, that is, the target initialization parameter is a better initialization parameter finally obtained in the pre-training stage.

However, it should be noted that, as described above, the model pruning scheme provided in the embodiment of the present invention may be applied not only to the meta-learning training process, but also to other model training processes, in this case, the training samples used for training the model are not limited to those corresponding to multiple training tasks, and may also be those corresponding to only one task (such as an image classification task), and since the models all need to be subjected to multiple rounds of iterative training and the model parameters are initialized randomly, the first model parameter and the second model parameter may be model parameters after any two adjacent or non-adjacent iterations. The third model parameter corresponds to an overall model parameter when the parameter corresponding to the target layer in the model tends to be in a stable state. Based on this, the execution of the model pruning scheme is to prune the randomly initialized model parameters based on the shielding matrix when the parameters corresponding to the target layer tend to be in a stable state, so as to obtain a sparse model parameter, namely a third model parameter. After obtaining the third model parameter, if the model has not converged at this time, the iterative training of the model may be continued based on the third model parameter.

It can be seen that, regardless of whether the meta-learning training method is adopted, the execution of the model pruning scheme is similar, and steps similar to the above-mentioned step 101-103 need to be executed. Therefore, for ease of understanding and description, the following description will be given by taking the meta learning training mode as an example.

In practical application, a plurality of corresponding training tasks and training samples corresponding to the training tasks are collected according to the requirements of a new target task.

For example, if the target task is a classification task, the training samples are classification tasks different from the target task. For example, the plurality of training tasks are respectively: a classification task of classifying cats and dogs, a classification task of classifying dogs and wolves, and a classification task of classifying cats and tigers; the target task may be a task of sorting birds.

In the pre-training stage, firstly, random initialization processing of model parameters is performed to obtain random initialization parameters of the model. In practice, the model parameters are tensors, which are composed of a number of parameter matrices. The model referred to herein is a deep neural network model, and from the structural point of view, the model may generally include an input layer, a plurality of hidden layers, and an output layer, where each layer has a corresponding parameter matrix, and thus, the model parameters are actually formed by the parameter matrices corresponding to each layer. It can be seen that the process of model training is actually the process of determining the model parameters.

After the random initialization parameters and the training tasks of the model are obtained, the model is trained in a multi-round learning mode based on the training samples corresponding to the training tasks. Model parameters are updated each time the training is completed.

The meta learning training process can be realized by referring to the prior related art, and is not described in detail in the embodiment of the present invention, which is simply described as follows: the training samples of a plurality of training tasks can be sampled, for each training task, a test loss function generated after the training sample corresponding to each training task trains the model is calculated from the same random initialization parameter, the test loss functions corresponding to the training tasks are weighted and summed to obtain a total loss function corresponding to the whole training task, and the model parameter is updated once based on the total loss function. So many iterations are performed.

A turn interval k for reading the model parameters may be preset, where k is 1, for example, to indicate that the model parameters after each round of training need to be obtained; k-3, indicating that the model parameters need to be read every third round.

And correspondingly acquiring a first model parameter and a second model parameter which respectively correspond to the models after two different rounds of training according to the set value of k. For example, if k is 1, starting from the random initialization parameter, the model parameter is updated once after each round of training, the first updated model parameter may be used as a first model parameter, and the second updated model parameter may be used as a second model parameter, at this time, a first shielding matrix corresponding to a first parameter matrix of the target layer in the first model parameter and a second shielding matrix corresponding to a second parameter matrix of the target layer in the second model parameter are determined, the similarity between the two shielding matrices is calculated, and if the similarity is smaller than a set threshold, the third round of training is continued to obtain a third updated model parameter. And at the moment, taking the model parameter after the second updating as a first model parameter, taking the model parameter after the third updating as a second model parameter, calculating the shielding matrixes corresponding to the model parameters and the similarity between the shielding matrixes, and suspending training if the calculated similarity is greater than a set threshold value.

In summary, for the case that any two adjacent model parameters are respectively used as the first model parameter and the second model parameter, as described above, the two model parameters include the parameter matrix corresponding to each layer in the model, but first, only the set parameter matrix corresponding to the target layer needs to be extracted from the two model parameters, which is respectively called as the first parameter matrix and the second parameter matrix, the first shielding matrix corresponding to the first parameter matrix and the second shielding matrix corresponding to the second parameter matrix are determined, and whether to perform the next round of training is determined according to the similarity between the two shielding matrices.

Wherein, determining the first shielding matrix corresponding to the first parameter matrix of the target layer in the first model parameters may be implemented as: sorting the parameter values contained in the first parameter matrix of the target layer from large to small; determining parameter values arranged at a later set proportion, or determining parameter values with values smaller than set values; and generating a first shielding matrix according to the corresponding position of the determined parameter value in the first parameter matrix.

The generation manner of the second shielding matrix is the same, and is not repeated. The first parameter matrix has the same dimension as the second parameter matrix, so the first mask matrix has the same dimension as the second mask matrix, which is equal to the dimension of the parameter matrix.

For example, assuming that the first parameter matrix is a 10 × 10-dimensional matrix, the 100 parameter values are sorted from large to small, for example, the set proportion is 10%, the last 10 parameter values are determined, and the corresponding target positions of the 10 parameter values in the first parameter matrix are recorded. A first mask matrix having the same dimension as the first parameter matrix is generated, and the element values corresponding to the target positions are set to 0 and the element values of the other positions are set to 1 in the first mask matrix.

And after the first shielding matrix and the second shielding matrix are obtained, calculating the similarity of the first shielding matrix and the second shielding matrix. For example, optionally, a preset type of distance between the first shielding matrix and the second shielding matrix is determined as the similarity, where the preset type of distance is, for example, a euclidean distance or a hamming distance. Optionally, the difference of the element values at the same position in the two shielding matrices may be counted, the number of the two shielding matrices in the case that the element values corresponding to the same position are the same is determined, and the percentage of the number in the dimension of the shielding matrices is determined as the similarity of the two shielding matrices. Or alternatively, the sum of the differences of the element values at the same position in the two shielding matrixes can be calculated as the similarity of the two shielding matrixes.

If the similarity between the first shielding matrix and the second shielding matrix is high, for example, greater than a set threshold, it is considered that the similarity between the first parameter matrix of the target layer in the first model parameters and the second parameter matrix of the target layer in the second model parameters is high, which indicates that the training of the target layer tends to be stable at this time, and the training may be suspended, so as to perform pruning processing on the random initialization parameters by using the shielding matrices corresponding to the parameter matrices of each layer in the second model parameters at this time.

The determination method of the shielding matrix corresponding to the parameter matrix of each layer in the second model parameters is the same as the determination method of the second shielding matrix corresponding to the second parameter matrix of the target layer, and is not repeated.

The pruning treatment can be as follows: and multiplying the shielding matrix corresponding to the parameter matrix of a certain layer in the second model parameters with the parameter matrix of the same layer in the random initialization parameters, so that a plurality of parameter values in the parameter matrix of the layer in the random initialization parameters are set to be 0, and other parameter values are kept unchanged, wherein the positions of the plurality of parameter values correspond to the positions of the element values of 0 in the corresponding shielding matrix.

The result of pruning the random initialization parameter is called as a third model parameter, and the parameter matrix of each layer in the third model parameter has many 0-valued elements, so the third model parameter is sparse. And then, continuing the model training by taking the third model parameter as a starting point until the model converges, and obtaining a target initialization parameter of the model when the model converges, thereby completing the pre-training stage of the model.

It can be understood that, since the third model parameter is sparse, when performing subsequent meta learning training based on the third model parameter, the 0-value elements in each parameter matrix in the third model parameter do not need to participate in the related calculation in the subsequent training process, so that the calculation overhead is reduced, and the 0-value elements also do not need to be stored, and only the values of the non-0-value elements and the positions of the non-0-value elements in the parameter matrix need to be stored, thereby reducing the storage amount. Based on the sparse third model parameter, the model can be trained to be converged more quickly, and the target initialization parameter obtained when the model is trained to be converged is also sparse, namely a sparse model can be obtained, so that the model has faster convergence capability on the subsequent target task.

In fact, the theoretical basis of the above-mentioned idea of pruning the models is the lottery hypothesis, according to which the original model (i.e. the model with the random initialization parameters) has a sparse winning submodel at the beginning, and therefore, pruning can be performed with the goal of finding the winning submodel. Meanwhile, the inventor finds that the shielding matrix corresponding to the parameter matrix of the first layer in the winning submodel is most important in all the layers through research, namely, how the parameter matrix of the first layer in the winning submodel performs pruning plays an important role in the pruning process of the whole model. Therefore, the goal of finding the winning submodel becomes to find the shielding matrix corresponding to the first layer of the winning submodel. Meanwhile, the speed of stabilizing the shielding matrix corresponding to the first layer in the pre-training process is high, and the shielding matrix is often fixed in the early training stage, so that the possibility of reducing the calculation cost of the pre-training is provided.

In summary, the embodiment of the present invention determines when to suspend training and obtain a final sparse initialization model, i.e., a final target initialization parameter of the model, by monitoring the stable condition of the mask matrix corresponding to the first layer of the model. Based on this, the target layer in the above may be the first layer, and of course, in practical application, the target layer may also be several layers including the first layer, such as the first layer and the second layer. However, the number of target layers contained is much less than the total number of layers contained by the model. For example, for a deep neural network model, which is composed of an input layer, several hidden layers, and an output layer, the first layer is the input layer.

In order to facilitate understanding of the pruning scheme provided by the above embodiment, the implementation of the above steps is simplified and described in conjunction with fig. 2, so that the scheme can be more intuitively understood. As shown in fig. 2, first, a random initialization process of model parameters is performed, and a randomly initialized deep neural network model is trained using a meta-learning algorithm. And k is used for representing the number of training rounds in the training process, and in the kth round in the training process, pruning is carried out on the first layer (namely the parameter matrix of the first layer) of the deep neural network model to obtain a shielding matrix corresponding to the first layer, namely a shielding matrix mk consisting of 0/1, wherein 0 represents the parameter value of the position to be pruned. Calculating the Hamming distance between the current shielding matrixes mk and mk-1: hamming (mk, mk-1), where mk-1 represents the mask matrix of the first layer calculated in the k-1 th round, and when the distance is smaller than a given threshold (a shown schematically in the figure), the training is stopped, and the mask position/pruning position of the first layer of the model has stabilized. Then, the entire model (i.e., the initial randomly initialized model) is pruned based on the mask matrices corresponding to the respective layers in the model obtained in the k-th round (including the mask matrix mk of the first layer), and a thinned model is obtained. At this time, the model has the characteristic that the meta-learning model can be converged quickly, and is a sparse model, i.e. the amount of calculation required for training is relatively small. Then, retraining the model (meta-learning retraining) based on the sparse model, and continuing to train until convergence by using the sparse model as a starting point to recover the model performance, so as to obtain the final target initialization parameter.

As shown in fig. 3 and fig. 4, the pruning scheme provided by the embodiment of the present invention may be executed on the user terminal device side, or may be executed on the server or the cloud. Due to the adoption of the pruning scheme in the model pre-training stage, the calculation overhead and the storage overhead can be greatly reduced, so that the possibility of completing the model training at the user terminal equipment is provided. Of course, if the pre-training stage is executed at the cloud or the server, the model with the target initialization parameters obtained when the pre-training stage is completed may also be sent to the user terminal device, so that the user terminal device may continue to complete the training of the model under the target task for the local target task, and at this time, the training of the model under the target task may be completed through a conventional machine learning training manner.

In summary, in the meta-learning pre-training stage completed in the user terminal device or the cloud, finally, the user terminal device may obtain the model with the target initialization parameter, and then, the user terminal device may train the model based on the locally obtained training sample corresponding to the target task to obtain the target parameter of the model under the target task.

Based on the method, when the training samples corresponding to the target tasks collected in the user terminal equipment relate to the user privacy safety problem, the training sample data can be ensured not to be output from the user terminal equipment, the model training requirements of the user for the target tasks can be met, and in addition, the calculation and storage expenses brought by model training are effectively reduced.

The model pruning scheme provided by the embodiment of the invention can be suitable for any model training requirement scene with few samples, and can adopt a meta-learning mode to realize model training aiming at a target task with lower calculation overhead cost.

As described above, the target task may be some sort of classification task, such as an image classification task, a voice classification task, and so on. In practical applications, for example, a user is a photo amateur, and often shoots objects such as mountains and waters, animals, and people in daily life, and the shot photos need to be classified and stored, at this time, the user can generate an object classification task, namely identifying which one of a plurality of classification labels the object in the shot photos belongs to, and storing the photos corresponding to the same classification label in a folder.

In this example scenario, a plurality of other classification tasks may be collected in advance as a plurality of training tasks, a model with target initialization parameters is obtained based on the model pruning scheme described in the foregoing embodiment, and the user terminal device of the user further trains the model with the target initialization parameters by using corresponding training samples (which may be a plurality of locally stored photos labeled with classification tags) under the target classification task, so as to finally obtain a model suitable for the target classification task. And then, after the user takes the photos again, the model can be called to finish the classified storage of the photos.

The target task may be a recommendation task, such as a recommendation task for a target category item, in addition to a classification task. The item may be a commodity, a movie, a song, a literary work, etc. Correspondingly, the plurality of training tasks for model meta-learning training may be recommended tasks for a plurality of categories of items.

In the recommended task scenario, the model to be trained may be referred to as a recommended model. Taking the item as a commodity, as shown in fig. 5, the plurality of training tasks for performing meta learning training of the recommendation model may include, for example: the recommendation method comprises the recommendation tasks of electronic product commodities, clothing commodities and food commodities. The target task may be: and (4) a recommendation task of cosmetic commodities.

In the above example situation, the training sample corresponding to each training task may be obtained by collecting operation behaviors of the same user (the user corresponding to the user terminal device executing the complete model training process) or a large number of users (behavior records of purchasing and the like of the large number of users may be obtained when the pre-training stage of the model is executed by the e-commerce server) under each corresponding commodity category, such as purchasing, evaluating, paying attention to, and adding a shopping cart. For the target task, the training sample of the target task can be obtained by collecting the above-mentioned various operation behavior information of the user corresponding to the user terminal device on the corresponding category of goods.

In the example scenario of the recommendation task, recommendation tasks of multiple categories of items collected in advance are used, and based on the model pruning scheme introduced in the above embodiment, a model with target initialization parameters is obtained, where the model with target initialization parameters has the preference characteristics of users who learn different characteristics for different categories of items. And the user terminal equipment of a certain user further uses the corresponding training sample under the target recommendation task to continue training the model with the target initialization parameters, and finally obtains the model suitable for the user and aiming at the target recommendation task. The model is finally operated in the user terminal equipment of the user, and recommendation service of the target category item corresponding to the target recommendation task for the user can be realized.

The above description is only made by taking classification tasks and recommendation tasks as examples, and the application scenarios to which the solution provided by the embodiment of the present invention can be applied are not limited thereto.

The model pruning device of one or more embodiments of the present invention will be described in detail below. Those skilled in the art will appreciate that these means can each be constructed using commercially available hardware components and configured through the steps taught in this scheme.

Fig. 6 is a schematic structural diagram of a model pruning device according to an embodiment of the present invention, and as shown in fig. 6, the device includes: the training device comprises an acquisition module 11, a pre-training module 12 and a pruning module 13.

An obtaining module 11, configured to obtain random initialization parameters of a model and a plurality of training tasks for training the model.

The pre-training module 12 is configured to perform multiple rounds of training on the model by using training samples corresponding to the multiple training tasks, so as to obtain a first model parameter and a second model parameter corresponding to the model after two different rounds of training, where the first model parameter and the second model parameter both include parameter matrices of each layer in the model.

A pruning module 13, configured to determine a first shielding matrix corresponding to a first parameter matrix of a target layer in the first model parameters, and a second shielding matrix corresponding to a second parameter matrix of the target layer in the second model parameters; and if the similarity between the first shielding matrix and the second shielding matrix is greater than a set threshold, pruning the random initialization parameters according to the shielding matrices respectively corresponding to the parameter matrices of each layer in the second model parameters to obtain third model parameters.

Optionally, the pre-training module 12 is further configured to train the model based on the third model parameter to obtain a target initialization parameter of the model when the model is trained to converge.

Optionally, the apparatus further comprises: and the target task training module is used for training the model based on a training sample which is locally acquired and corresponds to the target task and the target initialization parameter so as to obtain the target parameter of the model under the target task.

Optionally, the plurality of training tasks include a plurality of classification tasks, and the target task is a classification task different from the plurality of classification tasks.

Optionally, the plurality of training tasks include recommended tasks of a plurality of categories of items, the target task being a recommended task of a target category of items different from the plurality of categories of items; the training sample corresponding to the target task comprises the use information of the target category item by the user.

Optionally, the target layer is a first layer in the model.

Optionally, the pruning module 13 is specifically configured to: determining a distance of a preset type between the first shielding matrix and the second shielding matrix as the similarity.

Optionally, the pruning module 13 is specifically configured to: sorting the parameter values contained in the first parameter matrix of the target layer from large to small; determining parameter values arranged at a later set proportion, or determining parameter values with values smaller than set values; and generating the first shielding matrix according to the corresponding position of the determined parameter value in the first parameter matrix.

The apparatus shown in fig. 6 may perform the steps performed by the user terminal device in the foregoing embodiment, and the detailed performing process and technical effect refer to the description in the foregoing embodiment, which are not described herein again.

In one possible design, the structure of the item recommendation device shown in fig. 6 may be implemented as a user terminal device, such as a smart phone, a PC, a tablet computer, and so on. As shown in fig. 7, the user terminal device may include: a processor 21, a memory 22, and a communication interface 23. Wherein the memory 22 has stored thereon executable code, which when executed by the processor 21, causes the processor 21 to implement at least the model pruning method as provided in the previous embodiments.

Fig. 8 is a schematic structural diagram of another user terminal device provided in this embodiment, and as shown in fig. 8, the user terminal device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operations of the user terminal device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to perform all or a portion of the method steps described above as being performed by the user terminal device. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the user terminal device 800. Examples of such data include instructions for any application or method operating on user terminal device 800, contact data, phonebook data, messages, pictures, videos, and the like. The memory 804 may be implemented by any type or combination of volatile and non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power component 806 provides power to the various components of the user terminal device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the user terminal device 800.

The multimedia component 808 includes a screen providing an output interface between the user terminal device 800 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. When the user terminal device 800 is in an operation mode, such as a photographing mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive an external audio signal when the user terminal apparatus 800 is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The input/output interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 814 includes one or more sensors for providing various aspects of state assessment for the user terminal device 800. For example, the sensor component 814 may detect an open/closed state of the user terminal apparatus 800, a relative positioning of components such as a display and a keypad of the electronic apparatus 800, the sensor component 814 may detect a change in position of the user terminal apparatus 800 or a component of the user terminal apparatus 800, presence or absence of user contact with the user terminal apparatus 800, orientation or acceleration/deceleration of the user terminal apparatus 800, and a temperature change of the user terminal apparatus 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the user terminal device 800 and other devices in a wired or wireless manner. The user terminal device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G or 4G or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the user terminal device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the user terminal device 800 to perform the above-described method is also provided. For example, the non-transitory computer-readable storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, a magnetic or optical disk.

In addition, an embodiment of the present invention provides a non-transitory machine-readable storage medium having stored thereon executable code, which, when executed by a processor of a user terminal device, causes the processor to implement at least the item recommendation method as provided in the foregoing embodiments.

The above-described apparatus embodiments are merely illustrative, wherein the units described as separate components may or may not be physically separate. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by adding a necessary general hardware platform, and of course, can also be implemented by a combination of hardware and software. With this understanding in mind, the above-described aspects and portions of the present technology which contribute substantially or in part to the prior art may be embodied in the form of a computer program product, which may be embodied on one or more computer-usable storage media having computer-usable program code embodied therein, including without limitation disk storage, CD-ROM, optical storage, and the like.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A model pruning method is applied to a user terminal and comprises the following steps:

2. The method of claim 1, further comprising:

and training the model based on the third model parameter to obtain a target initialization parameter of the model when the model is trained to be convergent.

3. The method of claim 2, further comprising:

training the model based on a training sample corresponding to the target task and the target initialization parameter, so as to obtain the target parameter of the model under the target task.

4. The method of claim 3, wherein the plurality of training tasks includes a plurality of classification tasks, and the target task is a different classification task than the plurality of classification tasks; alternatively, the first and second electrodes may be,

the training tasks comprise recommended tasks of various types of projects, and the target task is a recommended task of a target type of project different from the various types of projects; the training sample corresponding to the target task comprises the operation behavior information of the user on the target category project.

5. The method of claim 1, wherein the target layer is a first layer in the model.

6. The method of claim 1, further comprising:

determining a distance of a preset type between the first shielding matrix and the second shielding matrix as the similarity.

7. The method of claim 1, wherein determining a first mask matrix corresponding to a first parameter matrix of a target layer in the first model parameters comprises:

sorting the parameter values contained in the first parameter matrix of the target layer from large to small;

determining parameter values arranged at a later set proportion, or determining parameter values with values smaller than set values;

and generating the first shielding matrix according to the corresponding position of the determined parameter value in the first parameter matrix.

8. A model pruning device, comprising:

9. A user terminal device, comprising: a memory, a processor, a communication interface; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to perform the model pruning method of any of claims 1 to 7.

10. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of a user terminal device, causes the processor to perform the model pruning method of any one of claims 1 to 7.