CN116090543A

CN116090543A - Model compression method and device, computer readable medium and electronic equipment

Info

Publication number: CN116090543A
Application number: CN202310028847.XA
Authority: CN
Inventors: 樊欢欢
Original assignee: Xi'an Oppo Communication Technology Co ltd
Current assignee: Xi'an Oppo Communication Technology Co ltd
Priority date: 2023-01-09
Filing date: 2023-01-09
Publication date: 2023-05-09

Abstract

The disclosure provides a model compression method and device, a computer readable medium and electronic equipment, and relates to the technical field of artificial intelligence. The method comprises the following steps: acquiring a pre-trained initial neural network model; deep pruning is carried out on the initial neural network model, and an initial neural network model after deep pruning is obtained; performing width pruning treatment on the initial neural network model subjected to the deep pruning to obtain an initial neural network model subjected to the width pruning; and determining the initial neural network model after pruning the width meeting the preset model compression parameters as a target neural network model. The method and the device can effectively compress the redundant structure in the initial neural network model, can effectively reduce the model volume of the target neural network model while guaranteeing the performance and the precision of the target neural network model, and further improve the applicable scene of the target neural network model.

Description

Model compression method and device, computer readable medium and electronic equipment

Technical Field

The present disclosure relates to the field of artificial intelligence technology, and in particular, to a model compression method, a model compression device, a computer readable medium, and an electronic apparatus.

Background

With the continuous improvement of the scientific and technological level, deep Learning (DL) technology is rapidly developed. The objective of deep learning is to learn the internal law and the representation hierarchy of sample data, the implementation of deep learning generally depends on various types of neural network structures, and as task complexity increases and data volume increases, the neural network structures in the deep learning model are more and more complex, more redundant networks generally exist, and how to optimize the redundant networks in the deep learning model is a huge difficulty facing today.

At present, the neural network optimization mainly adopts a channel pruning scheme, but the compression strength of the mode on the neural network model is limited, the obtained neural network model still has a redundant network, and meanwhile, the performance and the precision of the neural network after pruning cannot be effectively ensured.

Disclosure of Invention

The object of the present disclosure is to provide a model compression method, a model compression device, a computer-readable medium, and an electronic apparatus, thereby effectively reducing the model volume of a target neural network model while ensuring the performance and accuracy of the target neural network model.

According to a first aspect of the present disclosure, there is provided a model compression method, comprising:

Acquiring a pre-trained initial neural network model;

deep pruning is carried out on the initial neural network model, and an initial neural network model after deep pruning is obtained;

performing width pruning treatment on the initial neural network model subjected to the deep pruning to obtain an initial neural network model subjected to the width pruning;

and determining the initial neural network model after pruning the width meeting the preset model compression parameters as a target neural network model.

According to a second aspect of the present disclosure, there is provided a model compression apparatus comprising:

the model acquisition module is used for acquiring a pre-trained initial neural network model;

the deep pruning module is used for carrying out deep pruning treatment on the initial neural network model to obtain an initial neural network model after deep pruning;

the width pruning module is used for carrying out width pruning treatment on the initial neural network model subjected to the deep pruning to obtain the initial neural network model subjected to the width pruning;

and the model output module is used for determining the initial neural network model after pruning the width meeting the preset model compression parameters as a target neural network model.

According to a third aspect of the present disclosure, there is provided a computer readable medium having stored thereon a computer program which, when executed by a processor, implements the method described above.

According to a fourth aspect of the present disclosure, there is provided an electronic apparatus, comprising:

a processor; and

and a memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the methods described above.

According to the model compression method provided by the embodiment of the disclosure, a pre-trained initial neural network model can be obtained, deep pruning treatment can be performed on the initial neural network model to obtain an initial neural network model after deep pruning, further width pruning treatment can be performed on the initial neural network model after deep pruning to obtain an initial neural network model after width pruning, and finally the initial neural network model after width pruning meeting preset model compression parameters is determined to be a target neural network model. On one hand, the compression of the neural network model from coarse granularity to fine granularity can be realized by carrying out deep pruning treatment on the initial neural network model and then carrying out width pruning treatment on the initial neural network model, the compression strength of the initial neural network model is improved, and the model volume of the target neural network model is effectively reduced; on the other hand, the performance and the precision of the neural network model can be better controlled in the compression process by gradually compressing from coarse granularity to fine granularity, and the pruned neural network model is further verified through preset model compression parameters, so that the performance and the precision of the output target neural network model are further improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort. In the drawings:

FIG. 1 illustrates a schematic diagram of an exemplary system architecture to which embodiments of the present disclosure may be applied;

FIG. 2 schematically illustrates a flow diagram of a model compression method in an exemplary embodiment of the present disclosure;

FIG. 3 schematically illustrates a flow diagram for implementing deep pruning in an exemplary embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow diagram for determining an optimal sub-network by neural network structure search in an exemplary embodiment of the present disclosure;

FIG. 5 schematically illustrates a flow diagram for implementing a width pruning process in an exemplary embodiment of the present disclosure;

Fig. 6 schematically illustrates a flow chart of pruning a weight channel for an optimal sub-network obtained by searching in an exemplary embodiment of the disclosure;

FIG. 7 schematically illustrates a composition diagram of a modular compression apparatus in an exemplary embodiment of the present disclosure;

fig. 8 shows a schematic diagram of an electronic device to which embodiments of the present disclosure may be applied.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.

FIG. 1 illustrates a schematic diagram of a system architecture of an exemplary application environment in which a model compression method and apparatus of embodiments of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include one or more of the

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the

terminal devices

101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others. The

terminal devices

101, 102, 103 may be a variety of electronic devices with artificial intelligence computing capabilities including, but not limited to, desktop computers, portable computers, smart phones, tablet computers, and the like. It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, the server 105 may be a server cluster formed by a plurality of servers.

The model compression method provided by the embodiment of the present disclosure is generally performed in the server 105, and accordingly, the model compression device is generally disposed in the server 105. However, it will be readily understood by those skilled in the art that the model compression method provided in the embodiment of the present disclosure may be performed by the

terminal devices

101, 102, 103, and accordingly, the model compression apparatus may be provided in the

terminal devices

101, 102, 103, which is not particularly limited in the present exemplary embodiment.

In a technical scheme, a model pruning method combining layer pruning and channel pruning is provided, namely, step 1, sparsification training; step 2, searching a hierarchical self-adaptive threshold value; step 3, determining the sparsity of each layer according to the threshold value; step 4, calculating layer importance sequencing; step 5, layer pruning; step 6, pruning through a channel; and 7, fine tuning. In the scheme, a pruning scheme is mainly adopted to perform pruning optimization on the dimension of a convolution layer and the dimension of a weight channel of a model, the importance of the convolution layer and the channel in the model is calculated by adopting a statistical algorithm, and then pruning is performed according to the importance. However, in practice, due to the large granularity of the convolution layer, the importance of the layer is difficult to truly characterize by adopting statistical calculation, often causes incorrect pruning, affects model accuracy, and is difficult to truly measure.

In another technical scheme, a neural network model compression method based on structure search and channel pruning is provided, a lightweight network with low precision and smaller parameter number and calculation amount than a conventional convolutional neural network is trained for a data set and tasks, a scaling factor for correspondingly measuring importance of each hierarchy structure in the network is given, meanwhile, the scaling factors of each hierarchy structure are initialized by adopting a large variance Gaussian distribution, L1 norm regularization is applied, a secondary gradient optimization algorithm is adopted for sparse training, the hierarchy corresponding to the scaling factors close to 0 in the model is cut, fine adjustment is carried out on the model, and the next round of model compression pruning is carried out if the precision of baseline can be returned or the precision of baseline is reduced within 5%, otherwise, the program is ended. According to the scheme, an initial model is built by mainly adopting a network searching method, pruning is carried out on the basis of a searching model, a new network structure is generated in the model searching process by the method, a large amount of calculation amount and time are needed, an artificial network which is designed cannot be optimized, the use scene is limited, and the searched model is difficult to achieve the accuracy of the artificial design model in many tasks.

Neural network architecture search (Neural Architecture Search, NAS) is a process that changes the process of manually tuning a neural network to automatically perform tasks to find more complex architectures, uses search strategies to test and evaluate a large number of network architectures in a search space, and selects the architecture that best meets the given problem objective by maximizing the fitness function.

At present, a pruning or NAS scheme is mainly adopted to compress and optimize the model in the related technology, and for an excellent model trained manually, the current scheme only adopts the pruning scheme, but the accuracy of a model result output by the pruning scheme cannot be guaranteed. The model searched by the NAS in the related technology cannot be performed in many tasks and even has lower precision than the manual design model, the NAS is limited in practical application, and the related technology is difficult to compress the manual design model and ensure the precision.

Based on one or more problems in the related art, the present disclosure first provides a model compression method, and a model compression method and a model compression apparatus of exemplary embodiments of the present disclosure will be specifically described below taking a server to perform the method as an example.

Fig. 2 shows a flow chart of a model compression method in the present exemplary embodiment, which may include the following steps S210 to S240:

in step S210, a pre-trained initial neural network model is acquired.

In an exemplary embodiment, the initial neural network model is a convolutional neural network (Convolutional Neural Networks, CNN) model indicating that the deep learning task is constructed and trained, for example, the initial neural network model may be a convolutional neural network for the target detection task or a convolutional neural network for the speech recognition task, which is not particularly limited in this exemplary embodiment. In general, an initial neural network model which is manually designed and trained has computational redundancy when processing a specific deep learning task, often accompanied by massive computation, but is limited by the computation processing capacity of a terminal device or a server, so that the initial neural network model with large computation is difficult to land.

The convolutional neural network is a feedforward neural network (Feedforward Neural Networks) which comprises convolutional calculation and has a depth structure, has characteristic learning capability and can carry out translation invariant classification (Shift-invariant Classification) on input information according to a hierarchical structure of the convolutional neural network. The convolutional neural network can be constructed by a visual perception (Visual perception) mechanism of a simulated living being, and can perform supervised learning and unsupervised learning, and the convolutional kernel parameter sharing and the sparsity of interlayer connection in the hidden layer enable the convolutional neural network to learn the latticed features (such as pixels and audio) with small calculation amount.

In step S220, the initial neural network model is subjected to deep pruning, so as to obtain the initial neural network model after deep pruning.

In an exemplary embodiment, the deep pruning process refers to a process of pruning the deep level of the initial neural network model, and for the deep level of the neural network model, it generally refers to the number of layers of the convolutional layer of the model in the longitudinal direction.

The deep pruning processing can be a process of pruning the convolution layer in the initial neural network model, for example, whether the convolution layer is useful or not can be determined according to the calculated amount of the convolution layer in the initial neural network model, the useful convolution layer is reserved, and the convolution layer which does not participate in calculation is deleted, so that the pruning processing of the convolution layer in the initial neural network model is realized; of course, the deep pruning processing may also be a process of performing network search on the convolutional layer in the initial neural network model based on the neural network structure search NAS technology, and screening out the optimal sub-network, and the method for implementing the deep pruning processing in this example embodiment is not limited in any way.

Compared with the initial neural network model before the deep pruning, the initial neural network model after the deep pruning effectively reduces the convolution layer, and reduces the redundancy of the initial neural network model on the coarse granularity.

In step S230, the width pruning process is performed on the initial neural network model after deep pruning, so as to obtain the initial neural network model after width pruning.

In an exemplary embodiment, the width pruning process refers to a process of pruning the width level of the initial neural network model, and for the width level of the neural network model, it generally refers to the number of channels of each convolutional layer of the model in the lateral direction.

The width pruning processing may be a process of pruning the convolution layers in the initial neural network model, for example, pruning processing may be performed on the convolution channels according to weights of the convolution channels of the convolution layers, more important convolution channels are reserved, and unimportant convolution channels are deleted, so as to implement pruning processing on the convolution channels of each convolution layer in the initial neural network model. Of course, the pruning rate of each convolution layer may also be determined by a preset search agent (a reinforcement learning network), and the convolution channels of each convolution layer may be deleted based on the pruning rate, and the manner of implementing the width pruning process is not limited in any way in this example embodiment.

Compared with the initial neural network model before the width pruning treatment, the initial neural network model after the width pruning effectively reduces the number of convolution channels of a convolution layer, reduces the redundancy of the initial neural network model on the fine granularity, and reduces the calculated amount of the initial neural network model.

In step S240, the initial neural network model after pruning the width that satisfies the preset model compression parameter is determined as the target neural network model.

In an exemplary embodiment, the preset model compression parameter is a condition parameter for verifying whether the initial neural network model after the width pruning meets the performance and accuracy requirements, for example, the preset model compression parameter may be a model execution efficiency threshold, an output result accuracy threshold, a model compression rate threshold, or the like, and the preset model compression parameter may be specifically and custom set according to an actual situation, and the specific expression form of the preset model compression parameter is not limited in any way in this exemplary embodiment.

The target neural network model is a neural network model obtained after model compression is carried out on the initial neural network model, compared with the initial neural network model, the target neural network model is simpler in network structure, the redundancy quantity is reduced, the performance and the precision are kept within a certain range with the initial neural network model, namely, the target neural network model can achieve the same task as the pre-trained initial neural network model, meanwhile, the redundancy quantity of the target neural network model is less, the required calculation performance is less, and the application scene of the target neural network model is effectively improved.

The initial neural network model is subjected to deep pruning, and then the initial neural network model is subjected to width pruning, so that the compression of the neural network model from coarse granularity to fine granularity can be realized, the compression strength of the initial neural network model is improved, and the model volume of the target neural network model is effectively reduced; meanwhile, the performance and the precision of the neural network model can be better controlled in the compression process by gradually compressing from coarse granularity to fine granularity, and the pruned neural network model is further verified through preset model compression parameters, so that the performance and the precision of the output target neural network model are further improved.

Next, the technical contents in step S210 to step S240 will be described in detail.

In an exemplary embodiment, the preset model compression parameters may include a model execution efficiency threshold and an output result accuracy threshold, and specific parameters of the model execution efficiency threshold and the output result accuracy threshold may be set in a customized manner according to actual situations, which is not limited in any way in this exemplary embodiment.

Optionally, the method may further include the following steps of selecting an initial neural network model after pruning the width that meets the preset model compression parameters, and determining the initial neural network model as the target neural network model: the verification operation can be carried out on the initial neural network model after the width pruning, and the model execution efficiency and the output result accuracy rate of the initial neural network model after the width pruning are determined; if the model execution efficiency is greater than or equal to the model execution efficiency threshold and the output result accuracy is greater than or equal to the output result accuracy threshold, determining the initial neural network model after the width pruning as a target neural network model.

The verification operation refers to an operation of obtaining verification data from an initial neural network model after the width pruning, and a preset verification data set can be input into the initial neural network model after the width pruning to determine the model execution efficiency and the output result accuracy of the initial neural network model after the width pruning.

When the model execution efficiency is detected to be greater than or equal to the model execution efficiency threshold and the accuracy of the output result is detected to be greater than or equal to the accuracy threshold of the output result, determining that the initial neural network model after pruning of the width of the round meets the requirements, and taking the initial neural network model as a final output target neural network model.

When the model execution efficiency is detected to be smaller than the model execution efficiency threshold, or the output result accuracy is detected to be smaller than the output result accuracy threshold, or the model execution efficiency is detected to be smaller than the model execution efficiency threshold and the output result accuracy is detected to be smaller than the output result accuracy threshold, the initial neural network model after the width pruning of the round can be considered to be unsatisfactory, so that model compression, namely deep pruning processing and width pruning processing, can be continuously performed on the initial neural network model after the width pruning of the round until the model execution efficiency of the initial neural network model after the width pruning is larger than or equal to the model execution efficiency threshold and the output result accuracy is larger than or equal to the output result accuracy threshold, and the initial neural network model after the width pruning is used as a final output target neural network model.

By setting the preset model compression parameters, the performance and the accuracy of the neural network model in the model compression process can be monitored, and the performance (model execution efficiency) and the accuracy (output result accuracy) of the finally output target neural network model are effectively ensured.

In an exemplary embodiment, the deep pruning of the initial neural network model may be achieved by the steps in FIG. 3:

step S310, based on the convolution layer in the initial neural network modelUsability, constructing a neural network model search space, the neural network model search space comprising 2 ^N A sub-network, wherein N is the number of convolutional layers;

step S320, carrying out random sampling and network training on the sub-network in the neural network model search space to obtain a trained sub-network model;

step S330, searching the sub-network in the sub-network model based on the evolution searching mode and the preset model compression parameters, determining a target sub-network, and taking the target sub-network as the initial neural network model after deep pruning.

Wherein the neural network model search space refers to a predefined space defined based on a convolution layer in the initial neural network model for searching and determining the optimal sub-network structure for completing the deep learning task, and the neural network model search space may include 2 ^N A sub-network, where N may represent the number of convolutional layers in the initial neural network model.

The neural network model search space may be constructed based on the availability of the convolutional layer in the initial neural network model, for example, the calculation amount (i.e., the accumulation number) of the convolutional layer may be determined, and the availability of the convolutional layer may be determined based on the calculation amount (i.e., the accumulation number), or, of course, the availability of the convolutional layer may be determined according to the convolutional channel and the weight of the convolutional channel, which is not limited thereto in this example embodiment.

Optionally, the calculated amount (Multiply Accumulate, MAC, i.e. the cumulative multiply-accumulate number) of each sub-network in the neural network model search space may be determined, and the sub-networks with calculated amounts greater than or equal to the preset calculated amount threshold may be randomly sampled and network trained to obtain a trained sub-network model.

The evolutionary search algorithm is a heuristic search algorithm, and the main components of the evolutionary search algorithm are as follows: the algorithm design module comprises mutation, recombination and selection (comprising selection of father and selection of obsolete samples), and the mutation, recombination and selection module can be used for realizing the construction of a framework of an evolutionary search algorithm aiming at different deep learning tasks. The method can search the sub-networks in the sub-network model based on an evolution search mode and preset model compression parameters, determine a target sub-network, wherein the target sub-network is a sub-network which can realize a deep learning task and meets the precision requirement and the compression rate requirement, and the target sub-network is used as an initial neural network model after deep pruning.

The NAS technology is searched for through the neural network structure, and the convolution layer of the initial neural network model is pruned, so that the accuracy of the initial neural network model is not reduced while the network complexity is reduced, the pre-trained initial neural network model has a better network structure, and the network performance is improved.

Fig. 4 schematically illustrates a flowchart of determining an optimal sub-network by searching a neural network structure in an exemplary embodiment of the present disclosure.

Referring to fig. 4, in step S410, a pre-trained initial neural network model is input.

Step S420, searching space according to whether the convolutional layer is available to construct a neural network model. A neural network model search space may be constructed based on the availability of each convolution layer in the initial neural network model, which may include 2 ^N A sub-network, where N may represent the number of convolutional layers in the initial neural network model;

step S430, training the sub-networks in the neural network model search space. The sub-network with the calculated amount meeting a certain range in the searching space of the neural network model can be randomly sampled and trained, and a trained sub-network model can be obtained;

Step S440, network searching is performed according to the precision and the compression rate index. The sub-networks in the search space of the neural network model can be searched by adopting an evolution search algorithm with the precision and the compression rate index as references, so that pruning of the convolution layer of the initial neural network model is realized;

and step S450, obtaining the optimal sub-network meeting the precision and compression rate indexes. And the optimal sub-network obtained by searching can be used as an initial neural network model after deep pruning.

Under the condition that an artificially designed neural network model exists, the NAS is searched by adopting a neural network structure to search the neural network model again, so that the time is relatively consumed, the accuracy of the artificial model can be achieved, a new network structure is possibly added, the model design training efficiency is affected, and the artificially designed model is provided with more redundancy, high power consumption and long operation time, and is limited to the application of the neural network model on a low computing platform such as a mobile phone platform. In the embodiment, the NAS search method is adopted to perform pruning optimization of a convolution layer on the existing trained neural network model, so that the performance of the model is improved, and meanwhile, the accuracy of the model is guaranteed.

In an exemplary embodiment, the width pruning processing of the initial neural network model after deep pruning may be implemented through the steps in fig. 5, and referring to fig. 5, the method may specifically include:

Step S510, calculating the performance parameters of the convolution kernels of all the sub-networks in the initial neural network model after deep pruning;

and step S520, pruning treatment is carried out on the convolution channels of all convolution layers in the initial neural network model after deep pruning according to the performance parameters, and the initial neural network model after width pruning is obtained.

The performance parameter refers to a relevant parameter for measuring the performance of the convolution kernel of each sub-network, for example, the performance parameter may be a convolution calculation amount of the convolution kernel, i.e. the cumulative accumulation number MAC, or may be a parameter amount of a weight, for example, the parameter amount of the weight may be represented as i x O x K, where i may represent an input channel, O may represent an output channel, and K may represent the size of the convolution kernel. Of course, the performance parameters of the convolution kernels of the respective sub-networks may also be represented by other parameters, which are not particularly limited in the present exemplary embodiment.

Alternatively, the performance parameters of the convolution kernels of the sub-networks in the initial neural network model after deep pruning may be calculated by: the convolution calculated quantity and the weight parameter quantity of the convolution kernels of all the sub-networks in the initial neural network model after deep pruning can be calculated; and determining the performance parameters of the convolution kernel according to the convolution calculated quantity and the ratio of the weight parameter quantity. The larger the ratio of the convolution calculated amount to the weight parameter amount is, the larger the calculated amount of the convolution kernel is, the smaller the parameter amount is, the better the performance is, and the convolution channel corresponding to the convolution kernel can be reserved; when the ratio of the convolution calculated amount to the weight parameter amount is smaller than a certain threshold value, the convolution kernel can be described as being poor in performance, and the convolution channel corresponding to the convolution kernel can be pruned.

Optionally, the absolute value of the weight of each convolution layer can be calculated according to the convolution channels, and the absolute values are sequenced according to the absolute values, and then the convolution channels of each convolution layer in the initial neural network model after deep pruning are pruned according to the performance parameters, so that the initial neural network model after width pruning is obtained.

The convolution channel of the initial neural network model after the pruning processing of the NAS technology is searched for the neural network structure is further pruned, the redundancy of the channel dimension in the initial neural network model is further removed, the initial neural network model is optimized to the greatest extent, and meanwhile, the performance and the precision of the initial neural network model can be guaranteed.

Optionally, when pruning is performed on the convolution channels of the initial neural network model, a neural network model search space can be constructed, and then the best convolution channel is searched and determined through the neural network model search space, so that pruning of the convolution channel is realized.

In an exemplary embodiment, after obtaining the initial neural network model after the width pruning, which meets the preset model compression parameters, a training data set may be obtained, and fine tuning training is performed on the initial neural network model after the width pruning through the training data set, so as to further improve the accuracy of the initial neural network model after the width pruning, and obtain the final output target neural network model.

Fig. 6 schematically illustrates a flow chart of pruning a weight channel for an optimal sub-network obtained by searching in an exemplary embodiment of the disclosure.

Referring to fig. 6, in step S610, the searched optimal sub-network, that is, the initial neural network model after deep pruning is input;

in step S620, a ratio of the convolution calculation amount and the weight reference amount of each convolution is calculated. After the optimal sub-network after pruning of the NAS convolution layer is obtained, calculating the calculated amount of the MAC and the parameter amount of the Weight of each convolution in the network, wherein i can represent an input channel, O can represent an output channel, K can represent the size of a convolution kernel, and the ratio of the calculated amount of the convolution and the parameter amount of the Weight of each convolution can be represented as MAC/Weight;

in step S630, the absolute value of the weight of each convolution is calculated and ordered. Calculating absolute values of the weights of all the convolution layers according to the channels, and sequencing the absolute values;

step S640, selecting pruning channels according to the ratio of the convolution calculated quantity and the weight parameter quantity of each convolution. The convolved channels can be screened according to the size of the MAC/Weight, if the size is smaller than the ratio, the convolved channels are cut off, the larger the MAC/Weight is, the larger the calculated amount is, the number of parameters is small, and the larger the MAC/Weight is, and the reserved parameters are needed;

Step S650, performance and accuracy verification. The performance and the precision of the initial neural network model with the well-pruned channel can be verified, if the performance and the precision requirements are met, the target neural network model is determined to be obtained, and if the performance and the precision requirements are not met, the step S610 is executed in a returning mode;

step S660, fine tuning training. The training data set can be obtained, and the finally obtained compressed target neural network model is subjected to fine tuning training through the training data set, so that the accuracy of the target neural network model is further improved.

In summary, in this exemplary embodiment, a pre-trained initial neural network model may be obtained, deep pruning may be performed on the initial neural network model to obtain an initial neural network model after deep pruning, and then width pruning may be performed on the initial neural network model after deep pruning to obtain an initial neural network model after width pruning, and finally the initial neural network model after width pruning that meets the preset model compression parameters is determined as the target neural network model. On one hand, the compression of the neural network model from coarse granularity to fine granularity can be realized by carrying out deep pruning treatment on the initial neural network model and then carrying out width pruning treatment on the initial neural network model, the compression strength of the initial neural network model is improved, and the model volume of the target neural network model is effectively reduced; on the other hand, the performance and the precision of the neural network model can be better controlled in the compression process by gradually compressing from coarse granularity to fine granularity, and the pruned neural network model is further verified through preset model compression parameters, so that the performance and the precision of the output target neural network model are further improved.

In the exemplary embodiment, the network model is compressed from a thick model to a thin model by combining NAS convolution layer pruning and weight channel pruning, so that the problems of high power consumption and high time consumption of a model trained through a pre-training process in practical application are solved, and the pre-trained neural network model can be applied to hardware platforms with different performances, such as compression of semantic segmentation models in portrait blurring projects.

The existing portrait blurring project is mainly applied to a mobile phone with high-performance hardware, but for a low-end mobile phone platform, the hardware performance limit cannot be fully applied at present, and a semantic segmentation algorithm in the blurring technology is larger in model and higher in precision requirement, so that the wide application of the project is limited. In this example embodiment, the convolutional layer is pruned through NAS search, so that the complexity of the network is reduced, the accuracy of the model is not reduced, the pre-trained network model has a better network structure, and the performance of the network is improved. After the NAS layer pruning is completed, the model is further pruned by adopting a weight channel, the redundancy of the channel dimension in the network is further removed, and the maximum optimization of the model is realized. For users, the model optimized by the model compression method provided by the summary of the embodiment of the invention can enable functions of the mobile phone to be smoother, especially a plurality of functions relying on deep learning in the camera can be operated more smoothly, and the middle-low-end machine type can also apply various practical functions, thereby effectively avoiding the blocking and waiting time in the use of the mobile phone and improving the user experience.

It is noted that the above-described figures are merely schematic illustrations of processes involved in a method according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.

Further, referring to fig. 7, in this exemplary embodiment, a model compression apparatus 700 is further provided, which includes a quality restoration module 710, and an image generation module 720. Wherein:

the model acquisition module 710 is configured to acquire a pre-trained initial neural network model;

the deep pruning module 720 is configured to perform deep pruning on the initial neural network model to obtain a deep pruned initial neural network model;

the width pruning module 730 is configured to perform width pruning on the initial neural network model after deep pruning to obtain an initial neural network model after width pruning;

the model output module 740 is configured to determine, as a target neural network model, the initial neural network model after pruning the width that satisfies the preset model compression parameter.

In an exemplary embodiment, the preset model compression parameters may include a model execution efficiency threshold and an output result accuracy threshold; the model output module 740 may be configured to:

performing verification operation on the initial neural network model after the width pruning, and determining the model execution efficiency and the output result accuracy of the initial neural network model after the width pruning;

and if the model execution efficiency is greater than or equal to the model execution efficiency threshold and the output result accuracy is greater than or equal to the output result accuracy threshold, determining the initial neural network model after the width pruning as a target neural network model.

In an exemplary embodiment, the deep pruning module 720 may be configured to:

constructing a neural network model search space based on the availability of a convolution layer in the initial neural network model, wherein the neural network model search space comprises 2 ^N A sub-network, wherein N is the number of convolutional layers;

randomly sampling and training the sub-network in the neural network model search space to obtain a trained sub-network model;

searching the sub-network in the sub-network model based on the evolution searching mode and the preset model compression parameters, determining a target sub-network, and taking the target sub-network as an initial neural network model after deep pruning.

In an exemplary embodiment, the deep pruning module 720 may be configured to:

determining the calculated amount of each sub-network in the neural network model search space;

and carrying out random sampling and network training on the sub-network with the calculated amount being greater than or equal to a preset calculated amount threshold value to obtain a trained sub-network model.

In an exemplary embodiment, the width pruning module 730 may be configured to:

calculating the performance parameters of convolution kernels of all sub-networks in the initial neural network model after deep pruning;

and pruning treatment is carried out on convolution channels of all convolution layers in the initial neural network model after deep pruning according to the performance parameters, so that the initial neural network model after width pruning is obtained.

In an exemplary embodiment, the width pruning module 730 may be configured to:

calculating the convolution calculated quantity and the weight parameter quantity of the convolution kernels of all the sub-networks in the initial neural network model after the deep pruning;

and determining the performance parameters of the convolution kernel according to the convolution calculated quantity and the ratio of the weight parameter quantity.

In an exemplary embodiment, the model output module 740 may be configured to:

acquiring a training data set;

and performing fine tuning training on the initial neural network model after pruning the width meeting the preset model compression parameters through the training data set to obtain the target neural network model.

The specific details of each module in the above apparatus are already described in the method section, and the details that are not disclosed can be referred to the embodiment of the method section, so that they will not be described in detail.

Those skilled in the art will appreciate that the various aspects of the present disclosure may be implemented as a system, method, or program product. Accordingly, various aspects of the disclosure may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

Exemplary embodiments of the present disclosure also provide an electronic device. The electronic devices may be the above-described

terminal devices

101, 102, 103 and server 105. In general, the electronic device may include a processor and a memory for storing executable instructions of the processor, the processor being configured to perform the above-described model compression method via execution of the executable instructions.

The configuration of the electronic device will be exemplarily described below taking the mobile terminal 800 in fig. 8 as an example. It will be appreciated by those skilled in the art that the configuration of fig. 8 can be applied to stationary type devices in addition to components specifically for mobile purposes.

As shown in fig. 8, the mobile terminal 800 may specifically include: processor 801, memory 802, bus 803, mobile communication module 804, antenna 1, wireless communication module 805, antenna 2, display 806, camera module 807, audio module 808, power module 809, and sensor module 810.

The processor 801 may include one or more processing units, such as: the processor 801 may include an AP (Application Processor ), modem processor, GPU (Graphics Processing Unit, graphics processor), ISP (Image Signal Processor ), controller, encoder, decoder, DSP (Digital Signal Processor ), baseband processor, and/or NPU (Neural-Network Processing Unit, neural network processor), etc. The model compression method in the present exemplary embodiment may be performed by an AP, GPU, or DSP, and may be performed by an NPU when the method involves neural network related processing, for example, the NPU may load neural network parameters and execute neural network related algorithm instructions.

An encoder may encode (i.e., compress) an image or video to reduce the data size for storage or transmission. The decoder may decode (i.e., decompress) the encoded data of the image or video to recover the image or video data. The mobile terminal 800 may support one or more encoders and decoders, for example: image formats such as JPEG (Joint Photographic Experts Group ), PNG (Portable Network Graphics, portable network graphics), BMP (Bitmap), and video formats such as MPEG (Moving Picture Experts Group ) 1, MPEG10, h.1063, h.1064, HEVC (High Efficiency Video Coding ).

The processor 801 may form a connection with the memory 802 or other components through a bus 803.

Memory 802 may be used to store computer-executable program code that includes instructions. The processor 801 performs various functional applications and data processing of the mobile terminal 800 by executing instructions stored in the memory 802. The memory 802 may also store application data, such as files that store images, videos, and the like.

The communication functions of the mobile terminal 800 may be implemented by the mobile communication module 804, the antenna 1, the wireless communication module 805, the antenna 2, a modem processor, a baseband processor, and the like. The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. The mobile communication module 804 may provide a mobile communication solution of 3G, 4G, 5G, etc. applied on the mobile terminal 800. The wireless communication module 805 may provide wireless communication solutions for wireless local area networks, bluetooth, near field communications, etc. that are applied on the mobile terminal 800.

The display screen 806 is used to implement display functions such as displaying user interfaces, images, video, and the like. The image capturing module 807 is configured to perform capturing functions, such as capturing images, videos, and the like. The audio module 808 is used to implement audio functions such as playing audio, capturing speech, etc. The power module 809 is used to implement power management functions such as charging the battery, powering the device, monitoring the battery status, etc.

The sensor module 810 may include one or more sensors for implementing corresponding sensing functionality. For example, the sensor module 810 may include an inertial sensor for detecting a motion pose of the mobile terminal 800, outputting inertial sensing data.

Exemplary embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon a program product capable of implementing the method described above in the present specification. In some possible implementations, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the disclosure as described in the "exemplary methods" section of this specification, when the program product is run on the terminal device.

It should be noted that the computer readable medium shown in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Furthermore, the program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of model compression, comprising:

acquiring a pre-trained initial neural network model;

2. The method of claim 1, wherein the pre-set model compression parameters include a model execution efficiency threshold and an output result accuracy threshold; the determining the initial neural network model after pruning the width meeting the preset compression parameters as a target neural network model includes:

3. The method of claim 1, wherein the performing deep pruning on the initial neural network model to obtain a deep pruned initial neural network model comprises:

4. The method of claim 3, wherein the randomly sampling and network training the sub-networks in the neural network model search space to obtain a trained sub-network model comprises:

5. The method of claim 1, wherein the performing the width pruning on the initial neural network model after the depth pruning to obtain the initial neural network model after the width pruning comprises:

6. The method of claim 5, wherein calculating the performance parameters of the convolution kernels of the sub-networks in the deep pruned initial neural network model comprises:

7. The method of claim 1, wherein determining the initial neural network model after pruning the width that satisfies the preset model compression parameter as the target neural network model comprises:

acquiring a training data set;

8. A model compression device, characterized by comprising:

9. A computer readable medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the method according to any one of claims 1 to 7.

10. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of any one of claims 1 to 7 via execution of the executable instructions.