CN114861910A

CN114861910A - Neural network model compression method, device, equipment and medium

Info

Publication number: CN114861910A
Application number: CN202210556816.7A
Authority: CN
Inventors: 吕梦思; 王豪爽; 党青青; 刘其文
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-05-19
Filing date: 2022-05-19
Publication date: 2022-08-05
Anticipated expiration: 2042-05-19
Also published as: CN114861910B

Abstract

The disclosure provides a compression method, a compression device, equipment and a compression medium for a neural network model, and relates to the technical field of artificial intelligence, in particular to the technical field of deep learning and cloud service. The implementation scheme is as follows: acquiring target deployment environment information for deploying a target neural network model; determining a plurality of candidate compression strategies; based on each candidate compression strategy, performing compression operation on the target neural network model to obtain a plurality of candidate compression models respectively corresponding to the candidate compression strategies; determining inference time-consuming information corresponding to each candidate compression model; determining a target compression strategy from the plurality of candidate compression strategies based on at least inference time-consuming information of the plurality of candidate compression models; and determining a corresponding compressed neural network model of the target neural network model based on at least the target compression strategy.

Description

Neural network model compression method, device, equipment and medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the field of deep learning and cloud service technologies, and in particular, to a method and an apparatus for compressing a neural network model, an electronic device, a computer-readable storage medium, and a computer program product.

Background

Artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.

The complex neural network model has good performance, but needs to occupy more storage space and calculation resources. The model compression technology is used for reducing the parameter number or the calculation complexity of the neural network model so as to achieve the purposes of saving hardware resources for executing the model reasoning process and improving the model reasoning speed.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the problems mentioned in this section should not be considered as having been acknowledged in any prior art.

Disclosure of Invention

The present disclosure provides a compression method, apparatus, electronic device, computer-readable storage medium, and computer program product for a neural network model.

According to an aspect of the present disclosure, there is provided a compression method of a neural network model, including: acquiring target deployment environment information for deploying a target neural network model; determining a plurality of candidate compression strategies; performing a compression operation on the target neural network model based on each of the plurality of candidate compression strategies to obtain a plurality of candidate compression models respectively corresponding to the plurality of candidate compression strategies; determining inference time-consuming information corresponding to each candidate compression model in the plurality of candidate compression models, wherein the inference time-consuming information can indicate the time length required for inference by using the candidate compression model in a target deployment environment; determining a target compression strategy from a plurality of candidate compression strategies at least based on inference time-consuming information of the plurality of candidate compression models respectively corresponding to the plurality of candidate compression strategies; and determining a corresponding compressed neural network model of the target neural network model based on at least the target compression strategy.

According to another aspect of the present disclosure, there is provided a compression apparatus of a neural network model, including: an acquisition unit configured to acquire target deployment environment information for deploying a target neural network model; a first determining unit configured to determine a plurality of candidate compression strategies; a compression model generator configured to perform a compression operation on the target neural network model based on each of the plurality of candidate compression strategies to obtain candidate compression models respectively corresponding to the plurality of candidate compression strategies; the inference time consumption predictor is configured for determining inference time consumption information corresponding to each candidate compression model in the plurality of candidate compression models, and the inference time consumption information can indicate the time length required for inference by using the candidate compression model in the target deployment environment; a second determining unit configured to determine a target compression policy from among a plurality of candidate compression policies, based on at least inference time-consuming information of a plurality of candidate compression models respectively corresponding to the plurality of candidate compression policies; and a compression model determination unit configured to determine a corresponding compression neural network model of the target neural network model based on at least the target compression policy.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of neural network model compression described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the above-described neural network model compression method.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program is capable of implementing the above-mentioned compression method of a neural network model when executed by a processor.

According to one or more embodiments of the disclosure, the compression effect and efficiency of the neural network model can be improved, and the compressed neural network model with more optimized inference performance is obtained.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of illustration only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.

FIG. 1 shows a schematic diagram of an exemplary system in which various methods described herein may be implemented, according to an exemplary embodiment of the present disclosure;

FIG. 2 shows a flow chart of a compression method of a neural network model according to an exemplary embodiment of the present disclosure;

FIG. 3 shows a flowchart of a portion of an example process of a compression method of a neural network model, according to an example embodiment of the present disclosure;

fig. 4 illustrates a block diagram of a compression apparatus of a neural network model according to an exemplary embodiment of the present disclosure;

5A-5D illustrate structural schematics of a system that can be used to implement a compression method of a neural network model, according to an exemplary embodiment of the present disclosure;

FIG. 6 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, unless otherwise specified, the use of the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, the timing relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, while in some cases they may refer to different instances based on the context of the description.

The terminology used in the description of the various described examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the elements may be one or more. Furthermore, the term "and/or" as used in this disclosure is intended to encompass any and all possible combinations of the listed items.

In practical application, the neural network model is deployed in an inference library for performing model inference, and an inference chip is used to perform an inference calculation process. It should be understood that the inference performance of models with different parameter types and structures can be influenced by the deployment environment of the models, and since model compression is realized by changing the parameters or structures of the models, the model compression effect can also be influenced by the deployment environment of the models. For example, there are the following possibilities: the model deployment environment can only be used for executing the reasoning process of the original neural network model, but cannot be used for executing the compressed neural network model, that is, the support degrees of different model deployment environments on the model compression strategies are different, so that the compression model obtained by using a specific model compression strategy cannot be used for reasoning. As another example, there are also the following possibilities: in the model deployment environment, the difference between the time length required for reasoning by using the original neural network model and the time length required for reasoning by using the compressed neural network model is small, that is, the optimization degrees of different model deployment environments on the model compression strategies are different, so that the compression model obtained by using a specific model compression strategy cannot achieve the expected reasoning optimization effect.

In the related art, a model compression strategy is usually selected according to manually determined empirical knowledge, the influence of different deployment environments on model compression cannot be fully considered, and an optimal model compression strategy adaptive to the deployment environment cannot be efficiently determined.

In order to solve the above problems, the present disclosure provides a compression method for a neural network model, which can preliminarily evaluate the inference optimization effect corresponding to each candidate compression strategy by obtaining the inference time-consuming information of a plurality of candidate compression models respectively corresponding to a plurality of candidate compression strategies in a target deployment environment, so as to determine an optimal target compression strategy, and can improve the efficiency and effect of model compression.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented in accordance with embodiments of the present disclosure. Referring to fig. 1, the system 100 includes one or

more client devices

101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120.

Client devices

101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In embodiments of the present disclosure, the server 120 may run one or more services or software applications that enable the compression methods of the neural network model to be performed.

In some embodiments, the server 120 may also provide other services or software applications, which may include non-virtual environments and virtual environments. In certain embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of

client devices

101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof, which may be executed by one or more processors. A user operating a

client device

101, 102, 103, 104, 105, and/or 106 may, in turn, utilize one or more client applications to interact with the server 120 to take advantage of the services provided by these components. It should be understood that a variety of different system configurations are possible, which may differ from system 100. Accordingly, fig. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

The user may use

client devices

101, 102, 103, 104, 105, and/or 106 to send the target neural network model, target deployment environment information for deploying the target neural network model, and test data. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that any number of client devices may be supported by the present disclosure.

Client devices

101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptops), workstation computers, wearable devices, smart screen devices, self-service terminal devices, service robots, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and so forth. These computer devices may run various types and versions of software applications and operating systems, such as MICROSOFT Windows, APPLE iOS, UNIX-like operating systems, Linux, or Linux-like operating systems (e.g., GOOGLE Chrome OS); or include various Mobile operating systems such as MICROSOFT Windows Mobile OS, iOS, Windows Phone, Android. Portable handheld devices may include cellular telephones, smart phones, tablets, Personal Digital Assistants (PDAs), and the like. Wearable devices may include head-mounted displays (such as smart glasses) and other devices. The gaming system may include a variety of handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), Short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a variety of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. By way of example only, one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a blockchain network, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture involving virtualization (e.g., one or more flexible pools of logical storage that may be virtualized to maintain virtual storage for the server). In various embodiments, the server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above, as well as any commercially available server operating systems. The server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, and the like.

In some implementations, the server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of the

client devices

101, 102, 103, 104, 105, and 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of

client devices

101, 102, 103, 104, 105, and 106.

In some embodiments, the server 120 may be a server of a distributed system, or a server incorporating a blockchain. The server 120 may also be a cloud server, or a smart cloud computing server or a smart cloud host with artificial intelligence technology. The cloud Server is a host product in a cloud computing service system, and is used for solving the defects of high management difficulty and weak service expansibility in the traditional physical host and Virtual Private Server (VPS) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of the databases 130 may be used to store information such as audio files and video files. The database 130 may reside in various locations. For example, the database used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. The database 130 may be of different types. In certain embodiments, the database used by the server 120 may be, for example, a relational database. One or more of these databases may store, update, and retrieve data to and from the database in response to the command.

In some embodiments, one or more of the databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key-value stores, object stores, or regular stores supported by a file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

Fig. 2 shows a flow chart of a method 200 of compression of a neural network model according to an exemplary embodiment of the present disclosure. As shown in fig. 2, the method 200 includes:

step S210, obtaining target deployment environment information for deploying a target neural network model;

step S220, determining a plurality of candidate compression strategies;

step S230, based on each candidate compression policy of the plurality of candidate compression policies, performing a compression operation on the target neural network model to obtain a plurality of candidate compression models respectively corresponding to the plurality of candidate compression policies;

step S240, determining inference time-consuming information corresponding to each candidate compression model in the plurality of candidate compression models, wherein the inference time-consuming information can indicate the time length required for inference by using the candidate compression model in a target deployment environment;

step S250, determining a target compression strategy from a plurality of candidate compression strategies at least based on inference time consumption information of a plurality of candidate compression models respectively corresponding to the plurality of candidate compression strategies; and

and S260, determining a compressed neural network model corresponding to the target neural network model at least based on the target compression strategy.

Therefore, the corresponding reasoning performance of each candidate compression strategy can be indicated by acquiring the candidate compression models respectively corresponding to the candidate compression strategies and using the reasoning time consumption information of the candidate compression models in the target deployment environment, and the optimal target compression strategy can be determined based on the reasoning performance, so that the efficiency and the effect of model compression are improved. By utilizing the optimal target compression strategy, a compressed neural network model with better inference performance can be obtained, so that the computing resources occupied by the process of reasoning by utilizing the compressed neural network model can be saved, and the model reasoning efficiency is improved.

The target neural network model is a completed model constructed or trained based on a deep learning method, and may be, for example, neural network models of various structures, such as a convolutional neural network model, a feedback neural network model, a feedforward neural network model, and the like. Also, the target neural network model may be a model for performing various deep learning-based tasks, and may include at least one of a target detection model, an image classification model, a text content understanding model, and a visual language model, for example.

In some embodiments, the target neural network model is a target detection model. In this case, the training data and the test data corresponding to the target detection model are to-be-detected images, and the model is used for outputting the relevant prediction information of the target object included in the to-be-detected image based on the input to-be-detected image. By compressing the target detection model by using the compression method of the neural network model provided by the embodiment of the disclosure, the compressed target detection model which occupies less storage space or occupies less computing resources in the reasoning process can be obtained. On the basis, when the compressed target detection model is used for executing the corresponding target detection task, hardware resources required by the target detection process can be effectively saved.

In other embodiments, the target neural network model is an image classification model. In this case, the training data and the test data corresponding to the image classification model are images to be classified, and the model is used for outputting prediction information of a corresponding category of the images to be classified based on the input images to be classified.

In other embodiments, the target neural network model is a textual content understanding model. In this case, the training data and the test data corresponding to the text content understanding model are text information or document information, and the model is used for outputting corresponding semantic understanding content based on the input text information or document information. Illustratively, the semantic understanding content may include various types, such as text category information derived based on text content understanding, text semantic keyword information derived based on text content understanding, text summary paragraph information derived based on text content understanding, and the like.

In other embodiments, the target neural network model is a visual language model. In this case, the training data and the test data corresponding to the visual language model are images to be recognized, and the model is used for outputting text semantic features corresponding to image features included in the images to be recognized based on the input images to be recognized.

Similar to the foregoing description of performing the target detection task by using the compressed neural network model, in the above embodiments, at least one of the image classification model, the text content understanding model, or the visual language model may also be compressed by using the method provided by the exemplary embodiment of the present disclosure, so as to obtain the compressed neural network model that occupies less storage space or occupies less computing resources in the inference process, and further, the compressed neural network model can be used to perform the corresponding deep learning-based task, so as to save the hardware resources occupied by the task.

Illustratively, the target deployment environment information for deploying the target neural network model may include, but is not limited to, an inference library type for performing model inference, an operating system type corresponding to a computer for performing model inference, an inference chip type, and the like.

According to some embodiments, the candidate compression strategies comprise at least one of a quantization strategy and a sparsification strategy. Therefore, according to the requirements of practical application scenes, various model compression strategies or the combination of various model compression strategies can be flexibly utilized to realize the compression of the neural network model, so that the efficiency and the effect of model compression are improved, and the storage space occupied by the obtained compressed neural network model and the hardware computing resources occupied by the corresponding model reasoning process are further saved.

The quantization strategy refers to that parameters stored or calculated in a target neural network model in a form of higher bit quantity are stored or calculated by using a form of lower bit quantity instead. For example, the parameters stored in the form of 32-bit floating point numbers may be stored in the form of 8-bit integers instead, so that the storage space occupied by the target neural network model can be saved. For another example, the parameters calculated in the form of 32 floating point numbers can be changed into the parameters calculated in the form of 8-bit integers, so that hardware computing resources occupied by the inference process of the target neural network model can be saved, and the model inference speed can be improved.

Illustratively, the quantization strategy may include different quantization parameters, which may include, for example, a quantization rate and a number of bits of a quantized data type. The quantization rate is a rate indicating a parameter for quantizing the target neural network model, for example, 5% of the parameters in the target neural network model or 10% of the parameters in the target neural network model. For another example, the quantization parameter may also be used to indicate a location of a parameter for quantizing the target neural network model, for example, only a specific one or more network layers in the target neural network model may be quantized, so as to correspond to different types of quantization strategies, such as uniform quantization, symmetric quantization, and the like. Therefore, different quantization strategies can be flexibly used according to the requirements of the actual application scene, the effect of model compression is further improved, the obtained compressed neural network model can occupy less storage space, and the more optimized reasoning performance is achieved.

The sparsification strategy is to cut out network units in the target neural network model according to a certain rule so as to save the operation corresponding to the cut out network units in the running process of the neural network model. Therefore, the storage space occupied by the target neural network model can be saved, and meanwhile, the hardware computing resources occupied by the reasoning process of the target neural network model are saved.

Illustratively, the sparsification strategy may comprise different sparsification parameters, which may for example comprise a sparsification ratio for indicating a ratio of parameters comprised by the pruned target neural network model, such as for example 5% of the parameters in the pruned target neural network model or 10% of the parameters in the pruned target neural network model. For another example, the sparsification parameters may also be used for positions where parameters of the pruned target neural network model are located, so as to correspond to different types of sparsification strategies, such as structured sparsification, unstructured sparsification, and the like. Therefore, different sparse strategies can be flexibly used according to the requirements of the actual application scene, the effect of model compression is further improved, the obtained compressed neural network model can occupy less storage space, and the more optimized reasoning performance is achieved.

In this case, according to some embodiments, the compression operation in step S230 includes: in response to the candidate compression strategy comprising a quantization strategy, performing a quantization operation on the target neural network model to enable at least one parameter comprised by the target neural network model to be stored or calculated in a lower bit quantity. And further, the compressing operation comprises: in response to the candidate compression strategy comprising a sparsification strategy, performing a sparsification operation on the target neural network model to remove at least one parameter comprised by the target neural network model. Therefore, various model compression strategies can be comprehensively applied in the compression process of the neural network model, so that the effect of model compression is improved.

Further, according to some embodiments, when the candidate compression policy includes a quantization policy and a thinning policy, the candidate compression policy further includes compression policy configuration information capable of indicating an execution order of the quantization operation and the thinning operation, and the compression operation includes: and performing quantization operation and sparsification operation on the target neural network model based on the compression strategy configuration information. Therefore, the compression of the neural network model can be realized by flexibly utilizing the ordered combination of various model compression strategies according to the requirements of practical application scenes, so that the efficiency and the effect of model compression are improved.

In one example, the program code corresponding to each compression policy may be pre-configured, and the program codes corresponding to the various compression policies are combined together based on the compression policy configuration information to obtain the program code for performing the compression process of the neural network model. Illustratively, the process of pre-configuring the program code corresponding to each compression policy may be manually implemented, and it should be understood that in this case, the compatibility between the program codes corresponding to the various compression policies may be fully considered in the pre-configuration process. Therefore, the model compression can be realized more simply, conveniently and rapidly, and the model compression efficiency is improved.

It should be understood that the above is only an exemplary illustration of the candidate compression strategies described in the present disclosure, and those skilled in the art can select other types of compression strategies or other combinations of compression strategies as the candidate compression strategies according to actual needs, for example, a parameter sharing strategy, a low rank decomposition strategy, an attention decoupling strategy, and the like, so as to save storage space occupied by the model or accelerate the model inference process, and can save computing resources occupied by the model inference process.

According to some embodiments, the determining, in step S240, for each candidate compression model of the plurality of candidate compression models, inference time-consuming information corresponding to the candidate compression model includes: determining structure information of the candidate compression model, wherein the structure information comprises a plurality of computing units included in the candidate compression model and attribute information corresponding to each computing unit in the plurality of computing units; determining calculation time consumption information corresponding to each calculation unit based on a plurality of calculation units included in the candidate compression model, attribute information corresponding to each calculation unit in the plurality of calculation units and the target deployment environment information; and determining inference time-consuming information corresponding to the candidate compression model based on the computation time-consuming information corresponding to each computation unit. Therefore, by analyzing the structure of the candidate compression model, the inference time-consuming information of the model can be obtained based on the computation time-consuming information respectively corresponding to each computation unit included in the candidate compression model, and the method is simpler, more convenient and quicker.

Illustratively, the calculation unit may include a convolution calculation unit, a normalization calculation unit, a full connection unit, and the like. According to some embodiments, the attribute information corresponding to each computing unit at least includes the computing data amount corresponding to the computing unit. For example, the amount of calculation data corresponding to the normalization calculation unit can be represented by the number of parameters to be normalized and the number of bits of each parameter. Based on the calculation time consumption information, calculation time consumption information corresponding to the calculation unit can be determined more accurately.

For example, the attribute information corresponding to each computation unit may further include other information, for example, the attribute information corresponding to the convolution computation unit may include a size and a shape of data input to the convolution computation unit, a size and a shape of a filter, a size and a shape of a convolution kernel, and the like. Based on the calculation time consumption information, calculation time consumption information corresponding to the calculation unit can be determined more accurately.

Illustratively, when a candidate compression model includes A, B, C three computing units, and the determined time lengths required by A, B, C three computing units to perform the computing process in the target deployment environment are a1, B1, and C1, respectively, the sum of a1, B1, and C1 may be determined as the inference time consumption information corresponding to the candidate compression model. It should be understood that the inference time-consuming information corresponding to the candidate compression model may also be obtained through other calculation methods, for example, calculating a1, B1, and C1 based on other formulas, which is not limited in this respect.

It should be understood that the above process is only one example way of determining inference time-consuming information corresponding to each of the plurality of candidate compression models, and may also be another way of determining the inference time-consuming information. For example, a plurality of candidate compression models can be used for reasoning in the target deployment environment, and the required time duration is recorded, so as to obtain the reasoning time consumption information corresponding to the plurality of candidate compression models respectively.

According to some embodiments, the determining, based on the plurality of computing units included in the candidate compression model, the attribute information corresponding to each of the plurality of computing units, and the target deployment environment information, the calculation time consumption information corresponding to each of the computing units respectively includes: and acquiring computing time-consuming information corresponding to each computing unit from a time-consuming information database, wherein the time-consuming information database comprises attribute information of a plurality of computing units, a plurality of deployment environment information and a mapping relation between the computing time-consuming information corresponding to the plurality of computing units. Therefore, the calculation time-consuming information corresponding to each calculation unit can be determined more simply, conveniently and quickly by inquiring the information in the time-consuming information database.

It should be understood that the above process is only one example way of determining the calculation time-consuming information corresponding to each of the plurality of calculation units, and the calculation time-consuming information may also be determined in other ways. For example, in the target deployment environment, a plurality of computing units are respectively used for computing, and the required time length is recorded, so as to obtain computing time consumption information corresponding to the plurality of computing units respectively.

According to some embodiments, the time-consuming information database is obtained using the following method: calculating by utilizing a plurality of calculating units with different attributes in a plurality of deployment environments respectively to obtain calculating time-consuming information respectively corresponding to the plurality of calculating units in the plurality of deployment environments; and recording the mapping relation among the attribute information of the plurality of computing units, the plurality of deployment environment information and the computing time consumption information corresponding to the plurality of computing units. Therefore, the mapping relation among the attribute information of the plurality of computing units, the plurality of deployment environment information and the calculation time-consuming information corresponding to the plurality of computing units can be accurately determined through actual tests, the accuracy of the inference time-consuming information of the determined candidate compression model is further improved, and based on the method, a more appropriate target compression strategy can be determined, and the model compression effect is improved.

According to some embodiments, the determining the target compression policy from the plurality of candidate compression policies based on at least inference time-consuming information of a plurality of candidate compression models respectively corresponding to the plurality of candidate compression policies in step S250 includes: sequencing the inference time-consuming information of the candidate compression models respectively corresponding to the candidate compression strategies; and determining a target compression policy from the plurality of candidate compression policies based at least on the ranking result. Therefore, the compression effect corresponding to each candidate compression strategy can be more clearly indicated by sequencing the inference time-consuming information, so that a more appropriate target compression strategy can be determined, a compressed neural network model with higher inference speed can be obtained, the processing speed of inference by using the compressed neural network model can be correspondingly increased, and the model compression effect can be improved.

Illustratively, the inference time-consuming information of the candidate compression models can be sorted in a descending order, and the candidate compression strategy corresponding to the candidate compression model ranked at the top in the order is determined as the target compression strategy, so that the compressed neural network model with a faster inference speed can be obtained, and the calculation resources occupied by the model inference process can be saved.

According to some embodiments, the determining a target compression policy from the plurality of candidate compression policies based on at least inferred time-consuming information of a plurality of candidate compression models corresponding to the plurality of candidate compression policies, respectively, further comprises: obtaining preset priority information of the plurality of candidate compression strategies, and determining a target compression strategy from the plurality of candidate compression strategies based on the sorting result and the preset priority information. The preset priority information can be manually configured in advance, so that the determined target compression strategy can better meet the requirements of actual application scenarios.

As described above, since the model compression is implemented by changing the parameters or structure of the model, the compression models obtained by using different model compression strategies may have different performance degradation, and the a priori knowledge may indicate the influence of the different model compression strategies on the performance of the model, in other words, indicate the performance of the compression models obtained by using the different model compression strategies. In this case, the preset priority information may be manually configured according to the corresponding a priori knowledge. In an example, the inference time delays of the candidate compression models respectively corresponding to the candidate compression policy M and the candidate compression policy N may be the same, but since the candidate compression policy M has a smaller influence on the model performance, the candidate compression policy M may be determined as the target compression policy by configuring the preset priority information, so as to improve the model compression effect.

According to some embodiments, determining the corresponding compressed neural network model of the target neural network model based on at least the target compression policy in step S260 includes: adjusting the target compression strategy based on a preset optimization strategy; and determining the compressed neural network model corresponding to the target neural network model based on the adjusted target compression strategy. The preset optimization strategy is used for reducing the influence of the model compression process on the model performance, so that the prediction precision of the model can be ensured under the condition of improving the reasoning speed of the target neural network model.

As described previously, the target compression strategy may comprise a quantization strategy and a sparsification strategy, and the quantization strategy and the sparsification strategy comprise different quantization parameters and sparsification parameters, respectively. The method according to the exemplary embodiment of the present disclosure may use a plurality of compression strategies in combination to achieve compression of the target neural network model, and may indicate an execution order of the plurality of compression strategies based on the compression strategy configuration information. In this case, the preset optimization strategy may include a parameter adjustment strategy for adjusting the quantization parameter and the sparsification parameter and a configuration adjustment strategy for adjusting the compression strategy configuration information, so that the influence of the model compression process on the model performance can be reduced by flexibly adjusting the target compression strategy, and the prediction accuracy of the corresponding compressed neural network model can be improved.

Further, the preset optimization strategy may further include a model parameter adjusting strategy for adjusting parameters of the target neural network model, and the prediction accuracy of the corresponding compressed neural network model is improved by performing parameter adjusting optimization on the target neural network model.

According to some embodiments, said adjusting the target compression policy based on a preset optimization policy comprises: inputting test data into the target neural network model, and acquiring a first prediction result output by the target neural network model; inputting the test data into a candidate compression model obtained by compressing based on the target compression strategy, and obtaining a second output prediction result of the candidate compression model; calculating a first compression loss value based on the first prediction result and the second prediction result; and adjusting the target compression strategy and the target neural network model based on the first compression loss value, and wherein a corresponding compressed neural network model is determined based on the adjusted target neural network model and the adjusted target compression strategy. Therefore, the influence of the model compression process on the model performance can be indicated based on the output result of the original target neural network model and the output result of the candidate compression model, and adjustment is performed based on the influence, so that the prediction accuracy of the corresponding compression neural network model is improved.

For example, the determining a corresponding compression neural network model based on the adjusted target neural network model and the adjusted target compression policy may include: and executing compression operation on the adjusted target neural network model based on the adjusted target compression strategy, and taking the compressed target neural network model as the compressed neural network model.

As described previously, the target neural network model may be a model for performing various deep learning-based tasks, which according to some embodiments may include at least one of: an object detection model, an image classification model, a text content understanding model, and a visual language model. Accordingly, the type of the test data may include image data, text data, document data, etc., corresponding to the functional type of the target neural network model.

In one embodiment, the target neural network model is an image classification model, the test data is a test image, the first prediction result includes first prediction category information corresponding to the test image and a first category confidence thereof, and the second prediction result includes second prediction category information corresponding to the test image and a second category confidence thereof. Therefore, the model compression method according to the exemplary embodiment of the disclosure can be applied to the field of image classification, so that a compressed image classification model with better inference performance can be utilized, the image classification speed is increased, and the occupation of hardware resources is reduced.

In another embodiment, the target neural network model is a target detection model, the test data is a test image containing a target object, the first prediction result includes first prediction category information, first prediction position information, a first category confidence degree and a first position confidence degree corresponding to the target object in the test image, and the second prediction result includes second prediction category information, second prediction position information, a second category confidence degree and a second position confidence degree corresponding to the target object in the test image. Therefore, the model compression method according to the exemplary embodiment of the disclosure can be applied to the field of target detection, so that a compressed target detection model with better inference performance can be utilized, the speed of target detection is increased, and the occupation of hardware resources is reduced.

In another embodiment, the target neural network model is a text content understanding model, the test data is a test text, the first prediction result includes first predicted semantic understanding content corresponding to the test text, and the second prediction result includes second predicted semantic understanding content corresponding to the test text. Illustratively, the semantic understanding content may include various types, such as text category information derived based on text content understanding, text semantic keyword information derived based on text content understanding, text summary paragraph information derived based on text content understanding, and the like. Therefore, the model compression method according to the exemplary embodiment of the disclosure can be applied to the field of natural language processing, so that the compressed text content understanding model with better reasoning performance can be utilized, semantic understanding content corresponding to a text can be more efficiently acquired, and occupation of hardware resources is reduced.

It should be understood that the method provided by the exemplary embodiment of the present disclosure may also be used to compress other functional types of target neural network models, such as visual language models, etc., as long as the specific type of the corresponding test data and the specific representation form of the prediction result are determined according to the requirements of the actual scene, which is not limited herein.

Further, the adjusting step may be performed again on the compressed neural network model obtained in the foregoing step, for example, in some embodiments, the adjusting the target compression policy based on the preset optimization policy further includes: inputting the test data into the compressed neural network model, and acquiring a third prediction result output by the compressed neural network model; calculating a second compression loss value based on the first prediction result and the third prediction result; and adjusting the target compression strategy and the compression neural network model based on the second compression loss value, and wherein the adjusted compression neural network model is updated based on the adjusted target compression strategy. Thus, the performance of the compressed neural network model can be further optimized through multiple iterations.

Fig. 3 shows a flow chart of a part of an example process of a compression method of a neural network model according to an example embodiment of the present disclosure. As shown in fig. 3, step S260 may be implemented by the following process:

step S261: inputting test data into the target neural network model, and acquiring a first prediction result output by the target neural network model;

step S262, inputting the test data into a candidate compression model obtained by compression based on the target compression strategy, and obtaining a second prediction result output by the candidate compression model;

step S263, calculating a first compression loss value based on the first prediction result and the second prediction result;

step S264, based on the first compression loss value, adjusting the target compression strategy and the target neural network model;

step S265, determining a corresponding compressed neural network model based on the adjusted target neural network model and the adjusted target compression policy.

In one example, the target neural network model includes a plurality of network layers, and the target compression policy includes a plurality of local compression policies corresponding to the plurality of network layers, respectively. In this case, the adjusting and optimizing steps may be performed for the plurality of network layers, respectively, and may include the following steps, for example:

for each network layer of a plurality of network layers comprised by the target neural network model:

step S2611, inputting test data into the network layer and obtaining a first local output result output by the network layer;

step S2612, inputting the test data into a compressed network layer obtained by compressing based on a local compression strategy corresponding to the network layer, and obtaining a second local output result obtained and output by the compressed network layer;

step S2613 of calculating a local compression loss value based on the first local output result and the second local output result;

step S2614, adjusting the local compression policy and the network layer based on the local compression loss value;

step S2615, determining a corresponding compressed network layer based on the adjusted network layer and the adjusted local compression policy.

Through the steps, a plurality of compression network layers respectively corresponding to the plurality of network layers included by the target neural network model can be obtained, and then the corresponding compression neural network model can be obtained. Based on the method, the local compression loss value can be calculated according to the local output result of each network layer, so that the influence of the compression strategy on the model performance can be accurately indicated, and the efficiency of model compression is improved.

According to some embodiments, the method further comprises: obtaining description information of the target neural network model, wherein the description information can indicate structural characteristics of the target neural network model, and the preset optimization strategy is determined based on the description information of the target neural network model. Further, the description information includes at least one of: structure type information of the target neural network and network layer information included in the target neural network.

Illustratively, the type of target neural network may include a feed-forward neural network, a recurrent neural network, a convolutional neural network, and the like. The network layer information included in the target neural network model can indicate types and attribute information of each network layer included in the target neural network model, the types of the network layers may include a convolutional layer, a fully-connected layer, a pooling layer, and the like, and the attribute information of the network layers may include information such as the number of parameters and data types included in the network layer. Therefore, the preset optimization strategy adaptive to the description information of the target neural network can be determined according to the description information of the target neural network, and therefore the model compression efficiency is improved.

According to another aspect of the present disclosure, there is also provided a compression apparatus of a neural network model. Fig. 4 shows a block diagram of a compression apparatus 400 of a neural network model according to an exemplary embodiment of the present disclosure. As shown in fig. 4, the apparatus 400 includes:

an obtaining unit 401 configured to obtain target deployment environment information for deploying a target neural network model;

a first determining unit 402 configured to determine a plurality of candidate compression strategies;

a compression model generator 403 configured to perform a compression operation on the target neural network model based on each of the plurality of candidate compression strategies to obtain candidate compression models respectively corresponding to the plurality of candidate compression strategies;

a reasoning time consumption predictor 404 configured to determine, for each candidate compression model of the plurality of candidate compression models, reasoning time consumption information corresponding to the candidate compression model, where the reasoning time consumption information can indicate a time period required for reasoning with the candidate compression model in the target deployment environment;

a second determining unit 405 configured to determine a target compression policy from a plurality of candidate compression policies, based on at least inference time-consuming information of the plurality of candidate compression models respectively corresponding to the plurality of candidate compression policies; and

a compression model determination unit 406 configured to determine a corresponding compression neural network model of the target neural network model based on at least the target compression policy.

The operation of the unit 401-406 of the compression apparatus 400 of the neural network model is similar to the operation of the step S201-step S206 described above, and is not repeated herein.

Fig. 5A-5D illustrate structural schematics of a system that can be used to implement a compression method of a neural network model according to an exemplary embodiment of the present disclosure.

As shown in fig. 5A, the system capable of being used for implementing the compression method of the neural network model includes a compression model generator, an inference time consumption predictor, a priori knowledge base, a sorting unit and a compression model determination unit.

Fig. 5B illustrates a schematic structural diagram of a compression model generator according to an exemplary embodiment of the present disclosure. As shown in fig. 5B, by inputting a target neural network model and a plurality of candidate compression strategies into a compression model generator, the compression model generator can be caused to perform a compression operation on the target neural network model based on each of the plurality of candidate compression strategies to obtain a plurality of candidate compression models respectively corresponding to the plurality of candidate compression strategies.

Fig. 5C shows a schematic structural diagram of the inference time consumption predictor according to an exemplary embodiment of the present disclosure. As shown in fig. 5C, the inference time consumption predictor includes a calculation unit generator, an automatic construction tool and a time consumption prediction unit. The calculation unit generator is capable of generating calculation units having different properties, such as convolution calculation units, normalization calculation units, full-link units, etc. The automatic construction tool can respectively utilize the plurality of computing units with different attributes to perform calculation in a plurality of deployment environments so as to obtain the time-consuming calculation information respectively corresponding to the plurality of computing units in the plurality of deployment environments, and record the mapping relation among the attribute information of the plurality of computing units, the plurality of deployment environment information and the time-consuming calculation information corresponding to the plurality of computing units, so that a time-consuming information database can be obtained. After the time-consuming information database is obtained, in an actual application process of model compression, the time-consuming estimation unit determines a plurality of computing units included in each candidate compression model and attribute information corresponding to each computing unit in the plurality of computing units, based on which, computing time-consuming information corresponding to each computing unit is obtained from the time-consuming information database, and further based on the computing time-consuming information corresponding to each computing unit, inference time-consuming information corresponding to the candidate compression model is determined.

The ordering unit can order the inference time consumption information of the candidate compression models respectively corresponding to the candidate compression strategies, and meanwhile, preset priority information is obtained from the priori knowledge base, so that the target compression strategy can be determined from the candidate compression strategies on the basis of the ordering result and the preset priority information.

Fig. 5D illustrates a schematic structural diagram of a compression model determination unit according to an exemplary embodiment of the present disclosure. As shown in fig. 5D, the compression model determination unit includes a multi-strategy combiner and an optimizer.

As described above, the target compression policy may be a combination of multiple compression policies, in which case, the program code corresponding to each compression policy may be configured in advance, and the program codes corresponding to the various compression policies may be combined together by the multi-policy combiner based on the compression policy configuration information corresponding to the target compression policy. Meanwhile, the multi-strategy combiner can also add corresponding optimization logic on the basis of the combination of the program codes corresponding to the various compression strategies based on a preset optimization strategy, so that the program codes for executing the compression and optimization processes of the neural network model are obtained. The optimizer can obtain a target neural network model and test data, and the compression and optimization processes of the model are realized by executing the program codes generated by the multi-strategy combiner, so that a corresponding compressed neural network model is obtained.

Through the process, a plurality of candidate compression models respectively corresponding to a plurality of candidate compression strategies can be obtained, the inference time consumption information of the candidate compression models in a target deployment environment is used for indicating the corresponding inference optimization effect of each candidate compression strategy, the optimal target compression strategy can be further determined based on the inference optimization effect, the compression and optimization processes of the models can be executed based on the target compression strategies and the preset optimization strategies, the influence of the model compression processes on the model performance is reduced by flexibly adjusting the target compression strategies, the parameter adjustment optimization is executed on the target neural network model, and the prediction accuracy of the corresponding compressed neural network model is improved. Therefore, the compressed neural network model with optimized reasoning performance and prediction performance can be obtained with high efficiency.

According to another aspect of the present disclosure, there is also provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of neural network model compression described above.

According to another aspect of the present disclosure, there is also provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to execute the above-mentioned compression method of a neural network model.

According to another aspect of the present disclosure, there is also provided a computer program product comprising a computer program, wherein the computer program, when being executed by a processor, implements the above-mentioned method of compression of a neural network model.

Referring to fig. 6, a block diagram of a structure of an electronic device 600, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606, an output unit 607, a storage unit 608, and a communication unit 609. The input unit 606 may be any type of device capable of inputting information to the device 600, and the input unit 606 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a track pad, a track ball, a joystick, a microphone, and/or a remote control. Output unit 607 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. The storage unit 608 may include, but is not limited to, a magnetic disk, an optical disk. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers, and/or chipsets, such as bluetooth (TM) devices, 802.11 devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 performs the respective methods and processes described above, such as the compression method of the neural network model. For example, in some embodiments, the compression method of the neural network model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into the RAM 603 and executed by the computing unit 601, one or more steps of the method of compression of a neural network model described above may be performed. Alternatively, in other embodiments, the calculation unit 601 may be configured by any other suitable means (e.g. by means of firmware) to perform the compression method of the neural network model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be performed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the above-described methods, systems and apparatus are merely exemplary embodiments or examples and that the scope of the present invention is not limited by these embodiments or examples, but only by the claims as issued and their equivalents. Various elements in the embodiments or examples may be omitted or may be replaced with equivalents thereof. Further, the steps may be performed in an order different from that described in the present disclosure. Further, various elements in the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced with equivalent elements that appear after the present disclosure.

Claims

1. A method of compressing a neural network model, comprising:

acquiring target deployment environment information for deploying a target neural network model;

determining a plurality of candidate compression strategies;

performing a compression operation on the target neural network model based on each of the plurality of candidate compression strategies to obtain a plurality of candidate compression models respectively corresponding to the plurality of candidate compression strategies;

determining inference time-consuming information corresponding to each candidate compression model in the plurality of candidate compression models, wherein the inference time-consuming information can indicate the time length required for inference by using the candidate compression model in a target deployment environment;

determining a target compression strategy from a plurality of candidate compression strategies at least based on inference time-consuming information of the plurality of candidate compression models respectively corresponding to the plurality of candidate compression strategies; and

and determining a corresponding compressed neural network model of the target neural network model at least based on the target compression strategy.

2. The method of claim 1, wherein determining, based at least on the target compression policy, a corresponding compressed neural network model for the target neural network model comprises:

adjusting the target compression strategy based on a preset optimization strategy; and

and determining the compressed neural network model corresponding to the target neural network model based on the adjusted target compression strategy.

3. The method of claim 2, wherein said adjusting the target compression policy based on a preset optimization policy comprises:

inputting test data into the target neural network model, and acquiring a first prediction result output by the target neural network model;

inputting the test data into a candidate compression model obtained by compression based on the target compression strategy, and acquiring a second prediction result output by the candidate compression model;

calculating a first compression loss value based on the first prediction result and the second prediction result; and

adjusting the target compression strategy and the target neural network model based on the first compression loss value,

and wherein a corresponding compressed neural network model is determined based on the adjusted target neural network model and the adjusted target compression strategy.

4. The method of any one of claims 1-3, wherein the target neural network model includes at least one of:

a target detection model, an image classification model, a text content understanding model, and a visual language model.

5. The method of claim 3, wherein the target neural network model is an image classification model, the test data is a test image,

the first prediction result comprises first prediction category information and a first category confidence degree thereof corresponding to the test image, and the second prediction result comprises second prediction category information and a second category confidence degree thereof corresponding to the test image.

6. The method of claim 3, wherein the target neural network model is a target detection model, the test data is a test image containing a target object,

the first prediction result comprises first prediction category information, first prediction position information, a first category confidence degree and a first position confidence degree corresponding to the target object in the test image, and the second prediction result comprises second prediction category information, second prediction position information, a second category confidence degree and a second position confidence degree corresponding to the target object in the test image.

7. The method of claim 3, wherein the target neural network model is a text content understanding model, the test data is test text,

the first prediction result comprises first prediction semantic understanding content corresponding to the test text, and the second prediction result comprises second prediction semantic understanding content corresponding to the test text.

8. The method of any of claims 3, 5-7, wherein said adjusting the target compression policy based on a preset optimization policy further comprises:

inputting the test data into the compressed neural network model, and acquiring a third prediction result output by the compressed neural network model;

calculating a second compression loss value based on the first prediction result and the third prediction result; and

adjusting the target compression strategy and the compression neural network model based on the second compression loss value,

and wherein the adjusted compressed neural network model is updated based on the adjusted target compression policy.

9. The method of any of claims 2-8, further comprising:

obtaining description information of the target neural network model, wherein the description information can indicate structural features of the target neural network model,

and wherein the preset optimization strategy is determined based on the description information of the target neural network model.

10. The method of claim 9, wherein the description information comprises at least one of: structure type information of the target neural network and network layer information included in the target neural network.

11. The method of any of claims 1-10, wherein determining, for each candidate compression model of the plurality of candidate compression models, inference time-consuming information corresponding to the candidate compression model comprises:

determining structure information of the candidate compression model, wherein the structure information comprises a plurality of computing units included in the candidate compression model and attribute information corresponding to each computing unit in the plurality of computing units;

determining calculation time consumption information corresponding to each calculation unit based on a plurality of calculation units included in the candidate compression model, attribute information corresponding to each calculation unit in the plurality of calculation units and the target deployment environment information; and

and determining inference time-consuming information corresponding to the candidate compression model based on the computation time-consuming information corresponding to each computation unit.

12. The method of claim 11, wherein the determining, based on the plurality of computing units included in the candidate compression model, the attribute information corresponding to each of the plurality of computing units, and the target deployment environment information, the computation time consumption information corresponding to each of the plurality of computing units comprises:

and acquiring computing time-consuming information corresponding to each computing unit from a time-consuming information database, wherein the time-consuming information database comprises attribute information of a plurality of computing units, a plurality of deployment environment information and a mapping relation between the computing time-consuming information corresponding to the plurality of computing units.

13. The method of claim 12, wherein the time-consuming information database is obtained using:

calculating by utilizing a plurality of calculating units with different attributes in a plurality of deployment environments respectively to obtain calculating time consumption information respectively corresponding to the plurality of calculating units in the plurality of deployment environments; and

and recording the mapping relation among the attribute information of the plurality of computing units, the plurality of deployment environment information and the computing time consumption information corresponding to the plurality of computing units.

14. The method according to any one of claims 11-13, wherein the attribute information corresponding to each computing unit at least includes the amount of computing data corresponding to the computing unit.

15. The method of any of claims 1-14, wherein the candidate compression strategies include at least one of a quantization strategy and a sparsification strategy.

16. The method of claim 15, wherein the compressing operation comprises:

in response to the candidate compression strategy comprising a quantization strategy, performing a quantization operation on the target neural network model to enable at least one parameter comprised by the target neural network model to be stored or calculated in a lower bit quantity.

17. The method of claim 15, wherein the compressing operation comprises:

in response to the candidate compression strategy comprising a sparsification strategy, performing a sparsification operation on the target neural network model to remove at least one parameter comprised by the target neural network model.

18. The method according to any of claims 15-17, when the candidate compression strategies include quantization strategies and sparsification strategies, the candidate compression strategies further include compression strategy configuration information capable of indicating an order of execution of quantization operations and sparsification operations, and the compression operations include:

and performing quantization operation and sparsification operation on the target neural network model based on the compression strategy configuration information.

19. The method of any of claims 1-18, wherein the determining a target compression policy from the plurality of candidate compression policies based at least on inferred time-consuming information of a plurality of candidate compression models corresponding to the plurality of candidate compression policies, respectively, comprises:

sequencing the inference time-consuming information of the candidate compression models respectively corresponding to the candidate compression strategies; and

determining a target compression policy from the plurality of candidate compression policies based at least on the ranking results.

20. The method of claim 19, wherein the determining a target compression policy from the plurality of candidate compression policies based at least on inferred time-consuming information of a plurality of candidate compression models corresponding to the plurality of candidate compression policies, respectively, further comprises:

obtaining preset priority information of the plurality of candidate compression strategies,

and wherein a target compression policy is determined from the plurality of candidate compression policies based on the ranking result and the preset priority information.

21. An apparatus for compressing a neural network model, comprising:

an acquisition unit configured to acquire target deployment environment information for deploying a target neural network model;

a first determining unit configured to determine a plurality of candidate compression strategies;

a compression model generator configured to perform a compression operation on the target neural network model based on each of the plurality of candidate compression strategies to obtain candidate compression models respectively corresponding to the plurality of candidate compression strategies;

the inference time consumption predictor is configured for determining inference time consumption information corresponding to each candidate compression model in the plurality of candidate compression models, and the inference time consumption information can indicate the time length required for inference by using the candidate compression model in the target deployment environment;

a second determining unit configured to determine a target compression policy from among a plurality of candidate compression policies, based on at least inference time-consuming information of a plurality of candidate compression models respectively corresponding to the plurality of candidate compression policies; and

a compression model determination unit configured to determine a corresponding compression neural network model of the target neural network model based on at least the target compression policy.

22. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-20.

23. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-20.

24. A computer program product comprising a computer program, wherein the computer program realizes the method according to any of claims 1-20 when executed by a processor.