CN116187387A

CN116187387A - Neural network model quantization method, device, computer equipment and storage medium

Info

Publication number: CN116187387A
Application number: CN202310153930.XA
Authority: CN
Inventors: 黄威豪
Original assignee: Zeku Technology Shanghai Corp Ltd
Current assignee: Zeku Technology Shanghai Corp Ltd
Priority date: 2023-02-22
Filing date: 2023-02-22
Publication date: 2023-05-30

Abstract

The application relates to a neural network model quantification method, a neural network model quantification device, computer equipment and a storage medium. The method comprises the following steps: acquiring second-order partial derivative information corresponding to a target layer of an initial neural network model according to a reverse calculation graph and a training sample of the initial neural network model; the target layer is a network layer participating in training in the initial neural network model; and obtaining target bits of each network layer of the initial neural network model according to the second partial derivative information of the target layer and the hardware constraint conditions corresponding to the initial neural network model. The method can improve the quantization precision of the neural network model.

Description

Neural network model quantization method, device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of neural networks, and in particular, to a neural network model quantization method, apparatus, computer device, and storage medium.

Background

With the wide application of deep learning algorithms in the fields of computer vision, voice processing, automatic driving and the like, neural network models are designed to be more and more huge to meet the application scenes of complex application, high precision, multitasking and the like. However, when the neural network model is deployed on an edge device or a mobile platform, which has a high limitation on resources such as power, delay, memory, and the like, quantization processing is often required on the neural network model.

In the conventional technology, when the neural network model is quantized, bit widths with higher precision are allocated to more sensitive layers in the neural network model, and bit widths with lower precision are allocated to less sensitive layers in the neural network model, so that the quantized neural network model is obtained.

However, the conventional method has a problem in that the quantization accuracy is low.

Disclosure of Invention

The embodiment of the application provides a neural network model quantization method, a device, computer equipment and a storage medium, which can improve the quantization precision of the neural network model.

In a first aspect, an embodiment of the present application provides a neural network model quantization method, including:

acquiring second-order partial derivative information corresponding to a target layer of an initial neural network model according to a reverse calculation graph and a training sample of the initial neural network model; the target layer is a network layer participating in training in the initial neural network model;

and obtaining target bits of each network layer of the initial neural network model according to the second partial derivative information of the target layer and the hardware constraint conditions corresponding to the initial neural network model.

In a second aspect, an embodiment of the present application provides a neural network model quantization apparatus, including:

the first acquisition module is used for acquiring second-order partial derivative information corresponding to a target layer of the initial neural network model according to a reverse calculation graph and a training sample of the initial neural network model; the target layer is a network layer participating in training in the initial neural network model;

and the second acquisition module is used for acquiring and quantifying the target bit of each network layer of the initial neural network model according to the second partial derivative information of the target layer and the hardware constraint condition corresponding to the initial neural network model.

In a third aspect, an embodiment of the present application provides a computer device, including a memory and a processor, where the memory stores a computer program, where the computer program is executed by the processor, to cause the processor to perform the steps of the neural network model quantization method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method according to the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product comprising a computer program which, when executed by a processor, implements the steps of the method of the first aspect.

According to the neural network model quantization method, the device, the computer equipment and the storage medium, the second partial derivative information corresponding to the target layer participating in training in the initial neural network model can be acquired according to the reverse calculation graph and the training sample of the initial neural network model, so that the bit positions of each network layer of the quantized initial neural network model can be acquired according to the second partial derivative information of the target layer in the initial neural network model and the hardware constraint condition corresponding to the initial neural network model, and the acquired bit positions of each network layer of the quantized initial neural network model can be more finely configured under the hardware constraint condition corresponding to the initial neural network model due to the fact that the hardware constraint condition corresponding to the initial neural network model and the second partial derivative information of the target layer participating in training in the initial neural network model are considered in the process of acquiring the target bit positions of each network layer of the quantized initial neural network model.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a neural network model quantization method in one embodiment;

FIG. 2 is a flow chart of a neural network model quantization method in another embodiment;

FIG. 3 is a flowchart of a neural network model quantization method in another embodiment;

FIG. 4 is a block diagram of a neural network model quantization apparatus in one embodiment;

FIG. 5 is a block diagram of a neural network model quantization apparatus according to another embodiment;

FIG. 6 is a block diagram of a neural network model quantization apparatus according to another embodiment;

FIG. 7 is a block diagram of a neural network model quantization apparatus according to another embodiment;

FIG. 8 is a block diagram of a neural network model quantization apparatus according to another embodiment;

FIG. 9 is a block diagram of a neural network model quantization apparatus according to another embodiment;

FIG. 10 is a block diagram of a neural network model quantization apparatus according to another embodiment;

FIG. 11 is a schematic diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

In one embodiment, as shown in fig. 1, a neural network model quantization method is provided, and this embodiment is illustrated by applying the method to a computer device, it will be understood that the method may also be applied to a server. In this embodiment, the method includes the steps of:

s201, obtaining second partial derivative information corresponding to a target layer of an initial neural network model according to a reverse calculation diagram and a training sample of the initial neural network model; the target layer is a network layer participating in training in the initial neural network model.

First, it should be noted that the initial neural network model in this embodiment is a trained neural network model, that is, the network structure of the initial neural network model is already determined. Alternatively, the initial neural network model in this embodiment may be a neural network model used in the fields of image processing, voice processing, natural language processing, control processing, and the like. Alternatively, the initial neural network model may be a floating point type (the parameter is a neural network model of a floating point number), or may be a fixed point model, an integer model, or the like of any bit width (e.g., 16-bit, 12-bit, 10-bit, 8-bit, etc.). In addition, the inverse computation graph of the initial neural network model in this embodiment is obtained by performing inverse differential processing on the weight parameters of each layer and the input parameters of each layer of the initial neural network model.

It may be understood that some of all the network layers that constitute the neural network model do not participate in the training process of the neural network model, and optionally, in this embodiment, the network layers that participate in training in the initial neural network model may be determined as target layers according to the labels of the network layers of the initial neural network model. Further, as an optional implementation manner, in this embodiment, the training sample may be input into the initial neural network model, and the second partial derivative information corresponding to the target layer of the initial neural network model may be obtained according to the reverse calculation map of the initial neural network model and the output of each network layer of the initial neural network model. Alternatively, the training sample in this embodiment may be a sample used when the initial neural network model is trained, or may be a portion of a sample randomly extracted from a sample used when the initial neural network model is trained, or may be a new training sample different from a sample used when the initial neural network model is trained.

S202, obtaining target bit positions of each network layer of the quantized initial neural network model according to second partial derivative information of the target layer and hardware constraint conditions corresponding to the initial neural network model.

It can be understood that in this embodiment, the target bit of each network layer is used to quantize the initial neural network model, so that the initial neural network model can be compressed, so that the quantized neural network model is lighter and suitable for being deployed on a device with higher limitation on resources such as power, delay, memory and the like.

Optionally, in this embodiment, an optimization function may be constructed according to the second partial derivative information of the target layer and the quantized bit positions of each network layer of the initial neural network model by using the hardware constraint condition corresponding to the initial neural network model as a constraint, and the optimization function is solved to obtain the target bit positions of each network layer of the quantized initial neural network model. Optionally, the target bit of each network layer of the obtained initial neural network model may be a bit corresponding to each network layer of the initial neural network model under the condition that the second partial derivative information of the target layer is minimum on the above hardware constraint condition. Alternatively, the target bits of each network layer of the initial neural network model may be the same or different, and illustratively, the target bits of each network layer of the initial neural network model may be bits that are integer multiples of 2, 4, 6, 8, etc. 2.

According to the neural network model quantization method, the second partial derivative information corresponding to the target layer involved in training in the initial neural network model can be obtained according to the reverse calculation graph and the training sample of the initial neural network model, so that the bit positions of each network layer of the quantized initial neural network model can be obtained according to the second partial derivative information of the target layer in the initial neural network model and the hardware constraint condition corresponding to the initial neural network model, and the obtained target bit positions of each network layer of the quantized initial neural network model are used for taking the hardware constraint condition corresponding to the initial neural network model and the second partial derivative information of the target layer involved in training in the initial neural network model into consideration in the process of obtaining the target bit positions of each network layer of the quantized initial neural network model.

The detailed process of obtaining the target bit of each network layer of the quantized initial neural network model according to the second partial derivative information of the target layer of the initial neural network model and the hardware constraint condition corresponding to the initial neural network model will be described in this embodiment. In one embodiment, S202 includes: and minimizing the second partial derivative information of the target layer under the hardware resource constraint condition corresponding to the initial neural network model, and obtaining the target bit of each network layer of the quantized initial neural network model.

Optionally, in this embodiment, a hardware resource constraint condition corresponding to the initial neural network model may be used as a constraint, an optimization function including second partial derivative information of a target layer of the initial neural network model and bits of each network layer of the initial neural network model may be established under the constraint, the bits of each network layer of the initial neural network model may be obtained by minimizing the second partial derivative information of the target layer of the initial neural network model and solving the optimization function, and the second partial derivative information of the target layer of the initial neural network model and the bits of each network layer of the initial neural network model corresponding to the minimum time may be determined as the target bits of each network layer of the initial neural network model.

As an optional implementation manner, in this embodiment, the amount of hardware resources consumed by each network layer of the initial neural network model under different bits may be obtained in advance, including but not limited to power consumption, memory, and the like, and then, according to the target performance index required to be achieved by the initial neural network model and the amount of hardware resources consumed by each network layer of the initial neural network model under different bits, the hardware resource constraint condition corresponding to the initial neural network model when the target performance index is achieved is determined. For example, the minimum amount of hardware resources among the amounts of hardware resources consumed in different bits when the initial neural network model reaches the target performance execution may be determined as the hardware resource constraint condition corresponding to the initial neural network model.

In this embodiment, by minimizing the second partial derivative information of the target layer participating in training in the initial neural network model under the hardware resource constraint condition corresponding to the initial neural network model, finer granularity configuration can be performed on each network layer of the initial neural network model, and the target bit position of each network layer of the quantized initial neural network model with higher precision is obtained, so that the precision of the obtained target bit position of each network layer of the quantized initial neural network model is improved.

In some scenarios, after target bits of each network layer of the quantized initial neural network model are obtained, the initial neural network model may be quantized using the target bits, and when the accuracy of the quantized neural network model does not meet a preset accuracy condition, the target bits of each network layer of the quantized initial neural network model are adjusted. In one embodiment, as shown in fig. 2, the method further includes:

s301, each network layer of the initial neural network model is quantized by utilizing each target bit, and a quantized neural network model is obtained.

In this embodiment, after the target bit of each network layer of the quantized initial neural network model is obtained, each network layer of the initial neural network model may be quantized by using the target bit, that is, the bit of each network layer of the initial neural network model is modified into a corresponding target bit, so as to obtain the quantized neural network model. Further, the accuracy of the quantized neural network model may be tested through the test data set, so as to determine whether the accuracy of the quantized neural network model meets a preset accuracy condition, for example, if the error between the output obtained by inputting the test data set into the quantized neural network model and the gold standard data corresponding to the test data set is smaller, it may be determined that the accuracy of the quantized neural network model meets the preset accuracy condition, otherwise, if the error between the output obtained by inputting the test data set and the gold standard data corresponding to the test data set is larger, it may be determined that the accuracy of the quantized neural network model does not meet the preset accuracy condition.

S302, if the quantized neural network model precision does not meet the preset precision condition, taking a new training sample as a training sample, and returning to execute the step of acquiring second partial derivative information corresponding to the target layer of the initial neural network model according to the reverse calculation graph and the training sample.

Alternatively, in this embodiment, the new training sample may be a certain portion of the samples used when the initial neural network model is trained, or may be a portion of the samples randomly extracted from the samples used when the initial neural network model is trained, or the new training sample may be a sample different from the samples used when the initial neural network model is trained, which is not limited herein.

In this embodiment, if it is determined that the accuracy of the quantized neural network model does not meet the preset accuracy condition, the new training sample may be used as the training sample in step S201, and the step S201 is executed again, new second partial derivative information corresponding to the target layer of the initial neural network model is obtained again, and based on the second partial derivative information corresponding to the newly obtained target layer and the hardware constraint condition corresponding to the initial neural network model, new target bits of each network layer of the quantized initial neural network model are obtained again, and the new target bits of each network layer are used to quantize the initial neural network model until the accuracy of the quantized neural network model meets the preset accuracy condition.

In this embodiment, the target bit of each network layer of the initial neural network model is utilized to perform quantization processing on each network layer of the initial neural network model, so that a quantized neural network model can be obtained, when the precision of the quantized neural network model does not meet a preset precision condition, new second-order partial derivative information corresponding to the target layer of the initial neural network model can be obtained again according to a reverse calculation graph corresponding to the initial neural network model and a new training sample, and further new target bit of each network layer of the initial neural network model can be determined again.

In some scenarios, after the quantized neural network model is obtained, the quantized neural network model may be further compressed. In one embodiment, the method further comprises: compressing the quantized neural network model; the compression process includes at least one of pruning process and parameter sharing process.

The pruning of the neural network model is based on the premise that a trained neural network model has certain redundancy, compression ratio is distributed to all weight layers (such as a convolutional layer Conv, a full-connection layer Dense and the like) needing pruning in the neural network model according to a certain distribution algorithm on the premise that the trained neural network model has certain redundancy, importance of weights in different dimensions in each weight layer is measured according to a certain measurement standard (metrics), weights with lower importance are deleted, weights with higher importance are reserved, and therefore an original neural network model becomes a new neural network model with smaller parameter quantity, and then the weights of the new neural network model are retrained and finely adjusted, so that performance of the new neural network model approximates to or is even superior to that of the original neural network model. The parameter sharing of the neural network model is based on the premise that a certain parameter redundancy exists in the trained neural network model, and the network layer with the redundancy parameters in the neural network model shares the parameters so as to reduce the memory and the calculated amount required by operating the neural network model.

In this embodiment, after the quantized neural network model is obtained, pruning processing may be further performed on the quantized neural network model, or parameter sharing processing may be further performed on the quantized neural network model, or pruning processing and parameter sharing processing may be further performed on the quantized neural network model at the same time. Further, as an optional implementation manner, after the quantized neural network model is obtained, preprocessing of actual deployment and the like may be performed on the quantized neural network model according to actual deployment.

In this embodiment, by performing compression processing on the quantized neural network model, the memory and the calculation amount required for running the neural network model can be further reduced, so that the obtained processed neural network model is more suitable for being deployed on a device with higher limitation on resources such as power, delay, memory and the like.

In the above scenario of acquiring the second partial derivative information corresponding to the target layer of the initial neural network model according to the reverse computation graph and the training sample of the initial neural network model, the reverse computation graph of the initial neural network model needs to be acquired first, and the detailed process of acquiring the reverse computation graph of the initial neural network model will be described in this embodiment. In one embodiment, the method further comprises: and carrying out differential processing on the weight parameters of each layer of the initial neural network model and the input parameters of each layer according to the value of the loss function of the initial neural network model and the network structure of the initial neural network model, and constructing a reverse calculation graph.

In this embodiment, according to the network structure of the initial neural network model, the output and the value of the loss function of the initial neural network model are used as the input parameters of the last network layer of the initial neural network model, the weight parameters and the input parameters of the last network layer of the initial neural network model are subjected to backward differentiation processing, the input of the last network layer of the initial neural network model is used as the input parameters of the last network layer of the initial neural network model, the weight parameters and the input parameters of the last network layer of the initial neural network model are subjected to backward differentiation processing, each layer of the initial neural network model is traversed according to the operation, the weight parameters of each layer of the initial neural network model and the input parameters of each layer are subjected to backward differentiation processing, the differentiation results of each layer of the initial neural network model are obtained, and a backward calculation graph of the initial neural network model is constructed according to the differentiation results of each layer of the initial neural network model.

In this embodiment, according to the value of the loss function of the initial neural network model and the network structure of the initial neural network model, the process of performing differential processing on the weight parameters of each layer and the input parameters of each layer of the initial neural network model is relatively simple, so that a reverse calculation map of the initial neural network model can be quickly constructed, and the efficiency of constructing the reverse calculation map of the initial neural network model is improved.

In this embodiment, a process of acquiring second partial derivative information corresponding to a target layer of an initial neural network model according to a reverse calculation map and a training sample of the initial neural network model will be described in detail. In one embodiment, the step S201 includes: and carrying out differential processing on the weight parameters of the target layer according to the reverse calculation graph and the training sample to obtain second-order partial derivative information corresponding to the target layer.

In this embodiment, the training sample may be input into the initial neural network model, and according to the obtained reverse calculation graph of the initial neural network model, the weight parameters of the target layer involved in training in the initial neural network model are subjected to differential processing, so as to obtain the second partial derivative information of the loss function of the target layer in the initial neural network model on the weight parameters.

In this embodiment, according to the reverse calculation graph and the training sample of the initial neural network model, the weight parameter of the target layer of the initial neural network model can be rapidly subjected to differential processing, so that the efficiency of obtaining the second partial derivative information corresponding to the target layer of the initial neural network model is improved.

For easy understanding by those skilled in the art, as shown in fig. 3, the neural network model quantization method provided in the present disclosure is described in detail below, and the method may include:

s1, performing differential processing on weight parameters of each layer of the initial neural network model and input parameters of each layer according to the value of a loss function of the initial neural network model and the network structure of the initial neural network model, and constructing a reverse calculation graph of the initial neural network model.

S2, performing differential processing on the weight parameters of the target layer according to the reverse calculation graph and the training sample to obtain second partial derivative information corresponding to the target layer of the initial neural network model; the target layer is a network layer participating in training in the initial neural network model.

S3, determining a hardware resource constraint condition corresponding to the initial neural network model according to the hardware resource amount consumed by each network layer of the initial neural network model under different bit positions and the target performance index of the initial neural network model.

And S4, minimizing second partial derivative information of the target layer under the hardware resource constraint condition corresponding to the initial neural network model, and obtaining target bits of each network layer of the quantized initial neural network model.

And S5, carrying out quantization processing on each network layer of the initial neural network model by utilizing each target bit to obtain a quantized neural network model.

And S6, if the quantized neural network model precision does not meet the preset precision condition, taking the new training sample as a training sample, and returning to execute differential processing on the weight parameters of the target layer according to the reverse calculation graph and the training sample to obtain second partial derivative information corresponding to the target layer of the initial neural network model.

S7, compressing the quantized neural network model; the compression process includes at least one of pruning process and parameter sharing process.

It should be noted that, for the description of the above steps, reference may be made to the description related to the above embodiments, and the effects thereof are similar, which is not repeated herein.

It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a neural network model quantization device for realizing the above-mentioned neural network model quantization method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the neural network model quantization device or devices provided below may be referred to the limitation of the neural network model quantization method hereinabove, and will not be described herein.

In one embodiment, as shown in fig. 4, there is provided a neural network model quantization apparatus, including: a first acquisition module 10 and a second acquisition module 11, wherein:

the first obtaining module 10 is configured to obtain second partial derivative information corresponding to a target layer of the initial neural network model according to a reverse calculation graph and a training sample of the initial neural network model; the target layer is a network layer participating in training in the initial neural network model.

The second obtaining module 11 is configured to obtain, according to the second partial derivative information of the target layer and the hardware constraint condition corresponding to the initial neural network model, the target bit of each network layer of the quantized initial neural network model.

The neural network model quantization device provided in this embodiment may execute the above method embodiment, and its implementation principle and technical effects are similar, and will not be described herein.

On the basis of the above embodiment, as shown in fig. 5, optionally, the above second obtaining module 11 includes: a first acquisition unit 111, wherein:

the first obtaining unit 111 is configured to minimize second partial derivative information of the target layer under a hardware resource constraint condition corresponding to the initial neural network model, and obtain target bits of each network layer of the quantized initial neural network model.

On the basis of the above embodiment, as shown in fig. 6, optionally, the above apparatus further includes: a determination module 12, wherein:

the determining module 12 is configured to determine a hardware resource constraint condition corresponding to the initial neural network model according to the amount of hardware resources consumed by each network layer of the initial neural network model in different bits and a target performance index of the initial neural network model.

On the basis of the above embodiment, as shown in fig. 7, optionally, the above apparatus further includes: a quantization module 13 and an operation module 14, wherein:

and the quantization module 13 is configured to perform quantization processing on each network layer of the initial neural network model by using each target bit, so as to obtain a quantized neural network model.

And an operation module 14, configured to take the new training sample as a training sample if the quantized accuracy of the neural network model does not meet the preset accuracy condition, and return to perform the step of acquiring second partial derivative information corresponding to the target layer of the initial neural network model according to the reverse calculation map and the training sample.

On the basis of the above embodiment, as shown in fig. 8, optionally, the above apparatus further includes: a processing module 15, wherein:

the processing module 15 is used for compressing the quantized neural network model; the compression process includes at least one of pruning process and parameter sharing process.

On the basis of the above embodiment, as shown in fig. 9, optionally, the above apparatus further includes: build module 16, wherein:

and the construction module 16 is used for carrying out differential processing on the weight parameters of each layer of the initial neural network model and the input parameters of each layer according to the value of the loss function of the initial neural network model and the network structure of the initial neural network model, so as to construct a reverse calculation graph.

On the basis of the above embodiment, as shown in fig. 10, optionally, the above first obtaining module 10 includes: a second acquisition unit 101, wherein:

the second obtaining unit 101 is configured to perform differential processing on the weight parameter of the target layer according to the reverse calculation graph and the training sample, so as to obtain second partial derivative information corresponding to the target layer.

The above-described respective modules in the neural network model quantization apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 11. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing target bits of each network layer of the initial neural network model. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a neural network model quantization method.

It will be appreciated by those skilled in the art that the structure shown in fig. 11 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application applies, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

Embodiments of the present application also provide a computer-readable storage medium. One or more non-transitory computer-readable storage media containing computer-executable instructions that, when executed by one or more processors, cause the processors to perform the steps of a neural network model quantization method.

Embodiments of the present application also provide a computer program product containing instructions that, when run on a computer, cause the computer to perform a neural network model quantization method.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. A neural network model quantization method, comprising:

2. The method according to claim 1, wherein the obtaining the target bit of each network layer of the initial neural network model according to the second partial derivative information of the target layer and the hardware constraint condition corresponding to the initial neural network model includes:

and minimizing the second partial derivative information sum of the target layers under the hardware resource constraint condition corresponding to the initial neural network model, and obtaining the target bit of each network layer of the quantized initial neural network model.

3. The method according to claim 2, wherein the method further comprises:

and determining a hardware resource constraint condition corresponding to the initial neural network model according to the hardware resource amount consumed by each network layer of the initial neural network model under different bit positions and the target performance index of the initial neural network model.

4. A method according to any one of claims 1-3, wherein the method further comprises:

performing quantization processing on each network layer of the initial neural network model by utilizing each target bit to obtain a quantized neural network model;

and if the quantized neural network model precision does not meet the preset precision condition, taking a new training sample as the training sample, and returning to execute the step of acquiring second partial derivative information corresponding to the target layer of the initial neural network model according to the reverse calculation graph and the training sample.

5. The method according to claim 4, wherein the method further comprises:

compressing the quantized neural network model; the compression process includes at least one of pruning process and parameter sharing process.

6. A method according to any one of claims 1-3, wherein the method further comprises:

and carrying out differential processing on the weight parameters of each layer of the initial neural network model and the input parameters of each layer according to the value of the loss function of the initial neural network model and the network structure of the initial neural network model, and constructing the reverse calculation graph.

7. A method according to any one of claims 1 to 3, wherein the obtaining second partial derivative information corresponding to the target layer of the initial neural network model according to the reverse computational graph and the training sample of the initial neural network model includes:

and carrying out differential processing on the weight parameters of the target layer according to the reverse calculation graph and the training sample to obtain second-order partial derivative information corresponding to the target layer.

8. A neural network model quantization apparatus, comprising:

9. A computer device comprising a memory and a processor, the memory having stored therein a computer program, characterized in that the computer program, when executed by the processor, causes the processor to perform the steps of the neural network model quantification method of any of claims 1 to 7.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 7.

11. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.