CN113326920A

CN113326920A - Quantification method, device and equipment of neural network model

Info

Publication number: CN113326920A
Application number: CN202110581957.XA
Authority: CN
Inventors: 康瑞鹏; 游亮; 龙欣
Original assignee: Alibaba Singapore Holdings Pte Ltd
Current assignee: Alibaba Innovation Co
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2021-08-31

Abstract

The application discloses a quantification method of a neural network model, which comprises the following steps: obtaining a first operation branch and a second operation branch contained in a target operation module of a neural network model to be quantized; the number of the operation layers of the first operation branch is greater than that of the second operation branch; quantizing the first operational branch to a first precision; quantizing the second operation branch to a second precision; the first precision is less than the second precision. By adopting the method, the problems of low efficiency and insufficient optimization of the hybrid strategy acquisition when the neural network model is subjected to hybrid quantization in the prior art are solved.

Description

Quantification method, device and equipment of neural network model

Technical Field

The application relates to the technical field of computers, in particular to a quantification method and device of a neural network model, and also relates to an electronic device and a storage medium.

Background

Inference requirements of the public cloud neural network model are more and more important, and the requirement for optimizing the performance of an inference task exists. A new type of GPU (graphics processing unit) on the current public cloud supports INT8 (8-bit shaped signed number) type operations, the nominal computation power of INT8 is twice that of FP16 (half precision floating point number). In order to improve the operation speed, the trained neural network model with FP16 precision is required to be quantized into a model with INT8 precision. However, for some tasks with high requirements on precision, if a PTQ (Post-Training Quantization) technical scheme is adopted, and the neural network model is directly quantized to an INT8 precision model, an unacceptable precision loss exists.

In the prior art, aiming at a neural network model with acceleration requirements of a client, a mixed quantization scheme is adopted, so that the inference performance of the neural network model is further improved on the premise of ensuring the precision, and the calculation power of target hardware can be utilized to the maximum extent. When the neural network model is quantized to INT8 precision, under the condition that the model precision cannot meet the requirement, the sensitivity of a single operation layer to quantization is judged, the layer sensitive to quantization is kept to be high-precision, and the operation layer insensitive to quantization is quantized to INT 8.

The above solution has some drawbacks: first, for a large scale model, it takes a long time to measure the sensitivity of each operation layer on a data set, wherein it takes a long time to compile to generate a mixed precision model. Secondly, the two operation layers with high sensitivity to quantization are not necessarily both reserved with high precision, and the purpose of ensuring the precision can be achieved by only reserving one operation layer with high precision in some cases.

In summary, the prior art has problems of low efficiency and insufficient optimization of hybrid strategy acquisition when performing hybrid quantization on a neural network model.

Disclosure of Invention

The application provides a quantization method and device of a neural network model and electronic equipment, and aims to solve the problems that in the prior art, when the neural network model is subjected to hybrid quantization, a hybrid strategy is low in obtaining efficiency and not optimized.

The application provides a quantification method of a neural network model, which comprises the following steps:

obtaining a first operation branch and a second operation branch contained in a target operation module of a neural network model to be quantized; the number of the operation layers of the first operation branch is greater than that of the second operation branch;

quantizing the first operational branch to a first precision;

quantizing the second operation branch to a second precision; the first precision is less than the second precision.

As an embodiment, the first operation branch is a first operation branch comprising a multi-layer operation; the second operational branch is a short circuit branch.

As an embodiment, the first precision is an eight-bit shaping precision; the second precision is a half precision floating point number precision.

As an embodiment, the obtaining a first operation branch and a second operation branch included in a target operation module of a neural network model to be quantized includes:

determining a target operation module of a neural network model to be quantized;

and according to the target operation module, obtaining a first operation branch and a second operation branch contained in the target operation module of the neural network model to be quantized.

As an embodiment, the target operation module for determining the neural network model to be quantized includes:

traversing the neural network model to be quantized;

judging whether the output of the current operation passes through two operation branches, if so, taking the output of the current operation as the initial position of a target operation module;

determining a merging position of a target operation module;

and taking the operation from the initial position to the merging position of the target operation module as the target operation module.

As an embodiment, the determining the merging position of the target operation module includes:

acquiring a preset merging position identifier;

and determining the merging position of the target operation module according to the merging position identifier.

As an embodiment, further comprising:

and quantizing operation modules except the target operation module in the neural network model into second precision.

The present application further provides a quantization apparatus of a neural network model, including:

the branch obtaining unit is used for obtaining a first operation branch and a second operation branch contained in a target operation module of the neural network model to be quantized; the number of the operation layers of the first operation branch is greater than that of the second operation branch;

a first operation branch quantization unit for quantizing the first operation branch to a first precision;

a second operation branch quantization unit configured to quantize the second operation branch to a second precision; the first precision is less than the second precision.

As an embodiment, the branch obtaining unit is specifically configured to:

traversing the neural network model to be quantized;

determining a merging position of a target operation module; and taking the operation from the initial position to the merging position of the target operation module as the target operation module.

As an embodiment, the branch obtaining unit is specifically configured to:

acquiring a preset merging position identifier;

As an embodiment, the apparatus further comprises:

and the other operation module quantization units are used for quantizing operation modules except the target operation module into second precision.

The present application further provides an electronic device, comprising:

a processor; and

a memory for storing a program of a quantization method of a neural network model, the apparatus performing the following steps after being powered on and running the program of the quantization method of the neural network model by the processor:

quantizing the first operational branch to a first precision;

traversing the neural network model to be quantized;

acquiring a preset merging position identifier;

As an embodiment, the electronic device further performs the following steps:

and quantizing operation modules except the target operation module into second precision.

The present application further provides a storage medium storing a program of a data processing method, the program being executed by a processor to perform the steps of:

quantizing the first operational branch to a first precision;

quantizing the second operation branch to a second precision; the first precision is less than the second precision. Compared with the prior art, the method has the following advantages:

the application provides a quantification method of a neural network model, which comprises the following steps: obtaining a first operation branch and a second operation branch contained in a target operation module of a neural network model to be quantized; the number of the operation layers of the first operation branch is greater than that of the second operation branch; quantizing the first operational branch to a first precision; quantizing the second operation branch to a second precision; the first precision is less than the second precision. The quantization method of the neural network model, the first branch with the large number of operation layers in the target operation module of the neural network model to be quantized is quantized to low precision, the second branch with the small number of operation layers is directly quantized to high precision, and the measurement of quantization sensitivity of each operation layer is not needed, so that the efficiency of hybrid strategy acquisition when the neural network model is subjected to hybrid quantization is high. The method and the device solve the problems of low efficiency and insufficient optimization of hybrid strategy acquisition when the neural network model is subjected to hybrid quantization.

Drawings

Fig. 1A is an application scenario diagram of a quantization method of a neural network model provided in the present application.

Fig. 1 is a flowchart of a quantization method of a neural network model according to a first embodiment of the present application.

Fig. 2 is a schematic diagram of an operation branch of a target operation module according to a first embodiment of the present application.

Fig. 3 is a schematic diagram of a target operation module according to a first embodiment of the present application.

Fig. 4 is a schematic diagram of a quantization apparatus of a neural network model according to a second embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, but rather construed as limited to the embodiments set forth herein.

In order to show the present application more clearly, an application scenario of the quantization method of the neural network model provided in the first embodiment of the present application is briefly described.

The quantization method of the neural network model provided in the first embodiment of the present application may be applied to a scenario in which a client and a server interact with each other, as shown in fig. 1A, when a quantized neural network model needs to be obtained, a connection is usually established between the client and the server, after the connection, the client sends the neural network model to be quantized to the server, and after the server obtains the neural network model to be quantized, a first operation branch and a second operation branch included in a target operation module of the neural network model to be quantized are obtained at first in a branch obtaining unit 101; the number of the operation layers of the first operation branch is greater than that of the second operation branch; then, at the first operation branch quantization unit 102, the first operation branch is quantized to the first precision; next, second operation branch quantization section 103 quantizes the second operation branch to the first precision.

A first embodiment of the present application provides a method for quantifying a neural network model, which is described below with reference to fig. 1.

As shown in fig. 1, in step S101, a first operation branch and a second operation branch included in a target operation module of a neural network model to be quantized are obtained; the number of operation layers of the first operation branch is larger than that of the second operation branch.

The neural network model to be quantized refers to a neural network model trained by training data. In the prior art, when a neural network model to be quantized is quantized in a PTQ (Post-Training Quantization) mode, the neural network model is directly quantized into an INT8 precision model, and unacceptable precision loss exists. This application adopts and quantifies some operational layers into high accuracy, quantifies some operational layers into low accuracy, adopts the mode of mixing the quantization, can satisfy the required precision.

The target operation module may refer to a module including two operation branches in the neural network model. For example, the portion from mul to contact at the top of the figure in fig. 2 is a target operation module. Fig. 2 is a part of a neural network model, and the whole neural network model comprises a plurality of similar target operation modules.

The operation layer can refer to a single operation in the neural network model. For example, conv (convolution operation), sigmoid, mul (multiplication operation) in fig. 2 can all be referred to as an operation layer.

The number of operation layers of the first operation branch may refer to the number of individual operations included in the first operation branch.

The number of operation layers of the second operation branch may refer to the number of individual operations included in the second operation branch. As shown in fig. 2, the number of operation layers of the second operation branch is 3.

As shown in fig. 2, which is a part of YoloV5 network, it can be regarded as an example of a target operation module, and 2-1 is a first operation branch, which contains a far larger number of operation layers than the second operation branch 2-2.

The first operational branch may be a first operational branch comprising a multi-layer operation; the second operation branch may be a short circuit branch.

Referring to FIG. 2, 2-1 is a first operation branch containing multi-layer operations, and 2-2 is a short circuit branch (shortcut).

The first operation branch and the second operation branch included in the target operation module for obtaining the neural network model to be quantized comprise:

In specific implementation, the target operation module of the neural network model to be quantized may be determined first, and then the first operation branch and the second operation branch included in the target operation module of the neural network model to be quantized are obtained according to the target operation module.

The target operation module for determining the neural network model to be quantized comprises:

traversing the neural network model to be quantized;

determining a merging position of a target operation module;

For example, the output of the mul in FIG. 2 goes through two operation branches, 2-1 and 2-1, and the output of the mul is taken as the starting position of the target operation module.

The determining the merging position of the target operation module includes:

acquiring a preset merging position identifier;

The preset merging position identifier may be identifiers such as contact, add, and mul, and the contact in fig. 2 is a merging position identifier.

As shown in fig. 1, in step S102, the first operation branch is quantized to a first precision.

As shown in fig. 1, in step S103, the second operation branch is quantized to a second precision; the first precision is less than the second precision.

The first precision refers to the lower precision employed when performing neural network model hybrid quantization.

The second precision refers to the higher precision adopted when carrying out the neural network model hybrid quantization.

Hybrid quantization, which means that different segments of the neural network model operate at different accuracies.

Specifically, the first precision may be eight-bit shaping precision; the second precision is a half precision floating point number precision. Depending on the specific scenario, the first precision and the second precision may be other precisions as long as the first precision is less than the second precision.

The advantages of the first embodiment of the present application over the mixed strategy of layer-by-layer metric sensitivity given in the prior art are described below with reference to fig. 3:

1. according to the scheme, the sensitivity of each layer to quantization does not need to be measured, a hybrid strategy of hybrid quantization can be rapidly provided by analyzing the network structure, in the graph 3, a target operation module is obtained by analyzing the network structure, the 3-1 branch is directly set as the first precision, and the 3-2 branch is set as the second precision.

2. The two operation layers with high quantization sensitivity can make a better choice than the algorithm mentioned in the measurement sensitivity layer by layer under the condition that the accuracy can be guaranteed by keeping one operation layer with high accuracy. As shown in FIG. 3 below, the blocks in box 3-3 are sensitive to quantization as compared to the blocks in box 3-2 when tested alone. According to the mixed algorithm for measuring the sensitivity layer by layer, because the sensitivities of the 3-3 frame and the 3-2 frame both meet the FP16 precision, the FP16 precision is reserved for the operation of the 3-3 frame and the 3-2 frame; the algorithm in the application only needs to keep the operation in the 3-2 box as FP16, and the operation in the 3-3 box is INT8 precision. The experimental result shows that the influence of the two algorithms on the final precision is basically consistent, and the time consumption of model reasoning is less after the algorithms are subjected to mixed quantization.

According to the method, the target operation module is determined according to the structural characteristics of the neural network model, the first operation branch is quantized into the first precision, the second operation branch is quantized into the second precision, the sensitivity of the second operation branch to the quantization is not required to be measured layer by layer, and the operation speed of the model can be obviously improved.

It should be noted that the screenshot in fig. 2 is only a part of the neural network model, and is not the whole of the network; a neural network model may comprise a plurality of target operation modules (bifurcation merging structure), and may also comprise modules other than the target operation modules.

As an implementation manner, the first embodiment of the present application may further include:

and quantizing part of operation modules except the target operation module in the neural network model into second precision, and quantizing the other part of operation modules into first precision.

The operation module may refer to an operation layer.

Specifically, a part of operation modules except for the target operation module in the neural network model is quantized to the second precision, another part of operation modules is quantized to the first precision, a part of operation modules with high sensitivity to quantization can be quantized to the second precision, and another part of operation modules with low sensitivity to quantization can be quantized to the first precision.

In the first embodiment of the present application, only the second operation branch with less operations of the target operation module is quantized to the second precision, and the first operation branch with more operations of the target operation module is quantized to the second precision, so that the operation speed can be significantly increased. According to the experimental conclusion of the YoloV5l network, compared with the precision quantized to FP16, the hybrid quantization proposed by the scheme can improve the operation speed by 13.6% with the precision loss less than 1%.

The first embodiment of the application can be applied to scenes of image classification, target object detection, environment segmentation and face recognition based on a convolutional neural network model. By taking target object detection as an example, the time consumption for processing a single frame of picture can be reduced and the throughput can be improved by adopting the technology of the scheme.

Thus, the description of the first embodiment of the present application is completed. The quantization method of the neural network model, the first branch with the large number of operation layers in the target operation module of the neural network model to be quantized is quantized to low precision, the second branch with the small number of operation layers is directly quantized to high precision, and the measurement of quantization sensitivity of each operation layer is not needed, so that the efficiency of hybrid strategy acquisition when the neural network model is subjected to hybrid quantization is high. The method and the device solve the problems of low efficiency and insufficient optimization of hybrid strategy acquisition when the neural network model is subjected to hybrid quantization.

Corresponding to the quantization method of the neural network model provided in the first embodiment of the present application, a second embodiment of the present application provides a data processing apparatus.

As shown in fig. 4, the quantization apparatus of the neural network model includes:

a branch obtaining unit 401, configured to obtain a first operation branch and a second operation branch included in a target operation module of the neural network model to be quantized; the number of the operation layers of the first operation branch is greater than that of the second operation branch;

a first operation branch quantization unit 402 for quantizing the first operation branch to a first precision;

a second operation branch quantization unit 403 for quantizing the second operation branch to a second precision; the first precision is less than the second precision.

As an embodiment, the branch obtaining unit is specifically configured to:

traversing the neural network model to be quantized;

As an embodiment, the branch obtaining unit is specifically configured to:

acquiring a preset merging position identifier;

As an embodiment, the apparatus further comprises:

The utility model provides a quantization apparatus of neural network model, the first branch that the number of computation layers is many in the target operation module of the neural network model that will treat the quantization quantizes is low precision, directly quantizes the second branch that the number of computation layers is few into high precision, need not carry out the measurement of quantization sensitivity to every operation layer, therefore the efficiency that hybrid strategy when carrying out the hybrid quantization to the neural network model obtained is higher, in addition, when first operation branch exists the operation layer sensitive to the quantization, need not quantize the operation layer wherein into high precision, can only keep the operation layer in the second operation branch for high precision, realized that the operation layer that only keeps one of them branch just can reach the purpose of guaranteeing the precision for high precision. The method and the device solve the problems of low efficiency and insufficient optimization of hybrid strategy acquisition when the neural network model is subjected to hybrid quantization.

It should be noted that, for the detailed description of the data processing apparatus provided in the second embodiment of the present application, reference may be made to the related description of the first embodiment of the present application, and details are not repeated here.

A third embodiment of the present application provides an electronic device corresponding to the data processing method provided in the first embodiment of the present application.

The electronic device includes:

a processor; and

quantizing the first operational branch to a first precision;

traversing the neural network model to be quantized;

acquiring a preset merging position identifier;

As an embodiment, the electronic device further performs the following steps:

The electronic device provided by the application is powered on, and after the program of the quantization method of the neural network model stored in the memory is run through the processor, the first branch with the large number of operation layers in the target operation module of the neural network model to be quantized is quantized to low precision, the second branch with the small number of operation layers is directly quantized to high precision, and the measurement of quantization sensitivity of each operation layer is not needed. The method and the device solve the problems of low efficiency and insufficient optimization of hybrid strategy acquisition when the neural network model is subjected to hybrid quantization.

It should be noted that, for the detailed description of the electronic device provided in the third embodiment of the present application, reference may be made to the related description of the first embodiment of the present application, and details are not repeated here.

In accordance with a fourth embodiment of the present invention, corresponding to the data processing method provided in the first embodiment of the present invention, there is provided a storage medium storing a program of the data processing method, the program being executed by a processor to perform the steps of:

quantizing the first operational branch to a first precision;

It should be noted that, for the detailed description of the storage device provided in the fourth embodiment of the present application, reference may be made to the related description of the first embodiment of the present application, and details are not described here again.

Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application, therefore, the scope of the present application should be determined by the claims that follow.

In a typical configuration, a computing device includes one or more processors (CPUs), memory mapped input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims

1. A method for quantifying a neural network model, comprising:

quantizing the first operational branch to a first precision;

2. The method of claim 1, wherein the first branch of operations is a first branch of operations comprising a plurality of layers of operations; the second operational branch is a short circuit branch.

3. The method of claim 1, wherein the first precision is an eight-bit shaping precision; the second precision is a half precision floating point number precision.

4. The method according to claim 1, wherein the obtaining of the target operation module of the neural network model to be quantized comprises a first operation branch and a second operation branch, and comprises:

5. The method of claim 4, wherein the determining a target operation module of the neural network model to be quantified comprises:

traversing the neural network model to be quantized;

6. The method of claim 5, wherein determining the merge location of the target computing module comprises:

acquiring a preset merging position identifier;

7. The method of claim 1, further comprising:

8. An apparatus for quantizing a neural network model, comprising:

9. An electronic device, comprising:

a processor; and

quantizing the first operational branch to a first precision;

10. A storage medium storing a program of a data processing method, the program being executed by a processor to perform the steps of:

quantizing the first operational branch to a first precision;