CN115496181A

CN115496181A - Chip adaptation method, device, chip and medium of deep learning model

Info

Publication number: CN115496181A
Application number: CN202211097767.1A
Authority: CN
Inventors: 郭敬明; 张克俭; 田宏泽; 周晨君; 孙清阁; 梁维斌
Original assignee: Beijing Suiyuan Intelligent Technology Co ltd
Current assignee: Beijing Suiyuan Intelligent Technology Co ltd
Priority date: 2022-03-31
Filing date: 2022-09-08
Publication date: 2022-12-20

Abstract

The invention discloses a chip adaptation method, a chip adaptation device, a chip and a medium of a deep learning model. The method comprises the following steps: loading a target model to be adapted, and verifying whether a chip adaptation condition is met or not according to a numerical relation between the target precision of the target computing power computing unit and the parameter precision of the target model; if yes, loading each input vector set and the hidden state vector data range, carrying to a target memory, calling each built-in calculation instruction, and executing: and according to the target precision and the input weight matrix and the cyclic weight matrix of each gate structure, calculating to obtain a corresponding input weight quantization matrix, a corresponding cyclic weight quantization matrix and a corresponding quantization scale, and outputting the input weight quantization matrix, the cyclic weight quantization matrix and the corresponding quantization scale to an engine file to generate an adapted target model. By the technical scheme of the embodiment of the invention, the complexity of quantitative calculation can be reduced, and the efficiency of parameter quantification is improved.

Description

Chip adaptation method, device, chip and medium of deep learning model

Technical Field

The embodiment of the invention relates to a neural network technology, in particular to a chip adaptation method, a device, a chip and a medium of a deep learning model, and particularly relates to a chip adaptation scene of a recurrent neural network.

Background

Currently, recurrent neural networks have been widely used in text recognition, speech recognition, and natural language processing scenarios. And huge parameters and calculated amount are introduced while the model precision is improved. The model quantization can reduce the memory bandwidth and the storage occupation, reduce the power consumption, improve the throughput and reduce the time delay.

The existing parameter quantization method of the recurrent neural network generally needs to determine an iteration interval according to the amplitude of the quantized data variation, so as to adjust the quantization parameter in the neural network according to the iteration interval.

When the method is applied to the training or fine tuning process of the on-chip cyclic neural network, repeated iteration is needed, so that the process is time-consuming and complex, and huge computational cost and time cost are brought to a machine learning chip. Therefore, how to improve the efficiency of parameter quantization and reduce the consumption of storage space under the condition of less precision loss is an urgent problem to be solved at present.

Disclosure of Invention

The embodiment of the invention provides a chip adaptation method, a chip adaptation device, a chip and a medium of a deep learning model, which are used for reducing the calculated amount in the model quantization process and improving the parameter quantization efficiency.

In a first aspect, an embodiment of the present invention provides a chip adaptation method for a deep learning model, which is performed by a machine learning chip, and includes:

loading a target model to be adapted, wherein the target model comprises at least one recurrent neural network, the recurrent neural network comprises at least one network unit, the network unit comprises at least one gate structure, and the gate structure is provided with a matched input weight matrix and a matched recurrent weight matrix;

verifying whether a chip adaptation condition for a target model is met or not according to a numerical relation between the target precision of a target computing power computing unit in a machine learning chip and the parameter precision of the target model to be adapted;

if yes, loading an input vector set and a hidden state vector data range which are respectively matched with each cyclic neural network;

carrying the data ranges of the cyclic neural networks, the input vector sets and the hidden state vectors to a target memory which is arranged in a machine learning chip and close to a computing unit in a mode of carrying at least one stage of memory;

calling each built-in calculation instruction in the machine learning chip according to the data in the target memory through each calculation unit, and executing the following operations:

according to the input vector set matched with the recurrent neural network and the data range of the hidden state vector, calculating to obtain an input vector quantization scale and a hidden state vector quantization scale;

according to the target precision, the input vector quantization scale, the hidden state vector quantization scale and the input weight matrix and the cyclic weight matrix of each gate structure, calculating to obtain an input weight quantization matrix, a cyclic weight quantization matrix and a quantization scale which respectively correspond to each gate structure;

and outputting the input weight quantization matrix, the circulation weight quantization matrix and the quantization scale which respectively correspond to each gate structure to an engine file so as to generate an adapted target model.

In a second aspect, an embodiment of the present invention further provides a chip adaptation apparatus for a deep learning model, where the apparatus includes:

the target model loading module is used for loading a target model to be adapted, the target model comprises at least one recurrent neural network, the recurrent neural network comprises at least one network unit, the network unit comprises at least one gate structure, and the gate structure is provided with a matched input weight matrix and a matched recurrent weight matrix;

the adaptation condition verification module is used for verifying whether the chip adaptation condition of the target model is met or not according to the numerical relation between the target precision of the target computing unit in the machine learning chip and the parameter precision of the target model to be adapted;

the data loading module is used for loading the input vector set and the hidden state vector data range which are respectively matched with each cyclic neural network if the input vector set and the hidden state vector data range are in the same state;

the data carrying module is used for carrying the data ranges of the cyclic neural networks, the input vector sets and the hidden state vectors to a target memory which is arranged in a machine learning chip and close to the computing unit in a mode of carrying at least one stage of memory;

the instruction calling module is used for calling each built-in calculation instruction in the machine learning chip according to the data in the target memory through each calculation unit, and executing the following operations:

the quantization scale calculation module is used for calculating to obtain an input vector quantization scale and a hidden state vector quantization scale according to the input vector set matched with the recurrent neural network and the data range of the hidden state vector;

the quantization parameter calculation module is used for calculating an input weight quantization matrix, a circulation weight quantization matrix and a quantization scale which respectively correspond to each gate structure according to the target precision, the input vector quantization scale, the hidden state vector quantization scale and the input weight matrix and the circulation weight matrix of each gate structure;

and the target model adaptation module is used for outputting the input weight quantization matrix, the circulation weight quantization matrix and the quantization scale which respectively correspond to each gate structure to an engine file so as to generate an adapted target model.

In a third aspect, an embodiment of the present invention further provides a machine learning chip, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the chip adaptation method of the deep learning model according to any embodiment of the present invention when executing the program

In a fourth aspect, an embodiment of the present invention further provides a storage medium storing computer-executable instructions, where the storage medium stores a computer program, and is characterized in that the program, when executed by a processor, implements a chip adaptation method for a deep learning model according to any embodiment of the present invention.

The method comprises the steps of loading a target model comprising at least one recurrent neural network, and verifying whether a chip adaptation condition of the target model is met or not according to a numerical relation between target precision of a target computing power computing unit in a machine learning chip and parameter precision of the target model to be adapted; if yes, loading an input vector set and a hidden state vector data range which are respectively matched with each cyclic neural network; carrying the data ranges of the cyclic neural networks, the input vector sets and the hidden state vectors to a target memory which is arranged in a machine learning chip and close to a computing unit in a mode of carrying at least one stage of memory; then, calling each calculation instruction built in the machine learning chip through each calculation unit according to the data in the target memory, executing the input vector quantization scale and the hidden state vector quantization scale according to the input vector set and the hidden state vector data range matched with the cyclic neural network, and calculating; then according to the target precision, the input vector quantization scale, the hidden state vector quantization scale and the input weight matrix and the cyclic weight matrix of each gate structure, calculating to obtain an input weight quantization matrix, a cyclic weight quantization matrix and a quantization scale which respectively correspond to each gate structure; and outputting an input weight quantization matrix, a circulation weight quantization matrix and a quantization scale which respectively correspond to each gate structure to an engine file to generate an adapted target model, wherein when a circulation neural network loaded in a machine learning chip is not matched with the calculation power of the chip, a new model parameter on-chip quantization mode is provided, so that the problems of long process time consumption, high calculation complexity and huge calculation power cost and time cost brought to the machine learning chip of the parameter quantization method of the circulation neural network in the prior art are solved, the calculation power adaptation of the model and the chip after quantization is ensured to exert the maximum calculation performance of the chip, the complexity and the calculation time consumption of on-chip quantization calculation are greatly reduced under the condition of small precision loss, and the efficiency of parameter quantization in the circulation neural network is improved. In a model reasoning scene with higher real-time requirement, the calculation time consumption can be reduced to the greatest extent on the premise of ensuring the precision.

Particularly, the whole parameter quantization and inference method can be realized through hardware, and supports special 1-dimensional computing instructions such as round, clip and the like through the configured matched quantized precision 2-dimensional computing capability and 1-dimensional computing capability, so that the inference capability is greatly improved, and the occupation of on-chip storage space is greatly reduced.

Drawings

FIG. 1a is a flowchart of a chip adaptation method for a deep learning model according to a first embodiment of the present invention;

FIG. 1b is a schematic diagram of a network unit in a long term short term memory network according to an embodiment of the present invention;

FIG. 2a is a flowchart of a chip adaptation method of a deep learning model according to a second embodiment of the present invention;

FIG. 2b is a schematic diagram of a parameter quantization process of a recurrent neural network according to a second embodiment of the present invention;

FIG. 3a is a flowchart of a chip adaptation method of a deep learning model according to a third embodiment of the present invention;

FIG. 3b is a diagram of a language translation model according to a third embodiment of the present invention;

FIG. 3c is a diagram of an LSTM network unit iterative computation in the third embodiment of the present invention;

FIG. 3d is a schematic diagram of an inference flow in an adapted language translation model according to a third embodiment of the present invention;

FIG. 3e is a schematic structural diagram of a quantization calculation of an adapted language translation model according to a third embodiment of the present invention;

FIG. 3f is a flowchart of an inference method in an adapted language translation model according to a third embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a chip adaptation apparatus for deep learning model according to a fourth embodiment of the present invention;

fig. 5 is a schematic structural diagram of a machine learning chip in the fifth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1a is a flowchart of a chip adaptation method for a deep learning model according to an embodiment of the present invention, where the embodiment is applicable to a case of quantifying model parameters in a recurrent neural network, and the method may be performed by a chip adaptation apparatus for a deep learning model, where the apparatus may be implemented by software and/or hardware (firmware), and may be generally integrated in a tool for providing a recurrent neural network parameter quantification service, and may be exemplarily integrated in a machine learning chip, and specifically includes the following steps:

s110, loading a target model to be adapted, wherein the target model comprises at least one recurrent neural network, the recurrent neural network comprises at least one network unit, the network unit comprises at least one gate structure, and the gate structure is provided with a matched input weight matrix and a matched recurrent weight matrix.

The target model may be a model that needs to be loaded and calculated in a machine learning chip after the model training process is completed. The Recurrent neural network may refer to a type of Recurrent neural network that takes sequence data as input, recurses in the evolution direction of the sequence, and all nodes (network units) are connected in a chain manner, and common Recurrent neural networks mainly include a Bidirectional Recurrent neural network (Bi-RNN), a Gate Recurrent Unit (GRU), and a Long Short-Term Memory network (LSTM).

In a specific example, a recurrent neural network may include a plurality of chain-connected network units, each network unit including a plurality of gate structures, each gate structure having a matching input weight matrix and a circular weight matrix therein.

Illustratively, an LSTM network may generally include one or more LSTM network elements, taking the LSTM network as an example. Fig. 1b is a schematic structural diagram of an LSTM network unit, which mainly includes four gate structures: input gate, output gate, forgetting gate and cell gate.

The input weight matrix may be a matrix formed by values that adjust the input amount into the gate structure during the process of generating the output result by the gate structure. A circular weight matrix may refer to a matrix of values that modify corresponding components of a circular vector.

In this embodiment, the recurrent neural network to be quantized specifically refers to a network obtained by performing model training on a training sample set and used for satisfying a set data processing function. All model parameters in the recurrent neural network meet the precision requirement of model training. Correspondingly, the data precision of each matrix element in the input weight matrix and the cyclic weight matrix of each gate structure of each network unit in the cyclic neural network meets the precision requirement of model training. For example, the data precision of each matrix element may be fp64 (double precision), fp32 (single precision), fp16 (half precision), or the like.

S120, verifying whether the chip adaptation condition of the target model is met or not according to the numerical relation between the target precision of the target computing power computing unit in the machine learning chip and the parameter precision of the target model to be adapted.

The target calculation force calculation unit may refer to a calculation unit for setting calculation force in the machine learning chip. The target accuracy may refer to an operation accuracy corresponding to the target calculation power calculation unit. Exemplary methods include double-precision force (64 bits, FP 64), single-precision force (32 bits, FP 32), half-precision force (16 bits, FP 16), and integer force (INT 8 or INT 4). Generally, the higher the number of digital bits, the higher the precision, the higher the complexity of the operation that can be supported, and the wider the application scenarios are adapted.

Parameter accuracy may refer to the numerical accuracy of each model parameter in the target model to complete model training. Similarly, the parameter accuracy may be a double-accuracy calculation force (64 bits, FP 64), a single-accuracy calculation force (32 bits, FP 32), a half-accuracy calculation force (16 bits, FP 16), an integer calculation force (INT 8 or INT 4), or the like.

Generally, a machine learning chip includes a plurality of computing units with different accuracies, for example, a double-accuracy computing power (64 bits, FP 64), a single-accuracy computing power (32 bits, FP 32), a half-accuracy computing power (16 bits, FP 16), and an integer computing power (INT 8 or INT 4).

In actual use, the target model is required to be adapted to the calculation unit with which precision in the machine learning chip according to the actual application scene of the target model. In a specific example, when the target model is applied in a language translation scenario, it has a high requirement on real-time performance, and then the computing unit with the highest computational power (i.e. the target computational power computing unit) in the machine learning chip can be adapted to the target model to ensure the real-time performance of the computation to the maximum extent. Accordingly, in the present embodiment, the target calculation force calculation unit may be the calculation unit of the highest calculation force, and the target accuracy may be the accuracy of the calculation unit of the highest calculation force.

Furthermore, it is necessary to first verify whether parameter quantization is required for the target model, and when it is determined that parameter quantization is required, synchronously calculate a corresponding quantization scale. E.g., quantified from FP32 to INT8.

The chip adaptation condition may refer to an accuracy adaptation requirement between the target model and the machine learning chip. For example, it may be determined that the chip fitting condition for the target model is satisfied when the parameter accuracy of the target model to be fitted is higher than the target accuracy. At this time, the target model needs to be quantized in a machine learning chip.

It should be noted again that, when it is determined that the chip adaptation condition of the target model is satisfied, all model parameters of the target model are not quantized according to the target precision, but the inventors select only the input weight matrix and the circular weight matrix of each gate structure in each circular neural network to perform quantization processing by fully considering the requirements of accuracy and timeliness, and the remaining model parameters still maintain the model precision obtained by training the target model.

And S130, if so, loading an input vector set and a hidden state vector data range which are respectively matched with each recurrent neural network.

Wherein the set of input vectors may refer to a set of calibration data that matches the recurrent neural network. Specifically, the set of input vectors may be a subset of a set of training samples used for training each recurrent neural network in the target model.

The hidden state vector data range may be determined according to the type of the recurrent neural network, for example, if the recurrent neural network is LSTM, the hidden state vector data range should be (-1, 1) since the range of the activation function is (-1, 1). The quantization scale may refer to a standard scale required to map values from a larger set of values to a smaller set. The input vector quantization scale may refer to a quantization scale required when quantizing an input vector to a predetermined set. The hidden state vector quantization scale may refer to a quantization scale required to quantize the hidden state vector to a predetermined set.

And S140, carrying the data ranges of the cyclic neural networks, the input vector sets and the hidden state vectors to a target memory which is arranged in a machine learning chip and close to the computing unit in a mode of carrying at least one stage of memory.

The target memory may refer to a preset memory block closest to the physical distance of the computing unit.

The memory transfer may refer to transferring data in the lower memory to the upper memory. Generally, the farther the distance from the target memory, the larger the amount of memory, but the slower the transfer speed.

S150, calling each calculation instruction built in the machine learning chip to execute subsequent operation steps (steps S160 to S180) through each calculation unit according to the data in the target memory.

The calculation instruction may refer to a preset instruction for performing a quantitative calculation on the target model. Illustratively, instructions to perform steps S160 to S180 may be included.

And S160, calculating to obtain an input vector quantization scale and a hidden state vector quantization scale according to the input vector set and the hidden state vector data range matched with the recurrent neural network.

In an alternative embodiment, the calculating an input vector quantization scale and a hidden state vector quantization scale according to the input vector set and the hidden state vector data range matched with the recurrent neural network may include:

calculating to obtain an input vector value range corresponding to the input vector set according to the value domain distribution of the input vector set, and determining an input vector quantization threshold according to the input vector value range; calculating to obtain an input vector quantization scale according to the input vector quantization threshold, a preset quantization range and a quantization mode; and determining a hidden state vector quantization threshold according to the hidden state vector data range, and calculating to obtain a hidden state vector quantization scale according to the hidden state vector quantization threshold, a preset quantization range and a quantization mode.

The value domain distribution of the input vector set may refer to value distribution information of each tensor in the input vector set, and exemplarily, the quantization threshold of the input vector set may be calculated in a maximum-minimum value mode, a relative entropy (KL) Divergence, a percentile mode, and the like.

The preset quantization range may refer to a preset quantization range matched with the target precision of the target computing power computing unit in the machine learning chip, and for example, when the target model is applied in a language translation scene, because the target model has a high requirement on real-time performance, the computing unit with the highest computing power in the machine learning chip may be fully utilized to be adapted to the target model, so as to ensure the real-time performance of the operation to the greatest extent. If the computing power of fp32, int8 and int32 can be included in the machine learning chip, the target accuracy of the target computing power computing unit in the machine learning chip can be set to the highest computing power int8, in this case, the quantization range can be (0, 255), or (-127, 127), which is not limited in this embodiment. The quantization mode may refer to a mode used when quantizing the input weight matrix and the cyclic weight matrix, such as a symmetric quantization method or an asymmetric quantization method, and in the embodiment of the present invention, the symmetric quantization method is preferably a quantization mode.

In a specific example, the value range of the input vector can be calculated through the value range distribution of the input vector set, so that the absolute value of the input vector quantization threshold value which is the positive maximum value or the negative minimum value can be determined. If the preset quantization range is selected as (-127, 127), the quantization scale of the vector is input

Similarly, if the range of the hidden state vector data is (-1, 1), the quantization threshold of the hidden state vector is 1, and if the preset quantization range is (-127, 127), the quantization scale of the hidden state vector is determined

Thus, the input vector quantization scale S is calculated and obtained through the input vector quantization threshold value, the preset quantization range and the quantization mode _x (ii) a Calculating to obtain a hidden state vector quantization scale S through a hidden state vector quantization threshold value, a preset quantization range and a quantization mode _h And an effective data base is provided for subsequent gate-by-gate quantization.

And S170, calculating to obtain an input weight quantization matrix, a circulation weight quantization matrix and a quantization scale which respectively correspond to each gate structure according to the target precision, the input vector quantization scale, the hidden state vector quantization scale, and the input weight matrix and the circulation weight matrix of each gate structure.

Wherein, the input weight quantization matrix can refer to a quantized input weight matrix; the cyclic weight quantization matrix may refer to a quantized cyclic weight matrix. That is, the data precision of each matrix element in the input weight quantization matrix and the cyclic weight quantization matrix is the data precision obtained by integrally quantizing the matrix elements in the input weight quantization matrix and the cyclic weight quantization matrix, for example, the data precision of each matrix element in the input weight quantization matrix and the cyclic weight quantization matrix may be int16, int8, int4, or the like, and int is integer.

Because the recurrent neural network comprises a plurality of gate structures, the input weight matrix and the recurrent weight matrix of each gate structure need to be quantized into a matched input weight quantization matrix and a matched recurrent weight quantization matrix correspondingly. In this embodiment, only one gate structure may be taken as an example to specifically describe the quantization process, and the quantization processes of other gate structures are completely consistent.

Accordingly, in an optional embodiment, the calculating, according to the target precision, the input vector quantization scale, the hidden state vector quantization scale, and the input weight matrix and the cyclic weight matrix of each gate structure, to obtain the input weight quantization matrix, the cyclic weight quantization matrix, and the quantization scale respectively corresponding to each gate structure may include:

acquiring a target input weight matrix and a target cyclic weight matrix corresponding to a currently processed target gate structure; splicing a first result obtained by multiplying the target input weight matrix by the input vector quantization scale and a second result obtained by multiplying the target cyclic weight matrix by the hidden state vector quantization scale to obtain a target splicing matrix; calculating to obtain a target quantization scale corresponding to the target door structure according to the target splicing matrix; and according to the target precision and a target quantization scale corresponding to the target gate structure, performing quantization processing on the target splicing matrix to obtain a target input weight quantization matrix and a target cyclic weight quantization matrix corresponding to the target gate structure.

The target gate structure may refer to a gate structure currently processed in the recurrent neural network, and may be, for example, any one of an input gate i, an output gate o, a forgetting gate f, or a cell gate c in one network unit. The target input weight matrix may refer to an input weight matrix W corresponding to the target gate structure; the target cyclic weight matrix may refer to the cyclic weight matrix R corresponding to the target gate structure.

The first result may be a result obtained by multiplying the target input weight matrix by the input vector quantization scale, and the second result may be a result obtained by multiplying the target cyclic weight matrix by the hidden state vector quantization scale. Illustratively, taking an input gate i as an example, the input weight matrix corresponding to the input gate i is W _i The cyclic weight matrix corresponding to the input gate i is R _i . First result A ₁ ＝W _i *S _x Second result A ₂ ＝R _i *S _h 。

The target splicing matrix can refer to a new matrix obtained by splicing the first result and the second result, and the target splicing matrix

Further, a target quantization scale corresponding to the target gate structure is obtained according to the quantization threshold value, the preset quantization range and the quantization mode of the target splicing matrix. And then, according to the target quantization scale and the target precision of a target computing power computing unit in the machine learning chip, performing quantization processing on the target splicing matrix to obtain a target input weight quantization matrix and a target cyclic weight quantization matrix corresponding to the target door structure.

On the basis of the above embodiment, the embodiment of the present invention may further use a special calculation instruction to calculate and obtain an input vector quantization scale, a hidden state vector quantization scale, and an input weight quantization matrix and a circular weight quantization moment respectively corresponding to each gate structure; and the special calculation instruction is obtained by packaging calculation logic constructed by any algorithm of the maximum minimum value, the relative entropy divergence and the percentile.

And S180, outputting the input weight quantization matrix, the circulation weight quantization matrix and the quantization scale which respectively correspond to each gate structure to an engine file to generate an adapted target model.

Specifically, the input weight quantization matrix, the cyclic weight quantization matrix and the quantization scale of the input gate i, the output gate o, the forgetting gate f and the cell gate c in each network unit are respectively obtained through the method, and then the input weight quantization matrix and the cyclic weight quantization matrix are stored in corresponding parameters of a machine learning chip (typically, a Tensor processor), and the input weight quantization matrix, the cyclic weight quantization matrix and the quantization scale are output to an Engine file, so as to generate an adapted target model.

The method comprises the steps of loading a target model comprising at least one recurrent neural network, and verifying whether a chip adaptation condition of the target model is met or not according to a numerical relation between target precision of a target computing power computing unit in a machine learning chip and parameter precision of the target model to be adapted; if yes, loading an input vector set and a hidden state vector data range respectively matched with each cyclic neural network; carrying each of the recurrent neural networks, the input vector sets and the hidden state vector data ranges to a target memory arranged close to the computing unit in a machine learning chip in a mode of carrying at least one stage of memory; then, calling each calculation instruction built in the machine learning chip through each calculation unit according to the data in the target memory, executing the input vector quantization scale and the hidden state vector quantization scale according to the input vector set and the hidden state vector data range matched with the cyclic neural network, and calculating; then according to the target precision, the input vector quantization scale, the hidden state vector quantization scale and the input weight matrix and the cyclic weight matrix of each gate structure, calculating to obtain an input weight quantization matrix, a cyclic weight quantization matrix and a quantization scale which respectively correspond to each gate structure; and outputting an input weight quantization matrix, a cyclic weight quantization matrix and a quantization scale which respectively correspond to each gate structure to an engine file to generate an adapted target model, wherein when the cyclic neural network loaded in a machine learning chip is not matched with the computational calculation power of the chip, a new model parameter on-chip quantization mode is provided, so that the problems of long time consumption and high computational complexity of the parameter quantization method of the cyclic neural network in the prior art and huge computational cost and time cost brought to the machine learning chip are solved, the model is adapted to the computational calculation power of the chip after quantization is ensured, the maximum computational performance of the chip is exerted, the complexity and the computation time consumption of on-chip quantization computation are greatly reduced under the condition of small precision loss, and the efficiency of parameter quantization in the cyclic neural network is improved. Particularly, the whole parameter quantization and inference method can be realized through hardware, and supports special 1-dimensional computing instructions such as round, clip and the like through the configured matched quantized precision 2-dimensional computing capability and 1-dimensional computing capability, so that the inference capability is greatly improved, and the occupation of on-chip storage space is greatly reduced.

Example two

Fig. 2a is a flowchart of a chip adaptation method of a deep learning model according to a second embodiment of the present invention. The embodiment is refined based on the above embodiment, wherein the target quantization scale corresponding to the target gate structure calculated according to the target mosaic matrix is refined as follows: obtaining the maximum value of the matrix elements in the target splicing matrix after absolute value processing is carried out on each matrix element in the target splicing matrix; and calculating to obtain a target quantization scale corresponding to the target gate structure according to the maximum value of the matrix elements, a preset quantization range and a quantization mode.

Correspondingly, the target splicing matrix is quantized according to the target precision in the machine learning chip and the target quantization scale corresponding to the target gate structure, and the target input weight quantization matrix and the target circular weight quantization matrix corresponding to the target gate structure are obtained by the steps of: dividing the target splicing matrix by the target quantization scale to obtain an intermediate result matrix; performing rounding and truncation processing on the intermediate result matrix according to the target precision to obtain a target quantized splicing matrix; and performing de-stitching processing on the target quantized stitching matrix to obtain the target input weight quantization matrix and the target cyclic weight quantization matrix.

As shown in fig. 2a, the method comprises the following specific steps:

s210, loading a target model to be adapted, wherein the target model comprises at least one recurrent neural network, the recurrent neural network comprises at least one network unit, the network unit comprises at least one gate structure, and the gate structure is provided with a matched input weight matrix and a matched recurrent weight matrix.

Specifically, a recurrent neural network to be quantified is obtained, and accordingly, the types of gate structures in each network unit in the recurrent neural network can be obtained. Accordingly, each gate structure contains a corresponding input weight matrix and a circular weight matrix.

S220, verifying whether the chip adaptation condition of the target model is met or not according to the numerical relation between the target precision of the target computing power computing unit in the machine learning chip and the parameter precision of the target model to be adapted.

And S230, if so, loading an input vector set and a hidden state vector data range which are respectively matched with each recurrent neural network.

And S240, carrying the data ranges of the cyclic neural networks, the input vector sets and the hidden state vectors to a target memory which is arranged in a machine learning chip and close to the computing unit in a mode of carrying at least one stage of memory.

And S250, calling each calculation instruction built in the machine learning chip to execute the following S260-S2160 through each calculation unit according to the data in the target memory.

S260, calculating to obtain an input vector value range corresponding to the input vector set according to the value domain distribution of the input vector set, and determining an input vector quantization threshold according to the input vector value range.

And S270, calculating to obtain an input vector quantization scale according to the input vector quantization threshold, a preset quantization range and a quantization mode.

S280, determining a hidden state vector quantization threshold according to the hidden state vector data range, and calculating to obtain a hidden state vector quantization scale according to the hidden state vector quantization threshold, a preset quantization range and a quantization mode.

And S290, acquiring a target input weight matrix and a target circular weight matrix corresponding to the currently processed target gate structure.

S2100, splicing a first result obtained by multiplying the target input weight matrix by the input vector quantization scale and a second result obtained by multiplying the target cyclic weight matrix by the hidden state vector quantization scale to obtain a target splicing matrix.

And S2110, obtaining the maximum value of the matrix elements in the target splicing matrix after carrying out absolute value taking processing on the matrix elements in the target splicing matrix.

It should be noted that, in the embodiment of the present invention, each value in the input vector set is of type float32 (i.e. fp 32), and in order to improve the efficiency of parameter quantization, it is expected that each value in the input vector set is quantized to the target precision of the target computing power calculating unit in the machine learning chip, which may be of type int8, for example.

Wherein, taking the input gate i as an example, the maximum value of the matrix element can pass through

The function is calculated to obtain the result of the calculation,

and splicing the matrix for the target.

And S2120, calculating a target quantization scale corresponding to the target gate structure according to the maximum matrix element value, a preset quantization range and a quantization mode.

Specifically, the preset quantization range is determined by the target precision of the target computing power computing unit in the machine learning chip, and when the target precision of the target computing power computing unit in the machine learning chip is int8, the preset quantization range is (-127, 127).

Specifically, taking input gate i as an example, the maximum value of the matrix element is known as max (fabs (W) _{i_concat_fp32} ) (-127, 127), the quantization method is a symmetric quantization method, and thus, the target quantization scale is

And S2130, dividing the target splicing matrix by the target quantization scale to obtain an intermediate result matrix.

The intermediate result matrix may be a matrix obtained by dividing the target stitching matrix by the target quantization scale. In particular, taking input gate i as an example, the intermediate result matrix

S2140, rounding and truncating the intermediate result matrix according to the target precision to obtain a target quantization splicing matrix.

Where rounding may refer to an operation that rounds each matrix element in the intermediate result matrix to an integer that is only retained. Illustratively, the rounding operation may be performed in a manner of rounding to an even number (Round 2 even). That is, when there is exactly X.5 such data in the matrix elements, it is necessary to round the data to the nearest even number, e.g., round 4.5 to even 4, round 5.5 to even 6. When the matrix element does not have exactly x.5, ordinary rounding may be performed directly, for example, 4.4 is rounded to 4, and 5.6 is rounded to 6.

Truncation may refer to an operation of truncating each matrix element in the rounded intermediate result matrix according to a set requirement, and for example, a minimum value and a maximum value in a quantization range may be used as a standard, and specifically, a value between (-127, 127) may be truncated and retained for subsequent use, taking int8 as an example as a target precision in a machine learning chip. That is, the matrix elements that fall within the range of (-127, 127) after rounding are left as they are, while the matrix elements that fall outside the range of (-127, 127) after rounding are truncated to-127 or 127, and the target quantized concatenation matrix may refer to a matrix obtained by rounding and truncating the intermediate result matrix.

It should be noted that, in the embodiment of the present invention, the target precision of the target computing power calculating unit in the machine learning chip is preferably used to perform rounding and truncation processing on the intermediate result matrix.

Specifically, taking input gate i as an example, rounding the intermediate result matrix can be performed by

Further, the target quantized mosaic matrix:

s2150, performing de-stitching processing on the target quantization stitching matrix to obtain the target input weight quantization matrix and the target cyclic weight quantization matrix.

The de-stitching process may refer to performing inverse processing according to an acquisition process of the target stitching matrix to acquire a target input weight quantization matrix and a target cyclic weight quantization matrix corresponding to the currently processed target gate structure.

S2160, outputting the target input weight quantization matrix, the target cyclic weight quantization matrix and the target quantization scale which respectively correspond to each gate structure to an engine file to generate an adapted target model.

Specifically, based on the above method, the input gate i, the output gate o, the forgetting gate f and the cell gate c of each network unit are processed respectively to obtain a scale matrix [ s ] composed of the target quantization scales _wi ，s _wo ，s _wf ，s _wc ]. It is noted that if the recurrent neural network is a bidirectional LSTM, the scale matrix includes both forward and backward scale matrices. Further, the scale matrix [ s ] _wi ，s _wo ，s _wf ，s _wc ]Storing the quantized Wi _ int8, wo _ int8, wf _ int8 and Wc _ int8 into a Tensor processor of LSTM, and storing the quantized Wi _ int8, wo _ int8, wf _ int8 and Wc _ int8 into W _ int8 of the Tensor processor; the quantized Ri _ int8, ro _ int8, rf _ int8 and Rc _ int8 are stored into R _ int8 in the Tensor processor; further, W _ int8, R _ int8 and [ s ] to be stored _wi ，s _wo ，s _wf ，s _wc ]And outputting the target model to an Engine file to finish the generation of the matched target model corresponding to the recurrent neural network.

According to the technical scheme of the embodiment of the invention, whether a chip adaptation condition of a target model is met or not is verified by loading the target model comprising at least one cyclic neural network according to the numerical relation between the target precision of a target computing power computing unit in a machine learning chip and the parameter precision of the target model to be adapted; if yes, loading an input vector set and a hidden state vector data range which are respectively matched with each cyclic neural network; carrying the data ranges of the cyclic neural networks, the input vector sets and the hidden state vectors to a target memory which is arranged in a machine learning chip and close to a computing unit in a mode of carrying at least one stage of memory; then, calling each calculation instruction built in the machine learning chip through each calculation unit according to the data in the target memory, executing value range distribution according to the input vector set, calculating to obtain an input vector value range corresponding to the input vector set, and determining an input vector quantization threshold value according to the input vector value range; then, according to the input vector quantization threshold value, a preset quantization range and a quantization mode, calculating to obtain an input vector quantization scale; then, according to the hidden state vector data range, determining a hidden state vector quantization threshold value, and according to the hidden state vector quantization threshold value, a preset quantization range and a quantization mode, calculating to obtain a hidden state vector quantization scale; further, a first result obtained by multiplying a target input weight matrix of the recurrent neural network by an input vector quantization scale and a second result obtained by multiplying the target recurrent weight matrix by a hidden state vector quantization scale are spliced to obtain a target splicing matrix; carrying out absolute value processing on each matrix element in the target splicing matrix to obtain the maximum value of the matrix elements in the target splicing matrix; then, according to the maximum value of the matrix elements, a preset quantization range and a quantization mode, calculating to obtain a target quantization scale corresponding to the target gate structure, dividing the target splicing matrix by the target quantization scale to obtain an intermediate result matrix, and performing rounding and truncation processing on the intermediate result matrix according to target precision to obtain a target quantization splicing matrix; finally, the target quantization splicing matrix is subjected to de-splicing processing to obtain a target input weight quantization matrix and a target cyclic weight quantization matrix, and the target input weight quantization matrix, the target cyclic weight quantization matrix and the target quantization scale which respectively correspond to each gate structure are output to an engine file to generate an adapted target model.

Fig. 2b is a schematic diagram of a parameter quantization process of a recurrent neural network according to an embodiment of the present invention. Specifically, firstly, a trained Open Neural Network Exchange (ONNX) model is loaded, a corresponding input vector set is obtained, then, an input vector value range and an input vector quantization threshold corresponding to the input vector set are determined according to the value domain distribution of the input vector set, and further, an input vector quantization scale is obtained through calculation; further, setting the quantization scale of the hidden state vector to be 1/127 according to the quantization threshold of the hidden state vector, a preset quantization range and a quantization mode which are matched with the target precision of a target computing power computing unit in the machine learning chip; then, acquiring a target input weight matrix and a target cyclic weight matrix corresponding to each target gate structure included in the ONNX model, and splicingProcessing to obtain a target splicing matrix; after absolute value processing is carried out on each matrix element in the target splicing matrix, the maximum value of the matrix elements in the target splicing matrix is obtained, and a target quantization scale corresponding to the target gate structure is calculated; dividing the target splicing matrix by a target quantization scale, and performing rounding and truncation processing to obtain a target quantization splicing matrix; then, performing de-stitching processing on the target quantized stitching matrix to obtain a target input weight quantization matrix and a target cyclic weight quantization matrix; further, the calculation of all door structures is completed; after the calculation of all gate structures is completed, all quantized target input weight quantization matrixes Wi _ int8, wo _ int8, wf _ int8 and Wc _ int8 are stored in W _ int8 in a Tensor processor; and storing all the quantized target cyclic weight quantization matrixes Ri _ int8, ro _ int8, rf _ int8 and Rc _ int8 into R _ int8 in the Tensor processor; finally, W _ int8, R _ int8 and the scale matrix [ s ] to be stored _wi ，s _wo ，s _wf ，s _wc ]And outputting the target model to an Engine file to finish the generation of the matched target model corresponding to the recurrent neural network. Therefore, the input weight matrix and the circulation weight matrix in the recurrent neural network are quantized independently according to the input gate, the output gate, the forgetting gate and the cell gate, and the hidden state vector is symmetrically quantized through the data range of the hidden state vector, so that the gate structure matrix is converted into int8 multiplication through floating point multiplication, the memory space is reduced to 1/4 of the memory space, the efficiency of on-chip parameter quantization can be improved under the condition of small precision loss, and the consumption of on-chip memory space is reduced.

EXAMPLE III

Fig. 3a is a flowchart of a chip adaptation method of a deep learning model according to a third embodiment of the present invention, which is added on the basis of the above embodiments, wherein the operation after the input weight quantization matrix, the cyclic weight quantization matrix and the quantization scale respectively corresponding to each gate structure are output to an engine file to generate an adapted target model is embodied as: loading a source language sequence and the adapted language translation model in the engine file; carrying the source language sequence and the adapted language translation model to the target memory in a mode of carrying at least one stage of memory; calling each built-in calculation instruction in the machine learning chip through each calculation unit according to the data in the target memory, and executing the following operations: converting the source language sequence into a quantitative input sequence according to the target precision; inputting the quantized input sequence into the adapted language translation model; and through each LSTM network unit in the adapted language translation model, according to the quantization sequence value of the matching time in the quantization input sequence, the hidden state quantization value of the previous time of the matching time and the quantization parameter of each gate structure in the LSTM network unit, carrying out iterative computation step by step to obtain a computation result output by each LSTM network unit so as to finally output the target translation voice sequence.

As shown in fig. 3a, the method specifically includes the following steps:

s310, loading a target model to be adapted, wherein the target model comprises at least one recurrent neural network, the recurrent neural network comprises at least one network unit, the network unit comprises at least one gate structure, and the gate structure is provided with an input weight matrix and a recurrent weight matrix.

In an alternative embodiment, the target model is a language translation model, and fig. 3b is a schematic diagram of the language translation model. The language translation model specifically comprises: the LSTM coding layer and the LSTM network decoding layer respectively comprise a plurality of LSTM network units which are connected in sequence, each LSTM network unit comprises four gate structures, and each gate structure respectively comprises an input gate, an output gate, a forgetting gate and a cell gate. Finally, the data in the target memory (GPU 8 in the figure) in the LSTM network decoding layer is processed by using the softmax function, and the final output target translation voice sequence (y) is obtained ₁ ，y ₂ Etc.).

S320, verifying whether the chip adaptation condition of the target model is met or not according to the numerical relation between the target precision of the target computing power computing unit in the machine learning chip and the parameter precision of the target model to be adapted.

And S330, if so, loading an input vector set and a hidden state vector data range which are respectively matched with each cyclic neural network.

And S340, carrying the data ranges of the cyclic neural networks, the input vector sets and the hidden state vectors to a target memory which is arranged in a machine learning chip and close to the computing unit in a mode of carrying at least one stage of memory.

And S350, calling each built-in calculation instruction in the machine learning chip through each calculation unit according to the data in the target memory.

And S360, calculating to obtain an input vector quantization scale and a hidden state vector quantization scale according to the input vector set and the hidden state vector data range matched with the recurrent neural network.

And S370, calculating an input weight quantization matrix, a circulation weight quantization matrix and a quantization scale respectively corresponding to each gate structure according to the target precision, the input vector quantization scale, the hidden state vector quantization scale and the input weight matrix and the circulation weight matrix of each gate structure.

And S380, outputting the input weight quantization matrix, the circulation weight quantization matrix and the quantization scale which respectively correspond to each gate structure to an engine file to generate an adapted target model.

And S390, loading the source language sequence and the adapted language translation model in the engine file.

The source language sequence may refer to a language sequence to be translated, which is input into the adapted language translation model. The data precision of the source language sequence matches the data precision of each model parameter in the language translation model before adaptation, for example, the sequence may be in fp32 data format.

S3100, transporting the source language sequence and the adapted language translation model to the target memory in a mode of transporting at least one stage of memory.

S3110, calling each calculation instruction built in the machine learning chip through each calculation unit according to the data in the target memory.

And S3120, converting the source language sequence into a quantization input sequence according to the target precision.

The quantized input sequence may refer to an input sequence that meets the data precision requirement of each model parameter in the adapted language translation model, and in an exemplary embodiment of the present invention, the data format of the quantized input sequence may be int8. Specifically, the quantized input sequence may be obtained by dividing the source language sequence by the input vector quantization scale.

It should be noted that, if the adapted language translation model is an independent and complete computing network required for implementing a computing process, the source language sequence is the most original data sequence provided by the user; if the adapted language translation model is part of a large computing network (intermediate network) required to implement a computing process, then the source language sequences are the output of a previous network (or computing node) in the large computing network adjacent to the adapted language translation model.

S3130, inputting the quantized input sequence into the adapted language translation model.

S3140, through each LSTM network unit in the adapted language translation model, according to the quantization sequence value of the matching time in the quantization input sequence, the hidden state quantization value of the previous time of the matching time and the quantization parameter of each gate structure in the LSTM network unit, the calculation result output by each LSTM network unit is obtained through step-by-step iterative calculation, and the target translation voice sequence is finally output.

The quantized sequence value at the matching time may refer to a quantized input sequence value corresponding to the current time in the quantized input sequence. The hidden-state quantized value at the time immediately before the matching time may refer to a quantized value of a hidden state generated at the time immediately before the current time. The quantization parameter for each gate structure may refer to a corresponding target quantization scale for each gate structure. The target translated speech sequence may refer to a sequence generated by translating a source language sequence through an adapted language translation model.

Fig. 3c is a schematic diagram of the iterative computation of the LSTM network element. In an alternative embodiment, the step-by-step iterative computation by each LSTM network unit in the adapted language translation model according to the quantized sequence value of the matching time in the quantized input sequence, the hidden state quantized value of the previous time of the matching time, and the quantized parameter of each gate structure in the LSTM network unit to obtain the computation result output by each LSTM network unit may include:

acquiring an input target quantization sequence value and a target hidden state value through a target LSTM network unit, and performing quantization processing on the target hidden state value through the target LSTM network unit to obtain a target hidden state quantization value; calculating to obtain target quantization calculation result vectors respectively corresponding to each gate structure through a target LSTM network unit according to an input weight quantization matrix, a circulation weight quantization matrix, the target quantization sequence value and the target hidden state quantization value of each internal gate structure; converting a target quantization calculation result vector corresponding to each gate structure into a target calculation result vector through a target LSTM network unit, wherein the numerical precision of the target calculation result vector is matched with a cyclic neural network before quantization of a post-training quantization neural network; through a target LSTM network unit, correspondingly multiplying a target calculation result vector corresponding to each gate structure by a quantization scale in a quantization parameter of each gate structure, and correspondingly adding gate offset of each gate structure to obtain gate calculation result vectors respectively corresponding to each gate structure; processing the calculation result vector of each gate by adopting an internal activation function through a target LSTM network unit, and outputting a calculation result; wherein the gate offset of each gate structure and the numerical precision of the activation function match the language translation model to be adapted.

Wherein the target LSTM network element may refer to the network element currently being processed in the adapted language translation model. The target quantized sequence value may refer to a quantized sequence value at the matching time; the target hidden state value may refer to a hidden state value at a time previous to the matching time; the target hidden state quantization value may refer to a value obtained by quantizing a target hidden state value, for example, taking the current time t as an exampleCan be according to a formula

Is calculated to obtain _{t-1_int8} For the target hidden state quantization value, h _{t_1} Is the target hidden state value. The target quantization calculation result vector can be expressed according to the formula h _{t-1_int8} *R _int8 ^T +X _{t_int8} *W _int8 ^T Calculating the resulting vector, R _int8 For cyclic weight quantization matrix, W _int8 The matrix is quantized for the input weights.

The gate offset may refer to an offset of each gate structure, and specifically, the gate offset may include an input weight quantization matrix offset and a cyclic weight quantization matrix offset. The calculation result output by each network element may refer to a result generated in the calculation process, and may be, for example, a hidden state value h corresponding to the matching time, and a corresponding cell state C. Fig. 3d is a schematic diagram illustrating an inference process in an adapted language translation model according to an embodiment of the present invention. Specifically, taking the current time t as an example, the input target quantization sequence value X is obtained _{t_int8} And hidden state value (target hidden state value) h at previous time t-1 _{t_1} According to the formula

Hiding state value h to target _{t_1} Carrying out quantization processing to obtain a quantization value h of a target hidden state _{t-1_int8} Notably, since the initial target hidden state value is 0 _{t_1} Quantized to 0; further, h is calculated for each of the four gate structures _{t-1_int8} *R _int8 ^T And X _{t_int8} *W _int8 ^T Summing to obtain a target quantization calculation result vector with a data format of int 32; then, converting the target quantization calculation result vector into a data format fp32 matched with the recurrent neural network through a converter; further, the converted result is combined with a quantization parameter [ s ] _wi ，s _wo ，s _wf ，s _wc ]Quantization scale ofAfter row-wise multiplication, the gate offset (b) of each gate structure is again multiplied _i ) Corresponding added bias _i ＝Wb _i +Rb _i (taking input gate i as an example), due to b _i The numerical precision of the data is matched with the recurrent neural network, therefore, the data format is fp32, and finally, the hidden state value h corresponding to the current time t can be correspondingly obtained by calculating through the activation function of which the numerical precision is matched with the recurrent neural network _t And the cell state Ct as a calculation result output by the network unit matched with the current time t.

Correspondingly, as shown in fig. 3e, each gate structure of each grid cell is calculated in the matrix multiplication stage (Matmul-D) at int8 precision, and after the calculation result in int32 precision obtained by matrix multiplication is converted into fp32, all subsequent calculations are calculated at fp32 precision.

According to the technical scheme of the embodiment of the invention, whether a chip adaptation condition of a target model is met or not is verified by loading the target model comprising at least one cyclic neural network according to the numerical relation between the target precision of a target computing power computing unit in a machine learning chip and the parameter precision of the target model to be adapted; if yes, loading an input vector set and a hidden state vector data range which are respectively matched with each cyclic neural network; carrying each cyclic neural network, the input vector set and the hidden state vector data range to a target memory arranged close to the computing unit in the machine learning chip in a mode of carrying at least one stage of memory; then, calling each built-in calculation instruction in the machine learning chip through each calculation unit according to the data in the target memory, executing the range of input vector set and hidden state vector data matched with the recurrent neural network, and calculating to obtain an input vector quantization scale and a hidden state vector quantization scale; then according to the target precision, the input vector quantization scale, the hidden state vector quantization scale and the input weight matrix and the cyclic weight matrix of each gate structure, calculating to obtain an input weight quantization matrix, a cyclic weight quantization matrix and a quantization scale which respectively correspond to each gate structure; outputting the input weight quantization matrix, the circulation weight quantization matrix and the quantization scale which respectively correspond to each gate structure to an engine file to generate an adapted target model; further, loading a source language sequence and an adapted language translation model in the engine file, carrying the source language sequence and the adapted language translation model to a target memory in a mode of carrying the source language sequence and the adapted language translation model through at least one stage of memory, calling each calculation instruction built in a machine learning chip through each calculation unit according to data in the target memory, executing conversion of the source language sequence into a quantized input sequence according to target precision, inputting the input data sequence converted into the quantized input sequence into the adapted language translation model, performing iterative calculation step by step through each LSTM network unit in the adapted language translation model according to a quantized sequence value of a matching time in the quantized input sequence, a hidden state quantized value of a previous time of the matching time and a quantized parameter of each gate structure in the LSTM network unit to obtain a calculation result output by each LSTM network unit, outputting the target translated speech sequence finally, providing a calculation result in a quantization process for a worker, and providing an effective data base for a subsequent work process.

Fig. 3f is a flowchart of an inference method in an adapted language translation model according to an embodiment of the present invention, specifically, X is _{t_int8} And W _int8 Multiplying and outputting data in an int32 data format; obtaining a target hidden state value at the moment of t-1, and obtaining a target hidden state quantized value h through quantization _{t-1_int8} (ii) a Calculate h _t - _{1_int8} *R _int8 And is combined with X _{t_int8} *W _int8 Summing to obtain a target quantization calculation result vector; further, converting the target quantization calculation result vector from int32 to a target calculation result vector of fp32 type; then multiplying the target calculation result vector with the quantization parameter in turn to multiply each gate structure with the corresponding quantization parameter [ s ] _wi ，s _wo ，s _wf ，s _wc ](ii) a Further, the gate offset composed of W _ bais and R _ bais is correspondingly added; then, the gate calculation result vector corresponding to each gate structure is obtained, and the calculation is carried out under the fp32 precisionAnd performing subsequent activation function calculation and other calculations, outputting a hidden state value ht and a cell state Ct at the time t until all the sequences are calculated, and outputting a hidden state sequence, a final hidden state and a final cell state. Therefore, the calculation result in the quantization process can be provided for the working personnel by converting the floating point operation into the int8 operation, so that the calculation efficiency is improved, and an effective data base is also provided for the subsequent working process.

Particularly, the whole parameter quantization and inference method can be realized through hardware, and special 1-dimensional calculation instructions such as round and clip are supported through the configured matched quantized precision 2-dimensional calculation capability and 1-dimensional calculation capability, so that the inference capability is greatly improved, and the occupation of a storage space is greatly reduced.

Example four

Fig. 4 is a schematic structural diagram of a chip adaptation apparatus for a deep learning model according to a fourth embodiment of the present invention, which can execute the chip adaptation method for the deep learning model in the foregoing embodiments. The device can be implemented in software and/or hardware, and as shown in fig. 4, the chip adaptation device for the deep learning model specifically includes: the system comprises a target model loading module 410, an adaptation condition verification module 420, a data loading module 430, a data handling module 440, an instruction calling module 450, a quantization scale calculation module 460, a quantization parameter calculation module 470 and a target model adaptation module 480.

The adaptation condition verification module 410 is configured to load a target model to be adapted, where the target model includes at least one recurrent neural network, the recurrent neural network includes at least one network unit, the network unit includes at least one gate structure, and the gate structure has a matched input weight matrix and a matched recurrent weight matrix;

the adaptation condition verification module 420 is configured to verify whether a chip adaptation condition for a target model is satisfied according to a numerical relationship between target accuracy of a target computing power computing unit in a machine learning chip and parameter accuracy of the target model to be adapted;

a data loading module 430, configured to load, if yes, the input vector set and the hidden state vector data range respectively matched with each recurrent neural network;

a data transfer module 440, configured to transfer each of the recurrent neural networks, the input vector set, and the hidden state vector data range to a target memory, which is arranged in the machine learning chip and is close to the computing unit, in at least one stage of memory transfer mode;

the instruction calling module 450 is configured to call, by each computing unit, each computing instruction built in the machine learning chip according to the data in the target memory, and execute the following operations:

a quantization scale calculation module 460, configured to calculate an input vector quantization scale and a hidden state vector quantization scale according to the input vector set and the hidden state vector data range matched with the recurrent neural network;

a quantization parameter calculation module 470, configured to calculate, according to the target precision, the input vector quantization scale, the hidden state vector quantization scale, and the input weight matrix and the cyclic weight matrix of each gate structure, an input weight quantization matrix, a cyclic weight quantization matrix, and a quantization scale that respectively correspond to each gate structure;

and the target model adapting module 480 is configured to output the input weight quantization matrix, the circular weight quantization matrix and the quantization scale corresponding to each gate structure to an engine file, so as to generate an adapted target model.

The method comprises the steps of loading a target model comprising at least one recurrent neural network, and verifying whether a chip adaptation condition of the target model is met or not according to a numerical relation between target precision of a target computing power computing unit in a machine learning chip and parameter precision of the target model to be adapted; if yes, loading an input vector set and a hidden state vector data range which are respectively matched with each cyclic neural network; carrying each of the recurrent neural networks, the input vector sets and the hidden state vector data ranges to a target memory arranged close to the computing unit in a machine learning chip in a mode of carrying at least one stage of memory; then, calling each calculation instruction built in the machine learning chip through each calculation unit according to the data in the target memory, executing the input vector quantization scale and the hidden state vector quantization scale according to the input vector set and the hidden state vector data range matched with the cyclic neural network, and calculating; then according to the target precision, the input vector quantization scale, the hidden state vector quantization scale and the input weight matrix and the circular weight matrix of each gate structure, calculating to obtain the input weight quantization matrix, the circular weight quantization matrix and the quantization scale which respectively correspond to each gate structure; and outputting the input weight quantization matrix, the circulation weight quantization matrix and the quantization scale corresponding to each gate structure to an engine file to generate an adapted target model, so that the problems of long time consumption and high calculation complexity of the parameter quantization method of the recurrent neural network in the prior art, and huge calculation cost and time cost are solved, the complexity and the calculation time consumption of the quantization calculation are greatly reduced under the condition of small precision loss, and the efficiency of parameter quantization in the recurrent neural network is improved.

Particularly, the chip adaptation method of the whole deep learning model can be realized through hardware, and supports special 1-dimensional computing instructions such as round and clip through the configured matched quantized precision 2-dimensional computing capability and 1-dimensional computing capability, so that the inference capability is greatly improved, and the occupation of a storage space is greatly reduced.

Optionally, the quantization scale calculation module 460 may be specifically configured to calculate, according to the value range distribution of the input vector set, an input vector value range corresponding to the input vector set, and determine an input vector quantization threshold according to the input vector value range; calculating to obtain an input vector quantization scale according to the input vector quantization threshold, a preset quantization range and a quantization mode; and determining a hidden state vector quantization threshold according to the hidden state vector data range, and calculating to obtain a hidden state vector quantization scale according to the hidden state vector quantization threshold, a preset quantization range and a quantization mode.

Optionally, the quantization scale calculation module 460 may be specifically configured to calculate, by using a special calculation instruction, an input vector quantization scale, a hidden state vector quantization scale, and an input weight quantization matrix and a circular weight quantization matrix corresponding to each gate structure; and the special calculation instruction is obtained by packaging calculation logic constructed by any algorithm of the maximum minimum value, the relative entropy divergence and the percentile.

Optionally, the quantization parameter calculation module 470 specifically includes a data acquisition unit, a splicing calculation unit, a quantization scale calculation unit, and a target matrix acquisition unit;

the data acquisition unit is used for acquiring a target input weight matrix and a target cyclic weight matrix corresponding to a currently processed target gate structure;

the splicing calculation unit is used for splicing a first result obtained by multiplying the target input weight matrix by the input vector quantization scale and a second result obtained by multiplying the target circular weight matrix by the hidden state vector quantization scale to obtain a target splicing matrix;

the quantization scale calculation unit is used for calculating a target quantization scale corresponding to the target door structure according to the target splicing matrix;

and the target matrix acquisition unit is used for carrying out quantization processing on the target splicing matrix according to the target precision and the target quantization scale corresponding to the target gate structure to obtain a target input weight quantization matrix and a target cyclic weight quantization matrix corresponding to the target gate structure.

Optionally, the quantization scale calculation unit may be specifically configured to obtain a maximum value of matrix elements in the target mosaic matrix after performing absolute value taking processing on each matrix element in the target mosaic matrix; and calculating to obtain a target quantization scale corresponding to the target gate structure according to the maximum value of the matrix elements, a preset quantization range and a quantization mode.

Optionally, the target matrix obtaining unit may be specifically configured to divide the target mosaic matrix by the target quantization scale to obtain an intermediate result matrix; performing rounding and truncation processing on the intermediate result matrix according to the target precision to obtain a target quantitative splicing matrix; and performing de-stitching processing on the target quantized stitching matrix to obtain the target input weight quantization matrix and the target cyclic weight quantization matrix.

Optionally, the target model is a language translation model, and the language translation model specifically includes:

the long and short term memory network LSTM coding layer and the LSTM network decoding layer respectively comprise a plurality of LSTM network units which are sequentially connected, each LSTM network unit comprises four gate structures, and each gate structure respectively comprises an input gate, an output gate, a forgetting gate and a cell gate.

Optionally, the chip adaptation device for a deep learning model may further include a model running module, configured to load the source language sequence and the adapted language translation model in the engine file after outputting the input weight quantization matrix, the cyclic weight quantization matrix, and the quantization scale that respectively correspond to each gate structure to the engine file to generate an adapted target model; carrying the source language sequence and the adapted language translation model to the target memory in a mode of carrying at least one stage of memory; calling each built-in calculation instruction in the machine learning chip through each calculation unit according to the data in the target memory, and executing the following operations: converting the source language sequence into a quantitative input sequence according to the target precision; inputting the quantized input sequence into the adapted language translation model; and through each LSTM network unit in the adapted language translation model, according to a quantization sequence value of a matching moment in a quantization input sequence, a hidden state quantization value of a previous moment of the matching moment and a quantization parameter of each gate structure in the LSTM network unit, carrying out iterative computation step by step to obtain a computation result output by each LSTM network unit so as to finally output a target translation voice sequence.

Optionally, the model operation module may be further configured to: acquiring an input target quantization sequence value and a target hidden state value through a target LSTM network unit; quantizing the target hidden state value through a target LSTM network unit to obtain a target hidden state quantized value; calculating to obtain target quantization calculation result vectors respectively corresponding to each gate structure through a target LSTM network unit according to an input weight quantization matrix, a circulation weight quantization matrix, the target quantization sequence value and the target hidden state quantization value of each internal gate structure; converting a target quantized computation result vector corresponding to each gate structure into a target computation result vector through a target LSTM network unit, wherein the numerical precision of the target computation result vector is matched with a cyclic neural network before quantization of a post-training quantized neural network; through a target LSTM network unit, correspondingly multiplying a target calculation result vector corresponding to each gate structure by a quantization scale in a quantization parameter of each gate structure, and correspondingly adding gate offset of each gate structure to obtain gate calculation result vectors respectively corresponding to each gate structure; processing the calculation result vector of each gate by adopting an internal activation function through a target LSTM network unit, and outputting a calculation result; wherein the gate offset of each gate structure and the numerical precision of the activation function match the language translation model to be adapted.

The chip adaptation device of the deep learning model provided by the embodiment of the invention can execute the chip adaptation method of the deep learning model provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

EXAMPLE five

Fig. 5 is a schematic structural diagram of a machine learning chip according to a fifth embodiment of the present invention, as shown in fig. 5, the machine learning chip includes a processor 510, a memory 520, an input device 530, and an output device 540; the number of the processors 510 in the machine learning chip may be one or more, and one processor 510 is taken as an example in fig. 5; the processor 510, the memory 520, the input device 530, and the output device 540 in the machine learning chip may be connected by a bus or other means, and are exemplified by being connected by a bus in fig. 5.

The memory 520 is a computer-readable storage medium, and can be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the chip adaptation method of the deep learning model in the embodiment of the present invention (for example, the target model loading module 410, the adaptation condition verification module 420, the data loading module 430, the data handling module 440, the instruction calling module 450, the quantization scale calculation module 460, the quantization parameter calculation module 470, and the target model adaptation module 480 in the chip adaptation device of the deep learning model). The processor 510 executes various functional applications and data processing of the machine learning chip by executing software programs, instructions and modules stored in the memory 520, namely, implementing the chip adaptation method of the deep learning model described above.

The chip adaptation method of the deep learning model comprises the following steps:

loading a target model to be adapted, wherein the target model comprises at least one recurrent neural network, the recurrent neural network comprises at least one network unit, the network unit comprises at least one gate structure, and the gate structure is provided with an input weight matrix and a recurrent weight matrix;

carrying each of the recurrent neural networks, the input vector sets and the hidden state vector data ranges to a target memory arranged close to the computing unit in a machine learning chip in a mode of carrying at least one stage of memory;

calling each built-in calculation instruction in the machine learning chip through each calculation unit according to the data in the target memory, and executing the following operations:

The memory 520 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 520 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 520 can further include memory located remotely from the processor 510, which can be connected to a computer device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 530 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the machine learning chip. The output device 540 may include a display device such as a display screen.

EXAMPLE six

A sixth embodiment of the present invention further provides a storage medium containing computer-executable instructions, which when executed by a computer processor, perform a chip adaptation method for a deep learning model;

Of course, the storage medium containing the computer-executable instructions provided by the embodiments of the present invention is not limited to the method operations described above, and may also perform related operations in the chip adaptation method of the deep learning model provided by any embodiments of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention or portions thereof contributing to the prior art may be embodied in the form of a software product, which can be stored in a computer readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the chip adaptation apparatus for deep learning model, the included units and modules are only divided according to functional logic, but are not limited to the above division, as long as the corresponding functions can be realized; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It is to be noted that the foregoing description is only exemplary of the invention and that the principles of the technology may be employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A chip adaptation method of a deep learning model is executed by a machine learning chip, and is characterized by comprising the following steps:

verifying whether a chip adaptation condition of a target model is met or not according to a numerical relation between the target precision of a target computing power computing unit in a machine learning chip and the parameter precision of the target model to be adapted;

2. The method of claim 1, wherein computing the input vector quantization scale and the hidden state vector quantization scale based on the input vector set and the hidden state vector data range that match the recurrent neural network comprises:

calculating to obtain an input vector value range corresponding to the input vector set according to the value domain distribution of the input vector set, and determining an input vector quantization threshold according to the input vector value range;

calculating to obtain an input vector quantization scale according to the input vector quantization threshold, a preset quantization range and a quantization mode;

and determining a hidden state vector quantization threshold according to the hidden state vector data range, and calculating to obtain a hidden state vector quantization scale according to the hidden state vector quantization threshold, a preset quantization range and a quantization mode.

3. The method of claim 2, wherein:

calculating to obtain an input vector quantization scale, a hidden state vector quantization scale, an input weight quantization matrix and a circular weight quantization matrix which respectively correspond to each gate structure by using a special calculation instruction;

and the special calculation instruction is obtained by packaging calculation logic constructed by any algorithm of the maximum minimum value, the relative entropy divergence and the percentile.

4. The method according to claim 1 or 2, wherein the calculating an input weight quantization matrix, a circular weight quantization matrix and a quantization scale corresponding to each gate structure according to the target precision, the input vector quantization scale, the hidden state vector quantization scale and the input weight matrix and the circular weight matrix of each gate structure comprises:

acquiring a target input weight matrix and a target circular weight matrix corresponding to a currently processed target gate structure;

splicing a first result obtained by multiplying the target input weight matrix by the input vector quantization scale and a second result obtained by multiplying the target cyclic weight matrix by the hidden state vector quantization scale to obtain a target splicing matrix;

calculating to obtain a target quantization scale corresponding to the target door structure according to the target splicing matrix;

and according to the target precision and a target quantization scale corresponding to the target gate structure, performing quantization processing on the target splicing matrix to obtain a target input weight quantization matrix and a target circular weight quantization matrix corresponding to the target gate structure.

5. The method of claim 4, wherein calculating a target quantization scale corresponding to the target gate structure from the target stitching matrix comprises:

obtaining the maximum value of the matrix elements in the target splicing matrix after carrying out absolute value processing on each matrix element in the target splicing matrix;

and calculating to obtain a target quantization scale corresponding to the target gate structure according to the maximum value of the matrix elements, a preset quantization range and a quantization mode.

6. The method of claim 4, wherein quantizing the target stitching matrix according to the target precision and a target quantization scale corresponding to the target gate structure to obtain a target input weight quantization matrix and a target cyclic weight quantization matrix corresponding to the target gate structure, comprises:

dividing the target splicing matrix by the target quantization scale to obtain an intermediate result matrix;

performing rounding and truncation processing on the intermediate result matrix according to the target precision to obtain a target quantized splicing matrix;

and performing de-stitching processing on the target quantized stitching matrix to obtain the target input weight quantization matrix and the target circular weight quantization matrix.

7. The method according to any of claims 1-6, wherein the target model is a language translation model, the language translation model specifically comprising:

8. The method of claim 7, further comprising, after outputting the input weight quantization matrix, the circular weight quantization matrix, and the quantization scale corresponding to each gate structure to an engine file to generate an adapted target model:

loading a source language sequence and the adapted language translation model in the engine file;

carrying the source language sequence and the adapted language translation model to the target memory in a mode of carrying at least one stage of memory;

converting the source language sequence into a quantitative input sequence according to the target precision;

inputting the quantized input sequence into the adapted language translation model;

and through each LSTM network unit in the adapted language translation model, according to the quantization sequence value of the matching time in the quantization input sequence, the hidden state quantization value of the previous time of the matching time and the quantization parameter of each gate structure in the LSTM network unit, carrying out iterative computation step by step to obtain a computation result output by each LSTM network unit so as to finally output the target translation voice sequence.

9. The method of claim 8, wherein the step-by-step iterative computation of the computation result output by each LSTM network element in the adapted language translation model according to the quantized sequence value of the matching time in the quantized input sequence, the hidden state quantized value of the previous time of the matching time, and the quantized parameter of each gate structure in the LSTM network element comprises:

acquiring an input target quantization sequence value and a target hidden state value through a target LSTM network unit;

quantizing the target hidden state value through a target LSTM network unit to obtain a target hidden state quantized value;

calculating to obtain target quantization calculation result vectors respectively corresponding to each gate structure through a target LSTM network unit according to an input weight quantization matrix, a circulating weight quantization matrix, the target quantization sequence value and the target hidden state quantization value of each internal gate structure;

converting a target quantized computation result vector corresponding to each gate structure into a target computation result vector through a target LSTM network unit, wherein the numerical precision of the target computation result vector is matched with a cyclic neural network before quantization of a post-training quantized neural network;

through a target LSTM network unit, correspondingly multiplying a target calculation result vector corresponding to each gate structure by a quantization scale in a quantization parameter of each gate structure, and correspondingly adding gate offset of each gate structure to obtain gate calculation result vectors respectively corresponding to each gate structure;

processing the calculation result vector of each gate by adopting an internal activation function through a target LSTM network unit, and outputting a calculation result;

wherein the gate offset of each gate structure and the numerical precision of the activation function match the language translation model to be adapted.

10. A chip adapting device of a deep learning model is characterized by comprising:

the adaptation condition verification module is used for verifying whether the chip adaptation condition of the target model is met or not according to the numerical relation between the target precision of the target computing power computing unit in the machine learning chip and the parameter precision of the target model to be adapted;

the data carrying module is used for carrying the data ranges of the circulating neural networks, the input vector sets and the hidden state vectors to a target memory which is arranged in a machine learning chip and close to the computing unit in a mode of carrying at least one stage of memory;

the instruction calling module is used for calling each built-in computing instruction in the machine learning chip through each computing unit according to the data in the target memory, and executing the following operations:

11. A machine learning chip comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements a chip adaptation method for deep learning models according to any one of claims 1 to 9 when executing the program.

12. A storage medium of computer-executable instructions, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out a chip adaptation method of a deep learning model according to any one of claims 1 to 9.