CN115759237A

CN115759237A - End-to-end deep neural network model compression and heterogeneous conversion system and method

Info

Publication number: CN115759237A
Application number: CN202211292482.3A
Authority: CN
Inventors: 王旭强; 江黛茹; 张倩宜; 郑剑; 金尧; 杨一帆; 郑阳
Original assignee: State Grid Corp of China SGCC; State Grid Tianjin Electric Power Co Ltd; Information and Telecommunication Branch of State Grid Tianjin Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Tianjin Electric Power Co Ltd; Information and Telecommunication Branch of State Grid Tianjin Electric Power Co Ltd
Priority date: 2022-10-21
Filing date: 2022-10-21
Publication date: 2023-03-07

Abstract

The invention relates to an end-to-end deep neural network model compression and heterogeneous conversion system and a method, which comprises the following steps: the deep neural network model compression module is used for pruning a network model to be optimized, retraining the network model according to network characteristics, obtaining a compressed network model with small precision loss by using a plurality of model quantization compression methods, and inputting the compressed network model to the heterogeneous model conversion module facing the edge side equipment; and the heterogeneous model conversion module facing the edge side equipment is used for realizing the conversion of the network model from X86 to ARM and realizing the automation of the heterogeneous conversion of the network model. The invention can solve the problem that the existing deep neural network module can not be directly applied to the electric power edge side equipment or can not provide enough performance because the electric power edge side equipment can not be compatible with a universal server end structure, thereby restricting the application of the edge side deep neural network technology.

Description

End-to-end deep neural network model compression and heterogeneous conversion system and method

Technical Field

The invention belongs to the technical field of intelligent power business, and relates to a deep neural network model compression and heterogeneous conversion system, in particular to an end-to-end deep neural network model compression and heterogeneous conversion system and method.

Background

With the deep fusion of the technology of the internet of things and the intelligent power distribution network, a large number of computing nodes are accessed into the intelligent power distribution network and applied to various power service scenes, however, due to the complex structure and huge parameter data of the algorithm model based on the deep neural network, therefore, some challenges are faced in the low-computation power edge side device, generally speaking, the computing resource and the storage resource of the edge side device facing the power service are limited, namely limited storage space, memory bandwidth, floating point calculation power and the like, however, with the continuous improvement of the deep neural network model performance, the error rate of model identification is continuously reduced, meanwhile, the network structure is more complex and comprises a large number of model parameters and floating point calculation, mainly because a neural network is usually stacked with more feature extraction layers, the space complexity and the time complexity of the neural network are continuously increased, more storage space is occupied, meanwhile, a large amount of floating point number operation is introduced, in addition, the module for deep learning is continuously researched and optimized in recent years, although the performance of the system is greatly improved, the system is mostly oriented to architectures of server sides such as an X86 processor, a general GPU and the like, however, the edge side device of the power is usually mainly based on the ARM architecture, because of the reasons of volume, power consumption, cost and the like, general accelerating devices such as a GPU and the like cannot be equipped, although the edge detecting device also uses the technologies such as a customized accelerator, multi-core multi-instruction parallelism and the like, but is not compatible with a general server-side architecture, so that the existing deep neural network module cannot be directly applied to the power edge-side device or cannot provide sufficient performance, therefore, the application of the edge deep neural network technology is restricted, and therefore, an end-to-end deep neural network model compression and heterogeneous conversion system and method need to be designed to solve the above problems.

Disclosure of Invention

The invention aims to overcome the defects of a theoretical support system based on experience in the prior art, provides an end-to-end deep neural network model compression and heterogeneous conversion system and method, and can solve the problem that the existing deep neural network module cannot be directly applied to electric power edge side equipment or cannot provide enough performance due to the fact that the electric power edge side equipment cannot be compatible with a universal server end structure, so that the application of the edge side deep neural network technology is restricted.

The invention solves the practical problem by adopting the following technical scheme:

an end-to-end deep neural network model compression and heterogeneous conversion system, comprising:

the deep neural network model compression module is used for realizing pruning of a network model to be optimized, retraining the network model according to network characteristics, obtaining a compressed network model with small precision loss by using a plurality of model quantization compression methods, and inputting the compressed network model to the heterogeneous model conversion module facing the edge side equipment;

and the heterogeneous model conversion module facing the edge side equipment is used for realizing the conversion of the network model from X86 to ARM and realizing the automation of the heterogeneous conversion of the network model.

An end-to-end deep neural network model compression and heterogeneous conversion method comprises the following steps:

s1, an end-to-end system needs to receive a network model to be optimized used for analyzing electric power big data during operation, and the network model to be optimized is input to a deep neural network model compression module facing edge side equipment;

s2, a deep neural network model compression module facing to edge side equipment realizes pruning, sparsification and quantization of a network model to be optimized, so that a compressed network model is obtained;

and S3, inputting the compressed network model into a heterogeneous model conversion module facing the edge side equipment, and realizing conversion of the network model from X86 to ARM so as to obtain an optimized model.

Moreover, the specific method of step S2 is:

the method comprises the steps that a threshold value is set in the process of training a network to be optimized to judge whether the weight of network connection is important and the importance degree, the unimportant connection weight is cut off by using a zero setting method, then the cut-off network is trained, the rest parameter weights are cut off, and the process is repeated to finally obtain a compressed network model with little parameter weight.

Moreover, the specific method of step S3 is:

generating a target ARM CPU acceleration instruction through acceleration instruction translation;

through TVM IR, the code tuning is realized, and a target peripheral code is still generated, the embedded SoC on-chip calculation acceleration based on NPU and other hardware is realized, and finally the network model after migration is ensured to still use the front-end framework of the Pythrch;

and the ARM SIMD unit is used for carrying out instruction set parallel acceleration on the model after the preliminary conversion, so that the deep neural network reasoning time delay is reduced, and the model optimization is completed.

The invention has the advantages and beneficial effects that:

compared with the prior art, the invention realizes the pruning of the model through a network model compression module, retrains the network model according to the network characteristics, then obtains a compressed network model with smaller precision loss by using various model quantization compression methods, and obtains the compressed network model after the compression; the module inputs the compressed network model into the edge side heterogeneous module by using Docker related tools, realizes the conversion of the network model from X86 to ARM, realizes the automation of the heterogeneous conversion of the network model, saves human intervention, reduces the maintenance cost of power big data application update iteration, inputs the compressed network model into the edge side heterogeneous module, realizes the conversion of the network model from X86 to ARM, and reduces the cost of edge side equipment; the module also realizes the run-time optimization of the network model by combining the characteristics of an ARM instruction set structure, and finally realizes the inference performance optimization of the power big data analysis application network model, so that the system can ensure the high-efficiency operation on the power edge side equipment for a given complex deep neural network model, thereby providing powerful support for subsequent edge calculation related projects, and further liberating computing resources and fully utilizing resource fragments for the existing projects.

2. The invention can make the complicated deep neural network dispose on the edge apparatus of the low cost, and achieve the performance guarantee that the business needs, thus promote the data processing ability and intelligence level of the apparatus of the edge side, expand the coverage of the intelligent business, promote the business ability, with the help of the popularization and application of the achievement of the invention, can reduce the existing data processing task to the need of the hardware performance of the edge side on one hand, can promote the utilization factor of resources, reduce the running cost of the system; on the other hand, the invention enables the edge side to support a more complex neural network, thereby improving the data processing capability of the edge side and the intelligent level of equipment, achieving the effects of expanding the coverage of intelligent services and improving the service capability, being applied to the edge equipment of national power grid companies, realizing the maximum utilization of model adaptability and resource utilization rate, improving the deployment capability and the calculation efficiency of the model on the edge equipment, and greatly improving the functions and the performance of the model capability.

Drawings

FIG. 1 is a system framework diagram of the present invention;

FIG. 2 is a network model heterogeneous module migration flow diagram of the present invention;

FIG. 3 is a block diagram of a three-stage model pruning of the present invention;

FIG. 4 is a diagram of the effect of the present invention before and after pruning;

FIG. 5 is a flow diagram of the convolution calculation of the present invention;

FIG. 6 is a diagram of weight sharing quantization in accordance with the present invention;

FIG. 7 is a layer quantification process of the present invention;

fig. 8 is a flow chart of neural network pruning according to the present invention.

Detailed Description

The embodiments of the invention are further described in the following with reference to the drawings:

an end-to-end deep neural network model compression and heterogeneous transformation system, as shown in fig. 1 to 8, includes:

the deep neural network model compression module is used for pruning a network model to be optimized, retraining the network model according to network characteristics, obtaining a compressed network model with small precision loss by using a plurality of model quantization compression methods, and inputting the compressed network model to the heterogeneous model conversion module facing the edge side equipment;

the specific steps of the step S2 include: the method comprises the steps that a threshold value is set in the process of training a network to be optimized to judge whether the weight of network connection is important and the importance degree, the unimportant connection weight is cut off by using a zero setting method, then the cut-off network is trained, the rest parameter weights are cut off, and the process is repeated to finally obtain a compressed network model with little parameter weight.

The neural network model pruning is to eliminate some redundant parameters on the basis of a training model, so as to achieve the purposes of reducing the size of a model parameter storage space and accelerating the calculation speed of a deep neural network. When the model is properly cut, the accuracy of the model is possibly not reduced and is also improved, because the model is over-fitted, and pruning plays a role in normalization, the over-fitting of the model can be effectively inhibited, and the performance of the deep neural network model is improved. In a neural network, the importance degree of each weight parameter is different, and the larger the weight parameter value is, the larger the influence on the overall performance of the network is; in contrast, some extremely small weight values have little influence on the performance, so that by clipping (i.e. zeroing) these values, the model storage space can be reduced while the accuracy of the network model is not influenced, thereby reducing the network size. The selection method of the threshold is often used as the key point of pruning research, the simplest heuristic method is to use the absolute value of the parameter as the index of the importance of each weight value and then prune by using a greedy algorithm, and some updated technologies use the absolute value of a normalized target function relative to the derivative of the parameter as a measure index. On the other hand, there is also some work to tailor the network by setting per-layer or global sparsity. The method comprises a typical ADC method, and can learn the appropriate and optimal pruning rate of each layer according to different requirements of precision or calculated amount to finally obtain a pruning model.

In order to reduce the computation amount of a model under the condition of ensuring that the precision of the model does not generate large loss, a hardware-friendly deep neural network pruning and thinning method is provided, the method effectively combines hierarchical channel pruning with power exponent quantization of 2, the computation resources of the neural network deployed on hardware are greatly reduced while the small reduction of the precision of the network model is maintained, the hierarchical channel pruning refers to grouping different layers by reducing the precision of the model of the pruned network, the deep neural network is retrained after each layer is pruned in a specific sequence, and the adjustable hyper-parameter is set so as to meet different pruning rates in practical application. The hierarchical channel pruning method avoids irreversibility of permanent pruning, can enable network parameters to be more standardized, and facilitates network adjustment. After pruning and retraining some convolutional layers, the weights of the remaining convolutional layers are changed, and if one-time pruning is used, the weights are irreversible in the subsequent training process, so that the weights which play a decisive role in the convolutional layers can be extracted by hierarchical pruning, the weights are ensured not to be pruned, the precision reduction after pruning is also ensured to be small, the precision reduction after pruning is reduced as little as possible by hierarchical channel pruning, and in addition, the storage and calculation amount of network parameters in hardware are greatly reduced.

All full-precision weights are quantized to low-precision discrete values consisting of powers of 0 and 2 by group power index quantization, and the method allows a simple shifter to replace multiplication operation on hardware, so that the power consumption and the calculation consumption of the hardware can be greatly reduced, and the weights of the layers of the network can be more effectively stored and calculated on the hardware. The convolution kernels are divided into two groups, first, the weight ordering is grouped according to the absolute value of the weights from large to small, the group A contains the larger weights to be quantized, the group B contains the smaller weights to be retrained, the convolution kernels represent the kernels after the group A is quantized, the quantized weights are frozen in the subsequent retraining, the convolution kernels represent the kernels after the group B is retrained, the unquantized weights are divided into two groups, and the two groups are quantized and retrained respectively again, then, the unquantized weights are continuously grouped, quantized and retrained until all weights are quantized.

Convolutional Neural networks are a branch of Artificial Neural networks, also known as shift Invariant or Space Invariant Artificial Neural Networks (SIANN), and are commonly used for analyzing visual images. Yann Lecun originally used convolutional neural networks for recognizing handwritten digital tasks, which have been continuously playing its positive role in multiple directions in recent years. The convolutional neural network mainly includes an Input Layer (Input Layer), a convolutional Layer (Convolution Layer), a Batch Normalization Layer (BN), an Activation Layer (Activation Layer), a Pooling Layer (PoollingLayer), and a Fully connected Layer (Fully connected Layer). And forming a convolutional neural network by stacking the layer structures, converting the original image into a category score, wherein the convolutional layers and the full-connection layers have parameters, the parameters of the layer structures are stored in the network model, the parameters of the activation layer and the pooling layer do not exist, and the updating of the parameters in the neural network is realized through back propagation.

(1) Convolutional layer

Convolutional layers are the core blocks of convolutional neural networks, which perform most of the computations in convolutional layers. Convolutional layers are composed of a series of convolutional kernels, which are usually used to extract a certain feature. Convolution calculation is a process of performing linear transformation at each position on an image to map to a new value, and the weight of a convolution kernel is represented as a vector w, a pixel vector at a corresponding position of the image is represented as x, and an offset is represented as b, so that the process of the position convolution is a process of transforming x into y, as shown in formula (2-1), that is, the result of the convolution is the sum of the w and x vector inner products and the offset. The convolutional neural network has multilayer convolution, the layer-by-layer mapping is carried out through a plurality of convolutional layers, a complex function is integrally formed, and the network training can be expressed as a function fitting process.

y＝w*x+b#(2-1)

Convolutional layers have several necessary components to participate in the convolutional computation, namely the input, the convolutional kernel, and the feature map. In the convolutional neural network, the input data is a four-dimensional tensor, and the four dimensions are the number of inputs, the height of the inputs, the width of the inputs and the number of input channels respectively. After passing through the convolutional layer, the input image is abstracted into an eigenmap, also called an activation map, which is also a four-dimensional tensor, the four dimensions of which respectively represent the number of inputs, the height of the eigenmap, the width of the eigenmap, and the number of eigenmap channels, and the eigenmap is used as the input for the next layer. The weights in the convolutional layer are present in the convolutional kernel, which moves in the image's field of view, checking for the presence of features, a process called convolution. The convolution calculation flow diagram is shown in fig. 5, where H and W are the height and width of the input feature diagram, respectively, and if the size of the convolution kernel is K × M, the convolution kernel moves according to a given step length until the convolution kernel sweeps the whole image, so as to generate the output feature diagram. In fig. 5, the convolution calculation step size is 1, and 4 convolution kernels are provided, so that the output feature map size is H × W × 4. The convolution kernel size is typically a 3 x 3 matrix, and determines the size of the receptive field.

The number of convolution kernels, convolution kernel moving step size and zero filling all affect the convolution characteristic graph output result. The number of convolution kernels will affect the depth of the output, 3 different convolution kernels will generate 3 different feature maps, and the depth of the output will also change. The convolution kernel shift step size is the number of pixels the convolution kernel shifts over the input matrix, with larger step sizes yielding smaller output sizes. Zero padding is typically used when the convolution kernel does not match the input image. The convolutional layer applies a convolution operation to the input and passes the result to the next layer, the final output of the convolutional layer being a vector. The convolution layer is encouraged to realize parameter sharing in the convolution neural network, all spatial positions in the convolution layer share the same convolution kernel, and when the convolution kernel moves in an image, the weight of the convolution kernel is kept unchanged, so that the quantity of parameters required by the convolution layer is greatly reduced. The convolutional layer has strong characterization capability in a visual identification task through different layer structure combinations.

(2) Batch normalization layer

The batch normalization layer is a standard component of current convolutional neural networks. The batch normalization layer is a process of normalizing the output of the linear layer, and then scaling and adding an offset. The reason that the batch normalization layer was originally designed was to alleviate the problem of internal covariate transfer, which is a common problem in convolutional neural network training. The batch normalization layer is calculated as shown in equation (2-2) by first normalizing the small batch data and then learning the slope and deviation of each small batch data. Given an input x of a batch normalization layer, an output y of the batch normalization layer can be represented as:

where μ and σ denote the mean and variance, respectively, calculated as exponential moving averages of the batch statistics during training, and γ and β are learned affine hyperparameters for each channel. The specific calculation process of the batch normalization layer is shown in the formulas (2-3) to (2-6).

First, the mean value μ of the input data is solved _B In the formula x _i Is the input data for the batch normalization layer, and m is the batch size.

Similarly, the variance of the input data is solved

The result is obtained from the formula (2-4).

The mean and variance calculated by batch statistics are obtained using equations (2-3) and (2-4), and then the data are normalized using equations (2-5),

is the normalized eigenvalue.

The normalized data is then translated and scaled using two learnable parameters γ and β:

where x and y are the input and output vectors of a neuron response in a data sample, the transformation of the batch normalization layer ensures that the input distribution of each layer remains unchanged in different minibatches. When the random Gradient Descent (SGD) algorithm is used for optimization during back propagation, stable input distribution can be generated, so that network model convergence can be promoted, and the training speed of the convolutional neural network is higher. Furthermore, if the training data is reordered in each round of training, different transformations are applied to the same training samples, so that the overall training process produces a more comprehensive enhancement. In the inference phase, global statistics are used for normalization. A large number of experiments show that the network with the batch normalization layer obviously reduces the number of convergence iterations and improves the final network performance. Batch normalization layers have become standard components in the convolutional neural network architecture with the best performance at present, such as the convolutional neural network networks like the residual network ResNet-50 and the lightweight network MobileNet V2.

(3) Active layer

The activation function is a function which is usually positioned after a convolutional layer in a convolutional neural network and is also called an activation layer in the neural network, action potential in neuroscience causes the activation layer to be added in the neural network, and when the potential difference inside and outside the neuron exceeds a certain value, the neuron is caused to transmit a signal to an adjacent neuron, and an activation sequence generated by the action potential is called a pulse sequence. Similarly, an activation function in a neural network outputs a small value for small input data and a larger value if a threshold is exceeded. If the input is large enough, the activation function triggers, otherwise no change is triggered. The activation function is like a gate function that checks if the input value is larger than a critical value. The activation layer plays an important role in the network, and the activation function adds nonlinear characteristics to the neural network, so that the convolutional neural network has strong learning capacity.

There are many activation functions in current convolutional neural networks that are widely used. The simplest one is a Rectified Linear Unit (ReLU), which is a piecewise Linear function that outputs zero if the input is negative, and otherwise directly outputs the original value. Another commonly used activation function is the Sigmoid function, which has the property that its gradient is defined anywhere and the output of the Sigmoid function is between 0 and 1 for all inputs. However, the exponential function is very computationally intensive in practical applications, so a simpler activation function such as ReLU is usually chosen.

(4) Pooling layer

The pooling layer in the convolutional neural network, also known as downsampling, functions to perform dimensionality reduction, reducing the number of parameters in the input. Similar to convolutional layers, the pooling operation will scan through the input data with convolutional kernels, but with the difference that the pooled convolutional kernels have no weights. The convolution kernel applies an aggregation function to the input data to populate the output array. Although much information is lost in the pooling layer, pooling also has many benefits for convolutional neural networks, helping to reduce complexity, improve computational efficiency, and reduce the risk of overfitting.

There are two main types of pooling:

max Pooling (Max Pooling): assuming a 4 x 4 matrix to represent the initial input and assuming a 2 x 2 convolution kernel, the convolution kernel operation is performed on the input matrix, with the convolution kernel being shifted by a step size of 2 and without overlapping regions. As the convolution kernel moves across the input matrix, the pixel selected to have the maximum value will be sent to the output matrix. Maximum pooling tends to be applied more frequently than average pooling. The downsampled feature map is created by computing the maximum of the feature map. Pooling layers are typically used after the convolutional layer, and pooling increases the translation invariance, meaning that a small translation of the image does not significantly affect the value of the majority of the pooled output.

Global Average Pooling (Global Average pool): as the convolution kernel moves over the input matrix, the global average pooling operation computes the average of the input matrix to be passed to the output matrix. The advantage of average pooling is that the integrity of the overall data characteristics can be maintained, preserving more image background information.

(5) Full connection layer

A fully connected layer in a convolutional neural network refers to a layer that connects all inputs of one layer to each activation cell of the next layer. In most popular machine learning models, the last layer is the fully-connected layer, which compiles the data extracted from the previous layers to form the final output. Next to the second most time consuming layer of the convolutional layer. In partially connected layers, the pixel values of the input image are not directly connected to the output layer. In the fully connected layer, each node in the output layer is directly connected to a node in the previous layer. And the full connection layer executes classification tasks according to the characteristics extracted by the previous layers and different convolution kernels thereof. While convolutional and pooling layers tend to use the ReLU activation function, FC layers typically compute the input using the softmax activation function, yielding probabilities from 0 to 1. The fully-connected layer multiplies the input matrix by the weight matrix and then adds a bias vector. The calculation process is as the formula (2-7):

y _FC ＝x·w ^T +bias#(2-7)

where x represents the input of the fully connected layer and y _FC And (4) representing the output result of the full link layer, w represents the weight value of the full link layer, and bias represents the bias value of the full link layer.

The main function of the fully-connected layer is to map the feature space calculated by the previous convolutional layer, pooling layer, etc. into a sample mark space. Simply speaking, integrating the feature representation into one value has the advantages of reducing the influence of the feature position on the classification result and improving the robustness of the whole network.

As shown in fig. 4, which is a diagram of effects before and after pruning, in practice, the value of the pruned parameter in the neural network is set to zero during the training process, so that the neural network adapts to the change, thereby achieving the effect of eliminating unnecessary connections between layers of the neural network. Pruning with either L1-norm regularization or L2-norm regularization requires more iterations than the general method to reach the convergence state. In addition, the trimming criteria for each layer need to be manually set according to the sensitivity of the layer, and network parameters need to be fine-tuned, which may not be possible for some applications. Neural network model pruning can generally reduce model size, but can not improve efficiency, i.e., can not reduce training and reasoning time.

In the embodiment, the deep neural network model compression module facing the edge side equipment uses a neural network algorithm to accurately analyze the big data power information and simplifies the reasoning process of a complex neural network structure; for high requirements on the accuracy of an electric power information recognition algorithm, a deep neural network model compression module facing to edge side equipment overcomes the bottleneck of huge operation efficiency of neural network parameter quantity, pruning and thinning are carried out on a network so as to reduce storage space and calculation power occupied by a deep neural network model, and pruning, thinning and quantification of the model are realized through the network model compression module so as to obtain a compressed network model;

in the embodiment, the deep neural network model compression module facing the edge side device adopts a network model compression technology; the network model compression technology comprises a pruning and sparsification technology and a model quantization technology, aiming at the problems of complex depth neural network models, huge parameter quantity and high requirement on floating point computing power, the depth neural network models are adjusted and compressed by using a pruning quantization method, so that the depth neural network models can be adapted to the limited storage space, memory bandwidth and floating point computing power of electric power edge side equipment, the compression optimization of the models mainly adopts a pruning quantization mode, the layers are separated by a layering channel pruning according to sensitivity, the sensitivity shows how the accuracy of the whole network is influenced by the curled layer after pruning, research shows that if the curled layer with a larger sensitivity value is pruned firstly, the accuracy reduction of the network is smaller, therefore, the network can prune the layers in the order from large to small according to the sensitivity of the network before compression, the order of pruning is determined, the accuracy reduction is smaller than that of disordered pruning, and a grouping quantization strategy avoids the loss of one-time quantization to the network, and more effectively ensures that the accuracy reduction is smaller.

The main operations of convolutional neural networks are linear and nonlinear transformations, assuming in a neural network, w is the weight vector, a is the input activation vector, σ (-) is a nonlinear function, and z is the output activation vector. The convolutional layer is composed of multiple convolutional kernels

And the calculation process of the neural network layer is shown as a formula (3-1) if C, H and W are respectively the number of convolution kernel channels, the height of the kernel and the width of the kernel.

z＝σ(w ^T a)#(3-1)

The goal of neural network model quantization is to use low precision integer arithmetic to perform convolution and fully-connected layer computations of the neural network at the time of inference, and therefore, quantization of the weights and activations of the convolution and fully-connected layers is required. In inference, the quantized weight and activation values may be used as inputs to a low precision integer matrix multiplication unit of a convolutional or fully-connected layer, and then multiplied by step size to rescale the layer output, as shown in fig. 7.

As shown in fig. 6, the number of weights is limited by the plurality of connection sharing weights, the convolution layer weights of the AlexNet network are quantized by 8 bits, the all-connection layer weights are quantized by 4 bits, and the accuracy loss is within 0.01%. Ristretto approximates a convolutional neural network by using a dynamic fixed point quantization strategy, the weight and activation are simultaneously quantized by 8 bits, and the precision loss is kept within 1 percent.

The learned offset parameters are used to perform activation quantization to reduce the loss of accuracy on network architectures using Swish et al activation functions. EfficientNet-B0[20] performs activation 2-bit quantization and weight 2-bit quantization on ImageNet [21] data sets, yielding an accuracy gain of up to 5.6% over LSQ [22 ]. DoReFa-Net [23] performs the bitwise operation of convolution by using low bit width quantization weights and activations, the weights are transformed by a hyperbolic tangent function, and when DoReFa-Net performs 1-bit weight quantization and 2-bit activation quantization, a Top-1 accuracy of 46.1% is obtained on ImageNet validation set. AdaQuant [24] proposes a Post-Training Quantization (Post-Training Quantization) method for hierarchical calibration and integer programming, and the ResNet-50 network achieves a result that the precision loss is less than 1% when the activation and weight Quantization are both 4 bits. At present, research on low bit width quantization mainly focuses on 4-bit quantization, and a 4-bit quantization network gradually approaches a classification result of a full-precision network.

The quantization of network weights or activations to 1 bit is called network binarization, this quantization network is also called the binary neural network, which was first found in the pioneering work BNN of Hubara et al. It establishes an end-to-end gradient back propagation framework for training discrete binary weights and activations. It can save 32 times of memory occupation and obtain 64 times of CPU acceleration. The XNOR-Net proposes a real-valued scaling factor multiplied by each binary weight kernel, and improves the Top-1 accuracy of the ResNet-18 network to 51.2%, and the gap with the real-valued network is reduced to 18%. XNOR-Net + + improves the way the scale factor is calculated by treating it as a model parameter that can be learned end-to-end from the target loss. Compared with an XNOR-Net network, the accuracy is improved by 5%. IRNet proposes an information retention network to retain information in forward activation and backward gradients. Bi-Real Net proposes adding shorts to propagate Real values along the profile, further improving Top-1 accuracy of ResNet-18 binary networks to 56.4%.

Model quantization can be generally divided into two categories: the method comprises the steps of weight reduction and weight sharing, wherein the weight reduction refers to the conversion from a high-precision floating point type to a low-bit-width fixed point type, namely, an approximate 32-bit floating point number is represented by an 8-bit fixed point number or a lower bit number, the quantization method is characterized in that a low-bit-width neural network model is finally obtained, parameters of storage and calculation are not single-precision floating point types but low-bit-width data forms, and the weight sharing means that a picture is input, when the picture is subjected to convolution operation by using one filter, each position in the picture is convoluted by the same filter, so that the weights are the same, namely sharing is realized, the weight sharing is realized when one convolution layer is operated, and the other convolution layer is used for scanning the picture, so that the weight sharing is realized.

The quantization technology of the deep neural network is mainly divided into two types, namely quantization after complete training and quantization during training, which is different from a method for changing density in network pruning, wherein the quantization belongs to a method for changing network diversity, a strategy of grouping quantization is selected to avoid the loss of one-time quantization to the network, after one part of weights are quantized and fixed, the other part of weights are retrained, so that smaller precision reduction can be effectively ensured, all full-precision weights are quantized into low-precision discrete values consisting of powers of 0 and 2 by group power index quantization, and the method allows a simple shifter to replace multiplication operation on hardware, so that the power consumption and the calculation consumption of the hardware can be greatly reduced, and the weights of layers of the network can be more effectively stored and calculated on the hardware.

The specific steps of the step S3 include:

the front end of the reasoning process of the model under the traditional X86 instruction set uses a Pythroch frame, the back end realizes the quick execution of operations such as convolution and the like in the network model through the acceleration instruction of an X86 CPU and the GPU parallel acceleration of a CUDA computational library, for the CPU acceleration instruction, the frame generates a target ARM CPU acceleration instruction through acceleration instruction translation, for the calculation acceleration part which originally uses a high-performance computing card and a CUDA, the frame realizes code tuning and optimization and still generates target peripheral codes through TVM IR, realizes embedded SoC on-chip calculation acceleration based on hardware such as NPU and the like, finally ensures that the migrated network model still uses the front end frame of the Pythroch, and utilizes an ARM SIMD unit to carry out instruction set parallel acceleration on the model after primary conversion, reduces deep neural network reasoning time delay, and completes model optimization.

An existing electric power big data model is usually trained by using an X86 platform, a training process is accelerated by means of a complex instruction of an X86 instruction set, a high-performance graph computing card and a corresponding CUDA computing library to generate a network model with high accuracy, the reasoning process does not need too high computing power, the reasoning calculation of the network model can be completed by using low-cost embedded equipment, the embedded equipment is often an ARM architecture and cannot directly run the trained model under the X86 architecture, the migration framework realizes the basic heterogeneous migration of the model by realizing the environmental migration and dependence solution from the X86 instruction set to the ARM instruction set, and fine-grained optimization is carried out on the deep neural network model running process from the instruction set layer, namely the instruction set SIMD technology optimization. On the edge equipment of the ARM architecture, the SIMD unit has the advantages of good universality, low cost and the like, so that the ARM SIMD unit is fully utilized to carry out instruction set parallel acceleration on the model after preliminary conversion, and the reasoning efficiency of the deep neural network model can be effectively improved. The SIMD optimization process can be summarized into two parts of access optimization and calculation optimization, wherein the access optimization is used for arranging input data into a format convenient for SIMD instruction access; the computation optimization accelerates the computation in the deep neural network by using multi-channel multiplication and addition instructions. The SIMD instruction optimization reduces deep neural network reasoning time delay, and therefore execution efficiency of various electric power services based on the deep neural network is improved.

In the embodiment, the compressed network model is input to the heterogeneous model conversion module facing the edge side device, the conversion of the network model from X86 to ARM is realized, the model conversion is realized through the migration of the edge side heterogeneous conversion module, the model facing the ARM architecture is developed and preliminarily deployed, the reasoning delay of the deep neural network model is optimized by using resources such as a SIMD unit and heterogeneous multi-core provided by the edge side device, the model conversion with finer granularity is completed, and the execution efficiency of various applications is improved;

in this embodiment, the heterogeneous model conversion module facing the edge side device includes network module migration and runtime optimization, a heterogeneous instruction set based model conversion method study is performed on hardware resources such as an ARM instruction set architecture, multi-core multi-instruction parallelism, and a customized accelerator of the edge side device, fine-grained parallelism and architecture-oriented optimization are performed on the converted model, execution efficiency of a calculation task is finally improved, calculation power of the edge side is fully exerted, availability of various deep neural network models is further improved, model conversion across instruction sets can be achieved, the converted network model can be directly deployed into the edge side power device, and the deep neural network can be applied to more power scenes after landing of the deep neural network on the power edge side device is completed.

And S4, inputting the optimized model into the edge-side-facing equipment.

The optimized deep neural network model is applied to fault diagnosis of edge power equipment and a power system, online monitoring, offline testing and operation and maintenance data in the power production process are fully utilized, fault (abnormal) state characteristics of the edge power equipment and the system are mined from bottom-layer original data, and signal processing technology and manual diagnosis experience are not relied on, so that the accuracy of fault diagnosis and adaptability to newly-added fault categories are improved.

The working principle of the invention is as follows:

the invention relates to a deep neural network model compression and heterogeneous conversion system, which comprises: the deep neural network model compression and heterogeneous conversion system comprises a deep neural network model compression module facing to edge-side equipment and a heterogeneous model conversion module facing to edge-side equipment, the model is often modified appropriately according to the change of a new scene along with the operation of the smart grid, the requirement of the whole model optimization process on automation is urgent, end-to-end deployment of the whole model compression and heterogeneous conversion system is realized by means of technologies such as the model compression module, an automatic script of model heterogeneous conversion, a Docker container and the like, the deployment problem of the deep neural network model on low-computing-force power edge-side equipment is effectively solved, and the application range of artificial intelligence in a power scene is improved:

the system comprises an offline system and an online system.

The offline system is based on a Python programming language and a Pythrch deep learning module, an optimized deep neural network model is trained by using a pruning quantification method, the offline system inputs the optimized deep neural network model as an original network model and outputs the optimized deep neural network model, the offline system has the characteristics of sparseness and fixed point, the optimized network model structure is stored as a Python file, and the network model parameters are stored as a numpy file.

The online system inputs a network model structure file and a network model parameter file, outputs an executable deep neural network high-performance model facing the edge side equipment, and additionally outputs the network model reasoning time delay and the reasoning throughput after preliminary conversion and the reasoning time delay and the reasoning throughput after deep optimization.

It should be emphasized that the examples described herein are illustrative and not restrictive, and thus the present invention includes, but is not limited to, those examples described in this detailed description, as well as other embodiments that can be derived from the teachings of the present invention by those skilled in the art and that are within the scope of the present invention.

Claims

1. An end-to-end deep neural network model compression and heterogeneous conversion system is characterized in that: the method comprises the following steps:

and the heterogeneous model conversion module facing the edge side equipment is used for realizing the conversion of the network model from X86 to ARM and realizing the automation of heterogeneous conversion of the network model.

2. An end-to-end deep neural network model compression and heterogeneous conversion method is characterized in that: the method comprises the following steps:

s2, a deep neural network model compression module facing to edge side equipment realizes pruning, sparsification and quantification of a network model to be optimized, so that a compressed network model is obtained;

s3, inputting the compressed network model into a heterogeneous model conversion module facing to edge side equipment, and realizing conversion of the network model from X86 to ARM so as to obtain an optimized model;

and S4, inputting the optimized model into the edge-side-facing equipment.

3. The method of claim 2, wherein the deep neural network model compression and heterogeneous transformation from end to end is performed by: the specific method of the step S2 comprises the following steps:

4. The method of claim 2, wherein the deep neural network model compression and heterogeneous transformation from end to end is performed by: the specific method of the step S3 is as follows:

through the TVM IR, the purpose that the code tuning is optimized and the target peripheral code is generated still is achieved, the embedded SoC on-chip calculation acceleration based on NPU and other hardware is achieved, and finally the migrated network model is still ensured to use a front-end framework of the Pythroch;