CN117195961A

CN117195961A - Improved multi-scale weight sharing convolution method

Info

Publication number: CN117195961A
Application number: CN202311042653.1A
Authority: CN
Inventors: 陈彦如; 程健峰; 吴迪智; 罗富玮; 金正�; 陈是澎; 袁道华; 陈良银
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2023-08-18
Filing date: 2023-08-18
Publication date: 2023-12-08

Abstract

The present patent relates to an improved multi-scale weight sharing convolutional network that effectively captures features on different time scales and provides a multi-headed attention mechanism and improved variational self-encoders to enhance the performance of the model. The multichannel 1D CNN model provided by the patent uses a plurality of independent convolution cores to carry out convolution operation on each channel, and the results of the channels are combined to improve the performance and accuracy of the model. In order to reduce the number of parameters and ensure the accuracy, a multi-channel local weight sharing mode is adopted, and a multi-scale convolution kernel is proposed to adapt to the characteristics of different time scales. The improved variation self-encoder combines the attention output to carry out integral modeling, and the KL divergence optimization method is utilized to improve the feature extraction effect. The innovation point of the patent is that the combination application of the multi-scale weight sharing, the multi-channel 1DCNN, the multi-head attention mechanism and the improved variation self-encoder improves the model performance and accuracy, and is suitable for the fields of factory data analysis and the like.

Description

Improved multi-scale weight sharing convolution method

Technical Field

The present application relates to a multi-scale weight sharing convolutional network for processing multivariate time series data in a specific application scenario. The network structure combines a multi-scale convolutional neural network, a multi-head attention mechanism and an improved variational self-encoder, and aims to improve the feature extraction capability and accuracy of multi-variable time series data.

Background

In factory data analysis, multivariate time series data widely exist, and a traditional one-dimensional convolutional neural network (1 DCNN) is only suitable for processing single-channel time series data, and cannot fully utilize information of a plurality of associated channels. Furthermore, a fixed-size convolution kernel can only capture features of a particular scale, and may not capture critical information for time-series data that is long-term dependent. Meanwhile, the traditional global weight sharing method has the problem of insufficient learning in multichannel convolution.

Disclosure of Invention

The present application proposes an improved multi-scale weight sharing convolution method for processing multivariate time series data. The method comprises the following parts:

the first part, the multiscale weights share a convolutional network. The invention provides a multi-scale convolutional neural network model which is used for processing multivariate time series data. The model can effectively capture features on different time scales. The model may more fully characterize time series data by using smaller convolution kernels to capture short-term timing features and larger convolution kernels to capture long-term timing features.

A second part: multi-headed attention mechanism. The present invention also introduces a multi-headed attention mechanism for flexibly focusing on interactions between different data dimensions. By using a multi-headed attention mechanism, the model can better capture the relationship features between dimensions, thereby improving modeling ability for multivariate time series data.

Third section: the improved variation is from the encoder. To better model the output of attention and improve the feature extraction, the present invention uses an improved variational self-encoder (VAE) to model attention overall from a global perspective. Meanwhile, a KL divergence optimization method is introduced to improve the training effect of the VAE model. In addition, the probability distribution of the VAE model is utilized for anomaly interpretation, so that the anomaly detection capability of the model is enhanced.

In summary, the present invention proposes an improved multi-scale weight sharing convolution method, which utilizes multi-channel local weight sharing and design of multi-scale convolution kernel. In order to make the above objects, features and advantages of the present invention more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a block diagram of an electronic terminal according to a preferred embodiment of the present invention.

Fig. 2 is a flowchart of a data processing method based on a situation change model according to a preferred embodiment of the present invention.

Detailed Description

Because most of the data in the factory scene is a multi-variable time sequence, the traditional 1D CNN model is only applicable to the time sequence data of a single channel, and the information of a plurality of associated channels cannot be fully utilized. The multi-channel 1D CNN model performs a convolution operation on each channel using a plurality of independent convolution checks, and then merges the convolution results for each channel together. The method can extract time sequence characteristics of multiple dimensions at the same time, and fully utilizes information of multiple channels by utilizing merging operation, so that the performance and accuracy of the model are improved. To achieve this, the algorithm needs to create a multi-channel kernel k to process each channel of input data independently. Convolution operation of multichannel kernel k on multichannel data x:

wherein n is _c Is the number of channels in the data, X _c Is a measurement of the h-th channel at time step t-i + 1. The convolution on the multi-channel data is almost similar to the convolution on the single channel, the difference between the two being the additional summation over all channels. However, in this section, it is noted that in the multi-channel convolution, a convolution kernel is set for each channel, and the number of weights of each convolution kernel is greater, so that model parameters are huge, the training speed is reduced, and the risk of overfitting is increased.

In addition, when the existing model uses 1D CNN to process time sequence data, a common usage mode is to use convolution kernels with fixed sizes to extract the characteristics of each subsequence and capture time and space information. However, this section notes that since 1D CNN can only capture local timing features, for timing data that is long-term dependent, 1D CNN may not capture critical information. Therefore, a convolution kernel of a fixed size can only capture features of a specific scale and cannot adapt to different time scales.

In view of the above, this section proposes an improved multi-scale weight sharing convolution method. In order to balance the effects of excessive parameter quantity and feature extraction, a multichannel local weight sharing method is provided. In order to solve the problem that the method cannot adapt to different time scales, a multi-scale convolution kernel method is provided.

First, the multi-channel local weight sharing method proposed in this section is introduced. Shared weights mean that in a multi-channel convolutional neural network, the same parameters are used for the convolutional kernels at different locations. This approach can greatly reduce the amount of parameters of the neural network, as the number of parameters per convolution kernel can be compressed to the same number as a single convolution kernel. For a multi-channel convolutional neural network, each convolution kernel needs to convolve with multiple channels, thus requiring more parameters. However, if a shared weight approach is used, the same parameters can be used for the convolution kernels at the same locations on all channels, thus greatly reducing the number of parameters. Therefore, the problem of large parameter quantity of the multichannel convolutional neural network can be effectively solved.

Specifically, assuming that one convolution layer contains k convolution kernels, each convolution kernel has a size of h×w, and the number of input channels is c _in The number of output channels is c _out . If a method without sharing weights is used, the number of parameters of the layer is k×h×w×c _in ×c _out . Whereas if the method of sharing weights is used, the number of parameters of this layer is only k×h×w. Therefore, the sharing weight can greatly reduce the number of parameters, thereby reducing the complexity of the model and improving the training speed and generalization performance of the model.

However, this section notes that the global sharing method may have a problem of insufficient learning, because the convolution kernels all use the same weight matrix, and although the parameter amount is greatly reduced, the channel learning effect with a large difference between the types of the original data is poor. That is, if the data characteristics are very different between dimensions before the data is input, the globally shared convolution kernel cannot completely adapt to all the features, so that model learning is insufficient and features cannot be accurately extracted.

In response to this problem, this section proposes a local weight sharing method. Specifically, local free weight sharing only performs weight sharing in specific areas, so that each area can freely adjust the weight, and therefore the characteristics of the data are better adapted. That is, conventional global weight sharing is achieved by sharing one parameter matrix, while local free weight sharing is achieved by decomposing a weight matrix into a plurality of local weight matrices, each of which can be independently adjusted. The structure ensures good feature extraction capability while reducing the number of parameters. Traditional calculation formula for global weight sharing:

y _i,j,k ＝∑ _m,n,l W _m,n,l,k X _i+m,j+n,l

local free weight sharing formula:

wherein the method comprises the steps ofRepresenting weights in the local weight matrix located at the (i, j) coordinates. In order to fully utilize the advantages of multi-channel free weight sharing, the chapter further proposes that the correlation between data can be analyzed first, and the multi-dimensional data with stronger correlation uses the same local weight matrix, so that the parameter quantity can be reduced, and the feature extraction capability is ensured to be excessively lost. For example, for the data set used in this chapter, it is observed that all of the data features of AIT502, UV401, AIT501, and AIT502 are greatly hopped, and with 2000 time points as periods, have relatively obvious similarity, so that the same local free weight sharing can be used for optimizing the several channels.

To address the problem that fixed convolution kernels can only capture features of a specific scale, the Multi-scale convolution kernel proposed in this section (Multi-scale Convolutional Kernel) is used to extract potential features of high expressive power with different time scales from a dataset. This section constructs three bypass convolution layers, with different convolution kernel sizes, each using the multi-channel convolution method described in the first module above. Since factory data contains both short-term regularity (on a single sensor scale, considering product entering process transients, or exiting process transients), mid-term patterns (considering production processes occurring in a single process), and long-term variations (considering the entire cycle from last product to next product, including transportation, production, detection).

To cover the above three cases, observing the characteristics of the data set used in this section, this section selects a short period with a size of 3, a middle period with a size of 7, and a long period Conv 1D convolution kernel with a size of 15 slides in the time domain, which means that this multi-scale convolution module can extract data trend and hidden interaction with 3, 7 and 15 time units as periods in the sequence at the same time. Furthermore, it is known that the size of the one-dimensional convolution kernel will affect the learning effect of the network. For example, when the convolution kernel is small, this is advantageous for detecting point anomalies because the duration of the point anomalies is short (caused by a single point in time) and the local feature information can be adequately captured. When the convolution kernel is large, it is advantageous to capture longer anomaly types, such as pattern anomalies, because the pattern anomalies last long (caused by a series of consecutive points). Therefore, the multi-scale design of the proposed model can also take the advantages of a large kernel and a small kernel into consideration, and is beneficial to the joint learning mode of the model on different scales. At the end of the module, the eigenvectors obtained after convolution and pooling are connected to form a new global eigenvector matrix. In this way, feature maps extracted from three different cycles will be fused and the subsequent attention mechanisms will adaptively extract useful information from these hierarchical features.

After extracting features using convolution layers of different scales, this section selects the pair of feature maps c using maximum pooling for feature selection and downsampling to reduce the spatial dimension of the data and preserve the most significant features _k And (3) performing downsampling operation:

wherein c _k Representing a feature map extracted using a kernel of size k,is a sampled feature map. The feature vector obtained after the convolution and pooling operations is used +.>Representing the three eigenvectors are connected into a global eigenvector matrix T. The global feature vector matrix T is input to a multi-headed attention mechanism for further processing:

in order to capture correlation features between dimensions, the present chapter proposes the use of a multi-headed attention mechanism to further extract useful information for the convolutional network obtained in stage one.

Although the attention mechanism achieves better effects in different fields, when facing high-dimensional data, the single-head attention mechanism has the problems of large calculation amount, easy overfitting and the like. The advantage of the multi-headed attention mechanism is that it enables flexible combination and aggregation of input sequences, more suitable for modeling and processing long and multi-dimensional sequences. Meanwhile, the multi-head attention mechanism can pay attention to different time steps, and the robustness and the stability of the model are improved.

In order to further extract the inter-dimension correlation by using the multi-head attention mechanism, the present section initializes the global eigenvector matrix T obtained in the convolutional network to the matrices Q, K, V, which are used as key parameters of the single-head attention mechanism. The main idea of the single head attention mechanism is to scale the Dot product attention (SDA) by first solving the Dot product of Q and K to calculate the similarity, then dividing by(d _k Is the dimension of matrix K) so that the dot product calculation result is not too large. The result is then normalized by the Softmax function and then multiplied by the matrix V to obtain the attention expression. The calculation method of SDA comprises the following steps:

unlike standard attention mechanisms, multi-headed attention mechanisms introduce a combination of multiple queries, keys, values, thereby enhancing the ability of the attention mechanism to extract different information in an input sequence. The concrete idea is to use different parameters Sequentially performing linear transformation on the matrixes Q, K and V, inputting the linear transformation result into SDA, and using head for the calculation result _i Expressed by:

from head the result of the calculation _i To head _h Splicing into a matrix, and multiplying the matrix by a parameter W to finish the final linear change and obtain the final output of the multi-head attention mechanism:

Head＝MultiHead(Q,K,V)＝Concat(head ₁ ,...,head _h )W

in order to be able to model multivariate information as a whole from a global point of view, and to be able to capture dependencies between different dimensions better at the same time, this section proposes to map the output of the attention mechanism in the latent space of the VAE model, making use of the random nature of the VAE so that it can learn the characteristics of the data, including the distribution and variation pattern of normal data, from the latent space, thus better adapting to complex data distribution. While also providing exceptional interpretability using the reconstruction probability of the VAE.

However, this section notes that when a conventional VAE trains the VAE using an optimization method such as random gradient descent (SGD), the encoder tends to map each input to a fixed point in its potential distribution, which causes the KL divergence term in the VAE to become constant and not related to the potential variables any more, and thus cannot be effectively optimized. To address this problem, this section proposes an improved VAE approach with modified ELBO expression and heavy parameter optimization.

The basic theory of VAE is as follows: assuming x is the input data and z is the latent variable, then the VAE aims to learn a conditional distribution p _θ (z|X) so that for a given X, sampling from potential space z is possibleAnd generates new samples similar to x. To achieve this goal, the VAE first maps the input data x into a potential space z, resulting in an encoder (encoder) q _φ (z|X). Then, from q _φ Sampling a potential variable z in (z|X) and decoding it into data X' similar to X to obtain decoder p _θ (X|z). To ensure that the decoder generates data X 'that is similar to the input data X, the VAE introduces a reconstruction error term that represents the difference between X and X'. At the same time, in order to make the learned potential space z have a certain structure and continuity, the VAE also introduces a regularization term, i.e. the a priori distribution p (z) of the potential variable z.

Since p is calculated directly _θ (X|z) is difficult, and to achieve this goal, the VAE trains the model using a technique called variance inference (Variational Inference). Specifically, the loss function of VAE (ELBO):

ELBO＝Eq _φ (z|X)[logp _θ (X|z)]-KL(q _φ (z|X)|p(z))

wherein,indicating the desire for z given an input X. KL (q) _φ (z|X) |p (z)) represents a posterior distribution q _φ KL divergence between (z|x) and the a priori distribution p (z). By minimizing this loss function, the VAE can learn the potential distribution of data and generate new samples therefrom. However, this section notes that in practical applications, if the KL divergence term is weighted too much, or there is a large difference between the distribution of the data set and the a priori distribution, this results in a model that tends to map the input samples to a small region in potential space, ignoring the diversity between samples. In particular, the KL divergence term penalizes the difference between the distribution of potential space and the prior distribution in the objective function of the VAE. In order to minimize KL divergence, the model strives to approximate the learned latent variable distribution to the a priori distribution, which typically brings together sample points in the latent space, forming a compact cluster. The samples produced by the generator network by decoding these sample points will tend to be similar or repetitive. This isBecause the model tends to map the input samples to regions close to the a priori distribution, minimizing KL divergence terms rather than exploring a broader distribution in the underlying space.

To address this problem, this section optimizes the ELBO first term (i.e., the reconstruction error) and gives it more weight. Modified ELBO definition:

ELBO＝Eqφ(z|x)[logp _θ (x|z)]-βKL(q _φ (z|x)||p(z))

where β is a super parameter for controlling the relative importance of the reconstruction error and the KL divergence error. By adjusting the value of β, the reconstruction accuracy of the model and the continuity of the potential space can be balanced.

Furthermore, there are two methods of computing ELBO, the maximum likelihood method and the random gradient descent algorithm (SGD), respectively. The disadvantage of the maximum likelihood method is that it is difficult to optimize directly, since the gradient of the likelihood function usually needs to be calculated for the whole data set, which is costly. At the same time, sampling directly from the gaussian distribution is not straightforward, which makes conventional random gradient descent algorithms unavailable for training directly. The training process is difficult to converge because the use of a random gradient descent algorithm results in the randomness of the sampling process not being captured by the gradient. To solve this problem, this section proposes a stochastic gradient variation estimation method using a re-parameterization technique. The idea of the re-parameterization technique is to separate the sampling process from the differentiable operation, thereby converting the non-conductive sampling operation into a conductive operation, so that the training can be performed directly by applying the conventional random gradient descent algorithm. Specifically, the re-parameterization technique may re-represent the original random variable z by introducing a new random variable e-N (0,I) such that z may be rewritten as z (e) =μ _z +σ _z e, wherein mu _z Sum sigma _z Is the mean and standard deviation of z. In this way, the sampling process of z (e) is separated from the differentiable operation, can be regarded as a deterministic operation, and can be optimized by directly applying a random gradient descent algorithm.

Claims

1. A multi-scale weight sharing convolutional network that captures inter-dimensional correlations, comprising:

the multichannel 1D CNN module is used for carrying out convolution operation on multichannel time sequence data and merging the results together;

a multi-scale convolution kernel module comprising a plurality of convolution kernels of different sizes for extracting potential features of different time scales;

a multi-headed attention mechanism module for focusing on interactions between different data dimensions;

improved variation self-coding.

2. The multi-channel convolutional network of claim 1, wherein the multi-channel 1D CNN module convolves each channel using a plurality of independent convolution checks and merges the results together:

3. the multi-scale convolution network of claim 1, wherein the multi-scale convolution kernel module comprises at least two convolution kernels of different sizes for extracting potential features of different time scales:

4. the local weight sharing convolutional network of claim 1, wherein local free weight sharing only performs weight sharing in specific areas, so that each area can freely adjust weight, thereby better adapting to the characteristics of data; that is, the conventional global weight sharing is realized by sharing one parameter matrix, while the local free weight sharing is realized by decomposing the weight matrix into a plurality of local weight matrices, and each local weight matrix can be independently adjusted; the structure ensures good feature extraction capability while reducing the parameter quantity:

5. a method for capturing inter-dimension correlation features and providing exceptional interpretability, comprising the steps of:

further extracting global feature vectors in the convolutional network using a multi-headed attention mechanism, wherein the multi-headed attention mechanism calculates similarity by scaling the dot product attention mechanism and introduces a combination of multiple queries, keys and values for linear transformation and the attention mechanism;

maintaining a distribution function of data points for indicating data points of interest to a next encoder

Updating the distribution function weights of the data points according to the mean square error between the input and the decoder output:

6. an improved enhancement encoder as in claim 3, for neural network training, comprising:

and the weighted sampling method is used for focusing on the abnormal sample and the normal sample, so that the learning ability of the feature expression is improved:

the adaptive output weight is given according to the reconstruction error of the encoder, the performance difference of each encoder is captured, and the detection performance is improved:

7. a combined deep and shallow training method as claimed in claim 2, for improving the detection accuracy of the model, comprising:

using the generator to generate dummy data to confuse the model, minimizing differences from the real data;

model discrimination between true data and reconstructed data using countermeasure training maximizes variance.

8. A method of merging training for simultaneous enhancement encoder and countermeasure training, comprising:

combining the enhancement encoder with the training process of the countermeasure training, and reducing the iteration times;

the adaptive weight function is used for controlling the weights of the enhancement encoder and the countermeasure training, so that the stability and the accuracy of the training are improved:

9. a method of ensemble learning as claimed in claim 3, a method of calculating encoder importance based on error values, comprising:

evaluating the importance of the encoder according to the encoder reconstruction error value;

determining the relative weight of the encoder by using the magnitude of the reconstructed error value;

the reconstructed error value is used as an index to adjust the contribution degree of the encoder, so that the performance of the model is improved.

10. A method of distribution function based sample selection as recited in claim 3, for enhancing training of an encoder, comprising:

maintaining a distribution function to indicate data points of interest to the encoder;

sample selection is carried out according to the distribution function weight of the data points, and important data points are concerned;

and the attention degree and the learning ability of the model to different samples are improved by dynamically adjusting the distribution function weight of the data points.