CN112329865A

CN112329865A - Data anomaly identification method and device based on self-encoder and computer equipment

Info

Publication number: CN112329865A
Application number: CN202011242143.5A
Authority: CN
Inventors: 邓悦; 郑立颖; 徐亮
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-11-09
Filing date: 2020-11-09
Publication date: 2021-02-05
Anticipated expiration: 2040-11-09
Also published as: WO2022095434A1; CN112329865B

Abstract

The application relates to the technical field of artificial intelligence, and provides a data anomaly identification method and device based on an autoencoder, computer equipment and a storage medium, wherein the method comprises the following steps: receiving an input time sequence to be detected; based on the time sequence, carrying out integrated training processing on a preset number of sparsely connected self-encoders according to a preset rule to generate a corresponding self-encoder integrated frame; calculating an abnormal score value corresponding to each vector contained in the time sequence through a self-encoder integrated framework; and identifying whether an abnormal data value exists in the time sequence or not according to the abnormal score value. By the method and the device, whether the abnormal data value exists in the time sequence can be accurately identified, and the identification accuracy of the abnormal data value in the time sequence is effectively improved. The present application also relates to the field of blockchains, where the self-encoder integrated framework may be stored in a blockchain.

Description

Data anomaly identification method and device based on self-encoder and computer equipment

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a data anomaly identification method and device based on an autoencoder and computer equipment.

Background

With the advent of the big data era, various emerging topics such as cloud computing and the internet of things are brought forward, wherein the mining of potential data finally needed by people from mass data becomes more and more important. Conventional data mining is primarily concerned with data models that contain large amounts of data, while detection of anomalous data is of less interest. In fact, although it is inherently important to analyze and mine useful data, an abnormal value in which a deviation of important data occurs also contains a large amount of useful information, and may affect the data, make the data deformed, and fail to obtain a correct result, and thus detection of abnormal data is also not negligible.

In the prior art, most of the current anomaly detection methods are based on statistics and mainly comprise a deviation-based method, a method based on assigned recommended score value distribution, a distance-based method, a density-based method and the like, but these types of methods need to know the distribution of data in advance, and in addition, most of the anomaly detection algorithms based on statistics are only suitable for mining numerical data of single variables, are not suitable for time series data, have less ideal effect if directly applied to the time series data, and have low identification accuracy for the anomaly data.

Disclosure of Invention

The application mainly aims to provide a data anomaly identification method, a data anomaly identification device, a computer device and a storage medium based on a self-encoder, and aims to solve the technical problems that the existing anomaly detection method is not suitable for time series data, the effect is not ideal if the existing anomaly detection method is directly applied to the time series data, and the identification accuracy of the anomaly data is low.

The application provides a data anomaly identification method based on an autoencoder, which comprises the following steps:

receiving an input time sequence to be detected;

based on the time sequence, performing integrated training processing on a preset number of self-encoders which are generated in advance and connected sparsely according to a preset rule to generate a corresponding self-encoder integrated framework, wherein the self-encoders which are generated in advance are respectively generated after unit connection deletion processing is performed on the preset number of self-encoders based on the recurrent neural network;

calculating an abnormal score value corresponding to each vector contained in the time sequence through the self-encoder integration framework;

and identifying whether an abnormal data value exists in the time sequence or not according to the abnormal score value.

Optionally, the step of performing, based on the time series, an integrated training process on a preset number of sparsely connected self-encoders generated in advance according to a preset rule to generate a corresponding self-encoder integrated frame includes:

acquiring all first vectors contained in the time sequence; and the number of the first and second groups,

acquiring a first reconstruction vector which is generated by each self-encoder of the sparse connection based on each first vector and corresponds to one;

generating a corresponding first objective function based on the first vector and the first reconstruction vector;

training each self-encoder in sparse connection based on the first objective function to obtain a trained first self-encoder, wherein the number of the first self-encoders is the same as that of the self-encoders in sparse connection;

performing integrated processing on all the first self-encoders to generate corresponding independent frames, wherein the independent frames contain a specified number of the first self-encoders, and interaction is not generated between the first self-encoders;

determining the independent framework as the self-encoder integrated framework.

acquiring a preset sharing layer, wherein the sharing layer comprises a sharing hidden state;

carrying out weight sharing processing on all the sparsely connected self-coders through the sharing layer;

performing L1 regularization processing on the shared hidden state to obtain a processed shared hidden state;

acquiring all second vectors contained in the time sequence; and the number of the first and second groups,

acquiring a one-to-one corresponding second reconstruction vector generated by each sparsely connected self-encoder based on each second vector;

generating a corresponding second objective function according to the processed shared hidden state, the second vector and the second reconstruction vector;

performing joint training on all the self-encoders in sparse connection based on the second objective function to obtain a trained second self-encoder, wherein the number of the second self-encoders is the same as that of the self-encoders in sparse connection;

performing integrated processing on all the second self-encoders to generate a corresponding shared frame, wherein the shared frame contains a specified number of the second self-encoders, and interaction exists between the second self-encoders;

determining the shared frame as the self-encoder integrated frame.

Optionally, the step of calculating, by the self-encoder integration framework, an anomaly score value corresponding to each vector included in the time series includes:

calculating and generating a reconstruction error corresponding to a specified vector by each self-encoder contained in the self-encoder integrated framework, wherein the specified vector is any one of all vectors contained in the time sequence;

calculating the median of all the reconstruction errors;

determining the median as a specified anomaly score value corresponding to the specified vector in the time series.

Optionally, the step of generating a reconstruction error corresponding to the specified vector by each self-encoder calculation included in the self-encoder integration framework includes:

reconstructing the time sequence through a specific self-encoder to obtain a specific reconstructed time sequence corresponding to the time sequence, wherein the specific self-encoder is any one of all self-encoders included in the self-encoder integrated frame;

extracting a specific reconstruction vector corresponding to the specified vector from the specific reconstruction time series;

and calculating a specific reconstruction error corresponding to the specified vector according to the specified vector and the specific reconstruction vector.

Optionally, the step of identifying whether there is an abnormal data value in the time series according to the abnormal score value includes:

acquiring a preset abnormal threshold;

judging whether a designated score value with a numerical value larger than the abnormality threshold value exists in all the abnormality score values;

if so, screening the designated score value from all the abnormal score values;

finding a third vector corresponding to the specified score value from the time series;

determining the third vector as the anomalous data value.

Optionally, after the step of determining the third vector as the abnormal data value, the method further comprises:

screening out a fourth vector except the third vector from the time sequence;

marking the second vector as a normal data value;

obtaining a first number corresponding to the third vector; and the number of the first and second groups,

acquiring a second quantity corresponding to the fourth vector;

generating an abnormal analysis report corresponding to the time sequence according to the abnormal data value, the first quantity, the normal data and the second quantity;

and displaying the abnormal analysis report.

The present application further provides a data anomaly identification device based on an autoencoder, including:

the receiving module is used for receiving an input time sequence to be detected;

the training module is used for carrying out integrated training processing on a preset number of self-encoders which are generated in advance and connected sparsely according to a preset rule based on the time sequence to generate a corresponding self-encoder integrated frame, wherein the self-encoders which are generated in advance by respectively carrying out unit connection deletion processing on the preset number of self-encoders based on the recurrent neural network;

the computing module is used for computing an abnormal score value corresponding to each vector contained in the time sequence through the self-encoder integrated framework;

and the identification module is used for identifying whether an abnormal data value exists in the time sequence or not according to the abnormal score value.

The present application further provides a computer device, comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the above method when executing the computer program.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the above-mentioned method.

The data anomaly identification method and device based on the self-encoder, the computer equipment and the storage medium have the following beneficial effects:

the data abnormity identification method, the device, the computer equipment and the storage medium based on the self-encoder provided in the application are different from the existing abnormity detection method, the self-encoder integrated frame is adopted to carry out data abnormity identification processing on a time sequence, when the input time sequence to be detected is received, the self-encoder which is improved on the original self-encoder based on the recurrent neural network to generate sparse connection is firstly obtained, then the self-encoder which is generated in advance and connected sparsely is subjected to integrated training processing on the basis of the time sequence to generate the self-encoder integrated frame which can be used for time sequence data abnormity value identification, so that the self-encoder integrated frame can be used for calculating the abnormity score value corresponding to each vector contained in the time sequence, and further, whether the abnormal data value exists in the time sequence can be quickly and accurately identified according to the abnormity score value, the method and the device effectively improve the identification accuracy of the abnormal data values in the time sequence, and have higher identification processing efficiency of the abnormal data values in the time sequence.

Drawings

FIG. 1 is a flow chart of a data anomaly identification method based on a self-encoder according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of an apparatus for recognizing data anomalies based on a self-encoder according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the purpose of explanation of the embodiments of the present application, some concepts will be briefly described below:

the Recurrent Neural Network (RNN), its essence is: the RNN output is dependent on the current input and memory, since it has the ability to remember like a human. RNN networks introduce directed loops that can handle the problem of contextual associations between those inputs. The full connection between the structural layers of the traditional neural network is broken, the state transition of no connection between nodes of each layer is not in an input-hidden-output mode. Purpose of RNN: processing the contents of the sequence data RNN: the current output of a sequence is also related to the previous input. The specific implementation of RNN: the network memorizes the previous information and applies the previous information to the calculation of the current output, namely, the nodes between the hidden layers are not connected any more, and the input of the hidden layers not only comprises the output of the input layer, but also comprises the output of the hidden layer at the last moment. Functional characteristics of RNN: 1. hidden layer nodes can be interconnected or self-connected; 2. in the RNN network, the output of each step is not necessary, nor is the input of each step necessary. RNN use: language model and text generation research, machine translation, speech recognition, image description generation.

An auto-encoder: the method is a neural network, after training, input can be tried to be copied to output, a hidden layer h is arranged in a self-encoder, and encoding representation input can be generated, and the network can be regarded as comprising two parts: an encoder represented by the function h ═ f (x) and a decoder r ═ g (h) that generates the reconstruction. The conventional self-encoder processes the time sequence as follows: for time series T ═<s₁,s₂,…,s_C>Each vector s in the time series_tAn RNN unit in an encoder of the self-encoder fed to perform the following calculations:

wherein s is_tIs a vector at time step t in the time series, hidden state

Is the output of the previous RNN unit at time step t-1 in the encoder, and f (-) is a non-linear function. By the above formula

The hidden state of the current RNN unit of the encoder can be obtained at the time step t

And then hidden into the next RNN cell at time step t-1. In addition, in the decoder from the encoder, the time sequence is reconstructed in the reverse order, i.e.

First, the last hidden state of the encoder is used as the first hidden state of the decoder. Decoder-based

Previous hidden state and previously reconstructed vector

Reconstructing a current vector

And calculates the current hidden state

Where g (-) is a non-linear function.

Referring to fig. 1, a data anomaly identification method based on a self-encoder according to an embodiment of the present application includes:

s1: receiving an input time sequence to be detected;

s2: based on the time sequence, performing integrated training processing on a preset number of self-encoders which are generated in advance and connected sparsely according to a preset rule to generate a corresponding self-encoder integrated framework, wherein the self-encoders which are generated in advance are respectively generated after unit connection deletion processing is performed on the preset number of self-encoders based on the recurrent neural network;

s3: calculating an abnormal score value corresponding to each vector contained in the time sequence through the self-encoder integration framework;

s4: and identifying whether an abnormal data value exists in the time sequence or not according to the abnormal score value.

As described in the above steps S1-S4, the main implementation of the present method embodiment is a data anomaly recognition apparatus based on a self-encoder. In practical applications, the data anomaly identification device based on the self-encoder may be implemented by a virtual device, such as a software code, or may be implemented by a physical device written with or integrated with a relevant execution code, and may perform human-computer interaction with a user through a keyboard, a mouse, a remote controller, a touch panel, or a voice control device. The data anomaly identification device based on the self-encoder in the embodiment can quickly and accurately identify the anomalous data values in the time sequence to be detected. Specifically, an input time series to be detected is first received. Wherein the time sequence to be detected is a time sequence to be detected whether an abnormal data value exists, for exampleThe time series may be a Key Performance Indicator (KPI) time series in the server, and the data included in the time series is in a vector form. And then performing integrated training processing on a preset number of self-encoders which are generated in advance and connected sparsely according to a preset rule based on the time sequence to generate a corresponding self-encoder integrated framework, wherein the self-encoders which are generated in advance by performing unit connection deletion processing on the preset number of self-encoders based on the recurrent neural network. Specifically, the generating process of the sparse-connected self-encoder may include: a specified number of recurrent neural network-based autocoders are obtained first. The cyclic neural network-based self-encoder may specifically be a cyclic neural network self-encoder (RSCN) using an additional auxiliary connection, the cyclic neural network self-encoder using the additional auxiliary connection adds an auxiliary connection between each RNN unit, the specified number is not specifically limited, and the number may be set according to actual requirements, and the specified number may be taken as N in this embodiment. And then respectively carrying out unit connection deletion processing on each self-encoder based on the recurrent neural network to generate self-encoders with corresponding number of sparse connections. Because the auxiliary connection is added between each RNN unit by the self-encoder of the recurrent neural network adopting the additional auxiliary connection, a certain difference exists between network layers by cutting off the auxiliary connection between partial RNN units. Specifically, the process of performing the unit connection deletion process on each self-encoder based on the recurrent neural network may include: for an auto-encoder based on a recurrent neural network with additional auxiliary connections, it is possible to control which auxiliary connections should be deleted at each time step t by introducing a sparse weight vector.

w_tRepresents a sparse weight vector and represents a weight vector,

and

representing the elements contained in the sparse weight vector. Sparse weight vector w_tAt least one element not equal to 0, i.e. w_tThe three cases are (0,1), (1,0) and (1, 1). Thus based on the sparse weight vector w_tA sparsely connected self-encoder can be generated and the resulting hidden state of each RNN unit within the sparsely connected self-encoder is calculated as follows:

wherein s is_tFor the input of the vector at time step t in the time series data, h_t-1Hidden state at time step t-1 in an encoder of an autoencoder for sparse cycles, h_t-LHidden state at time step t-L in an encoder of an autoencoder for sparse cycles, w_tIs a sparse weight vector, | w_t‖₀Represents the vector w_tNumber of non-zero elements. Furthermore, unit connection deletion processing can be performed by using a random deletion connection mode according to actual requirements, and for each self-encoder based on the recurrent neural network, the self-encoder in sparse connection is obtained by randomly deleting the connections of some RNN units, so that reconstruction errors obtained after the self-encoder in sparse connection reconstructs a time sequence are different, the application range of the self-encoder is effectively expanded, and the reliability, the accuracy and the generalization of the self-encoder are enhanced. In addition, assuming that the specified number is N, N self-encoders of the sparse cycles are obtained, and the self-encoder of each sparse cycle is encoded by the encoder E_iAnd decoder D_iAnd i is more than or equal to 1 and less than or equal to N, and each automatic encoder of the sparse cycle has different sparse weight vectors. In addition, the self-encoder integrated framework may include an independent framework and a shared framework. Specifically, the corresponding first objective function may be generated based on all vectors included in the time series and a reconstructed vector corresponding to each vector included in the time series and generated by a sparsely connected self-encoder, and each sparsely connected self-encoder may be trained based on the first objective function to obtain the self-encodingThe device is integrated with the frame. And generating a corresponding second objective function based on all vectors contained in the time sequence, a reconstruction vector generated by the self-encoder in sparse connection and corresponding to each vector contained in the time sequence, and a preset shared hidden state, and performing joint training on all self-encoders in sparse connection based on the second objective function to obtain the self-encoder integrated framework. After the self-encoder integration framework is obtained, the abnormal score value corresponding to each vector contained in the time series is calculated through the self-encoder integration framework. The reconstruction errors corresponding to each vector in the time series can be calculated and generated through each self-encoder included in the self-encoder integrated frame, then for any one appointed vector in the time series, the median of all the reconstruction errors corresponding to the appointed vector is calculated, and further the abnormal score value corresponding to the appointed vector can be obtained. And finally, identifying whether an abnormal data value exists in the time sequence or not according to the abnormal score value. Whether abnormal data values exist in the time sequence can be identified according to a preset abnormal threshold value, and if the abnormal score value corresponding to any one appointed vector in the time sequence is larger than the abnormal threshold value, the appointed vector is determined as the abnormal data value. And if the abnormal score value corresponding to the designated vector is not larger than the abnormal threshold value, the designated vector is determined as a normal data value, namely the designated vector does not belong to an abnormal data value. Different from the existing anomaly detection method, the embodiment adopts the self-encoder integrated framework to perform data anomaly identification processing on the time sequence, when the input time sequence to be detected is received, the original self-encoder based on the cyclic neural network is firstly acquired to be improved to generate the self-encoder in sparse connection, then the pre-generated self-encoder in sparse connection is subjected to integrated training processing on the time sequence to generate the self-encoder integrated framework capable of being used for identifying the time sequence data anomaly value, so that the self-encoder integrated framework can be used for calculating the anomaly score value corresponding to each vector contained in the time sequence, and further the anomaly score value corresponding to each vector contained in the time sequence can be calculated, and further the self-encoder integrated framework can be used for calculating the anomaly score value of eachWhether the abnormal data values exist in the time sequence can be quickly and accurately identified according to the abnormal score value, the identification accuracy of the abnormal data values in the time sequence is effectively improved, and the identification processing efficiency of the abnormal data values in the time sequence is high.

Further, in an embodiment of the present application, the step S2 includes:

s200: acquiring all first vectors contained in the time sequence; and the number of the first and second groups,

s201: acquiring a first reconstruction vector which is generated by each self-encoder of the sparse connection based on each first vector and corresponds to one;

s202: generating a corresponding first objective function based on the first vector and the first reconstruction vector;

s203: training each self-encoder in sparse connection based on the first objective function to obtain a trained first self-encoder, wherein the number of the first self-encoders is the same as that of the self-encoders in sparse connection;

s204: performing integrated processing on all the first self-encoders to generate corresponding independent frames, wherein the independent frames contain a specified number of the first self-encoders, and interaction is not generated between the first self-encoders;

s205: determining the independent framework as the self-encoder integrated framework.

As described in steps S200 to S205, the self-encoder integrated framework may be an independent framework generated based on all the self-encoders in sparse connection, and the training process of the independent framework is performed by training each different self-encoder in sparse connection separately, so that each self-encoder in sparse connection does not generate interaction in the training phase, and no interaction occurs between each self-encoder included in the generated independent framework. Specifically, the step of generating the corresponding self-encoder integration frame by performing the integration training process on the pre-generated self-encoders with the specified number of sparse connections according to the preset rule based on the time sequence may include: first, all the first vectors included in the time series are obtained. Wherein, the input time sequence to be detected may be: t ═<s₁,s₂,…,s_C>Vector s contained in time series T₁,s₂,…,s_CThe first vector is then considered as the above-mentioned first vector. And simultaneously acquiring first reconstruction vectors which are generated by the self-encoder of each sparse connection based on each first vector and correspond to one another, wherein any self-encoder of the sparse connection generates a reconstruction time sequence corresponding to the time sequence after reconstructing the time sequence

And reconstruct the time series

Vector contained in

The first vectors are regarded as first reconstruction vectors corresponding to the first vectors respectively. And then generating a corresponding first objective function based on the first vector and the first reconstruction vector. Wherein, the difference between the input vector in the time series and the corresponding reconstructed vector generated by the self-encoder with sparse connection and corresponding to the input vector can be minimized as the first objective function J_iAnd using the first objective function J_iTo train each sparsely connected self-encoder independently. Specifically, the first objective function may be:

wherein, J_iIs a first objective function, s_tIs a vector at time step t in the time series,

representing the decoder D contained in the self-encoder from the sparse connection at time step t_iGenerating a vector s_tThe reconstructed vector of (a) is obtained,

is the L2-norm of the vector. And after the first objective function is obtained, training each sparsely connected self-encoder based on the first objective function to obtain a trained first self-encoder, wherein the number of the first self-encoders is the same as that of the sparsely connected self-encoders. And after the first self-encoders are obtained, performing integrated processing on all the first self-encoders to generate corresponding independent frameworks. The independent frame contains a designated number of the first self-encoders, and interaction between the first self-encoders does not occur. Specifically, all the first self-encoders may be integrated into a preset integration frame to generate the independent frame. In addition, each decoder D in the independent framework_iWill independent hidden state

As initial hidden state and corresponding weight matrix

Linear combinations of (3). And finally, when the independent frame is obtained, determining the independent frame as the self-encoder integrated frame. In the embodiment, the independent frame composed of the self-encoders in the specified number and with different network structures and sparse connections is generated through training, and the reconstruction errors from the self-encoders are considered when the independent frame is used for anomaly detection, so that the variance of the total reconstruction errors is reduced, and the anomaly score value corresponding to each vector contained in the time sequence can be accurately calculated according to the independent frame in the following process, and then whether the abnormal data value exists in the time sequence is quickly and accurately identified according to the anomaly score value, so that the identification efficiency and the identification accuracy of the abnormal data value in the time sequence are effectively improved.

Further, in an embodiment of the present application, the step S2 includes:

s210: acquiring a preset sharing layer, wherein the sharing layer comprises a sharing hidden state;

s211: carrying out weight sharing processing on all the sparsely connected self-coders through the sharing layer;

s212: performing L1 regularization processing on the shared hidden state to obtain a processed shared hidden state;

s213: acquiring all second vectors contained in the time sequence; and the number of the first and second groups,

s214: acquiring a one-to-one corresponding second reconstruction vector generated by each sparsely connected self-encoder based on each second vector;

s215: generating a corresponding second objective function according to the processed shared hidden state, the second vector and the second reconstruction vector;

s216: performing joint training on all the self-encoders in sparse connection based on the second objective function to obtain a trained second self-encoder, wherein the number of the second self-encoders is the same as that of the self-encoders in sparse connection;

s217: performing integrated processing on all the second self-encoders to generate a corresponding shared frame, wherein the shared frame contains a specified number of the second self-encoders, and interaction exists between the second self-encoders;

s218: determining the shared frame as the self-encoder integrated frame.

As described in steps S210 to S218, the self-encoder integrated framework may be a shared framework including different self-encoders generated based on all the sparsely connected self-encoders and a predetermined shared layer, and since the shared framework includes interactions between different self-encoders, the accuracy of identifying abnormal data values in the time series may be further improved compared to the independent framework. Specifically, the self-encoder with a specified number of pre-generated sparsely connected self-encoders is subjected to integrated training processing according to a preset rule based on the time sequence to generate a corresponding self-encodingThe encoder integration framework may include: firstly, a preset sharing layer is obtained, and weight sharing processing is carried out on all the sparsely connected self-coders through the sharing layer, wherein the sharing layer comprises a sharing hidden state. In addition, the shared layer is the last hidden state of the encoder connecting all the sparse connections

And corresponding weight matrix

In particular, sharing layers, i.e. sharing hidden states

And then performing L1 regularization processing on the shared hidden state to obtain a processed shared hidden state. Wherein, the shared hidden state can be enabled by performing L1 regularization processing on the shared hidden state

And (4) sparse. And further, some encoders are prevented from over-fitting the time sequence, so that the decoder has a wider application range and is not easily influenced by abnormal data values. And acquiring all second vectors contained in the time series after the processed shared hidden state is obtained. Wherein, the input time sequence to be detected may be: t ═<s₁,s₂,…,s_C>Vector s contained in time series T₁,s₂,…,s_CI.e. as the second vector mentioned above. And simultaneously acquiring one-to-one corresponding second reconstruction vectors generated by the self-encoder of each sparse connection based on each second vector. Wherein each of the sparsely connected self-encoders generates a reconstructed time series corresponding to the time series by reconstructing the time series

And reconstruct the time series

Vector contained in

I.e. as the second reconstruction vectors corresponding to the above-mentioned second vectors, respectively. And then generating a corresponding second objective function according to the processed shared hidden state, the second vector and the second reconstruction vector. Specifically, the second objective function may specifically be:

where λ is the weight parameter that controls the importance of the L1 regularization term, s_tIs a vector at time step t in the time series,

indicating that at time step t, from decoder D_iThe reconstructed vector of (a) is then reconstructed,

is the shared hidden state after the regularization process of L1,

is the L2-norm, J, of the vector_iIs the first objective function. And after the second objective function is obtained, performing joint training on all the sparsely connected self-encoders based on the second objective function to obtain trained second self-encoders, wherein the number of the second self-encoders is the same as that of the sparsely connected self-encoders. And then, performing integrated processing on all the second self-encoders to generate corresponding sharing frameworks. Wherein the sharing frame contains a designated number of the second self-encodingAnd the second self-coders are interacted with each other. In addition, all the second self-encoders may be integrated into a preset integration frame to generate the sharing frame. And finally, determining the shared frame as the self-encoder integrated frame. In the embodiment, a sharing frame composed of a specified number of sparsely connected self-encoders with different network structures is generated through training, and since reconstruction errors from a plurality of self-encoders are considered when the sharing frame is used for anomaly detection, and interaction can be generated among the sparsely connected self-encoders, the variance of the total reconstruction errors is further favorably reduced, so that an anomaly score value corresponding to each vector contained in the time sequence is accurately calculated according to the sharing frame, and whether an abnormal data value exists in the time sequence is rapidly and accurately identified according to the anomaly score value, so that the identification efficiency and the identification accuracy of the abnormal data value in the time sequence are effectively improved.

Further, in an embodiment of the present application, the step S3 includes:

s300: calculating and generating a reconstruction error corresponding to a specified vector by each self-encoder contained in the self-encoder integrated framework, wherein the specified vector is any one of all vectors contained in the time sequence;

s301: calculating the median of all the reconstruction errors;

s302: determining the median as a specified anomaly score value corresponding to the specified vector in the time series.

As described in the above steps S300 to S302, the step of calculating the abnormal score value corresponding to each vector included in the time series by the self-encoder framework may specifically include: first, a reconstruction error corresponding to a specific vector is calculated and generated by each self-encoder included in the self-encoder integrated framework, wherein the specific vector is any one of all vectors included in the time series. Specifically, assuming that the above specified number is N, T is equal to the original time series<s₁,s₂,…,s_C>Of any one vector s_kThe vector s can be generated by N self-encoders included in the self-encoder integration framework_kCorresponding N reconstruction errors { a₁,a₂,…,a_N}. In addition, the generation process of the reconstruction error may include: generating reconstruction time sequences corresponding to the time sequences by N self-coders contained in a self-coder integrated frame, and extracting vectors s from the reconstruction time sequences_kCorresponding reconstructed vector, thereby invoking vector s_kAnd a calculation formula associated with the corresponding reconstructed vector to calculate the quantity s_kCorresponding reconstruction error. The median of all the above reconstruction errors is then calculated. Wherein, the formula OS(s) can be passed_k)＝median{a₁,a₂,…,a_NAnd (4) calculating the median. And finally, determining the median as a specified abnormal score value corresponding to the specified vector in the time series. Here, in order to reduce the influence of the reconstruction error from the self-encoder, the median of N reconstruction errors is used as the vector s_kThe final anomaly score value of (a). The independent framework and the shared framework calculate the abnormal point value corresponding to each vector included in the time series by using the same calculation formula. In the embodiment, the reconstruction errors corresponding to the designated vectors are calculated and generated by using each self-encoder included in the self-encoder integrated frame, and then the median of all the reconstruction errors is the designated abnormal score value corresponding to the designated vectors in the time series, so that the abnormal score value corresponding to each vector included in the time series is accurately calculated and calculated, and then whether the abnormal data value exists in the time series or not is rapidly and accurately identified according to the abnormal score value, so that the identification efficiency and the identification accuracy of the abnormal data value in the time series are effectively improved.

Further, in an embodiment of the present application, the step S300 includes:

s3000: reconstructing the time sequence through a specific self-encoder to obtain a specific reconstructed time sequence corresponding to the time sequence, wherein the specific self-encoder is any one of all self-encoders included in the self-encoder integrated frame;

s3001: extracting a specific reconstruction vector corresponding to the specified vector from the specific reconstruction time series;

s3002: and calculating a specific reconstruction error corresponding to the specified vector according to the specified vector and the specific reconstruction vector.

As described in steps S3000 to S3002, the step of calculating and generating the reconstruction error corresponding to the designated vector by each self-encoder included in the self-encoder framework may specifically include: firstly, a specific self-encoder reconstructs the time sequence to obtain a specific reconstructed time sequence corresponding to the time sequence, wherein the specific self-encoder is any one of all self-encoders included in the self-encoder integrated frame. Wherein, the input time sequence to be detected may be: t ═<s₁,s₂,…,s_C>The specific autoencoder can generate a reconstructed time series corresponding to the time series by reconstructing the time series

I is more than or equal to 1 and less than or equal to N. Then, a specific reconstruction vector corresponding to the specified vector is extracted from the specific reconstruction time series. Wherein, for the specified vector s in the time series_kReconstruction time series that can be generated from a specific self-encoder

Extracts the vector s associated with the designated vector_kCorresponding specific reconstruction vector

Finally, according to the appointed vector and the specific reconstruction vector, calculating the specific reconstruction error corresponding to the appointed vector. Wherein, can pass through the formula

To calculate a specific reconstruction error corresponding to the specified vector.

Further, can be represented by a formula

To calculate a specified anomaly score value corresponding to the specified vector in the time series. Therefore, the reconstruction error corresponding to the appointed vector can be calculated and generated according to each self-encoder contained in the self-encoder integrated frame in the following process, the abnormal score value corresponding to each vector contained in the time sequence can be calculated rapidly, whether the abnormal data value exists in the time sequence or not can be identified rapidly and accurately according to the abnormal score value in the following process, and the identification efficiency and the identification accuracy of the abnormal data value in the time sequence can be improved effectively.

Further, in an embodiment of the present application, the step S4 includes:

s400: acquiring a preset abnormal threshold;

s401: judging whether a designated score value with a numerical value larger than the abnormality threshold value exists in all the abnormality score values;

s402: if so, screening the designated score value from all the abnormal score values;

s403: finding a third vector corresponding to the specified score value from the time series;

s404: determining the third vector as the anomalous data value.

As described in the foregoing steps S400 to S404, the step of identifying whether there is an abnormal data value in the time series according to the abnormal score value may specifically include first obtaining a preset abnormal threshold value. The value of the abnormal threshold is not specifically limited, and may be generated based on corresponding statistical calculation of the historical time-series data, or may be set according to actual requirements. Then, it is determined whether there is a specified point value having a value greater than the abnormality threshold value among all the abnormality point values. And if a designated score value with a value larger than the abnormality threshold value exists in all the abnormality score values, screening the designated score value from all the abnormality score values. And then finding out a third vector corresponding to the specified fraction value from the time sequence. And finally, when the third vector is obtained, determining the third vector as the abnormal data value. In the present embodiment, the self-encoder integration framework is used to calculate the anomaly score value corresponding to each vector included in the time series. By comparing the abnormal score value with a preset abnormal threshold value, the appointed score value which is larger than the abnormal threshold value in all the abnormal score values is searched from the time sequence, and the corresponding third vector corresponding to the appointed score value in the time sequence is determined as the abnormal data value, so that the accurate identification of the abnormal data value contained in the time sequence is realized, and the identification efficiency of the abnormal data in the time sequence is effectively improved.

Further, in an embodiment of the application, after the step S404, the method includes:

s405: screening out a fourth vector except the third vector from the time sequence;

s406: marking the second vector as a normal data value;

s407: obtaining a first number corresponding to the third vector; and the number of the first and second groups,

s408: acquiring a second quantity corresponding to the fourth vector;

s409: generating an abnormal analysis report corresponding to the time sequence according to the abnormal data value, the first quantity, the normal data and the second quantity;

s410: and displaying the abnormal analysis report.

As described in steps S405 to S410, after the abnormal data values in the time series are obtained, a corresponding abnormal analysis report may be further generated according to the abnormal data values and the related data, and specifically, after the step of determining the third vector as the abnormal data values, the method may further include: first, a fourth vector except the third vector is screened out from the time sequence, and the second vector is marked as a normal data value. Then, a first number corresponding to the third vector is obtained. And simultaneously acquiring a second quantity corresponding to the fourth vector. And generating an abnormal analysis report corresponding to the time series according to the abnormal data value, the first quantity, the normal data and the second quantity. Wherein each of the abnormal analysis reports at least includes the abnormal data value, the first quantity, the normal data, and the second quantity. And finally, after the abnormal analysis report is obtained, displaying the abnormal analysis report so that a user can clearly know the specific distribution condition and the specific scale of the abnormal data values and the specific distribution condition and the specific scale of the normal data values contained in the time sequence to be detected according to the abnormal analysis report. The display mode of the above-mentioned abnormal analysis report is not specifically limited, and may be set according to the implementation requirement.

The data anomaly identification method based on the self-encoder in the embodiment of the application can also be applied to the field of block chains, for example, the data such as the self-encoder integrated framework is stored on the block chains. By storing and managing the self-encoder integrated framework by using the block chain, the security and the non-tamper property of the self-encoder integrated framework can be effectively ensured.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

The block chain underlying platform can comprise processing modules such as user management, basic service, intelligent contract and operation monitoring. The user management module is responsible for identity information management of all blockchain participants, and comprises public and private key generation maintenance (account management), key management, user real identity and blockchain address corresponding relation maintenance (authority management) and the like, and under the authorization condition, the user management module supervises and audits the transaction condition of certain real identities and provides rule configuration (wind control audit) of risk control; the basic service module is deployed on all block chain node equipment and used for verifying the validity of the service request, recording the service request to storage after consensus on the valid request is completed, for a new service request, the basic service firstly performs interface adaptation analysis and authentication processing (interface adaptation), then encrypts service information (consensus management) through a consensus algorithm, transmits the service information to a shared account (network communication) completely and consistently after encryption, and performs recording and storage; the intelligent contract module is responsible for registering and issuing contracts, triggering the contracts and executing the contracts, developers can define contract logics through a certain programming language, issue the contract logics to a block chain (contract registration), call keys or other event triggering and executing according to the logics of contract clauses, complete the contract logics and simultaneously provide the function of upgrading and canceling the contracts; the operation monitoring module is mainly responsible for deployment, configuration modification, contract setting, cloud adaptation in the product release process and visual output of real-time states in product operation, such as: alarm, monitoring network conditions, monitoring node equipment health status, and the like.

Referring to fig. 2, an embodiment of the present application further provides an apparatus for identifying data anomalies based on an autoencoder, including:

the receiving module 1 is used for receiving an input time sequence to be detected;

the training module 2 is configured to perform integrated training processing on a preset specified number of sparsely connected self-encoders according to a preset rule based on the time sequence to generate a corresponding self-encoder integrated frame, where the sparsely connected self-encoders are generated by performing unit connection deletion processing on the specified number of self-encoders based on the recurrent neural network, respectively;

the calculating module 3 is used for calculating an abnormal score value corresponding to each vector contained in the time sequence through the self-encoder integrated framework;

and the identification module 4 is used for identifying whether an abnormal data value exists in the time sequence according to the abnormal score value.

In this embodiment, the implementation processes of the functions and actions of the receiving module, the training module, the calculating module and the identifying module in the data anomaly identification apparatus based on the self-encoder are specifically described in the implementation processes corresponding to steps S1 to S4 in the data anomaly identification method based on the self-encoder, and are not described herein again.

Further, in an embodiment of the present application, the training module includes:

a first obtaining unit, configured to obtain all first vectors included in the time series; and the number of the first and second groups,

a second obtaining unit, configured to obtain a one-to-one corresponding first reconstruction vector generated by each self-encoder of the sparse connection based on each first vector;

a first generating unit, configured to generate a corresponding first objective function based on the first vector and the first reconstruction vector;

a first training unit, configured to train each of the sparsely connected self-encoders based on the first objective function, respectively, to obtain trained first self-encoders, where the number of the first self-encoders is the same as the number of the sparsely connected self-encoders;

the first processing unit is used for performing integrated processing on all the first self-encoders to generate corresponding independent frames, wherein the independent frames contain a specified number of the first self-encoders, and interaction does not occur between the first self-encoders;

a first determining unit for determining the independent frame as the self-encoder integrated frame.

In this embodiment, the implementation process of the functions and actions of the first obtaining unit, the second obtaining unit, the first generating unit, the first training unit, the first processing unit and the first determining unit in the data anomaly identification device based on the self-encoder is specifically described in the implementation process corresponding to steps S200 to S205 in the data anomaly identification method based on the self-encoder, and is not described herein again.

a third obtaining unit, configured to obtain a preset shared layer, where the shared layer includes a shared hidden state;

the second processing unit is used for carrying out weight sharing processing on all the sparsely connected self-coders through the sharing layer;

the third processing unit is configured to perform L1 regularization processing on the shared hidden state to obtain a processed shared hidden state;

a fourth obtaining unit, configured to obtain all second vectors included in the time series; and the number of the first and second groups,

a fifth obtaining unit, configured to obtain a one-to-one second reconstruction vector generated by each of the sparsely connected self-encoders based on each of the second vectors;

a second generating unit, configured to generate a corresponding second objective function according to the processed shared hidden state, the second vector, and the second reconstruction vector;

a second training unit, configured to perform joint training on all the sparsely connected self-encoders based on the second objective function to obtain second trained self-encoders, where the number of the second self-encoders is the same as the number of the sparsely connected self-encoders;

the fourth processing unit is configured to perform integrated processing on all the second self-encoders to generate a corresponding shared frame, where the shared frame includes a specified number of the second self-encoders, and there is interaction between the second self-encoders;

a second determining unit, configured to determine the shared frame as the self-encoder integrated frame.

In this embodiment, the implementation processes of the functions and actions of the third obtaining unit, the second processing unit, the third processing unit, the fourth obtaining unit, the fifth obtaining unit, the second generating unit, the second training unit, the fourth processing unit and the second determining unit in the data anomaly identification apparatus based on the self-encoder are specifically described in the implementation processes corresponding to steps S210 to S218 in the data anomaly identification method based on the self-encoder, and are not described herein again.

Further, in an embodiment of the present application, the calculating module includes:

a first calculation unit, configured to calculate and generate, by each self-encoder included in the self-encoder integrated framework, a reconstruction error corresponding to a specified vector, where the specified vector is any one of all vectors included in the time series;

a second calculation unit for calculating the median of all the reconstruction errors;

a third determination unit configured to determine the median as a specified abnormality score value corresponding to the specified vector in the time series.

In this embodiment, the implementation processes of the functions and functions of the first calculating unit, the second calculating unit and the third determining unit in the data abnormality identification device based on the self-encoder are specifically described in the implementation processes corresponding to steps S300 to S302 in the data abnormality identification method based on the self-encoder, and are not described herein again.

Further, in an embodiment of the application, the first calculating unit includes:

a processing subunit, configured to perform reconstruction processing on the time sequence through a specific self-encoder to obtain a specific reconstructed time sequence corresponding to the time sequence, where the specific self-encoder is any one of all self-encoders included in the self-encoder integrated frame;

an extraction subunit, configured to extract a specific reconstruction vector corresponding to the specified vector from the specific reconstruction time series;

and the calculating subunit is used for calculating a specific reconstruction error corresponding to the specified vector according to the specified vector and the specific reconstruction vector.

In this embodiment, the implementation processes of the functions and functions of the processing subunit, the extracting subunit, and the calculating subunit in the data abnormality identification device based on the self-encoder are specifically described in the implementation processes corresponding to steps S3000 to S3002 in the data abnormality identification method based on the self-encoder, and are not described herein again.

Further, in an embodiment of the present application, the identification module includes:

a sixth obtaining unit, configured to obtain a preset abnormal threshold;

a judging unit configured to judge whether there is a specified score value having a value larger than the abnormality threshold value among all the abnormality score values;

the first screening unit is used for screening the specified score value from all the abnormal score values if the abnormal score value is positive;

a search unit configured to search for a third vector corresponding to the designated score value from the time series;

a fourth determination unit configured to determine the third vector as the abnormal data value.

In this embodiment, the implementation processes of the functions and functions of the sixth obtaining unit, the judging unit, the first screening unit, the searching unit and the fourth determining unit in the data abnormality identification device based on the self-encoder are specifically described in the implementation processes corresponding to steps S400 to S404 in the data abnormality identification method based on the self-encoder, and are not described herein again.

a second screening unit, configured to screen out a fourth vector except the third vector from the time series;

a marking unit for marking the second vector as a normal data value;

a seventh obtaining unit configured to obtain a first number corresponding to the third vector; and the number of the first and second groups,

an eighth acquiring unit configured to acquire a second number corresponding to the fourth vector;

a third generating unit, configured to generate an abnormal analysis report corresponding to the time series according to the second screening unit, the first amount, the normal data, and the second amount;

and the display unit is used for displaying the abnormity analysis report.

In this embodiment, the implementation processes of the functions and functions of the second screening unit, the marking unit, the seventh obtaining unit, the eighth obtaining unit, the third generating unit and the displaying unit in the data abnormality identification device based on the self-encoder are specifically described in the implementation processes corresponding to steps S405 to S410 in the data abnormality identification method based on the self-encoder, and are not described herein again.

Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device comprises a processor, a memory, a network interface, a display screen, an input device and a database which are connected through a system bus. Wherein the processor of the computer device is designed to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used for storing data such as time sequences to be detected, sparsely connected self-encoders, self-encoder integrated frames, abnormal score values and abnormal data values. The network interface of the computer device is used for communicating with an external terminal through a network connection. The display screen of the computer equipment is an indispensable image-text output equipment in the computer, and is used for converting digital signals into optical signals so that characters and figures are displayed on the screen of the display screen. The input device of the computer equipment is the main device for information exchange between the computer and the user or other equipment, and is used for transmitting data, instructions, some mark information and the like to the computer. The computer program is executed by a processor to implement a method for autoencoder-based data anomaly identification.

The processor executes the data abnormity identification method based on the self-encoder, and comprises the following steps:

receiving an input time sequence to be detected;

Those skilled in the art will appreciate that the structure shown in fig. 3 is only a block diagram of a part of the structure related to the present application, and does not constitute a limitation to the apparatus and the computer device to which the present application is applied.

An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a data anomaly identification method based on a self-encoder, and specifically includes:

receiving an input time sequence to be detected;

In summary, the data anomaly identification method, apparatus, computer device and storage medium based on the self-encoder provided in the embodiments of the present application are different from the existing anomaly detection method, and the present application adopts a self-encoder integration frame to perform data anomaly identification processing on a time sequence, and when receiving an input time sequence to be detected, the self-encoder that improves an original self-encoder based on a recurrent neural network to generate sparse connection is obtained first, and then the self-encoder integration frame that can be used for identifying time series data values is generated by performing integrated training processing on the pre-generated sparse connection self-encoder based on the time sequence, so that the self-encoder integration frame can be used to calculate an anomaly score value corresponding to each vector included in the time sequence, and further, whether an anomalous data value exists in the time sequence can be quickly and accurately identified according to the anomaly score value, the method and the device effectively improve the identification accuracy of the abnormal data values in the time sequence, and have higher identification processing efficiency of the abnormal data values in the time sequence.

It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by hardware associated with instructions of a computer program, which may be stored on a non-volatile computer-readable storage medium, and when executed, may include processes of the above embodiments of the methods. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims

1. A data anomaly identification method based on an auto-encoder is characterized by comprising the following steps:

receiving an input time sequence to be detected;

2. The self-encoder-based data anomaly identification method according to claim 1, wherein the step of performing integrated training processing on a preset number of sparsely connected self-encoders according to a preset rule based on the time series to generate a corresponding self-encoder integrated frame comprises:

determining the independent framework as the self-encoder integrated framework.

3. The self-encoder-based data anomaly identification method according to claim 1, wherein the step of performing integrated training processing on a preset number of sparsely connected self-encoders according to a preset rule based on the time series to generate a corresponding self-encoder integrated frame comprises:

determining the shared frame as the self-encoder integrated frame.

4. The self-encoder based data anomaly identification method according to claim 1, wherein the step of calculating, by the self-encoder integration framework, the anomaly score value corresponding to each vector included in the time series comprises:

calculating the median of all the reconstruction errors;

5. The self-encoder based data anomaly identification method according to claim 4, wherein said step of generating a reconstruction error corresponding to a given vector by each self-encoder calculation included in said self-encoder integration framework comprises:

6. The self-encoder based data anomaly identification method according to claim 1, wherein said step of identifying whether an anomalous data value exists in said time series according to said anomaly score value comprises:

acquiring a preset abnormal threshold;

if so, screening the designated score value from all the abnormal score values;

determining the third vector as the anomalous data value.

7. The self-encoder based data anomaly identification method according to claim 6, wherein said step of determining said third vector as said anomalous data value is followed by:

screening out a fourth vector except the third vector from the time sequence;

marking the second vector as a normal data value;

acquiring a second quantity corresponding to the fourth vector;

and displaying the abnormal analysis report.

8. An apparatus for recognizing data abnormality based on an auto-encoder, comprising:

9. A computer device comprising a memory and a processor, the memory having stored therein a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method according to any one of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.