CN111931587A

CN111931587A - Video anomaly detection method based on interpretable space-time self-encoder

Info

Publication number: CN111931587A
Application number: CN202010678292.XA
Authority: CN
Inventors: 丰江帆; 梁渝坤; 熊伟; 张莉; 李皓辰
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Aerospace Guosheng Technology Co ltd; Shenzhen Hongyue Enterprise Management Consulting Co ltd
Priority date: 2020-07-15
Filing date: 2020-07-15
Publication date: 2020-11-13
Anticipated expiration: 2040-07-15
Also published as: CN111931587B

Abstract

The invention relates to a video anomaly detection method based on an interpretable space-time self-encoder, which comprises the steps of preprocessing a video; performing feature learning on the processed data, wherein the feature learning comprises a deep learning model based on an interpretable space-time self-encoder, and acquiring a reconstructed video sequence; a step of calculating a regularity score for the reconstructed video sequence; and comparing the calculated regularity score with a predefined threshold value, and judging whether an abnormality occurs.

Description

Video anomaly detection method based on interpretable space-time self-encoder

Technical Field

The invention belongs to the technical field of video processing, relates to a method for detecting video abnormity, and particularly relates to a video abnormity detection method based on an interpretable space-time self-encoder.

Background

With the increasing popularization of video monitoring equipment and the increasing importance of people on security work, the requirements on the analysis of monitoring videos, particularly on the automatic detection of abnormal events or behaviors in the videos, are more and more urgent.

In recent years, many researchers have contributed in the field. For example, Xuet et al propose a depth model for anomalous event detection using a stacked self-encoder for feature learning and a linear classifier for event classification. Tian Wang et al propose a video stream-based algorithm to detect anomalous events, which is based on the direction histogram of the optical flow descriptor and a class of Support Vector Machine (SVM) (support Vector machines) classifiers, and demonstrate the effectiveness of the algorithm on a large amount of data. Mahmudul Hasan et al solve the problem of extracting valid motion features in long video sequences by learning a generative model of a regular motion pattern, called regularity. Yong Shean Chong et al propose an effective video anomaly detection method, which is applicable to spatio-temporal structure of video anomaly detection including crowded scenes.

The above method can solve the problem of anomaly detection to some extent, but the prior art ignores a problem, i.e. ignores the inherent logic for explaining the anomaly detection process. Without the transparency processing of the feature learning process, we cannot fully believe the accuracy of the result and can not take the detection result as the basis of the final decision. Therefore, the Interpretability (Interpretability) of deep learning needs to be combined with an anomaly detection method, so that the reliability of anomaly detection is greatly improved, the judgment result serving as the basis of final decision is improved, and the purpose of enhancing the reliability and accuracy of a security system is achieved.

Disclosure of Invention

The invention aims to provide a video anomaly detection method, and particularly relates to a video anomaly detection method based on an interpretable space-time self-encoder.

In order to achieve the purpose, the invention provides the following technical scheme:

a video anomaly detection method based on an interpretable spatio-temporal autoencoder, comprising the steps of:

preprocessing a video;

performing feature learning on the processed data, wherein the feature learning comprises a deep learning model based on an interpretable space-time self-encoder, and acquiring a reconstructed video sequence;

a step of calculating a regularity score for the reconstructed video sequence;

and comparing the calculated regularity score with a predefined threshold value, and judging whether an abnormality occurs.

Preferably, the video anomaly detection method further comprises a step of visualizing a convolution kernel, wherein visualizing the convolution kernel comprises calculating a neural-activated receptive field in the convolution kernel and enlarging the neural-activated receptive field to an image resolution.

Preferably, the deep learning based on interpretable spatio-temporal self-coding comprises processing the pre-processed video sequence successively through a spatial encoder, a temporal self-encoder and a spatial decoder, wherein the spatial encoder consists of at least 2 interpretable convolutional layers, the temporal self-encoder consists of at least 3 interpretable convolutional LSTM layers, and the spatial decoder consists of at least 2 deconvolution layers.

Preferably, the 2-layer interpretable convolution layer of the spatial encoder is set to 11 x 11 for a first layer with a step size of 4 and contains 128 convolution kernels, and is set to 5 x 5 for a second layer with a step size of 2 and contains 64 convolution kernels.

Preferably, the 2-level deconvolution level of the spatial decoder is set to 5 × 5 for the first level and 2 steps, and contains 128 convolution kernels, and the second level is set to 11 × 11 for the second level and 4 steps, and contains 1 convolution kernel.

Preferably, the interpretable convolutional layer and/or interpretable convolutional LSTM layer includes at least one mask therein, and the particular mask is selected from the at least one mask to filter noise activations.

Preferably, the selecting of the particular mask from the at least one mask is selected by calculating an optimal activation position on the object site.

The method applies the interpretable deep learning method to the video event anomaly detection method, directly interprets the network by visualizing the representation of the convolutional neural network and visualizing the convolutional core in the convolutional layer, and improves the reliability of the anomaly detection result.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

In order to make the object, technical scheme and beneficial effect of the invention more clear, the invention provides the following drawings for explanation:

FIG. 1 is a block diagram of a conventional video anomaly detection method;

FIG. 2 is a block diagram of a video anomaly detection method according to the present invention;

FIG. 3 is a block diagram of an interpretable spatio-temporal autoencoder model according to the present invention;

fig. 4 is a block diagram of a method for implementing an interpretable convolutional layer according to the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

As shown in FIG. 1, a flow diagram 100 of a conventional video anomaly detection method includes the contents of block diagrams 101 to 107. In step 101, an input video is preprocessed, that is, the input video includes a series of conventional operations such as decomposing the video into a sequence of video frames, converting the video frames into gray scales, and reducing dimensions. The preprocessed video frames are then input into block 103 for feature learning, and steps such as calculating regularity scores in block 105 and detecting abnormal events in block 107 are performed.

In fig. 2, a flow chart 200 of the anomaly detection method based on the interpretable spatio-temporal self-encoder according to the present invention is shown. In block 201, the video data still needs to be preprocessed to convert the video into an acceptable input based on the interpretable spatio-temporal self-coder model. The preprocessing of the video data includes decomposing the original video into a frame-by-frame sequence of video frames, such as uniformly setting the size of the video frames to 224 × 224 pixels, and further converting the images into gray scale to reduce the dimension. In the flow step block 203, a deep learning method based on an interpretable space-time self-encoder is introduced for feature learning, including obtaining a reconstructed video sequence through the space-time self-encoder and visualizing semantics of a convolution kernel in an interpretable convolutional layer, and a specific method needs to further refer to fig. 3. Then, in optional step block 205, after the model training is completed, the semantics of the convolution kernel are visualized by computing the neural-activated receptor Field in the convolution kernel, and then enlarging it to image resolution. In block 207, the regularity score of the video frame is calculated, and finally an anomaly detection result is obtained in block 209.

From the comparison between fig. 2 and fig. 1, the present invention introduces a deep learning method based on the interpretable spatio-temporal autoencoder to perform feature learning, and optionally, to perform visualization processing on the semantics of the convolution kernel in the interpretable convolutional layer.

In fig. 3, a block diagram of the interpretable spatiotemporal self-coder model and the processing flow provided by the present invention, i.e. the processing flow of the feature learning method shown in block 203 in fig. 2, is shown. In FIG. 3, the spatio-temporal autoencoder is divided into a spatial encoder of blocks 303-305, a temporal autoencoder of blocks 307-311, and a spatial decoder of blocks 313-315 for feature learning and obtaining reconstructed video sequences and visualizing the semantics of the convolution kernel in the interpretable layer.

First, at step block 301, a video sequence is input. The video sequence first passes through a spatial self-encoder, which is made up of at least two interpretable

convolutional layers

303 and 305, which turns the convolutional kernel into an interpretable convolutional layer by adding a mask to the normal convolutional layer and setting a specific loss function for the convolutional kernel, as shown in detail in fig. 4. The first layer of interpretable convolutional layer 303 may be sized 11 x 11 with a step size of 4 containing 128 convolution kernels, and the second layer of interpretable convolutional layer 305 may be sized 5 x 5 with a step size of 2 containing 64 convolution kernels. After learning the spatial features of a video sequence by the interpretable convolutional layer, the encoded spatial structure is then input into a temporal self-encoder.

Within the temporal self-encoder, the spatial signature graph may be generated using 3 layers of interpretable convolutional LSTM layers 307-311. The 3 layers of the convolution LSTM can be explained, compared with the common full-connected LSTM model, and the convolution operation is added, so that the better spatial feature map can be obtained only by needing less weight. Finally, the spatial signature is passed through a spatial decoder, namely two deconvolution layers 313-315, to reconstruct the video sequence, the first layer 313 of the two deconvolution layers being 5 × 5 in size and 2 in steps, containing 128 convolution kernels, and the second layer 315 being 11 × 11 in size and 4 in steps, with only one convolution kernel. At this point, the process of model training ends.

Inputting the video frame to be tested into the trained model, and combining the obtained reconstructed video sequence with the initial input frame, and further calculating the regularity score. The content of calculating the regularity score may be performed according to the prior art. The specific calculation of the regularity score may be as follows:

calculating the reconstruction error of the intensity value I of the pixel at position (x, y) in the t-th frame of the video sequence:

e(x,y,t)＝‖I(x,y,t)–fw(i(x,y,t))‖₂

where fw is the learning model of the spatio-temporal autoencoder.

The formula calculates the error of one pixel point, and the reconstruction error of the frame is the sum of all the pixel point errors, so that the reconstruction errors of all the pixel points are added in a summation mode to obtain the reconstruction error of the current frame.

e(t)＝∑_(x,y)e(x,y,t)

Finally, the regularity score of each frame is calculated,

wherein, min_te (t) indicates reconstruction errors in the test videoOne frame with the smallest difference, max_te (t) represents a frame with the maximum reconstruction error in the test video, the regularity score obtained by converting the reconstruction error can be limited within the range of 0-1 through the operation, a proper threshold value is set, when the regularity score of the image sequence is higher than the threshold value, abnormal behaviors appear in the video, an alarm is sent to remind security personnel, and serious abnormal accidents are prevented.

Referring to fig. 4, a method of implementing the interpretable convolutional layer and the interpretable convolutional LSTM layer provided by the present invention is described. As shown in the figure, the interpretable convolutional layers and interpretable convolutional LSTM layers of the present invention add a penalty to the feature map X of the normal convolutional kernel f after activating the function Relu layer, compared to the normal convolutional layers and normal convolutional LSTM layers of the prior art. X is an n X n matrix, and since the object regions corresponding to the convolution kernel may be at different locations in the image, n X n templates are provided for the convolution kernel f, each template is also an n X n matrix describing the desired activation profile of the feature map X, and the mask described above is one of the templates selected.

Therefore, as shown in fig. 4, in the forward propagation process of deep learning (from left to right), for an input image I, first, the input image I passes through the Conv layer and the activation function Relu layer, and then a specific template is selected as a Mask (Mask) in the interpretable convolution layer to filter out noise activation from the feature map X. The Mask operation also supports gradient backpropagation during learning. In the back propagation, the Loss of the convolution kernel (Loss for filter) pushes the self to represent a specific object part of one kind, and silence other kinds of images. Therefore, the convolution kernels in the interpretable convolutional layer only represent a specific object part, and are not activated by a plurality of object parts of the image together, so that the ambiguity of activation of the convolution kernels is reduced, and the interpretability of the convolutional layer is greatly improved. The interpretable convolution LSTM layer also contains many convolution kernels, so similar operations are also performed on the convolution kernels to convert them into interpretable convolution kernels, thereby increasing the interpretability of the convolution LSTM.

The loss is calculated for the convolution kernel as follows:

the loss value of the convolution kernel f can be expressed as mutual information of the feature map X and the template T. Where X represents the special mapping of the convolution kernel f after the ReLU operation, and T represents n X n +1 templates, each of which is a matrix of n X n that describes the ideal activation profile of the signature X. The prior probability p (t) is defined as a constant. p (X | T) is a conditional probability of chance expressed as the fitness between the feature map X and the template T.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: ROM, RAM, magnetic or optical disks, and the like.

The above-mentioned embodiments, which further illustrate the objects, technical solutions and advantages of the present invention, should be understood that the above-mentioned embodiments are only preferred embodiments of the present invention, and should not be construed as limiting the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A video anomaly detection method based on an interpretable spatio-temporal autoencoder, comprising the steps of:

preprocessing a video;

a step of calculating a regularity score for the reconstructed video sequence;

2. The method of claim 1, further characterized by the step of visualizing the convolution kernel, wherein visualizing the convolution kernel includes computing a neural-activated receptive field in the convolution kernel and magnifying it to image resolution.

3. The method of claim 1 or 2, further characterized in that the deep learning based on interpretable spatio-temporal self-coding comprises processing the pre-processed video sequence successively through a spatial encoder, a temporal self-encoder, and a spatial decoder, wherein the spatial encoder consists of at least 2 interpretable convolutional layers, the temporal self-encoder consists of at least 3 interpretable convolutional LSTM layers, and the spatial decoder consists of at least 2 deconvolution layers.

4. The method of claim 3, further characterized in that the 2-level interpretable convolution layer of the spatial encoder has a first level of 11 x 11 with a step size of 4 and 128 convolution kernels, and a second level of 5 x 5 with a step size of 2 and 64 convolution kernels.

5. A method as claimed in claim 3, further characterised in that the 2-level deconvolution hierarchy of the spatial decoder is set to 5 x 5 for the first level, step size 2, with 128 convolution kernels, and 11 x 11 for the second level, step size 4, with 1 convolution kernel.

6. A method as claimed in any preceding claim, further characterised in that the interpretable convolutional layer and/or interpretable convolutional LSTM layer contains at least one mask and selecting a particular mask from the at least one mask filters noise activations.

7. A method as recited in claim 6, further characterized in that the selecting a particular mask from at least one mask is selected by calculating an optimal activation location on the subject site.