CN113705490A

CN113705490A - Anomaly detection method based on reconstruction and prediction

Info

Publication number: CN113705490A
Application number: CN202111016334.4A
Authority: CN
Inventors: 仲元红; 陈霞; 朱冬; 张建; 杨易
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2021-11-26
Anticipated expiration: 2041-08-31
Also published as: CN113705490B

Abstract

The invention relates to the technical field of video and image processing, in particular to an anomaly detection method based on reconstruction and prediction, which comprises the following steps: acquiring a test video sequence to be detected; inputting the test video sequence into a pre-trained anomaly detection model; the anomaly detection model firstly extracts spatial appearance characteristics and temporal motion characteristics of a test video sequence respectively, then fuses the spatial appearance characteristics and the temporal motion characteristics to obtain corresponding spatio-temporal characteristics, then obtains corresponding reconstructed frames based on the spatio-temporal characteristics, and finally calculates corresponding anomaly scores according to the reconstructed frames; and taking the abnormal score of the test video sequence as an abnormal detection result. The anomaly detection method can give consideration to both anomaly detection performance and accuracy, so that the anomaly detection effect and efficiency can be improved.

Description

Anomaly detection method based on reconstruction and prediction

Technical Field

The invention relates to the technical field of video and image processing, in particular to an anomaly detection method based on reconstruction and prediction.

Background

Video anomaly detection is an important research task in computer vision, and has many applications, such as traffic accident detection, violence detection and abnormal crowd behavior detection. Due to the uncertainty and diversity of anomalies, despite years of research, accurately identifying anomalous events from normal events remains a challenging task. Meanwhile, in the real world, it is difficult to enumerate all the abnormal events to learn various abnormal patterns. Thus, many studies are based on a class of classification methods to detect abnormalities, rather than a binary classification based on a supervised thought. Anomaly detection based on one class of classification learns the distribution of normal patterns from normal data and calculates the probability that a test sample obeys the distribution to reflect anomalies.

For the problem that the existing anomaly detection method is sensitive to noise and time intervals, chinese patent publication No. CN111680614A discloses "an anomaly behavior detection method based on video monitoring", which extracts features from a target object in a video frame image, then clusters the features, inputs the features into an SVM classifier, obtains the highest abnormal score as the target object, and finally obtains the highest value of the abnormal scores of all the target objects in the video frame image as the abnormal score of the frame image.

The above-mentioned anomaly (behavior) detection method in the existing scheme utilizes a target detection technology to detect a foreground target in each video frame, and inputs the foreground target into a network frame of a convolutional self-encoder to be reconstructed, and classifies through a reconstruction error to judge the anomaly. However, in the conventional anomaly detection method, all pixels in a frame are processed equally, and a model loses focus, and a complex region which is difficult to reconstruct during training cannot be learned and reconstructed preferentially, so that the model cannot effectively obtain a reconstructed image with high-quality foreground (because simple background pixels control the optimization of the model), and the performance of anomaly detection is reduced because the foreground is more important than a static background in the anomaly detection. Meanwhile, the existing reconstruction method tries to minimize the difference between the reconstruction frame and the real label thereof, although the similarity is guaranteed in the pixel space and even the potential space, the similarity is one-to-one constrained, and the similarity of different normal frames in the same scene is ignored, so that the accuracy of abnormal detection is not high. Therefore, how to design an abnormality detection method that can achieve both abnormality detection performance and accuracy is an urgent technical problem to be solved.

Disclosure of Invention

Aiming at the defects of the prior art, the technical problems to be solved by the invention are as follows: how to provide an anomaly detection method which can give consideration to both anomaly detection performance and accuracy, thereby improving anomaly detection effect and efficiency.

In order to solve the technical problems, the invention adopts the following technical scheme:

an anomaly detection method based on reconstruction and prediction, comprising the steps of:

s1: acquiring a test video sequence to be detected;

s2: inputting the test video sequence into a pre-trained anomaly detection model; the anomaly detection model firstly extracts spatial appearance characteristics and temporal motion characteristics of a test video sequence respectively, then fuses the spatial appearance characteristics and the temporal motion characteristics to obtain corresponding spatio-temporal characteristics, then obtains corresponding reconstructed frames based on the spatio-temporal characteristics, and finally calculates corresponding anomaly scores according to the reconstructed frames;

s3: and taking the abnormal score of the test video sequence as an abnormal detection result.

Preferably, the anomaly detection model includes a reconstruction encoder for extracting spatial appearance features, a prediction encoder for extracting temporal motion features, a fusion module connected to outputs of the reconstruction encoder and the prediction encoder and used for fusing to obtain spatio-temporal features, and a decoder connected to an output of the fusion module and used for obtaining a reconstructed frame.

Preferably, in step S2, the current frame of the test video sequence is input to the reconstruction encoder to extract the corresponding spatial appearance feature; a number of frames preceding a current frame of the test video sequence are input to the predictive encoder to extract corresponding temporal motion features.

Preferably, when the anomaly detection model is trained, the video sequence input in the current round is reversely erased based on the reconstruction error of the previous round of the anomaly detection model, so as to remove pixels with reconstruction errors smaller than a preset threshold value in the video sequence, and obtain a corresponding erased frame.

Preferably, I_tRepresenting the t-th frame, I, in a video sequence_t-ΔRepresents I_tThe previous Δ frame;

the reverse erase refers to: after each round of training iterations except the first round, the original frame I is first calculated_tAnd reconstructing the frame

Pixel level errors in between; then setting the corresponding pixel value in the mask to be 1 or 0 according to whether the value of the pixel level error is larger than a preset threshold value or not so as to obtain a corresponding mask; finally, before the current round of training, from I_t-ΔTo I_tMultiplying the original frame by the mask pixel by pixel to obtain an erased frame of the current round of the abnormal detection model, which is represented as I_t′_-ΔTo I_t′。

Preferably, when the anomaly detection model is trained, a deep SVDD module is connected to the output of the decoder; the depth SVDD module is used for searching a hypersphere with the smallest volume to contain all or most high-level features of a reconstructed frame of a normal event, and enabling the reconstructed normal frame to be similar by utilizing the compact constraint of the high-level features of the reconstructed frame so as to increase the reconstruction distance between the normal frame and an abnormal frame.

Preferably, the depth SVDD module includes a mapping encoder connected to an output of the decoder, and a hypersphere connected to an output of the mapping encoder; the mapping encoder first reconstructs the frame

Mapping into a low-dimensional potential representation and then fitting the low-dimensional representation into a hypersphere with minimal volume to forceThe anomaly detection model learns and extracts common factors of normal events;

the target function of the depth SVDD module is defined as:

in the formula: c and R represent the center and radius of the hyper-sphere, respectively, n represents the number of frames,

representing reconstructed frames output by a network with parameter W

Is represented by argmax {. cndot.) represents a function taking the maximum value.

Preferably, the anomaly detection model is optimized by a training loss function;

reconstructing frames

Constrained in pixel space and potential space of the depth SVDD module;

optimizing the anomaly detection model based on intensity loss and weighted RGB loss in pixel space; in the latent space, optimizing the anomaly detection model based on feature compaction loss.

Preferably, the training loss function is represented by the following formula:

L＝λ_intL_int+λ_rgbL_rgb+λ_compactL_compact(ii) a In the formula: l is_intDenotes the loss of strength, L_rgbRepresenting a weighted RGB loss, L_compcatRepresents the loss of feature compactness, λ_int、λ_rgb、λ_compactHyperparameters corresponding to each loss, respectively, which determine their contribution to the total training loss;

loss of strength L_intCalculated by the following formula:

in the formula: t represents the t-th frame of the video sequence, | · | | non-woven cells₂Is represented by₂A norm;

weighted RGB loss L_rgbCalculated by the following formula:

in the formula: i | · | purple wind₁Is represented by₁Norm, N representing the number of previous frames, frame I_t-iThe weight of (N-i + 1)/N;

the feature compaction loss is calculated by the following formula:

representing reconstructed frames output by a network with parameter W

Is represented by the low dimension of (a).

Preferably, the anomaly detection model calculates the corresponding anomaly score by:

s201: the partial score for each image block in the test video sequence is defined as:

in the formula: p represents an image block in the I frame, I and j represent the spatial positions of pixels in the image block, | P | represents the number of pixels in the image block, and the image block is determined by sliding a window with the step size of 4;

s202: calculating an anomaly score for a frame in a test video sequence:

Score＝argmax{S(P₁),S(P₂),...,S(P_m) }; in the formula: the size of P is set to 16X 16, and m representsThe number of image blocks;

s203: after obtaining the score for each frame in the test video sequence, the scores for all frames are normalized to a range of [0,1] to obtain the following frame-level anomaly scores:

in the formula: min_ScoreAnd max_ScoreRespectively testing the minimum score and the maximum score in the video sequence;

s204: and smoothing the frame-level abnormal scores in the time dimension by adopting a Gaussian filter to obtain the abnormal scores corresponding to the test video sequence.

Compared with the prior art, the anomaly detection method has the following beneficial effects:

in the invention, the spatial features and the temporal features of the video sequence are respectively extracted through a reconstruction method and a prediction method, and the corresponding spatio-temporal features are obtained through fusion to calculate the reconstruction frame, so that the model does not lose focus, complex regions which are difficult to reconstruct during the prior learning and reconstruction training can be preferentially learned, the reconstructed image with high quality prospect can be effectively obtained, and the anomaly detection performance of the anomaly detection model is further improved; meanwhile, the spatial features and the temporal features are extracted, and the similarity of different normal frames in the same scene is considered, so that the anomaly detection accuracy of the anomaly detection model can be improved. Therefore, the abnormality detection method in the invention has both the performance and the accuracy of abnormality detection, thereby improving the effect and the efficiency of abnormality detection.

In the invention, some pixels are erased from the original frame in a reverse erasing mode to create the input data (namely, an erased frame) of the model, which can reserve the pixels with larger reconstruction errors in the previous round of training, remove the pixels with smaller reconstruction errors, and further force the model to focus on the pixels which are not reconstructed in the previous round, so that the simple background and the complex foreground are reconstructed with high quality, most foreground pixels are reserved in the erased frame, most background pixels are discarded, the model is favorable for automatically forming a focus mechanism for the foreground, and the anomaly detection performance and the accuracy can be considered.

In the invention, the depth SVDD module directly acts on the reconstructed frame, a hypersphere with the smallest volume can be searched to contain all or most high-level features of the reconstructed frame of the normal event, the similarity between the reconstructed images of the normal frame is ensured through similar low-dimensional features in a potential space, the reconstruction distance between the normal frame and the abnormal frame can be effectively increased, and the accuracy of the abnormal detection is further improved.

Drawings

For purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made in detail to the present invention as illustrated in the accompanying drawings, in which:

FIG. 1 is a logic block diagram of an anomaly detection method;

FIG. 2 is a diagram of a network architecture during an anomaly detection model test;

FIG. 3 is a network architecture diagram during anomaly detection model training;

FIG. 4 is a partial example graph of three data sets (exceptional events are marked by bounding boxes);

FIG. 5 is a network architecture diagram of an encoder and decoder;

FIG. 6 is a graph showing qualitative results of frame reconstruction for three data sets (lighter colors represent greater error);

FIG. 7 is a graph showing a comparison of anomaly scores;

FIG. 8 is a diagram illustrating a comparison of average scores for normal and abnormal frames;

FIG. 9 is a diagram illustrating the visualization of reverse erasure during different training sessions;

FIG. 10 is a graph showing a comparison of model training loss with and without reverse erase on Ped 2;

FIG. 11 is a diagram of the model visualization with and without reverse erase on Avenue and Ped 2;

FIG. 12 is a diagram of a t-SNE visualization of a low-dimensional representation of reconstructed frames in Avenue and Ped 2.

Detailed Description

The following is further detailed by the specific embodiments:

example (b):

the embodiment discloses an anomaly detection method based on reconstruction and prediction.

As shown in fig. 1, the anomaly detection method based on reconstruction and prediction includes the following steps:

s1: acquiring a test video sequence to be detected;

s2: inputting a test video sequence into a pre-trained anomaly detection model; the anomaly detection model firstly extracts spatial appearance characteristics and temporal motion characteristics of a test video sequence respectively, then fuses the spatial appearance characteristics and the temporal motion characteristics to obtain corresponding spatio-temporal characteristics, then obtains corresponding reconstructed frames based on the spatio-temporal characteristics, and finally calculates corresponding anomaly scores according to the reconstructed frames;

In a specific implementation process, as shown in fig. 2, the anomaly detection model (Dual-Encoder Single-Decoder network, DESDnet) includes a reconstruction Encoder for extracting spatial appearance features, a prediction Encoder for extracting temporal motion features, a fusion module connected to outputs of the reconstruction Encoder and the prediction Encoder and used for obtaining temporal-spatial features through fusion, and a Decoder connected to an output of the fusion module and used for obtaining a reconstructed frame. Specifically, the fusion module of the anomaly detection model comprises a two-dimensional convolution layer and a Tanh activation layer; the convolution kernel of the two-dimensional convolution layer is 1 × 1 in size and the channel is 512. Inputting a current frame of the test video sequence to the reconstruction encoder to extract corresponding spatial appearance features; a number of frames preceding a current frame of the test video sequence are input to the predictive encoder to extract corresponding temporal motion features. In the test phase, from I_t-ΔTo I_tIs input to a reconstruction encoder and a prediction encoder to extract spatial and temporal features, respectively, of the video sequence. The appearance characteristic a_tAnd motion characteristics m_tInputting the cascade into a fusion module to obtain corresponding space-time characteristics; compared with the fusion method of the tandem characteristic, the method of the invention savesThe calculated amount is increased, and the expression capability of the model is improved. Furthermore, the spatio-temporal features are input into a decoder, and a reconstructed frame is obtained by performing a deconvolution

In a specific implementation process, as shown in fig. 3, when the anomaly detection model is trained, the video sequence input in the current round is reversely erased based on the reconstruction error in the previous round of the anomaly detection model, so as to remove pixels in the video sequence whose reconstruction error is smaller than the preset threshold value, and obtain a corresponding erased frame.

In particular, I_tRepresenting the t-th frame, I, in a video sequence_t-ΔRepresents I_tThe previous Δ frame;

reverse erase refers to: after each round of training iterations except the first round, the original frame I is first calculated_tAnd reconstructing the frame

Pixel level errors in between; then, according to whether the value of the pixel level error is larger than a preset threshold value or not, setting the corresponding pixel value in the mask to be 1 or 0 to obtain a corresponding mask; finally, before the current round of training, from I_t-ΔTo I_tMultiplying the original frame and the mask pixel by pixel to obtain the erasing frame of the current round of the abnormal detection modelIs represented by l'_t-ΔTo l'_t. In the training phase, slave I 'is given'_t-ΔTo l'_tOf erased frame, I'_tIs input to a reconstruction encoder to extract appearance features in the spatial domain, denoted a_tFrom l'_t-ΔTo l'_t-1Is input to a predictive coder to extract motion features in the temporal domain, denoted m_t(ii) a Compared with capturing motion patterns by using optical flow, the method avoids the inaccuracy and high calculation cost caused by optical flow calculation.

In the invention, some pixels are erased from the original frame in a reverse erasing mode to create the input data (namely, an erased frame) of the model, which can reserve the pixels with larger reconstruction errors in the previous round of training, remove the pixels with smaller reconstruction errors, and further force the model to focus on the pixels which are not reconstructed in the previous round, so that the simple background and the complex foreground are reconstructed with high quality, most foreground pixels are reserved in the erased frame, most background pixels are discarded, the model is favorable for automatically forming a focus mechanism for the foreground, and the anomaly detection performance and the accuracy can be considered. Meanwhile, the input uncertainty change enables the anomaly detection model to have stronger robustness to noise, the model does not lose focus, complex areas which are difficult to reconstruct during training can be preferentially learned and reconstructed, reconstructed images with high quality prospects can be effectively obtained, and anomaly detection performance of the anomaly detection model is improved.

In a specific implementation process, as shown in fig. 3, when the anomaly detection model is trained, a depth SVDD module is connected to the output of the decoder; the depth SVDD module is used for searching a hypersphere with the smallest volume to contain all or most high-level features of a reconstructed frame of a normal event, and enabling the reconstructed normal frame to be similar by utilizing the compact constraint of the high-level features of the reconstructed frame so as to increase the reconstruction distance between the normal frame and an abnormal frame.

Specifically, the depth SVDD module includes a mapping encoder connected to an output of the decoder, and a hypersphere connected to an output of the mapping encoder; the mapping encoder first reconstructs the frame

Mapping into a low-dimensional potential representation, and then fitting the low-dimensional representation into a hypersphere with a minimum volume to force an anomaly detection model to learn and extract common factors of normal events;

the objective function of the depth SVDD module is defined as:

representing reconstructed frames output by a network with parameter W

Is represented by argmax {. cndot.) represents a function taking the maximum value. In the objective function, a first term is used to minimize the volume of the hypersphere, and a second term is a penalty term for samples located outside the hypersphere; the hyperparameter v ∈ (0, 1)]For measuring the volume and boundary loss of a hypersphere; large v means that some samples are allowed to fall outside the hypersphere, which can be very penalized if v is small; optimizing a network parameter W and a radius R by a block coordinate descent and alternative minimization method; namely, a fixed R, a network iteration k is a secondary optimization parameter W; after k times, the latest W is used again to optimize R.

In the specific implementation process, an anomaly detection model is optimized through a training loss function;

reconstructing frames

Constrained in pixel space and in the potential space of the depth SVDD module;

optimizing an anomaly detection model based on intensity loss and weighted RGB loss in pixel space; in the latent space, an anomaly detection model is optimized based on feature compaction loss.

Specifically, the training loss function is represented by the following formula:

loss of strength L_intCalculated by the following formula:

weighted RGB loss L_rgbCalculated by the following formula:

the feature compaction loss is calculated by the following formula:

representing reconstructed frames output by a network with parameter W

Is represented by the low dimension of (a).

In order to constrain the reconstruction of all normal frames within the reachable range, the mean of the feature vectors of the reconstructed frames extracted by the first round of training model is taken as the center c. In subsequent training, the Euclidean distance between the feature representation of the reconstructed frame and the center c is calculated, and the feature compaction loss is obtained according to the distance.

In the present invention, by minimizing the feature compaction loss, the model can automatically map the reconstruction of normal frames to near the center of the hypersphere to obtain a compact description of normal events. Therefore, the feature of the reconstructed frame containing the normal event is close to the center of the hypersphere, and the feature of the abnormal event is far from the center and even falls outside the hypersphere, which means that the reconstructed images of all the normal frames in the pixel space are more similar, and the reconstructed image of the abnormal frame is more different from the reconstructed image of the normal frame, so that the distinctiveness of the abnormality can be increased, and the abnormality detection performance and accuracy of the abnormality detection model can be improved.

In a specific implementation process, the anomaly detection model calculates the corresponding anomaly score through the following steps:

s202: calculating an anomaly score for a frame in a test video sequence:

Score＝argmax{S(P₁),S(P₂),...,S(P_m) }; in the formula: the size of P is set to 16 × 16, m representing the number of image blocks;

According to the invention, the abnormal score of the test video sequence can be effectively calculated through the steps, and then the abnormal behavior or the abnormal event in the test video sequence can be detected based on the abnormal score, so that the effect of abnormal detection can be assisted and improved.

In order to better illustrate the advantages of the anomaly detection method of the present invention, the present embodiment also discloses the following experiments:

the experiment was performed on three publicly available data sets, shown in fig. 4, which are a CUHK Avenue data set, a UCSD pedistrin data set, and a university campus anomaly detection data set, respectively.

According to the network structure parameters in fig. 5, the model in the present invention is implemented on a pytorech.

In order to train the model, Adam's algorithm with an initial learning rate of 0.0002 was introduced and the learning rate was attenuated using a cosine annealing method. The batch size was set to 4, and the number of training rounds on CUHK Avenue, UCSD Ped2 and a university campus anomaly detection dataset were 60, 60 and 10, respectively. For all data sets, the frame is resized to 256 × 256 pixels, with pixel intensities normalized to a range of [ -1,1 ]. The total length of the input frame is set to 5, i.e., Δ ═ 4.

In the training loss function, the hyperparameter λ_int、λ_rgb、λ_compactSet to 1, 0.2, 0.01, respectively. The v of the depth SVDD module is set to 0.1 to ensure the model's tolerance to various normal modes. To reduce the memory required for computation, the embodiment does not include one for each training setCalculating a special mask by the frame; instead, an or operation is performed on these masks to generate a generic mask for erasure in the next round of training. The whole experiment is carried out on a computer running a Linux Ubuntu16.04 operating system, wherein Intel (R) core (TM) i7-7800xCPU @3.50GHz is adopted, and a display card is GeForce GTX 1080 with 8GB memory.

The CUHKAVAnue dataset; the method comprises the following steps of (1) including 37 videos, wherein 16 videos with 15328 frames are used for training a model, and the rest 21 videos with 15324 frames are used for evaluating the abnormal detection performance of the model; in this dataset, 47 exceptional events including wandering, throwing objects, and running were observed at a 640 x 360 resolution per frame.

A UCSD Peertramin data set; comprises a Ped1(UCSD Pedestrainin 1) dataset and a Ped2(UCSD Pedestrainin 2) dataset; experiments were performed on Ped2, but not on Ped1, since the 158 × 238 frame resolution in Ped1 is rather low; in Ped2, there are 16 training videos and 12 test videos, each video not exceeding 200 frames; the resolution of the video frame is 360 × 240; there are 12 irregularities in the Ped2 dataset that manifest primarily as objects with an abnormal appearance, such as bicycles and trucks on sidewalks.

An anomaly detection dataset for a university campus; the video anomaly detection data set is a very challenging video anomaly detection data set and consists of 13 scenes and more than 27 ten thousand training frames; it contains 330 training videos and 107 test videos; the resolution of each frame is 856 multiplied by 480; there are 130 abnormal events in the abnormal detection data set of a university campus, including the occurrence of bicycles, skateboards, etc.

This example evaluates the performance of anomaly detection by AUC (Area Under the Curve).

First, the anomaly detection model of the present invention

Comparing the anomaly detection model of the present invention with typical conventional methods and the latest methods based on deep learning, including: DeepOC, Stacked RNN, Liu et al, Lu et al, MESDnet, MemAE, STAE, ST-CaAE, Kim, and the like. The AUC performance of each model is shown in table 1.

TABLE 1AUC Performance comparison results

As can be seen from table 1, the model of the invention achieved good AUC performance on three different data sets, showing great competitiveness compared to the state of the art methods. On the data sets of CUHK Avenue and UCSD Ped2, the AUC performance of the model reaches 89.9% and 97.5% respectively, which is superior to the detection performance of other methods. A university campus anomaly detection dataset is a new dataset in video anomaly detection, so only a few studies provide test results for this dataset.

On a university campus anomaly detection dataset, the AUC of the model of the invention did not reach the best AUC performance, but was only 1.1% lower than the highest value. Furthermore, in order to visually observe the detection performance, fig. 6 provides qualitative results of the model for frame reconstruction on three data sets, and in conjunction with fig. 6, the normal regions can be well reconstructed, while the abnormal regions cannot.

Second, reconstruction and prediction models relating to the present invention

In order to evaluate the effect of reconstruction and prediction fusion in the present invention, a reconstruction encoder, a prediction encoder, and a decoder are combined to obtain three different models: 1) reconstruction model consisting of reconstruction encoder and decoder, with frame I_tIs input; 2) prediction model consisting of a predictive coder and decoder, with I_t-ΔTo I_t-1The frame of (a) is input; 3) consisting of a reconstruction encoder, a prediction encoder and a decoder, with I_t-ΔTo I_tIs input. In order to keep up with the proposed model, a jump connection is used between the encoder and the decoder of the prediction model. For the training of each model, pixel intensity loss, weighted RGB loss, and feature compactness loss are used to supervise the training. Through the models, the performance of the reconstruction model and the prediction model in independently detecting the abnormity can be obtained.

FIG. 7 shows the anomaly scores for video sequences of the Avenue and Ped2 datasets on the three models described above. The result shows that the model of the invention always generates larger reconstruction error for the abnormal frame and smaller error for the normal frame; the average of the normal and abnormal scores and the gap between them are shown in fig. 8. Overall, the score difference of the model of the invention on each data set is the largest, which indicates that the model of the invention has better detection performance. In addition, the AUC listed in table 2 also demonstrates that neither the reconstructed model nor the predicted model achieves the AUC performance achieved by the combination of the models of the present invention.

TABLE 2 AUC comparison of different models

Third, the reconstruction error reverse erasure related to the present invention

Fig. 9 shows masks for erasure at different training periods, and frame images before and after erasure. As can be seen from fig. 9, the erased pixels in each round are mainly background pixels, which helps the model to focus more on complex foreground; and as the number of training rounds increases, more background pixels remain in the erased frame, indicating that the reconstruction error gap between the foreground and the background is decreasing. This reflects that the reverse erasure can effectively direct the model to reduce the reconstruction error of the foreground pixels. This can also be verified in the reconstructed error map provided in fig. 9.

To better demonstrate the advantages of the reverse erase of the present invention, the present example performed ablation experiments on the reverse erase: the training penalty for the model with and without reverse erasure at Ped2 is shown in fig. 10; although fig. 10 shows that the model with reverse erasure does not significantly reduce the training loss, it can be found that the reduction in training loss is dominated by foreground pixels, rather than background pixels, as compared to fig. 9; conversely, models without reverse erasure lose guidance, look the same for all regions, resulting in simple background dominated model convergence. Finally, we list the AUC performance of the models with and without reverse erasure in table 3 and give a visual comparison in fig. 11. The result shows that the reverse erasure model of the invention has better detection performance.

TABLE 3 AUC comparison of models without and with reverse erase

Fourth, the depth SVDD module related to the invention

Based on the t-distributed Stochastic Neighbor Embedding (t-SNE) method, FIG. 12 provides a t-SNE visualization of a low-dimensional representation of a reconstructed frame on an Avenue and Ped2 dataset. It can be observed that in three dimensions, especially in the Ped2 dataset, most normal data are clustered in the form of close spheres, and abnormal data are scattered outside the spheres. This result is attributed to the loss of feature compaction based on the depth SVDD, which is directed to finding a minimal volume hypersphere containing normal data but not abnormal data.

To verify the advantages of applying deep SVDD after the decoder, three methods were explored in this experiment: 1) the mapping encoder after the decoder is removed, has no constraint on characteristics, is a simple double-encoding single-decoding structure and is represented as DESD; 2) depth SVDD is performed at the bottleneck between the encoder and decoder, i.e. the spatio-temporal representation of the input frame is mapped into a compact hypersphere, denoted DE-SVDD-SD; 3) a depth SVDD, denoted DESD-SVDD, is performed after the decoder.

The AUC performance of the different methods is summarized in table 4. In the table, the characteristic AUC is calculated from the distance of the low-dimensional characteristic of the frame from the center of the hypersphere. First, the distance is defined as follows:

in the formula: w^*Parameters representing a pre-trained network; a large distance means that the low dimensional features of the frame deviate more strongly from the normal mode.

The abnormality score is expressed as

From table 4, it can be observed that DESD-SVDD achieves the highest AUC on both datasets, whether frame-based AUC or feature-based AUC. The frame AUC of DE-SVDD-SD is lower than that of DESD-SVDD, confirming that the decoder reconstructed abnormal frames may not approach normal frames due to the strong representation capability of CNNs even if the high-level features are limited.

TABLE 4 AUC comparison of potential feature spaces under different constraints

Fifth, weighted RGB loss with respect to the invention

The effect of weighted RGB loss was studied by comparison with the motion loss from which the RGB difference between two adjacent frames was calculated. Table 5 shows that weighted RGB loss can give higher AUC on both the Ped2 and Avenue datasets.

TABLE 5 AUC Performance under different motion constraints

Furthermore, in experiments, it was found that the RGB penalty λ would be weighted_rgbA fixed parameter of 0.2 can achieve good detection performance on different data sets. Take the Ped2 data set as an example to carry out lambda_rgbThe results of the experiments are summarized in table 6.

TABLE 6 AUC comparison of weighted RGB loss for different weights on Ped2 dataset

Sixth, conclusion

The experiment researches the problem that in the traditional video anomaly detection based on deep learning, network optimization is not important, and the similarity between different normal frames is neglected. In the invention, each frame in a video is reconstructed through a double-encoder single-decoder network of an anomaly detection module, and a training strategy is provided, which comprises reverse erasure and depth SVDD based on reconstruction errors to standardize the training of the network. In the training, according to the reconstruction error of the previous round of training, the pixel with smaller error in the original frame is removed, and then the frame is input into the model, so that the model is concentrated on the pixel with larger learning error, and the reconstruction quality is improved; in addition, the application of the depth SVDD maps the reconstruction of the normal frame into the minimum-volume hypersphere, making the reconstruction of the abnormal frame easier to identify. Experimental results on three data sets showed that the method of the invention has a competitive advantage compared to the existing methods.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that, while the invention has been described with reference to preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Meanwhile, the detailed structures, characteristics and the like of the common general knowledge in the embodiments are not described too much. Finally, the scope of the claims should be determined by the content of the claims, and the description of the embodiments and the like in the specification should be used for interpreting the content of the claims.

Claims

1. An anomaly detection method based on reconstruction and prediction, characterized by comprising the steps of:

s1: acquiring a test video sequence to be detected;

2. The reconstruction and prediction based anomaly detection method according to claim 1, characterized by: the anomaly detection model comprises a reconstruction encoder used for extracting spatial appearance characteristics, a prediction encoder used for extracting temporal motion characteristics, a fusion module which is connected with the outputs of the reconstruction encoder and the prediction encoder and is used for obtaining space-time characteristics through fusion, and a decoder which is connected with the output of the fusion module and is used for obtaining a reconstruction frame.

3. The reconstruction and prediction based anomaly detection method according to claim 2, characterized by: in step S2, inputting the current frame of the test video sequence to the reconstruction encoder to extract the corresponding spatial appearance feature; a number of frames preceding a current frame of the test video sequence are input to the predictive encoder to extract corresponding temporal motion features.

4. The reconstruction and prediction based anomaly detection method according to claim 2, characterized by: and when the anomaly detection model is trained, reversely erasing the video sequence input in the current round based on the reconstruction error of the previous round of the anomaly detection model so as to remove pixels with the reconstruction error smaller than a preset threshold value in the video sequence and obtain a corresponding erased frame.

5. The reconstruction and prediction based anomaly detection method according to claim 4, characterized by:

I_trepresenting the t-th frame, I, in a video sequence_t-ΔRepresents I_tThe previous Δ frame;

Pixel level errors in between; then theSetting the corresponding pixel value in the mask to 1 or 0 to obtain a corresponding mask according to whether the value of the pixel level error is greater than a preset threshold; finally, before the current round of training, from I_t-ΔTo I_tIs multiplied by a mask pixel by pixel to obtain an erasure frame of the current round of the abnormal detection model, which is represented as I'_t-ΔTo l'_t。

6. The reconstruction and prediction based anomaly detection method according to claim 4, characterized by: when the anomaly detection model is trained, a depth SVDD module is connected to the output of the decoder; the depth SVDD module is used for searching a hypersphere with the smallest volume to contain all or most high-level features of a reconstructed frame of a normal event, and enabling the reconstructed normal frame to be similar by utilizing the compact constraint of the high-level features of the reconstructed frame so as to increase the reconstruction distance between the normal frame and an abnormal frame.

7. The reconstruction and prediction based anomaly detection method according to claim 6, characterized by: the depth SVDD module comprises a mapping encoder connected to an output of the decoder, and a hypersphere connected to an output of the mapping encoder; the mapping encoder first reconstructs the frame

Mapping into a low-dimensional potential representation and then fitting the low-dimensional representation into a hypersphere with minimal volume to force the anomaly detection model to learn to extract common factors for normal events;

the target function of the depth SVDD module is defined as:

is represented byReconstructed frame of network output with parameter W

8. The reconstruction and prediction based anomaly detection method according to claim 6, characterized by: optimizing the anomaly detection model by a training loss function;

reconstructing frames

Constrained in pixel space and potential space of the depth SVDD module;

9. The reconstruction and prediction based anomaly detection method according to claim 8, characterized by: the training loss function is represented by the following formula:

loss of strength L_intCalculated by the following formula:

weighted RGB loss L_rgbCalculated by the following formula:

the feature compaction loss is calculated by the following formula:

representing reconstructed frames output by a network with parameter W

Is represented by the low dimension of (a).

10. The reconstruction and prediction based anomaly detection method according to claim 1, characterized by: the anomaly detection model calculates a corresponding anomaly score by:

s202: calculating an anomaly score for a frame in a test video sequence: