CN113762007B

CN113762007B - Abnormal behavior detection method based on appearance and action feature double prediction

Info

Publication number: CN113762007B
Application number: CN202011263894.5A
Authority: CN
Inventors: 陈洪刚; 李自强; 王正勇; 何小海; 刘强; 吴晓红; 熊书琪
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2020-11-12
Filing date: 2020-11-12
Publication date: 2023-08-01
Anticipated expiration: 2040-11-12
Also published as: CN113762007A

Abstract

The invention discloses an abnormal behavior detection method based on appearance and action feature double prediction, and relates to the field of computer vision and artificial intelligence. The method comprises the following steps: (1) Sequentially reading a video frame sequence, calculating the inter-frame difference of adjacent images, and obtaining a video frame sequence with fixed length and a corresponding frame difference image sequence; (2) Extracting special appearance and action characteristics belonging to normal behaviors through appearance and action sub-networks respectively by utilizing a double-flow network model introduced with a memory enhancement module, and predicting a video frame image and a frame difference image; (3) Adding and fusing the predicted video frame and the frame difference map to obtain a final predicted video frame; (4) The frame anomaly score is obtained by evaluating the motion and appearance features extracted by the memory enhancement module and the final predicted image quality. The invention adopts a deep learning method based on a prediction model, can effectively detect the video frame containing abnormal behaviors, and improves the accuracy of abnormal detection.

Description

Abnormal behavior detection method based on appearance and action feature double prediction

Technical Field

The invention relates to an abnormal behavior detection method based on appearance and action feature double prediction, and belongs to the field of computer vision and security monitoring.

Background

Abnormal behavior detection is a technology in the field of computer vision, and the purpose of abnormal behavior detection is to detect what is considered to be the presence of abnormal behavior in video. Public safety is attracting more and more attention nowadays, a large number of monitoring devices are deployed at various sites, thereby generating a huge amount of video resources, and focusing on various monitoring pictures in real time by manpower is difficult and consumes a great amount of human resources. The abnormal behavior detection algorithm can detect abnormal behaviors in the monitoring video and give out warning timely, so that the labor cost can be greatly reduced, and the efficiency is improved. The abnormal behavior detection has wide application prospect in the fields of video monitoring, intelligent security, transportation and the like.

For abnormal behavior detection of video, most of the current methods adopt a semi-supervised learning method which only uses normal video for training due to low occurrence rate of abnormal behavior and difficult data collection, and a method based on reconstruction or prediction becomes a main used method due to a good detection effect. The method reconstructs an input frame or predicts a next frame by inputting a plurality of continuous frames of video to a self-encoder network or generating a countermeasure network, and judges whether the video is abnormal or not by judging whether the video is reconstructed or predicted. Although this type of approach achieves good results, it still suffers from the following problems: (1) Abnormal behavior can be classified into appearance, action or both, and the current reconstruction and prediction method fully utilizes appearance and action information; (2) The normal behavior has diversity, complex background and the like, so that the network cannot correctly learn the special characteristics of the normal sample, and in addition, the strong generating capacity of the convolutional neural network can ensure that the reconstruction or prediction effect of the abnormal sample can be good, and the final abnormal detection accuracy is influenced.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides an abnormal behavior detection method based on appearance and action feature double prediction, and aims to design a double-flow network structure containing a memory enhancement module for predicting appearance and action features, so that an abnormal video frame can obtain larger prediction error and the accuracy of abnormal behavior detection is improved.

The invention adopts the following technical scheme: an abnormal behavior detection method based on appearance and motion feature bi-prediction, the method comprising the following steps:

(1) Sequentially reading video frames, calculating the inter-frame difference of adjacent images, and obtaining a video frame sequence with fixed length and a corresponding frame difference image sequence;

(2) Extracting special appearance and action characteristics belonging to normal behaviors through an appearance sub-network and an action sub-network respectively by utilizing a double-flow network model introduced with a memory enhancement module, and predicting a video frame image and a frame difference image;

(3) Adding and fusing the predicted video frame and the frame difference map to obtain a final predicted video frame;

(4) The anomaly score for the frame is obtained by measuring the motion and appearance characteristics extracted by the memory enhancement module and the final predicted image quality.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention simultaneously utilizes the video frame sequence and the RGB frame difference image sequence as inputs to be sent into the double-flow convolution self-encoder network for prediction, and compared with the existing method which uses a light flow image to extract action characteristics, the invention can reduce the complexity of the network and the calculated amount by using the frame difference image;

2. the invention improves the network structure of the encoder and the decoder in the self-coding network, so that the characteristics are better extracted, and the image prediction quality is improved;

3. according to the invention, the memory enhancement module is added, so that the characteristics of a normal sample are better learned, the robustness of a network is enhanced, and the abnormal video can obtain higher abnormal score;

4. according to the invention, the abnormal score considers the quality of the predicted image, and the feature similarity score of the extracted sample features and the normal behavior features is used as an evaluation basis, so that the effect of abnormal detection is effectively improved, and the false detection rate is reduced.

Drawings

FIG. 1 is a flow chart of an abnormal behavior detection method of the present invention;

FIG. 2 is a network structure diagram of the present invention for anomaly behavior detection based on appearance and motion feature biprediction;

fig. 3 is a block diagram of upsampling and downsampling modules in an encoder and decoder of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the detailed description is presented merely to illustrate the invention and is not intended to limit the invention.

As shown in fig. 1-2, a method for detecting abnormal behavior based on appearance and motion feature bi-prediction includes the following steps:

(4) The frame anomaly score is obtained by measuring the motion and appearance characteristics extracted by the memory enhancement module and the final predicted image quality.

The detailed steps are as follows:

step 1: and obtaining a video frame with a fixed length and a frame difference map. A video stream is obtained from a fixed camera, after the video is subjected to framing treatment, a continuous video frame sequence with a fixed length of t is selected, wherein the previous t-1 frame image is directly sent into an appearance subnetAnd (5) collaterals. For the video stream of the fixed camera, the background image I of the video can be obtained by an OpenCV method _B Then subtracting I from t frame RGB video image _B Obtaining a foreground image I 'without background noise' ₁ ,I′ ₂ ,…,I′ _t Finally subtracting the previous frame from the next frame of the foreground image sequence to obtain t-1 continuous frame difference image sequence X required by the action sub-network ₁ ,X ₂ ,…,X _t-1 。

Step 2: and respectively sending the video frames with fixed lengths and the frame difference images into a double-flow network which is introduced with the memory enhancement module for prediction, and generating predicted video frames and RGB frame difference images.

For the network architecture, as shown in fig. 2, the network is composed of two structurally identical self-encoder sub-networks, which are widely used for feature extraction and image reconstruction and prediction tasks. Taking the appearance subnetwork as an example, further illustrate: the sub-networks are in turn encoded by encoder E _a Memory enhancing module M _a And decoder D _a And the cascade connection is formed. The encoder and the decoder are connected in a skip-connection mode at the feature layer with the same resolution, and the memory enhancing module enhances the normal sample feature of the feature extracted by the encoder and then sends the feature to the decoder for reconstruction. For the encoder and the decoder, the invention improves the up-sampling layer and the down-sampling layer of the encoder and the decoder, as shown in fig. 3, an improved up-sampling module and a down-sampling module both adopt a residual-like structure, and two branches of the down-sampling module respectively pass through convolution operation and maximum pooling operation of different kernel functions; the up-sampling module adopts deconvolution operation of convolution kernels with different sizes. The improved convolution kernel obtains more information and can extract more effective semantic features. Setting the input appearance sub-network input as I ₁ ,I ₂ ,…,I _t Through encoder E _a Downsampling to extract deep features Z such as image scenes, target appearance information and the like _a After that, memory enhancing module M _a For feature Z _a The memory enhancement of the normal sample is carried out to obtain enhanced characteristic Z' _a Decoder D _a Input Z' _a Predicting to obtain the t+1st frameThe calculation method is shown in the formula (1):

in the aboveRespectively represent encoder E _a Memory enhancing module M _a And decoder D _a Is a parameter of (a).

The memory enhancing module is specifically described as follows:

the module comprises a memory item for storing M normal sample feature vectors locally; during the training phase, the encoder feeds all the features of the extracted normal samples into the module, which extracts the M features that most characterize the normal samples and stores them locally. The function of the module is realized by two operations of reading and updating.

For the read operation, which is used for the reconstruction of the decoder in order to generate the enhanced features, it exists in the training and testing phase of the network. The reading operation steps are as follows: for the output characteristic z of the encoder, calculating the cosine similarity of the characteristic p stored in the memory term and the z, wherein the calculation formula is shown in (2):

where k, m are the indices of the features z and p, respectively, for s (z ^k ,p ^m ) Obtaining the reading weight omega by applying softmax function ^k,m The calculation formula is as follows (3):

applying the calculated corresponding weights ω to the memory term features p ^k,m Obtaining the memory enhancementEnhanced featuresThe calculation method is as follows:

the updating operation only exists in the training stage and is used for learning the characteristic features of the normal sample, the cosine similarity is calculated by using the formula (1) firstly, and then the updating weight v is calculated ^m,k The calculation method is as shown in formula (5):

the calculation method of the updated local memory is as shown in formula (6):

in order for the memory item to truly memorize the characteristics of the normal sample, the module introduces a characteristic compression loss L _c And feature separation loss L _s Two loss functions. Characteristic compression loss L _c As shown in formula (7):

p in the above ^τ Representing all memory items and z ^k One item with the highest similarity.

Feature separation loss L _s The calculation method of (2) is shown in the formula (9):

wherein τ and γ in the above formula respectively represent ω in formula (1) ^k,m Cable when taking maximum value and second maximum valueAnd (3) leading the value of m. Step 3: predicting from the appearance sub-network and the action sub-network by step 2 to obtain a predicted video frameAnd RGB frame difference map->Adding the two predictive pictures to obtain the final t+1st frame video frame of the network +.>Step 4: the calculation method of the anomaly score specifically comprises the following steps:

first calculate the t+1st frameAnd real frame I _t+1 The peak signal-to-noise ratio (PSNR) of (a) is calculated as shown in equation (10):

wherein N represents the t+1st frame image I _t+1 All pixel numbers.

Second calculate each output characteristic z of the appearance sub-network and action sub-network encoder ^k Memory item feature p with memory enhancement module ^τ The L2 distance of the two sub-networks is used as the characteristic similarity score of the two sub-networks, and the calculation method is shown in a formula (11):

wherein τ is equal to z ^k Index of memory item feature with maximum similarity;

finally, after normalizing the three scores to [0,1], balancing the weight of each score through a super parameter beta, wherein the calculation method is shown in a formula (12):

in the middle ofD′ _a (z _a ,p _a ) And D' _m (z _m ,p _m ) And respectively representing the normalized PSNR, the appearance characteristic similarity score and the action characteristic similarity score.

To verify the effectiveness of the method of the present invention, the present invention uses three common data sets commonly used in the field of video anomaly behavior detection, avenue, UCSD-ped2, and ShangghaiTech, for training and testing. Four abnormal behavior detection methods based on deep learning are selected as comparison methods, and the method specifically comprises the following steps:

method 1: the method proposed by Abati et al, reference "D.Abati, A.Porrello, S.Calderara, and R.Cucchiara," Latent space autoregression for novelty detection, "in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2019, pp.481-490."

Method 2: the method proposed by Nguyen et al, reference "T. -N.Nguyen and J.Meuner," Anomaly detection in video sequence with appearance-motion correspondence, "in Proceedings of the IEEE International Conference on Computer Vision,2019, pp.1273-1283"

Method 3: liu et al, reference "W.Liu, W.Luo, D.Lian, and S.Gao," Future frame prediction for anomaly detection-a new baseline, "in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018, pp.6536-6545"

Method 4: methods proposed by Gong et al, references "d.gong et al," Memorizing normality to detect anomaly: memory-augmented deep autoencoder for unsupervised anomaly detection, "in Proceedings of the IEEE International Conference on Computer Vision,2019, pp.1705-1714," and the like "

As shown in table 1, the method provided by the invention uses AUC as an evaluation index on three data sets, and compared with other four methods, the accuracy of identification of the method is greatly improved.

Table 1 comparison with other method evaluation index (AUC)

Finally, it should be noted that the above embodiments are only for illustrating the technical scheme of the present invention, and are not limiting; while the invention has been described in detail with reference to the foregoing embodiments, it will be appreciated by those skilled in the art that variations may be made in the techniques described in the foregoing embodiments, or equivalents may be substituted for in part or in whole; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. The abnormal behavior detection method based on appearance and action feature double prediction is characterized by comprising the following steps of:

the double-flow network structure introduced with the memory enhancement module comprises an appearance sub-network and an action sub-network which are two paths of convolutional neural networks, wherein the appearance sub-network and the action sub-network are composed of self-encoder networks with the same structure; the self-encoder network consists of an encoder, a decoder and a memory enhancement module, wherein the memory enhancement module is cascaded between the encoder and the decoder;

the memory enhancement module comprises M memory items for storing normal sample feature vectors locally, and the memory enhancement module is divided into two operations of reading and updating:

the read operation has both training and testing phases of the network; the reading operation steps are as follows: for the encoder output feature z, calculating the cosine similarity of the feature p stored in the memory term and the z, wherein the calculation formula is shown in (1):

where k, m are the indices of the features z and p, respectively, for s (z ^k ,p ^m ) Obtaining the reading weight omega by applying softmax function ^k,m The calculation formula is as follows (2):

applying the calculated corresponding weights ω to the memory term features p ^k,m Obtaining the characteristics after memory enhancementThe calculation method is as follows:

the update operation only exists in the training stage, and the cosine similarity is calculated by using the formula (1) and then the update weight v is calculated ^m,k The calculation method is as shown in formula (4):

the calculation method of the updated local memory is as shown in formula (5):

the updated local memory items will be saved locally for use in the read operations of training and testing;

2. The abnormal behavior detection method based on appearance and motion feature bi-prediction according to claim 1, wherein the frame difference map calculation method in step (1) is as follows:

for each video segment, firstly, extracting a background image of the video segment; secondly, subtracting a background image from a video frame sequence with a fixed length of t to obtain a foreground target image with a background removed; and finally, subtracting the front and back adjacent frames to obtain a frame difference map sequence with the length of t-1.

3. The abnormal behavior detection method based on appearance and motion feature bi-prediction according to claim 1, wherein the network structure of the encoder and decoder;

the encoder and decoder respectively comprise three downsampling layers and three upsampling layers; the downsampling layer adopts a residual structure, and two branches respectively adopt maximum pooling and convolution to reduce resolution and increase the number of channels; the two branches of the up-sampling layer adopt deconvolution of convolution kernels with different sizes to increase the resolution and reduce the number of channels; the encoder and decoder use a hop connection at the feature layer of the same resolution.

4. The abnormal behavior detection method based on appearance and motion feature bi-prediction according to claim 1, wherein:

the method for obtaining the final predicted video frame in the step (3) comprises the following steps:

inputting continuous t-1 video frame images into the appearance subnetwork for prediction to obtain a t frameSimultaneously, continuously inputting t-1 frames into the action sub-network to predict and obtain the first framet frames difference map->Finally will->And->Adding and fusing to obtain t+1st frame->

In the step (4), the anomaly score is calculated by the following method:

(4.1) calculating the t+1st frameAnd real frame I _t+1 Peak signal to noise ratio (PSNR);

(4.2) calculating the encoder output characteristics z in the appearance sub-network and the action sub-network, respectively ^k Memory item feature p with memory enhancement module ^τ The L2 distance of the two sub-networks is used as the characteristic similarity score of the two sub-networks, and the calculation method is shown in a formula (6):

(4.3) normalizing the three scores in the step (4.1) and the step (4.2) to 0,1, and then adding and fusing to obtain a final anomaly score, wherein the higher the score is, the greater the anomaly possibility of the video frame is, and the score calculation method is as shown in (7):

in the middle ofD' _a (z _a ,p _a ) And D' _m (z _m ,p _m ) And respectively representing the normalized PSNR, the appearance characteristic similarity score and the action characteristic similarity score.