CN109359519B

CN109359519B - Video abnormal behavior detection method based on deep learning

Info

Publication number: CN109359519B
Application number: CN201811026243.7A
Authority: CN
Inventors: 陈华华; 刘萍; 郭春生; 叶学义
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2018-09-04
Filing date: 2018-09-04
Publication date: 2021-12-07
Anticipated expiration: 2038-09-04
Also published as: CN109359519A

Abstract

The invention relates to a video abnormal behavior detection method based on deep learning. The method comprises a training phase and a testing phase. In the training stage, a training video sequence is converted into a gray-scale image and an optical flow image, the gray-scale image and the optical flow image are respectively manufactured into a space-time block, the space-time block is respectively input into a residual self-coding model for training, two models based on the gray-scale image and the optical flow image are established, and the trained models contain the motion mode and the appearance information of normal behaviors, so that the reconstruction of the normal behaviors is facilitated. In the testing stage, the testing data is made into space-time blocks of a gray-scale image and an optical flow image, the space-time blocks of the gray-scale image are firstly input into a gray-scale image model, a normal area and a suspicious area are preliminarily detected, then the space-time blocks of the optical flow image of the suspicious area are input into the optical flow image model, the normal area and an abnormal area are detected, and the final abnormal judgment is obtained.

Description

Video abnormal behavior detection method based on deep learning

Technical Field

The invention belongs to the technical field of video processing, relates to a method for detecting abnormal behaviors of videos, and particularly relates to a method for detecting abnormal behaviors of videos based on deep learning.

Background

The video abnormal behavior detection belongs to the category of intelligent video monitoring, and is to detect abnormal behaviors in a monitoring video by using an intelligent algorithm and send an alarm signal to improve the response speed of related departments. The development of the video abnormal behavior detection technology plays an important role in maintaining the safety of public places and saving manpower and material resources.

The definition of the abnormal events in different video scenes is different, and the types of the abnormal events in the same scene are more diversified. In general, an abnormal event is an event having a relatively low occurrence probability, unlike a normal event. From the viewpoint of the current anomaly detection method, the method can be roughly divided into a method based on artificial feature selection and a method based on deep neural network automatic feature selection. Most methods generally involve two parts, one or more models associated with normal data are built during the training phase by unsupervised learning of the appearance and motion characteristics of the training data. And in the testing stage, judging whether the video is abnormal according to whether the video data is matched with the model. Early learners detected abnormal events by studying the track of moving objects in videos, but this method cannot solve the problem of crowd occlusion. Scholars also propose methods based on learning dictionaries and sparse reconstruction.

In recent years, deep neural networks have shown their advantages over manually setting features in visualization data representation and motion recognition in image classification, and in particular have made further breakthroughs in time effectiveness and accuracy in recent years. Therefore, the deep learning neural network is applied to abnormal behavior detection, and better effect can be obtained compared with the traditional feature extraction method.

Disclosure of Invention

The invention aims to provide a video abnormal behavior detection method based on deep learning, which improves the detection rate of abnormal behaviors.

In order to solve the technical problems, the technical scheme provided by the invention integrates various deep learning neural network structures. The method comprises a training stage and a testing stage, and the specific details are as follows:

1. a training stage:

firstly, pretreatment is carried out: selecting a video data set to be detected, wherein the video data set comprises a training data set and a testing data set, and all training data in the training data set are normal behavior samples; extracting a gray map sequence of a training data set, wherein the size of an image is normalized to be M multiplied by N, and M and N respectively represent the width and height of the image; calculating an optical flow diagram sequence through the gray level diagram sequence of the training data set to obtain a gray level diagram data set and an optical flow diagram data set of the training data set; then the following operations are carried out:

step (1) dividing the gray image of the gray image data set in the training data set into non-overlapping space blocks in a grid mode, wherein the size of each frame image is MxN, and the frame image is divided into space blocks g with the size of p x p_ri(ii) a Spatially combining spatially identical successive t frames into a spatio-temporal sample g_r＝[g_r1,g_r2,…,g_rt]The size is t multiplied by p;

number of optical flow graphs in training data setThe optical flow image of the data set is divided into non-overlapping space blocks in a grid mode, the image size of each frame is M multiplied by N, and the space blocks are divided into space blocks o with the size of p multiplied by p_ri(ii) a Spatially combining spatially identical successive t frames into a spatio-temporal sample o_r＝[o_r1,o_r2,…,o_rt]The size is t multiplied by p;

thus, the training data set is divided into a gray image space-time sample set G_rSum-optical flow image spatio-temporal sample set O_r；

Step (2), building a residual self-coding network, wherein the residual self-coding network structure is divided into an input layer, a middle network layer and an output layer; the middle network layer consists of a convolution layer a, a convolution long-time memory layer b and a convolution layer a, and is combined and connected in a residual error structure mode;

step (3) a gray level image space-time sample set G of the training data set_rInputting the data into the residual self-coding network in the step (2) for iterative training, taking the mean square error of the output data and the input data as a training objective function, and stopping iteration when the maximum iteration frequency is 200 times or the change value of the objective function compared with the last iteration result is less than 0.1; obtaining a model A1 based on gray level image data reconstruction after training; space-time sample set G of gray level image_rInputting the data into a model A1 to obtain model A1 output data, taking a two-norm of a subtraction result of the input data and the model A1 output data as a reconstruction error, and recording the maximum value of the reconstruction error as alpha;

optical flow image space-time sample set O of training data set_rInputting the data into the residual self-coding network in the step (2) for iterative training, taking the mean square error of the output data and the input data as a training objective function, and stopping iteration when the maximum iteration frequency is 200 times or the change value of the objective function compared with the last iteration result is less than 0.1; obtaining a model A2 reconstructed based on optical flow image data after the training is finished; space-time sample set O of optical flow image_rInputting the data into model A2 to obtain model A2 output data, and using the two-norm of the subtraction result of the input data and the model A2 output data as reconstruction errorThe maximum value is denoted as β.

2. And (3) a testing stage:

firstly, pretreatment is carried out: selecting a video data set to be detected, extracting a gray graph sequence of the test data set, and normalizing the image size to be M multiplied by N, wherein M and N respectively represent the width and the height of an image; calculating an optical flow diagram sequence through the gray level diagram sequence of the test data set to obtain a gray level diagram data set and an optical flow diagram data set of the test data set; then the following operations are carried out:

step (4) dividing the gray image of the gray image data set in the test data set into non-overlapping space blocks in a grid mode, wherein the size of each frame image is MxN, and the frame image is divided into space blocks g with the size of p x p_ei(ii) a Spatially combining spatially identical successive t frames into a spatio-temporal sample g_e＝[g_e1,g_e2,…,g_et]The size is t multiplied by p; test set G of gray level image space-time samples for obtaining test data set_e；

Step (5), testing a space-time sample set G of the gray level image_eInputting the data into a model A1, obtaining data reconstruction output after each layer of network processing, subtracting the input data from the output data, and calculating a two-norm of the subtraction result as a reconstruction error E1 between the input and the output; detecting sample data with the reconstruction error E1 smaller than or equal to alpha as a normal area, and detecting sample data larger than alpha as a suspicious area;

step (6) extracting the optical flow image of the suspicious region, and dividing the optical flow image into non-overlapping space blocks in a grid mode, wherein the size of each frame image is MxN, and the optical flow image is divided into space blocks o with the size of p x p_ei(ii) a Spatially combining spatially identical successive t frames into a spatio-temporal sample o_e＝[o_e1,o_e2,…,o_et]The size is t multiplied by p; space-time sample test set O for obtaining optical flow image of suspicious region_e；

And (7): mixing O with_eInputting the data into a model A2 to obtain the output of data reconstruction, subtracting the input data from the output data, and calculating the two-norm of the subtraction result as the reconstruction error E2 between the input and the output; the reconstruction error E2 is less than or equal toAnd detecting sample data equal to beta as a normal area, and detecting sample data larger than beta as an abnormal area.

Thus, the detection result of the test data set is obtained, and the anomaly detection of the whole system is completed.

And (3) extracting the gray-scale graph and the optical flow graph of the video data as the appearance characteristic and the motion characteristic representation of the video data respectively according to the most key appearance characteristic and motion information in the detection, and performing model training based on the two characteristics. The residual self-coding model fusing the convolutional neural network and the convolutional long-time memory network structure is adopted, and the model can achieve a better effect on processing video data because the capability of the convolutional neural network for extracting data space structure information is fully utilized, the advantage of the convolutional long-time memory network for processing time sequences is combined, and the characteristic fitting capability of the model is further enhanced by utilizing the residual structure.

Drawings

FIG. 1 is a flow chart of the training phase of the method of the present invention;

FIG. 2 is a flow chart of the testing phase of the method of the present invention;

fig. 3 is a diagram of a residual self-coding network.

Detailed Description

The invention is described in detail below with reference to the figures and the examples.

A video abnormal behavior detection method based on deep learning comprises a training stage and a testing stage.

The training phase consists of three modules: the method comprises the following steps of firstly, acquiring gray-scale image data and optical flow image data of a training data set by a preprocessing module; secondly, building a residual self-coding network module, wherein the module has the main function of building a residual self-coding network structure fused with a plurality of neural network structures; and training a neural network, wherein the module mainly has the function of training the residual self-coding network by using training data only containing normal behaviors to obtain two models based on gray-scale image reconstruction and optical flow image reconstruction.

The testing phase also consists of three modules: the system comprises a preprocessing module, a data processing module and a data processing module, wherein the preprocessing module is mainly used for acquiring gray-scale image data and optical flow image data of a test data set; the A1 model detection module is mainly used for inputting gray-scale map data of a test data set into a trained A1 model for inspection, and judging a normal region and a suspicious region by calculating a reconstruction error; and the A2 model detection module is mainly used for inputting the optical flow image data of the suspicious region into the A2 model for inspection, and judging a normal region and an abnormal region by calculating a reconstruction error to obtain a final abnormal detection result.

As shown in fig. 1, the training phase first performs preprocessing: selecting a video data set to be detected, wherein the video data set comprises a training data set and a testing data set, and all training data in the training data set are normal behavior samples; extracting a gray map sequence of a training data set, wherein the image size is normalized to 260 multiplied by 180, and 260 and 180 represent the width and height of an image respectively; and computing an optical flow graph sequence from the grey-scale graph sequence of the training data set. A gray-scale image dataset and an optical flow map dataset of the training dataset are thus obtained. Then the following operations are carried out:

step (1) dividing the gray level image of the training data set into non-overlapping space blocks in a grid mode, wherein the size of each frame image is 260 multiplied by 180, and the space blocks are divided into space blocks g with the size of 20 multiplied by 20_ri(ii) a Spatially combining spatially identical 10 consecutive frames into a spatio-temporal sample g_r＝[g_r1,g_r2,…,g_r10]The size is 10 multiplied by 20; similarly, the optical flow image is divided into non-overlapping spatial blocks in a grid-like manner, the image size of each frame is 260 × 180, and the optical flow image is divided into spatial blocks o of 20 × 20_ri(ii) a Spatially combining spatially identical 10 consecutive frames into a spatio-temporal sample o_r＝[o_r1,o_r2,…,o_r10]The size is 10 × 20 × 20, so that the training data set is divided into a spatio-temporal sample set G of grayscale images_rSpatio-temporal sample set O of sum-optical flow images_r；

And (2) building a residual self-coding network, wherein the network structure is mainly divided into an input layer, a middle network layer and an output layer. The middle network layer consists of 3 convolutional layers, 2 convolutional long-time memory layers and 3 deconvolution layers, and is combined and connected in a residual error structure mode; as shown in fig. 3, first, the spatial structure information of the input data is extracted by the calculation of three convolutional layers Conv1, Conv2 and Conv3 from the input data, then the time series characteristics of the data are extracted by the processing of two convolutional long and short memory layers ConvLSTM1 and ConvLSTM2, and finally the input data is reconstructed by three convolutional layers Deconv1, Deconv2 and Deconv3, thereby completing the whole residual self-coding network. The concrete implementation is as follows:

in the stage of Conv1, the size of input data is 10 × 20 × 20 × 1, the size of a convolution kernel is set to be 3 × 3, the number of channels is set to be 64, a Relu excitation function is adopted, the size of output data is 10 × 20 × 20 × 64, and the output of Conv1 is used as the input of Conv 2; conv2 and Conv3 have convolution kernels of 3 × 3 in size and channels of 64 in number, Relu excitation functions are adopted, output data sizes are 10 × 20 × 20 × 64, and Conv2 outputs serve as Conv3 inputs; adding outputs of Conv1 and Conv3, and adopting a Relu excitation function to obtain an input of ConvLSTM1, wherein the input size is 10 × 20 × 20 × 64, the number of channels of ConvLSTM1 is set to be 128, the size of a convolution kernel is 3 × 3, and output data with the size of 10 × 20 × 20 × 128 is obtained and serves as the input of ConvLSTM 2; the number of ConvLSTM2 channels is 128, the size of a convolution kernel is 3 multiplied by 3, and the size of output data is 10 multiplied by 20 multiplied by 128; the ConvLSTM1 input data is padded and added to the output of ConvLSTM2 and the Relu excitation function is used as the input to Deconv 1. In the Deconv1 stage, data with the size of 10 × 20 × 20 × 128 is input, the size of a convolution kernel is set to be 3 × 3, the number of channels is set to be 64, a Relu excitation function is adopted, the size of output data is 10 × 20 × 20 × 64, and the output of Deconv1 is used as the input of Deconv 2; deconv2 outputs data of size 10 × 20 × 20 × 64 by deconvolution with the same setting parameters as Deconv 1; the input data of Deconv1 is subjected to data alignment and then added with the output data of Deconv2, the data obtained by adopting a Relu excitation function is used as the input of Deconv3, the size of a deconvolution kernel set by Deconv3 is 3 x 3, the number of channels is 1, and the output with the size of 10 x 20 x 1 obtained by adopting the Relu excitation function is used as the reconstruction output of the input data.

Step (3) a space-time sample set G of the gray level image_rAnd (3) inputting the residual error into the residual error self-coding network in the step (2) for iterative training, and taking the mean square error of the output data and the input data as a training objective function. After the training is finished, obtaining a model based on gray image data reconstruction, and recording the model as A1; space-time sample set G of gray level image of training data set_rInputting the data into an a1 model to obtain output data, and taking a two-norm of a subtraction result of the input data and the output data as a reconstruction error, wherein the maximum value of the reconstruction error is denoted as α, and α is 191; similarly, a spatio-temporal sample set O of the optical flow image_rInputting the data into the residual self-coding network in the step (2) for iterative training, and taking the mean square error of the output data and the input data as a training objective function to obtain a model reconstructed based on the optical flow image data and recording the model as A2; optical flow image space-time sample set O of training data set_rThe input data is input to the a2 model to obtain output data, and a two-norm of a result of subtracting the input data and the output data is used as a reconstruction error, and a maximum value of the reconstruction error is denoted by β, β being 100.

As shown in fig. 2, the testing phase first performs a pre-treatment: selecting a video data set to be detected, extracting a gray map sequence of the test data set, and normalizing the image size to 260 multiplied by 180, wherein 260 and 180 respectively represent the width and the height of an image; and calculating the optical flow graph sequence through the gray scale graph sequence of the test data set, thereby obtaining a gray scale graph data set and an optical flow graph data set of the test data set.

Step (4) dividing the gray image of the test data set into non-overlapping space blocks in a grid mode, wherein the size of each frame image is 260 multiplied by 180, and the space blocks are divided into space blocks g with the size of 20 multiplied by 20_ei(ii) a Spatially combining spatially identical 10 consecutive frames into a spatio-temporal sample g_e＝[g_e1,g_e2,…,g_e10]The size is 10 multiplied by 20; thus, a test set G of spatio-temporal samples of a gray image of the test data set is obtained_e。

Step (5), testing a space-time sample set G of the gray level image_eInputting the data into a model A1, processing the data through each layer of network to obtain the output of data reconstruction, and matching the input data with the output dataSubtracting, and calculating a two-norm of the subtraction result as a reconstruction error between the input and the output, and recording the result as E1; sample data with a reconstruction error E1 smaller than or equal to α is detected as a normal region, and sample data larger than α is detected as a suspicious region.

Step (6) extracting optical flow image of suspicious region, and dividing into non-overlapping space blocks by means of grid, the image size of each frame is 260X 180, and the image size is divided into space blocks o with size of 20X 20_ei(ii) a Spatially combining spatially identical 10 consecutive frames into a spatio-temporal sample o_e＝[o_e1,o_e2,…,o_e10]The size is 10 multiplied by 20; thus obtaining a space-time sample test set O of the optical flow image of the suspicious region_e。

Step (7) adding O_eInputting the data into a model A2 to obtain data reconstruction output, subtracting the input data from the output data, and calculating a two-norm of the subtraction result as a reconstruction error between the input and the output, which is denoted as E2; sample data having a reconstruction error E2 of β or less is detected as a normal region, and sample data having a reconstruction error E2 of β or more is detected as an abnormal region.

Claims

1. A video abnormal behavior detection method based on deep learning comprises a training stage and a testing stage, and is characterized in that:

the training phase is as follows:

step (1) of dividing a grayscale image of a grayscale image dataset in a training dataset by means of a gridNon-overlapping spatial blocks of size M N per frame are divided into spatial blocks g of size p_ri(ii) a Spatially combining spatially identical successive t frames into a spatio-temporal sample g_r＝[g_r1,g_r2,…,g_rt]The size is t multiplied by p;

the optical flow image of the optical flow image data set in the training data set is divided into non-overlapping space blocks in a grid mode, the image size of each frame is M multiplied by N, and the frame is divided into space blocks o with the size of p multiplied by p_ri(ii) a Spatially combining spatially identical successive t frames into a spatio-temporal sample o_r＝[o_r1,o_r2,…,o_rt]The size is t multiplied by p;

thus, the training data set is divided into a gray-scale image spatio-temporal sample set G_rSum-optical flow image spatio-temporal sample set O_r；

optical flow image space-time sample set O of training data set_rInputting the residual error into the residual error self-coding network in the step (2) for iterative training, taking the mean square error of the output data and the input data as a training objective function, and taking the maximum iteration number as 200 times or the orderStopping iteration when the change value of the standard function compared with the last iteration result is less than 0.1; obtaining a model A2 reconstructed based on optical flow image data after the training is finished; space-time sample set O of optical flow image_rInputting the data into a model A2 to obtain model A2 output data, taking a two-norm of a subtraction result of the input data and the model A2 output data as a reconstruction error, and recording the maximum value of the reconstruction error as beta;

the test stage is as follows:

step (6) extracting the optical flow image of the suspicious region, and dividing the optical flow image into non-overlapping space blocks in a grid mode, wherein the size of each frame image is MxN, and the optical flow image is divided into space blocks o with the size of p x p_ei(ii) a Spatially combining spatially identical successive t frames into a spatio-temporal sample o_e＝[o_e1,o_e2,…,o_et]Size ofThe size is t × p × p; space-time sample test set O for obtaining optical flow image of suspicious region_e；

And (7): mixing O with_eInputting the data into a model A2 to obtain the output of data reconstruction, subtracting the input data from the output data, and calculating the two-norm of the subtraction result as the reconstruction error E2 between the input and the output; detecting sample data with the reconstruction error E2 smaller than or equal to beta as a normal area, and detecting sample data larger than beta as an abnormal area;

therefore, the detection result of the test data set is obtained, and the anomaly detection of the whole system is completed.