CN112926388A

CN112926388A - Campus violent behavior video detection method based on action recognition

Info

Publication number: CN112926388A
Application number: CN202110094939.9A
Authority: CN
Inventors: 吴洺; 余天; 姜飞; 卢宏涛
Original assignee: Chongqing Research Institute Of Shanghai Jiaotong University
Current assignee: Chongqing Research Institute Of Shanghai Jiaotong University
Priority date: 2021-01-25
Filing date: 2021-01-25
Publication date: 2021-06-08

Abstract

The invention relates to a campus violent behavior video detection method based on action recognition, which is based on a YOWO framework, performs space-time decoupling on a backbone network to extract space-time characteristics step by step, and improves a data filling method and a loss calculation method, thereby recognizing and positioning violent behaviors in a video. Compared with the prior art, the method has the advantages of ensuring the speed to meet the real-time performance and simultaneously being as accurate as possible.

Description

Campus violent behavior video detection method based on action recognition

Technical Field

The invention relates to an image recognition technology, in particular to a campus violent behavior video detection method based on action recognition.

Background

The campus safety of students is always one of the key points of social attention, and many campus dangerous behaviors such as putting up, falling down and the like bring potential safety hazards and threaten the safety of the students in schools. And the campus dangerous behavior real-time monitoring system is introduced, so that various dangerous behaviors of students can be found in time and early warning is carried out in time, serious consequences are avoided, and the campus dangerous behavior real-time monitoring system has profound practical significance and application value. In addition, the action recognition of data such as videos which depend on time sequences and spatial information is one of the research focuses in the field of computer vision, and the method comprises two tasks of classification and positioning, is beneficial to recognition and understanding of various human behaviors by a computer, and has high scientific research value.

However, the data set mainly used for motion recognition is an open data set, and data collected in a real scene is lacked. The public data set is generally cut and scaled, so that the difference between the public data set and the real data set is large, and a model trained on the public data set cannot be directly applied to a real scene basically. In addition, although various existing algorithms can finish detection with higher precision, various problems such as large calculated amount, long time consumption, large parameter amount, large memory occupation and the like exist, so that the method is limited to be applied to real application.

The existing action recognition method can be mainly divided into three categories: a dual-stream based method, a convolution based method, and a pose skeleton based method.

Wherein the dual stream based representation algorithm relies on characterizing the trajectory of the motion using optical flow information of the images. The optical flow is used as a motion vector of a pixel level, and the calculated amount is large, so that the speed of the whole model is low, and the real-time requirement cannot be met. In addition, optical flow information generally needs to be calculated separately, so that an end-to-end system cannot be realized, and the capability of being used for a real-time system is poor.

Convolution-based methods can utilize convolution, particularly 3D convolution, to simultaneously acquire temporal and spatial features for end-to-end learning and prediction. However, the 3D convolution contains a large number of parameters, and when the network is deep, the occupied resource overhead is huge, which is not favorable for wide-range deployment to the actual production environment.

The pose skeleton-based method firstly uses a posture estimation method to obtain a human body joint point model, and then carries out subsequent processing to obtain a final prediction result. In this way, the prediction and analysis cannot be performed end to end, and the final result of the motion recognition module depends on the result of the attitude estimation, which easily causes error accumulation and affects the final precision.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provide a campus violent behavior video detection method based on motion recognition.

The purpose of the invention can be realized by the following technical scheme:

according to one aspect of the invention, the campus violent behavior video detection method based on action recognition is provided, and based on a YOWO framework, the campus violent behavior video detection method is used for performing space-time decoupling on a backbone network to extract space-time characteristics step by step, and a data filling method and a loss calculation method are improved, so that violent behaviors in videos are recognized and positioned.

Preferably, the violent behavior comprises fighting.

As a preferred technical solution, the yoko frame includes:

the data input module is used for directly acquiring data from the actual scene of the school and transmitting the data into the system;

a ResNext-101 module, which is an improved 3D main network and is used for extracting the decoupled space-time characteristics of the video;

a YOLOv2 module for extracting spatial features of the segment key frames;

the channel fusion attention mechanism module is used for fusing and outputting the output results of the ResNext-101 module and the YOLOv2 module;

and the identification and positioning module is used for predicting whether the fighting behaviors exist and the occurrence positions of the fighting behaviors through a bounding box regression method.

As a preferred technical solution, the data input module adopts an improved data filling method to fill the video clip.

As a preferred technical solution, the improved data filling method is implemented by using an adaptive average pooling layer, and the specific process is as follows:

101) converting input data with dimension of D multiplied by C multiplied by H multiplied by W into C multiplied by D multiplied by H multiplied by W, wherein D is the frame number, C is the channel number, H is the height, and W is the width;

102) the adaptive average pooling layer is utilized to expand the frame number D to 16 frames.

As a preferred solution, the adaptive average pooling layer is generated by using an existing partial prior frame and inserted therein in sequence.

As a preferable technical scheme, the ResNext-101 module decouples and respectively extracts space-time characteristics by using a method of R2P 1D.

As a preferred technical scheme, the decoupling and extracting specific processes of the time-space characteristics are as follows:

201) modifying the convolution kernel of the 3D convolution branch from a size of 3 x 3 to two convolution kernels of 1 x 3 and 3 x 1, respectively, wherein the dimensions correspond to duration D, height H and width W, respectively, and a ReLU layer is added in the middle to provide nonlinearity;

202) the model is changed from the original mode of extracting space-time characteristics together into two steps of extracting space characteristics and then extracting time characteristics, and time sequence and space decoupling is achieved.

As a preferred technical solution, the identifying and positioning module includes:

the classification unit is used for searching whether the shelving action exists or not by adopting an MSE loss function;

and a positioning unit for marking the fighting behavior in the image by using a smooth L1 loss function.

As a preferred technical solution, the loss function adopts a weight-based loss function, and gives a larger weight to data with larger loss, so as to strengthen learning of a part of difficult samples, and the specific process is as follows:

for each batch of training data, the weight coefficient w of each sample is calculated according to the confidence coefficient loss Lc, and then the weight coefficient w is applied to the positioning loss of the corresponding sample.

Compared with the prior art, the invention has the following advantages:

1) the campus fighting behavior analysis method can analyze the campus fighting behavior in real time, and ensure that the speed meets the real-time performance and is as accurate as possible;

2) improving prediction accuracy for key frames lacking prior data by padding data using adaptive average pooling;

3) the time sequence and the spatial characteristics of the fighting behaviors are respectively extracted by using a separable convolution method, so that the recognition rate of the fighting behaviors is improved;

4) the learning of the difficult samples is strengthened by using the weighted loss function, and the accuracy of the fighting behavior is further improved.

Drawings

FIG. 1 is a system framework diagram of the present invention;

FIG. 2 is a schematic diagram of a filling method of YOWO protogenesis;

fig. 3 is a schematic diagram of the improved filling method of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.

The invention relates to a real-time campus fighting behavior detection algorithm based on motion recognition. The method is based on a YOWO frame, performs space-time decoupling on a backbone network to extract space-time characteristics step by step, and improves a data filling method and a loss calculation method, so that the fighting behavior in the video is identified and positioned, and the speed is ensured to meet the real-time performance and the accuracy is realized as much as possible.

As shown in fig. 1, the main body of the present invention is mainly constructed by a yoko frame. The system collects data directly from the actual scene of the school and transmits the data to the system for analysis. Wherein ResNext-101 is a 3D backbone network with improved algorithm and is used for extracting decoupled spatiotemporal features of the video; YOLOv2 extracts the spatial features of the segment key frame (last frame). And the two are fused and then sent to a Channel Fusion Attention Mechanism (CFAM) module, and finally, whether the fighting behavior exists or not and the position where the fighting behavior occurs are predicted by a bounding box regression method.

The video clip filling method specifically comprises the following steps:

the input to the system is a slice of successive 75 frames of RGB image data acquired at 3 second intervals. For each frame (i.e., key frame) it is necessary to use the 15 frames of data preceding it to help extract spatio-temporal features for prediction of the key frame. When the key frame sequence number is the first 15 frames, the information of partial prior frames is lost. Yoko uses a cyclic mode to take frames from the rear of a segment for supplement, however, in doing so, subsequent frame information is introduced, and the order of video information is disturbed, so that the prediction precision is influenced. Referring to fig. 2, taking the input total length of 8 frames and the key frame as the 4 th frame as an example, the frame-fighting action does not occur yet, but because 72-75 frames of data are used and the frames contain the frame-fighting action, the fourth frame is predicted as the frame-fighting and a bounding box is given.

In this regard, the present invention proposes a simple and efficient method to automatically fill in the number of complementing frames, and in particular, the present invention uses an adaptive averaging pooling layer to achieve this. Firstly, the invention converts the input data with the dimension of D multiplied by C multiplied by H multiplied by W into C multiplied by D multiplied by H multiplied by W, wherein D is the frame number (D <16), C is the channel number, H is the height, and W is the width. The adaptive average pooling layer is then utilized to expand the frame number D to 16 frames. The method is generated by using the existing partial prior frame and is inserted into the partial prior frame in sequence, so the inherent time sequence information is not disturbed, referring to fig. 3, the invention uses the data of 1-4 frames for adaptive average pooling, and extends the data to 8 frames, wherein no frame-breaking action occurs, so the final prediction result is also correct.

The decoupled space-time feature extraction of the invention specifically comprises the following steps:

motion recognition relies on features of both timing and spatial information. Unlike the dual stream method or some methods based on LSTM, the 3D convolution extracts features of both parts simultaneously, which results in coupling of timing information and spatial information. However, the two parts of features are slightly different in practice, and the coupled method of 3D convolution cannot better obtain robust space-time features, which is not favorable for subsequent processing. In addition, the act of fighting is relatively complex, and unlike simple actions, a more robust model is required to achieve better results.

Therefore, the invention uses a method based on R2P1D to decouple and extract the spatio-temporal features respectively. Specifically, the present invention modifies the convolution kernel of the 3D convolution branch from a size of 3 x 3 to two convolution kernels of 1 x 3 and 3 x 1, respectively, where the dimensions correspond to duration D, height H and width W, respectively, with the addition of a ReLU layer in the middle to provide non-linearity. Therefore, the model is changed from two steps of extracting space characteristics and then extracting time characteristics from the original mode of extracting the space characteristics together, decoupling of time sequence and space is achieved, the model can be trained better, and more appropriate characteristics are obtained to improve prediction and positioning of the fighting behavior.

The learning of the difficult sample of the invention is specifically as follows:

the invention has two main tasks for the action of fighting: classification and localization. The classification is used for searching whether a fighting behavior exists or not, and is a two-classification problem; the positioning is to mark the fighting behavior in the image, and the invention adopts a bounding box regression method to position. For the classification task, the invention uses the MSE penalty, while the localization task uses the smoothed L1 penalty. Yoko uses Focal local for the inter-class imbalance problem to help improve the accuracy of the classification task. However, for the shelving behavior of the invention, the invention observes that the loss of confidence is relatively greater during training, and thus the final result is also less than satisfactory, because the shelving behavior also has a larger intra-class imbalance problem. Such as common fighting activities including punching and kicking, there are also different types of fighting including tying, pulling, etc. The probability of occurrence of the part is relatively low, so that the accuracy of prediction of the part is not high, and the probability of prediction of common fighting behaviors can be inhibited.

For this reason, the invention proposes a weight-based loss function, which gives more weight to the data with more loss, thereby strengthening the learning of the part of difficult samples. Specifically, for each batch of training data, the invention calculates the weight coefficient w of each sample according to the confidence coefficient loss Lc, and then applies the weight coefficient w to the positioning loss of the corresponding sample. This results in the localization loss being the average of the batch samples by weight, appearing differently from the classification loss ratio, and therefore multiplying by the batch size to restore their original ratio. Detailed formulas see formulas (1) and (2), where Lc and w are of vector form and Lw and Ll are of scalar form, to be noted.

ω=Softmax(L_c) (1)

Two data sets were used to evaluate the method of the invention, the baseline method referenced as yoko. Referring to table 1, on the public data set UCF101-24, taking the input of 16 frame length as an example for feature extraction, the method of the present invention finally achieves 0.7% improvement, and meanwhile, the inference speed is maintained almost unchanged, which is enough to meet the real-time requirement of the present invention. In addition, the method collects a large amount of data of real scenes from the primary and secondary schools of Min' in Shanghai, and a real data set of the campus fighting behavior is manufactured, so that the result of YOWO in the real scenes of the method is 58.7%, and the precision is improved by 2.6% through the improvement of the method. Moreover, the invention also carries out ablation experiments to respectively verify the effectiveness of the method.

TABLE 1

Considering that the filling method is low in data occupation ratio of partial lack of the prior frames, the method also extracts and compares the prediction results of the first 16 frames of each video segment in the data set, and as shown in table 2, the results of the method are obviously improved on different data sets.

TABLE 2

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A campus violent behavior video detection method based on action recognition is characterized in that the method is based on a YOWO framework, time-space decoupling is carried out on a backbone network to extract space-time characteristics step by step, and a data filling method and a loss calculation method are improved, so that violent behaviors in videos are recognized and positioned.

2. The method of claim 1, wherein the violent behavior comprises fighting a shelf.

3. The method of claim 1, wherein the yoko framework comprises:

a YOLOv2 module for extracting spatial features of the segment key frames;

4. The method as claimed in claim 3, wherein the data input module fills the video segments by using an improved data filling method.

5. The method as claimed in claim 4, wherein the improved data filling method is implemented by using an adaptive average pooling layer, and comprises the following specific steps:

6. The method of claim 5, wherein the adaptive average pooling layer is generated using existing partial prior frames and inserted sequentially therein.

7. The video detection method for campus violent behavior based on action recognition as claimed in claim 5, wherein said ResNext-101 module uses R2P1D method to decouple and extract spatiotemporal features respectively.

8. The method for detecting the campus violent behavior video based on the action recognition as claimed in claim 7, wherein the specific processes of decoupling the spatiotemporal features and respectively extracting are as follows:

9. The method of claim 3, wherein the identification and location module comprises:

10. The method as claimed in claim 9, wherein the loss function is a weight-based loss function, and gives more weight to data with more loss, so as to enhance learning of part of difficult samples, and the specific process is as follows: