CN112926388A - Campus violent behavior video detection method based on action recognition - Google Patents

Campus violent behavior video detection method based on action recognition Download PDF

Info

Publication number
CN112926388A
CN112926388A CN202110094939.9A CN202110094939A CN112926388A CN 112926388 A CN112926388 A CN 112926388A CN 202110094939 A CN202110094939 A CN 202110094939A CN 112926388 A CN112926388 A CN 112926388A
Authority
CN
China
Prior art keywords
data
module
multiplied
space
campus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110094939.9A
Other languages
Chinese (zh)
Inventor
吴洺
余天
姜飞
卢宏涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Research Institute Of Shanghai Jiaotong University
Original Assignee
Chongqing Research Institute Of Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Research Institute Of Shanghai Jiaotong University filed Critical Chongqing Research Institute Of Shanghai Jiaotong University
Priority to CN202110094939.9A priority Critical patent/CN112926388A/en
Publication of CN112926388A publication Critical patent/CN112926388A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Human Computer Interaction (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a campus violent behavior video detection method based on action recognition, which is based on a YOWO framework, performs space-time decoupling on a backbone network to extract space-time characteristics step by step, and improves a data filling method and a loss calculation method, thereby recognizing and positioning violent behaviors in a video. Compared with the prior art, the method has the advantages of ensuring the speed to meet the real-time performance and simultaneously being as accurate as possible.

Description

Campus violent behavior video detection method based on action recognition
Technical Field
The invention relates to an image recognition technology, in particular to a campus violent behavior video detection method based on action recognition.
Background
The campus safety of students is always one of the key points of social attention, and many campus dangerous behaviors such as putting up, falling down and the like bring potential safety hazards and threaten the safety of the students in schools. And the campus dangerous behavior real-time monitoring system is introduced, so that various dangerous behaviors of students can be found in time and early warning is carried out in time, serious consequences are avoided, and the campus dangerous behavior real-time monitoring system has profound practical significance and application value. In addition, the action recognition of data such as videos which depend on time sequences and spatial information is one of the research focuses in the field of computer vision, and the method comprises two tasks of classification and positioning, is beneficial to recognition and understanding of various human behaviors by a computer, and has high scientific research value.
However, the data set mainly used for motion recognition is an open data set, and data collected in a real scene is lacked. The public data set is generally cut and scaled, so that the difference between the public data set and the real data set is large, and a model trained on the public data set cannot be directly applied to a real scene basically. In addition, although various existing algorithms can finish detection with higher precision, various problems such as large calculated amount, long time consumption, large parameter amount, large memory occupation and the like exist, so that the method is limited to be applied to real application.
The existing action recognition method can be mainly divided into three categories: a dual-stream based method, a convolution based method, and a pose skeleton based method.
Wherein the dual stream based representation algorithm relies on characterizing the trajectory of the motion using optical flow information of the images. The optical flow is used as a motion vector of a pixel level, and the calculated amount is large, so that the speed of the whole model is low, and the real-time requirement cannot be met. In addition, optical flow information generally needs to be calculated separately, so that an end-to-end system cannot be realized, and the capability of being used for a real-time system is poor.
Convolution-based methods can utilize convolution, particularly 3D convolution, to simultaneously acquire temporal and spatial features for end-to-end learning and prediction. However, the 3D convolution contains a large number of parameters, and when the network is deep, the occupied resource overhead is huge, which is not favorable for wide-range deployment to the actual production environment.
The pose skeleton-based method firstly uses a posture estimation method to obtain a human body joint point model, and then carries out subsequent processing to obtain a final prediction result. In this way, the prediction and analysis cannot be performed end to end, and the final result of the motion recognition module depends on the result of the attitude estimation, which easily causes error accumulation and affects the final precision.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provide a campus violent behavior video detection method based on motion recognition.
The purpose of the invention can be realized by the following technical scheme:
according to one aspect of the invention, the campus violent behavior video detection method based on action recognition is provided, and based on a YOWO framework, the campus violent behavior video detection method is used for performing space-time decoupling on a backbone network to extract space-time characteristics step by step, and a data filling method and a loss calculation method are improved, so that violent behaviors in videos are recognized and positioned.
Preferably, the violent behavior comprises fighting.
As a preferred technical solution, the yoko frame includes:
the data input module is used for directly acquiring data from the actual scene of the school and transmitting the data into the system;
a ResNext-101 module, which is an improved 3D main network and is used for extracting the decoupled space-time characteristics of the video;
a YOLOv2 module for extracting spatial features of the segment key frames;
the channel fusion attention mechanism module is used for fusing and outputting the output results of the ResNext-101 module and the YOLOv2 module;
and the identification and positioning module is used for predicting whether the fighting behaviors exist and the occurrence positions of the fighting behaviors through a bounding box regression method.
As a preferred technical solution, the data input module adopts an improved data filling method to fill the video clip.
As a preferred technical solution, the improved data filling method is implemented by using an adaptive average pooling layer, and the specific process is as follows:
101) converting input data with dimension of D multiplied by C multiplied by H multiplied by W into C multiplied by D multiplied by H multiplied by W, wherein D is the frame number, C is the channel number, H is the height, and W is the width;
102) the adaptive average pooling layer is utilized to expand the frame number D to 16 frames.
As a preferred solution, the adaptive average pooling layer is generated by using an existing partial prior frame and inserted therein in sequence.
As a preferable technical scheme, the ResNext-101 module decouples and respectively extracts space-time characteristics by using a method of R2P 1D.
As a preferred technical scheme, the decoupling and extracting specific processes of the time-space characteristics are as follows:
201) modifying the convolution kernel of the 3D convolution branch from a size of 3 x 3 to two convolution kernels of 1 x 3 and 3 x 1, respectively, wherein the dimensions correspond to duration D, height H and width W, respectively, and a ReLU layer is added in the middle to provide nonlinearity;
202) the model is changed from the original mode of extracting space-time characteristics together into two steps of extracting space characteristics and then extracting time characteristics, and time sequence and space decoupling is achieved.
As a preferred technical solution, the identifying and positioning module includes:
the classification unit is used for searching whether the shelving action exists or not by adopting an MSE loss function;
and a positioning unit for marking the fighting behavior in the image by using a smooth L1 loss function.
As a preferred technical solution, the loss function adopts a weight-based loss function, and gives a larger weight to data with larger loss, so as to strengthen learning of a part of difficult samples, and the specific process is as follows:
for each batch of training data, the weight coefficient w of each sample is calculated according to the confidence coefficient loss Lc, and then the weight coefficient w is applied to the positioning loss of the corresponding sample.
Compared with the prior art, the invention has the following advantages:
1) the campus fighting behavior analysis method can analyze the campus fighting behavior in real time, and ensure that the speed meets the real-time performance and is as accurate as possible;
2) improving prediction accuracy for key frames lacking prior data by padding data using adaptive average pooling;
3) the time sequence and the spatial characteristics of the fighting behaviors are respectively extracted by using a separable convolution method, so that the recognition rate of the fighting behaviors is improved;
4) the learning of the difficult samples is strengthened by using the weighted loss function, and the accuracy of the fighting behavior is further improved.
Drawings
FIG. 1 is a system framework diagram of the present invention;
FIG. 2 is a schematic diagram of a filling method of YOWO protogenesis;
fig. 3 is a schematic diagram of the improved filling method of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.
The invention relates to a real-time campus fighting behavior detection algorithm based on motion recognition. The method is based on a YOWO frame, performs space-time decoupling on a backbone network to extract space-time characteristics step by step, and improves a data filling method and a loss calculation method, so that the fighting behavior in the video is identified and positioned, and the speed is ensured to meet the real-time performance and the accuracy is realized as much as possible.
As shown in fig. 1, the main body of the present invention is mainly constructed by a yoko frame. The system collects data directly from the actual scene of the school and transmits the data to the system for analysis. Wherein ResNext-101 is a 3D backbone network with improved algorithm and is used for extracting decoupled spatiotemporal features of the video; YOLOv2 extracts the spatial features of the segment key frame (last frame). And the two are fused and then sent to a Channel Fusion Attention Mechanism (CFAM) module, and finally, whether the fighting behavior exists or not and the position where the fighting behavior occurs are predicted by a bounding box regression method.
The video clip filling method specifically comprises the following steps:
the input to the system is a slice of successive 75 frames of RGB image data acquired at 3 second intervals. For each frame (i.e., key frame) it is necessary to use the 15 frames of data preceding it to help extract spatio-temporal features for prediction of the key frame. When the key frame sequence number is the first 15 frames, the information of partial prior frames is lost. Yoko uses a cyclic mode to take frames from the rear of a segment for supplement, however, in doing so, subsequent frame information is introduced, and the order of video information is disturbed, so that the prediction precision is influenced. Referring to fig. 2, taking the input total length of 8 frames and the key frame as the 4 th frame as an example, the frame-fighting action does not occur yet, but because 72-75 frames of data are used and the frames contain the frame-fighting action, the fourth frame is predicted as the frame-fighting and a bounding box is given.
In this regard, the present invention proposes a simple and efficient method to automatically fill in the number of complementing frames, and in particular, the present invention uses an adaptive averaging pooling layer to achieve this. Firstly, the invention converts the input data with the dimension of D multiplied by C multiplied by H multiplied by W into C multiplied by D multiplied by H multiplied by W, wherein D is the frame number (D <16), C is the channel number, H is the height, and W is the width. The adaptive average pooling layer is then utilized to expand the frame number D to 16 frames. The method is generated by using the existing partial prior frame and is inserted into the partial prior frame in sequence, so the inherent time sequence information is not disturbed, referring to fig. 3, the invention uses the data of 1-4 frames for adaptive average pooling, and extends the data to 8 frames, wherein no frame-breaking action occurs, so the final prediction result is also correct.
The decoupled space-time feature extraction of the invention specifically comprises the following steps:
motion recognition relies on features of both timing and spatial information. Unlike the dual stream method or some methods based on LSTM, the 3D convolution extracts features of both parts simultaneously, which results in coupling of timing information and spatial information. However, the two parts of features are slightly different in practice, and the coupled method of 3D convolution cannot better obtain robust space-time features, which is not favorable for subsequent processing. In addition, the act of fighting is relatively complex, and unlike simple actions, a more robust model is required to achieve better results.
Therefore, the invention uses a method based on R2P1D to decouple and extract the spatio-temporal features respectively. Specifically, the present invention modifies the convolution kernel of the 3D convolution branch from a size of 3 x 3 to two convolution kernels of 1 x 3 and 3 x 1, respectively, where the dimensions correspond to duration D, height H and width W, respectively, with the addition of a ReLU layer in the middle to provide non-linearity. Therefore, the model is changed from two steps of extracting space characteristics and then extracting time characteristics from the original mode of extracting the space characteristics together, decoupling of time sequence and space is achieved, the model can be trained better, and more appropriate characteristics are obtained to improve prediction and positioning of the fighting behavior.
The learning of the difficult sample of the invention is specifically as follows:
the invention has two main tasks for the action of fighting: classification and localization. The classification is used for searching whether a fighting behavior exists or not, and is a two-classification problem; the positioning is to mark the fighting behavior in the image, and the invention adopts a bounding box regression method to position. For the classification task, the invention uses the MSE penalty, while the localization task uses the smoothed L1 penalty. Yoko uses Focal local for the inter-class imbalance problem to help improve the accuracy of the classification task. However, for the shelving behavior of the invention, the invention observes that the loss of confidence is relatively greater during training, and thus the final result is also less than satisfactory, because the shelving behavior also has a larger intra-class imbalance problem. Such as common fighting activities including punching and kicking, there are also different types of fighting including tying, pulling, etc. The probability of occurrence of the part is relatively low, so that the accuracy of prediction of the part is not high, and the probability of prediction of common fighting behaviors can be inhibited.
For this reason, the invention proposes a weight-based loss function, which gives more weight to the data with more loss, thereby strengthening the learning of the part of difficult samples. Specifically, for each batch of training data, the invention calculates the weight coefficient w of each sample according to the confidence coefficient loss Lc, and then applies the weight coefficient w to the positioning loss of the corresponding sample. This results in the localization loss being the average of the batch samples by weight, appearing differently from the classification loss ratio, and therefore multiplying by the batch size to restore their original ratio. Detailed formulas see formulas (1) and (2), where Lc and w are of vector form and Lw and Ll are of scalar form, to be noted.
ω=Softmax(Lc) (1)
Figure BDA0002913676110000051
Two data sets were used to evaluate the method of the invention, the baseline method referenced as yoko. Referring to table 1, on the public data set UCF101-24, taking the input of 16 frame length as an example for feature extraction, the method of the present invention finally achieves 0.7% improvement, and meanwhile, the inference speed is maintained almost unchanged, which is enough to meet the real-time requirement of the present invention. In addition, the method collects a large amount of data of real scenes from the primary and secondary schools of Min' in Shanghai, and a real data set of the campus fighting behavior is manufactured, so that the result of YOWO in the real scenes of the method is 58.7%, and the precision is improved by 2.6% through the improvement of the method. Moreover, the invention also carries out ablation experiments to respectively verify the effectiveness of the method.
TABLE 1
Figure BDA0002913676110000061
Considering that the filling method is low in data occupation ratio of partial lack of the prior frames, the method also extracts and compares the prediction results of the first 16 frames of each video segment in the data set, and as shown in table 2, the results of the method are obviously improved on different data sets.
TABLE 2
Figure BDA0002913676110000062
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A campus violent behavior video detection method based on action recognition is characterized in that the method is based on a YOWO framework, time-space decoupling is carried out on a backbone network to extract space-time characteristics step by step, and a data filling method and a loss calculation method are improved, so that violent behaviors in videos are recognized and positioned.
2. The method of claim 1, wherein the violent behavior comprises fighting a shelf.
3. The method of claim 1, wherein the yoko framework comprises:
the data input module is used for directly acquiring data from the actual scene of the school and transmitting the data into the system;
a ResNext-101 module, which is an improved 3D main network and is used for extracting the decoupled space-time characteristics of the video;
a YOLOv2 module for extracting spatial features of the segment key frames;
the channel fusion attention mechanism module is used for fusing and outputting the output results of the ResNext-101 module and the YOLOv2 module;
and the identification and positioning module is used for predicting whether the fighting behaviors exist and the occurrence positions of the fighting behaviors through a bounding box regression method.
4. The method as claimed in claim 3, wherein the data input module fills the video segments by using an improved data filling method.
5. The method as claimed in claim 4, wherein the improved data filling method is implemented by using an adaptive average pooling layer, and comprises the following specific steps:
101) converting input data with dimension of D multiplied by C multiplied by H multiplied by W into C multiplied by D multiplied by H multiplied by W, wherein D is the frame number, C is the channel number, H is the height, and W is the width;
102) the adaptive average pooling layer is utilized to expand the frame number D to 16 frames.
6. The method of claim 5, wherein the adaptive average pooling layer is generated using existing partial prior frames and inserted sequentially therein.
7. The video detection method for campus violent behavior based on action recognition as claimed in claim 5, wherein said ResNext-101 module uses R2P1D method to decouple and extract spatiotemporal features respectively.
8. The method for detecting the campus violent behavior video based on the action recognition as claimed in claim 7, wherein the specific processes of decoupling the spatiotemporal features and respectively extracting are as follows:
201) modifying the convolution kernel of the 3D convolution branch from a size of 3 x 3 to two convolution kernels of 1 x 3 and 3 x 1, respectively, wherein the dimensions correspond to duration D, height H and width W, respectively, and a ReLU layer is added in the middle to provide nonlinearity;
202) the model is changed from the original mode of extracting space-time characteristics together into two steps of extracting space characteristics and then extracting time characteristics, and time sequence and space decoupling is achieved.
9. The method of claim 3, wherein the identification and location module comprises:
the classification unit is used for searching whether the shelving action exists or not by adopting an MSE loss function;
and a positioning unit for marking the fighting behavior in the image by using a smooth L1 loss function.
10. The method as claimed in claim 9, wherein the loss function is a weight-based loss function, and gives more weight to data with more loss, so as to enhance learning of part of difficult samples, and the specific process is as follows:
for each batch of training data, the weight coefficient w of each sample is calculated according to the confidence coefficient loss Lc, and then the weight coefficient w is applied to the positioning loss of the corresponding sample.
CN202110094939.9A 2021-01-25 2021-01-25 Campus violent behavior video detection method based on action recognition Pending CN112926388A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110094939.9A CN112926388A (en) 2021-01-25 2021-01-25 Campus violent behavior video detection method based on action recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110094939.9A CN112926388A (en) 2021-01-25 2021-01-25 Campus violent behavior video detection method based on action recognition

Publications (1)

Publication Number Publication Date
CN112926388A true CN112926388A (en) 2021-06-08

Family

ID=76166055

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110094939.9A Pending CN112926388A (en) 2021-01-25 2021-01-25 Campus violent behavior video detection method based on action recognition

Country Status (1)

Country Link
CN (1) CN112926388A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023025051A1 (en) * 2021-08-23 2023-03-02 港大科桥有限公司 Video action detection method based on end-to-end framework, and electronic device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023025051A1 (en) * 2021-08-23 2023-03-02 港大科桥有限公司 Video action detection method based on end-to-end framework, and electronic device

Similar Documents

Publication Publication Date Title
CN109886090B (en) Video pedestrian re-identification method based on multi-time scale convolutional neural network
CN108846365B (en) Detection method and device for fighting behavior in video, storage medium and processor
CN108230291B (en) Object recognition system training method, object recognition method, device and electronic equipment
CN109389086B (en) Method and system for detecting unmanned aerial vehicle image target
US20160078287A1 (en) Method and system of temporal segmentation for gesture analysis
CN110070029B (en) Gait recognition method and device
CN112991656A (en) Human body abnormal behavior recognition alarm system and method under panoramic monitoring based on attitude estimation
CN114582030B (en) Behavior recognition method based on service robot
CN110796051A (en) Real-time access behavior detection method and system based on container scene
CN110390308B (en) Video behavior identification method based on space-time confrontation generation network
CN109583334B (en) Action recognition method and system based on space-time correlation neural network
CN110688927A (en) Video action detection method based on time sequence convolution modeling
CN112801019B (en) Method and system for eliminating re-identification deviation of unsupervised vehicle based on synthetic data
CN111723687A (en) Human body action recognition method and device based on neural network
CN114241379A (en) Passenger abnormal behavior identification method, device and equipment and passenger monitoring system
Wu et al. Pose-Guided Inflated 3D ConvNet for action recognition in videos
Liu et al. ACDnet: An action detection network for real-time edge computing based on flow-guided feature approximation and memory aggregation
CN113191216A (en) Multi-person real-time action recognition method and system based on gesture recognition and C3D network
Sunney et al. A real-time machine learning framework for smart home-based Yoga Teaching System
CN112926388A (en) Campus violent behavior video detection method based on action recognition
CN112381774A (en) Cow body condition scoring method and system based on multi-angle depth information fusion
CN111898458A (en) Violent video identification method based on attention mechanism for bimodal task learning
CN115331152B (en) Fire fighting identification method and system
CN115798055A (en) Violent behavior detection method based on corersort tracking algorithm
CN115100559A (en) Motion prediction method and system based on lattice point optical flow

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination