CN112668495A - Violent video detection algorithm based on full space-time convolution module - Google Patents

Violent video detection algorithm based on full space-time convolution module Download PDF

Info

Publication number
CN112668495A
CN112668495A CN202011619964.6A CN202011619964A CN112668495A CN 112668495 A CN112668495 A CN 112668495A CN 202011619964 A CN202011619964 A CN 202011619964A CN 112668495 A CN112668495 A CN 112668495A
Authority
CN
China
Prior art keywords
time
full
space
convolution module
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011619964.6A
Other languages
Chinese (zh)
Other versions
CN112668495B (en
Inventor
谭振华
王鹏飞
夏祯彻
毛克明
张斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN202011619964.6A priority Critical patent/CN112668495B/en
Publication of CN112668495A publication Critical patent/CN112668495A/en
Application granted granted Critical
Publication of CN112668495B publication Critical patent/CN112668495B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a violent video detection algorithm based on a full space-time convolution module, which realizes the characteristic fusion of time sequence characteristics in a local space and a full time sequence, can effectively extract the local space and the full time sequence characteristics of a violent video, and effectively improves the detection accuracy and the generalization capability of a model. The full-time-space volume module can be applied to other network architectures, and the feature fusion capability of the time space is fully utilized to achieve a better video behavior classification effect.

Description

Violent video detection algorithm based on full space-time convolution module
Technical Field
The invention relates to the technical field of deep learning and violent video detection, in particular to a violent video detection algorithm based on a full space-time convolution module.
Background
With the popularization of public place monitoring, the violent cases in the public places are continuously reduced. However, the traditional method needs a manual method to identify and classify a large number of monitoring pictures, which is time-consuming and labor-consuming, and often has hysteresis in processing violent cases in public places, and cannot respond in time. Therefore, monitoring and early warning of violent behaviors are important for public safety, especially for people-flow-intensive areas such as airports, railway stations, and intersections.
The technical methods for detecting violent videos are mainly classified into 2 types. In the first category, after the complete global spatial features of the video frame are extracted through a 2D convolution kernel, the global spatial features are fused on a time dimension by using a Long short-term memory (Long short-term memory) network, so that the purpose of judging whether the video content is violent or not is achieved. This method, after extracting global spatial features, is not reasonable for fusion in the time dimension because, in some cases, violent behavior in the video has local temporal order, for example, in the video, the punch action of the force applicator is only local spatial variation in the time dimension of the arm part. This is ignored by this type of approach. Second, the 2D convolution kernel and 2D pooling layer are extended to 3D convolution kernel and 3D pooling layer. The method has certain defects in violent video detection, firstly, the 3D convolution kernel considers the feature fusion of local features in the time dimension, and the size of the convolution kernel in the time dimension is 3 which is far smaller than the number of input video time sequence frames (generally 10-64 frames are unequal), so that the method is obviously unreasonable for violent actions with rapid deformation degree of human limbs in short time because the full time dimension information of the input video frames is not fully utilized.
Aiming at the problems, the invention provides a violent video detection algorithm based on a full space-time convolution module.
Disclosure of Invention
The invention aims to provide a violent video detection algorithm based on a full space-time convolution module, thereby solving the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme: a violent video detection algorithm based on a full space-time convolution module is characterized by comprising the following steps:
s01: selecting a random starting point in a target video;
s02: using a subnet based on Xception pre-trained by ImageNet as an extractor of high-level semantics of image spatial features;
s03: fusing information of the characteristic diagram on local space and all time dimensions by using a full space-time convolution module;
s04: connecting the full-time-space convolution module in the S03 with the 3D pooling layer repeatedly for multiple times in the deep neural network architecture, and fusing full-time-space characteristics of the time sequence characteristic diagrams with different scales;
s05: and inputting the feature map with the full-time space feature information into a classifier, wherein the final output result has two categories which respectively represent whether violent behaviors exist in the input continuous video frames.
Further, in step S01, 1 frame of video is selected every 1 frame after the random start point is selected, and 15 frames are selected in total, and the continuous 15 frames of video will be used as input.
Further, in step S02, the Xception sub-network is a feature extraction network, the truncation layer of the sub-network is "add _ 3", the original video frame propagates forward through the Xception sub-network, and a feature map is generated after reaching the truncation layer.
Further, in step S03, the full-time-space convolution module has two paths, where the left path indicates extracting spatial information of the consecutive video frames, the right path indicates extracting full-time-sequence features of the consecutive video frames, and finally the local space and the full-time-sequence features are superimposed to generate a fusion feature having a local space and a full-time sequence.
Further, the full-time-space rolling module process comprises the following steps:
s1: the left path, with a 3D convolution kernel (i.e., K) of size (3, 3, 1) representing spatial width, spatial height, and temporal length, respectively3×3×1) And extracting the spatial features of the input time sequence feature diagram.
S2: a right path for splitting the input time sequence characteristic diagram according to the channel dimension, and further X may be expressed as X ═ X1,X2,……,Xc],Xi∈RW×H×T
S3: mixing Xi,i∈[1,c]Abstract is as
Figure BDA0002873889150000032
Using K3×3And the 2D convolution kernel with the CT number of output channels is fused
Figure BDA0002873889150000031
The full timing feature of (1).
S4: and performing feature superposition on the spatial features of the time sequence feature diagram extracted in the S1 and the full time sequence features of the time sequence feature diagram extracted in the S3 to generate fusion features with local space and full time sequence.
Further, the full-time empty rolling module flow may be represented by the following notations:
Xin=[X1,X2,……,Xc]
Xin-c=BN(Xin*W3×3×1,filters)
Xfull-Ti=Mish(Xi*W3×3,T)i∈[1,c]
Xconcat=Concat(Xfull-Ti|∈[1,c])
Xfull-c=BN(Xconcat*W1×1×1,filters)
Xadd=Add(Xfull-c,Xin-c)
Yout=Mish(Xadd)。
compared with the prior art, the invention has the following beneficial effects:
in the invention, aiming at the problem that the local space and full time sequence characteristics of continuous video frames cannot be obtained in the field of violent video detection by a 2DCNN + LSTM algorithm and a 3 DCNN-based algorithm, a full-time-space convolution module is designed, the detection accuracy is effectively improved, and the method has stronger generalization capability in sample data which is low in resolution, high in noise, mixed in characters and difficult to describe individual behaviors like a monitoring video, can effectively extract the local space and full time sequence characteristics of the continuous video frames, and can more accurately judge the violent behaviors in the video. With the rapid increase of the number of short video users in China in recent years, the content review work of the uploaded videos of the users becomes more difficult, and violent videos are not beneficial to the healthy development of the network world.
Drawings
FIG. 1 is a flow chart of the sampling of successive video frames of the present invention;
FIG. 2 is a network architecture diagram of the spatial advanced semantic extractor of the present invention;
FIG. 3 is an architecture diagram of the full spatiotemporal convolution module of the present invention;
FIG. 4 is an architecture diagram of full time-space convolution module full timing feature fusion of the present invention;
FIG. 5 is a flow chart of the violent video detection algorithm of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1-5, an embodiment of the present invention provides a violent video detection algorithm based on a full space-time convolution module, which includes the following steps:
the method comprises the following steps: selecting continuous 30 frames in a target video, selecting 1 frame of video frames at intervals of 1 frame in the 30 frames of video, totaling 15 frames, and redefining the video frames to be 224 multiplied by 224 pixels, so that the input data is defined as [224, 224, 15, 3], wherein the last dimension represents that the input image is an RGB image with the channel number of 3;
step two: using an Xconcentration sub-network pre-trained based on ImageNet as an extractor of high-level semantics of image space features, wherein the truncation layer of the sub-network is 'add-3', and an original video frame is transmitted forwards through the Xconcentration sub-network and reaches the truncation layer to generate a feature map;
step three: fusing information of the characteristic diagram on local space and all time dimensions by using a full space-time convolution module;
step four: connecting the full-time-space convolution module in the third step with the 3D pooling layer repeatedly for multiple times in the deep neural network architecture, and fusing full-time-space characteristics of the time sequence characteristic diagrams with different scales;
step five: and inputting the feature map with the full-time space feature information into a classifier, wherein the final output result has two categories which respectively represent whether violent behaviors exist in the input continuous video frames.
Furthermore, the full-time-space convolution module has two paths, the left path represents extracting the spatial information of the continuous video frames, the right path represents extracting the full-time feature of the continuous video frames, and finally, the local space and the full-time feature are overlapped to generate a fusion feature with the local space and the full-time, X ═ W, H, T, C is defined as the continuous feature map input to the module, wherein W, H, T and C represent the width, height, time-sequence length and channel number of the feature map respectively. The module comprises the following specific steps:
s1: the left path, with a 3D convolution kernel (i.e., K) of size (3, 3, 1) representing spatial width, spatial height, and temporal length, respectively3×3×1) Extracting the spatial features of the input time sequence feature diagram;
s2: a right path for splitting the input time sequence characteristic diagram according to the channel dimension, and further X may be expressed as X ═ X1,X2,……,Xc],Xi∈RW×H×T
S3: mixing Xi,i∈[1,c]Abstract is as
Figure BDA0002873889150000052
Using K3×3And the 2D convolution kernel with the CT number of output channels is fused
Figure BDA0002873889150000053
The full-time-series characteristic of (c);
s4: and performing feature superposition on the spatial features of the time sequence feature diagram extracted in the S1 and the full time sequence features of the time sequence feature diagram extracted in the S3 to generate fusion features with local space and full time sequence.
The flow of the full space-time convolution module can be expressed by the following notations:
Figure BDA0002873889150000051
wherein XinA continuous feature graph representing the input; xin-cRepresenting an output result obtained after the left channel extracts the spatial features of the continuous feature map and is processed by BatchNorm; xfull-TiThe output result which is obtained after the characteristics which are obtained in S2 and fused with the full time sequence are finally obtained and processed by a Mish activation function is represented; xconcatRepresents that X isfull-TiSplicing along the channel dimension, then increasing the channel number of the feature diagram through a convolution kernel of 1 multiplied by 1, and obtaining the final output feature X of the right pathfull-c;XaddThe spatial feature extracted from the left path and the full-time-sequence feature extracted from the right path are superposed, and finally the output Y of the full-time-space convolution module is obtained through a Mish activation functionout
In summary, the following steps: the invention provides a violent video detection algorithm based on a full space-time convolution module, which realizes the characteristic fusion of time sequence characteristics in a local space and a full time sequence. Compared with the conventional 2DCNN + LSTM and 3 DCNN-based correlation algorithm, the method has greater advantages in the field of violent video detection. The full-time-space volume module can be applied to other network architectures, and the feature fusion capability of the time space is fully utilized to achieve a better video behavior classification effect.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (6)

1. A violent video detection algorithm based on a full space-time convolution module is characterized by comprising the following steps:
s01: selecting a random starting point in a target video;
s02: using a subnet based on Xception pre-trained by ImageNet as an extractor of high-level semantics of image spatial features;
s03: fusing information of the characteristic diagram on local space and all time dimensions by using a full space-time convolution module;
s04: connecting the full-time-space convolution module in the S03 with the 3D pooling layer repeatedly for multiple times in the deep neural network architecture, and fusing full-time-space characteristics of the time sequence characteristic diagrams with different scales;
s05: and inputting the feature map with the full-time space feature information into a classifier, wherein the final output result has two categories which respectively represent whether violent behaviors exist in the input continuous video frames.
2. The violent video detection algorithm based on the full space-time convolution module as claimed in claim 1, characterized in that: in step S01, after the random starting point is selected, 1 frame of video is selected every 1 frame, and 15 frames are selected in total, and the continuous 15 frames of video will be used as input.
3. The violent video detection algorithm based on the full space-time convolution module as claimed in claim 1, characterized in that: in step S02, the Xception sub-network is a feature extraction network, the truncation layer of the sub-network is "add _ 3", and the original video frame propagates forward through the Xception sub-network and generates a feature map after reaching the truncation layer.
4. The violent video detection algorithm based on the full space-time convolution module as claimed in claim 1, characterized in that: in step S03, the full-time-space convolution module has two paths, where the left path represents extracting spatial information of the continuous video frames, the right path represents extracting full-time-sequence features of the continuous video frames, and finally, the local space and the full-time-sequence features are superimposed to generate a fusion feature having a local space and a full-time-sequence.
5. The violent video detection algorithm based on the full space-time convolution module as claimed in claim 4, characterized in that: the full-time-space rolling module process comprises the following steps:
s1: the left path, with a 3D convolution kernel (i.e., K) of size (3, 3, 1) representing spatial width, spatial height, and temporal length, respectively3×3×1) Extracting the spatial features of the input time sequence feature diagram;
s2: a right path for splitting the input time sequence characteristic diagram according to the channel dimension, and further X may be expressed as X ═ X1,X2,……,Xc],Xi∈RW×H×T
S3: mixing Xi,i∈[1,c]Abstract is as
Figure FDA0002873889140000021
Using K3×3And the 2D convolution kernel with the CT number of output channels is fused
Figure FDA0002873889140000022
The full-time-series characteristic of (c);
s4: and performing feature superposition on the spatial features of the time sequence feature diagram extracted in the S1 and the full time sequence features of the time sequence feature diagram extracted in the S3 to generate fusion features with local space and full time sequence.
6. The violent video detection algorithm based on the full space-time convolution module as claimed in claim 5, characterized in that: the flow of the full space-time convolution module can be represented by the following notations:
Xin=[X1,X2,……,Xc]
Xin-c=BN(Xin*W3×3×1,filters)
Xfull-Ti=Mish(Xi*W3×3,T)i∈[1,c]
Xconcat=Concat(Xfull-Ti|∈[1,c])
Xfull-c=BN(Xconcat*W1×1×1,filters)
Xadd=Add(Xfull-c,Xin-c)
Yout=Mish(Xadd)。
CN202011619964.6A 2020-12-30 2020-12-30 Full-time space convolution module-based violent video detection algorithm Active CN112668495B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011619964.6A CN112668495B (en) 2020-12-30 2020-12-30 Full-time space convolution module-based violent video detection algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011619964.6A CN112668495B (en) 2020-12-30 2020-12-30 Full-time space convolution module-based violent video detection algorithm

Publications (2)

Publication Number Publication Date
CN112668495A true CN112668495A (en) 2021-04-16
CN112668495B CN112668495B (en) 2024-02-02

Family

ID=75412098

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011619964.6A Active CN112668495B (en) 2020-12-30 2020-12-30 Full-time space convolution module-based violent video detection algorithm

Country Status (1)

Country Link
CN (1) CN112668495B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463949A (en) * 2017-07-14 2017-12-12 北京协同创新研究院 A kind of processing method and processing device of video actions classification
CN107609460A (en) * 2017-05-24 2018-01-19 南京邮电大学 A kind of Human bodys' response method for merging space-time dual-network stream and attention mechanism
CN109446923A (en) * 2018-10-10 2019-03-08 北京理工大学 Depth based on training characteristics fusion supervises convolutional neural networks Activity recognition method
CN110175580A (en) * 2019-05-29 2019-08-27 复旦大学 A kind of video behavior recognition methods based on timing cause and effect convolutional network
CN111353395A (en) * 2020-02-19 2020-06-30 南京信息工程大学 Face changing video detection method based on long-term and short-term memory network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107609460A (en) * 2017-05-24 2018-01-19 南京邮电大学 A kind of Human bodys' response method for merging space-time dual-network stream and attention mechanism
CN107463949A (en) * 2017-07-14 2017-12-12 北京协同创新研究院 A kind of processing method and processing device of video actions classification
CN109446923A (en) * 2018-10-10 2019-03-08 北京理工大学 Depth based on training characteristics fusion supervises convolutional neural networks Activity recognition method
CN110175580A (en) * 2019-05-29 2019-08-27 复旦大学 A kind of video behavior recognition methods based on timing cause and effect convolutional network
CN111353395A (en) * 2020-02-19 2020-06-30 南京信息工程大学 Face changing video detection method based on long-term and short-term memory network

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
YUTONG CAI等: "Multi-scale spatiotemporal information fusion network for video action recognition", 《2018 IEEE VISUAL COMMUNICATIONS AND IMAGE PROCESSING 》, pages 1 - 4 *
ZHENHUA T等: "FTCF: Full temporal cross fusion network for violence detection in videos", 《 APPLIED INTELLIGENCE 》, vol. 53, pages 4218 - 4230 *
夏清沛: "基于深度学习的人体行为识别", 《CNKI优秀硕士毕业论文全文库(信息科技辑)》, no. 4, pages 138 - 886 *
谭等泰等: "多特征融合的行为识别模型", 《中国图象图形学报》, vol. 25, no. 12, pages 2541 - 2552 *

Also Published As

Publication number Publication date
CN112668495B (en) 2024-02-02

Similar Documents

Publication Publication Date Title
CN112132156B (en) Image saliency target detection method and system based on multi-depth feature fusion
EP3291558B1 (en) Video coding and decoding methods and apparatus
Hoang Ngan Le et al. Robust hand detection and classification in vehicles and in the wild
CN108875608B (en) Motor vehicle traffic signal identification method based on deep learning
CN109636795B (en) Real-time non-tracking monitoring video remnant detection method
CN112150450B (en) Image tampering detection method and device based on dual-channel U-Net model
CN111461039B (en) Landmark identification method based on multi-scale feature fusion
US11544510B2 (en) System and method for multi-modal image classification
CN110020658B (en) Salient object detection method based on multitask deep learning
CN111931859B (en) Multi-label image recognition method and device
CN111814817A (en) Video classification method and device, storage medium and electronic equipment
CN112287983B (en) Remote sensing image target extraction system and method based on deep learning
Mlích et al. Fire segmentation in still images
Santana et al. A novel siamese-based approach for scene change detection with applications to obstructed routes in hazardous environments
CN111401368B (en) News video title extraction method based on deep learning
CN115131797A (en) Scene text detection method based on feature enhancement pyramid network
CN116129291A (en) Unmanned aerial vehicle animal husbandry-oriented image target recognition method and device
CN117197763A (en) Road crack detection method and system based on cross attention guide feature alignment network
CN115984537A (en) Image processing method and device and related equipment
Ghali et al. CT-Fire: a CNN-Transformer for wildfire classification on ground and aerial images
CN112668495A (en) Violent video detection algorithm based on full space-time convolution module
CN115393901A (en) Cross-modal pedestrian re-identification method and computer readable storage medium
CN113255646B (en) Real-time scene text detection method
CN114359789A (en) Target detection method, device, equipment and medium for video image
CN110866487B (en) Large-scale pedestrian detection and re-identification sample set construction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant