CN112668495A - Violent video detection algorithm based on full space-time convolution module - Google Patents
Violent video detection algorithm based on full space-time convolution module Download PDFInfo
- Publication number
- CN112668495A CN112668495A CN202011619964.6A CN202011619964A CN112668495A CN 112668495 A CN112668495 A CN 112668495A CN 202011619964 A CN202011619964 A CN 202011619964A CN 112668495 A CN112668495 A CN 112668495A
- Authority
- CN
- China
- Prior art keywords
- time
- full
- space
- convolution module
- video
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 22
- 230000004927 fusion Effects 0.000 claims abstract description 13
- 238000010586 diagram Methods 0.000 claims description 22
- 238000000034 method Methods 0.000 claims description 15
- 206010001488 Aggression Diseases 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 4
- 230000002123 temporal effect Effects 0.000 claims description 4
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 238000005096 rolling process Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 2
- 239000000284 extract Substances 0.000 abstract description 3
- 230000000694 effects Effects 0.000 abstract description 2
- 230000009471 action Effects 0.000 description 5
- 238000012544 monitoring process Methods 0.000 description 4
- 230000004913 activation Effects 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 229910052739 hydrogen Inorganic materials 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Landscapes
- Image Analysis (AREA)
Abstract
The invention discloses a violent video detection algorithm based on a full space-time convolution module, which realizes the characteristic fusion of time sequence characteristics in a local space and a full time sequence, can effectively extract the local space and the full time sequence characteristics of a violent video, and effectively improves the detection accuracy and the generalization capability of a model. The full-time-space volume module can be applied to other network architectures, and the feature fusion capability of the time space is fully utilized to achieve a better video behavior classification effect.
Description
Technical Field
The invention relates to the technical field of deep learning and violent video detection, in particular to a violent video detection algorithm based on a full space-time convolution module.
Background
With the popularization of public place monitoring, the violent cases in the public places are continuously reduced. However, the traditional method needs a manual method to identify and classify a large number of monitoring pictures, which is time-consuming and labor-consuming, and often has hysteresis in processing violent cases in public places, and cannot respond in time. Therefore, monitoring and early warning of violent behaviors are important for public safety, especially for people-flow-intensive areas such as airports, railway stations, and intersections.
The technical methods for detecting violent videos are mainly classified into 2 types. In the first category, after the complete global spatial features of the video frame are extracted through a 2D convolution kernel, the global spatial features are fused on a time dimension by using a Long short-term memory (Long short-term memory) network, so that the purpose of judging whether the video content is violent or not is achieved. This method, after extracting global spatial features, is not reasonable for fusion in the time dimension because, in some cases, violent behavior in the video has local temporal order, for example, in the video, the punch action of the force applicator is only local spatial variation in the time dimension of the arm part. This is ignored by this type of approach. Second, the 2D convolution kernel and 2D pooling layer are extended to 3D convolution kernel and 3D pooling layer. The method has certain defects in violent video detection, firstly, the 3D convolution kernel considers the feature fusion of local features in the time dimension, and the size of the convolution kernel in the time dimension is 3 which is far smaller than the number of input video time sequence frames (generally 10-64 frames are unequal), so that the method is obviously unreasonable for violent actions with rapid deformation degree of human limbs in short time because the full time dimension information of the input video frames is not fully utilized.
Aiming at the problems, the invention provides a violent video detection algorithm based on a full space-time convolution module.
Disclosure of Invention
The invention aims to provide a violent video detection algorithm based on a full space-time convolution module, thereby solving the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme: a violent video detection algorithm based on a full space-time convolution module is characterized by comprising the following steps:
s01: selecting a random starting point in a target video;
s02: using a subnet based on Xception pre-trained by ImageNet as an extractor of high-level semantics of image spatial features;
s03: fusing information of the characteristic diagram on local space and all time dimensions by using a full space-time convolution module;
s04: connecting the full-time-space convolution module in the S03 with the 3D pooling layer repeatedly for multiple times in the deep neural network architecture, and fusing full-time-space characteristics of the time sequence characteristic diagrams with different scales;
s05: and inputting the feature map with the full-time space feature information into a classifier, wherein the final output result has two categories which respectively represent whether violent behaviors exist in the input continuous video frames.
Further, in step S01, 1 frame of video is selected every 1 frame after the random start point is selected, and 15 frames are selected in total, and the continuous 15 frames of video will be used as input.
Further, in step S02, the Xception sub-network is a feature extraction network, the truncation layer of the sub-network is "add _ 3", the original video frame propagates forward through the Xception sub-network, and a feature map is generated after reaching the truncation layer.
Further, in step S03, the full-time-space convolution module has two paths, where the left path indicates extracting spatial information of the consecutive video frames, the right path indicates extracting full-time-sequence features of the consecutive video frames, and finally the local space and the full-time-sequence features are superimposed to generate a fusion feature having a local space and a full-time sequence.
Further, the full-time-space rolling module process comprises the following steps:
s1: the left path, with a 3D convolution kernel (i.e., K) of size (3, 3, 1) representing spatial width, spatial height, and temporal length, respectively3×3×1) And extracting the spatial features of the input time sequence feature diagram.
S2: a right path for splitting the input time sequence characteristic diagram according to the channel dimension, and further X may be expressed as X ═ X1,X2,……,Xc],Xi∈RW×H×T。
S3: mixing Xi,i∈[1,c]Abstract is asUsing K3×3And the 2D convolution kernel with the CT number of output channels is fusedThe full timing feature of (1).
S4: and performing feature superposition on the spatial features of the time sequence feature diagram extracted in the S1 and the full time sequence features of the time sequence feature diagram extracted in the S3 to generate fusion features with local space and full time sequence.
Further, the full-time empty rolling module flow may be represented by the following notations:
Xin=[X1,X2,……,Xc]
Xin-c=BN(Xin*W3×3×1,filters)
Xfull-Ti=Mish(Xi*W3×3,T)i∈[1,c]
Xconcat=Concat(Xfull-Ti|∈[1,c])
Xfull-c=BN(Xconcat*W1×1×1,filters)
Xadd=Add(Xfull-c,Xin-c)
Yout=Mish(Xadd)。
compared with the prior art, the invention has the following beneficial effects:
in the invention, aiming at the problem that the local space and full time sequence characteristics of continuous video frames cannot be obtained in the field of violent video detection by a 2DCNN + LSTM algorithm and a 3 DCNN-based algorithm, a full-time-space convolution module is designed, the detection accuracy is effectively improved, and the method has stronger generalization capability in sample data which is low in resolution, high in noise, mixed in characters and difficult to describe individual behaviors like a monitoring video, can effectively extract the local space and full time sequence characteristics of the continuous video frames, and can more accurately judge the violent behaviors in the video. With the rapid increase of the number of short video users in China in recent years, the content review work of the uploaded videos of the users becomes more difficult, and violent videos are not beneficial to the healthy development of the network world.
Drawings
FIG. 1 is a flow chart of the sampling of successive video frames of the present invention;
FIG. 2 is a network architecture diagram of the spatial advanced semantic extractor of the present invention;
FIG. 3 is an architecture diagram of the full spatiotemporal convolution module of the present invention;
FIG. 4 is an architecture diagram of full time-space convolution module full timing feature fusion of the present invention;
FIG. 5 is a flow chart of the violent video detection algorithm of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1-5, an embodiment of the present invention provides a violent video detection algorithm based on a full space-time convolution module, which includes the following steps:
the method comprises the following steps: selecting continuous 30 frames in a target video, selecting 1 frame of video frames at intervals of 1 frame in the 30 frames of video, totaling 15 frames, and redefining the video frames to be 224 multiplied by 224 pixels, so that the input data is defined as [224, 224, 15, 3], wherein the last dimension represents that the input image is an RGB image with the channel number of 3;
step two: using an Xconcentration sub-network pre-trained based on ImageNet as an extractor of high-level semantics of image space features, wherein the truncation layer of the sub-network is 'add-3', and an original video frame is transmitted forwards through the Xconcentration sub-network and reaches the truncation layer to generate a feature map;
step three: fusing information of the characteristic diagram on local space and all time dimensions by using a full space-time convolution module;
step four: connecting the full-time-space convolution module in the third step with the 3D pooling layer repeatedly for multiple times in the deep neural network architecture, and fusing full-time-space characteristics of the time sequence characteristic diagrams with different scales;
step five: and inputting the feature map with the full-time space feature information into a classifier, wherein the final output result has two categories which respectively represent whether violent behaviors exist in the input continuous video frames.
Furthermore, the full-time-space convolution module has two paths, the left path represents extracting the spatial information of the continuous video frames, the right path represents extracting the full-time feature of the continuous video frames, and finally, the local space and the full-time feature are overlapped to generate a fusion feature with the local space and the full-time, X ═ W, H, T, C is defined as the continuous feature map input to the module, wherein W, H, T and C represent the width, height, time-sequence length and channel number of the feature map respectively. The module comprises the following specific steps:
s1: the left path, with a 3D convolution kernel (i.e., K) of size (3, 3, 1) representing spatial width, spatial height, and temporal length, respectively3×3×1) Extracting the spatial features of the input time sequence feature diagram;
s2: a right path for splitting the input time sequence characteristic diagram according to the channel dimension, and further X may be expressed as X ═ X1,X2,……,Xc],Xi∈RW×H×T;
S3: mixing Xi,i∈[1,c]Abstract is asUsing K3×3And the 2D convolution kernel with the CT number of output channels is fusedThe full-time-series characteristic of (c);
s4: and performing feature superposition on the spatial features of the time sequence feature diagram extracted in the S1 and the full time sequence features of the time sequence feature diagram extracted in the S3 to generate fusion features with local space and full time sequence.
The flow of the full space-time convolution module can be expressed by the following notations:
wherein XinA continuous feature graph representing the input; xin-cRepresenting an output result obtained after the left channel extracts the spatial features of the continuous feature map and is processed by BatchNorm; xfull-TiThe output result which is obtained after the characteristics which are obtained in S2 and fused with the full time sequence are finally obtained and processed by a Mish activation function is represented; xconcatRepresents that X isfull-TiSplicing along the channel dimension, then increasing the channel number of the feature diagram through a convolution kernel of 1 multiplied by 1, and obtaining the final output feature X of the right pathfull-c;XaddThe spatial feature extracted from the left path and the full-time-sequence feature extracted from the right path are superposed, and finally the output Y of the full-time-space convolution module is obtained through a Mish activation functionout。
In summary, the following steps: the invention provides a violent video detection algorithm based on a full space-time convolution module, which realizes the characteristic fusion of time sequence characteristics in a local space and a full time sequence. Compared with the conventional 2DCNN + LSTM and 3 DCNN-based correlation algorithm, the method has greater advantages in the field of violent video detection. The full-time-space volume module can be applied to other network architectures, and the feature fusion capability of the time space is fully utilized to achieve a better video behavior classification effect.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.
Claims (6)
1. A violent video detection algorithm based on a full space-time convolution module is characterized by comprising the following steps:
s01: selecting a random starting point in a target video;
s02: using a subnet based on Xception pre-trained by ImageNet as an extractor of high-level semantics of image spatial features;
s03: fusing information of the characteristic diagram on local space and all time dimensions by using a full space-time convolution module;
s04: connecting the full-time-space convolution module in the S03 with the 3D pooling layer repeatedly for multiple times in the deep neural network architecture, and fusing full-time-space characteristics of the time sequence characteristic diagrams with different scales;
s05: and inputting the feature map with the full-time space feature information into a classifier, wherein the final output result has two categories which respectively represent whether violent behaviors exist in the input continuous video frames.
2. The violent video detection algorithm based on the full space-time convolution module as claimed in claim 1, characterized in that: in step S01, after the random starting point is selected, 1 frame of video is selected every 1 frame, and 15 frames are selected in total, and the continuous 15 frames of video will be used as input.
3. The violent video detection algorithm based on the full space-time convolution module as claimed in claim 1, characterized in that: in step S02, the Xception sub-network is a feature extraction network, the truncation layer of the sub-network is "add _ 3", and the original video frame propagates forward through the Xception sub-network and generates a feature map after reaching the truncation layer.
4. The violent video detection algorithm based on the full space-time convolution module as claimed in claim 1, characterized in that: in step S03, the full-time-space convolution module has two paths, where the left path represents extracting spatial information of the continuous video frames, the right path represents extracting full-time-sequence features of the continuous video frames, and finally, the local space and the full-time-sequence features are superimposed to generate a fusion feature having a local space and a full-time-sequence.
5. The violent video detection algorithm based on the full space-time convolution module as claimed in claim 4, characterized in that: the full-time-space rolling module process comprises the following steps:
s1: the left path, with a 3D convolution kernel (i.e., K) of size (3, 3, 1) representing spatial width, spatial height, and temporal length, respectively3×3×1) Extracting the spatial features of the input time sequence feature diagram;
s2: a right path for splitting the input time sequence characteristic diagram according to the channel dimension, and further X may be expressed as X ═ X1,X2,……,Xc],Xi∈RW×H×T;
S3: mixing Xi,i∈[1,c]Abstract is asUsing K3×3And the 2D convolution kernel with the CT number of output channels is fusedThe full-time-series characteristic of (c);
s4: and performing feature superposition on the spatial features of the time sequence feature diagram extracted in the S1 and the full time sequence features of the time sequence feature diagram extracted in the S3 to generate fusion features with local space and full time sequence.
6. The violent video detection algorithm based on the full space-time convolution module as claimed in claim 5, characterized in that: the flow of the full space-time convolution module can be represented by the following notations:
Xin=[X1,X2,……,Xc]
Xin-c=BN(Xin*W3×3×1,filters)
Xfull-Ti=Mish(Xi*W3×3,T)i∈[1,c]
Xconcat=Concat(Xfull-Ti|∈[1,c])
Xfull-c=BN(Xconcat*W1×1×1,filters)
Xadd=Add(Xfull-c,Xin-c)
Yout=Mish(Xadd)。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011619964.6A CN112668495B (en) | 2020-12-30 | 2020-12-30 | Full-time space convolution module-based violent video detection algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011619964.6A CN112668495B (en) | 2020-12-30 | 2020-12-30 | Full-time space convolution module-based violent video detection algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112668495A true CN112668495A (en) | 2021-04-16 |
CN112668495B CN112668495B (en) | 2024-02-02 |
Family
ID=75412098
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011619964.6A Active CN112668495B (en) | 2020-12-30 | 2020-12-30 | Full-time space convolution module-based violent video detection algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112668495B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107463949A (en) * | 2017-07-14 | 2017-12-12 | 北京协同创新研究院 | A kind of processing method and processing device of video actions classification |
CN107609460A (en) * | 2017-05-24 | 2018-01-19 | 南京邮电大学 | A kind of Human bodys' response method for merging space-time dual-network stream and attention mechanism |
CN109446923A (en) * | 2018-10-10 | 2019-03-08 | 北京理工大学 | Depth based on training characteristics fusion supervises convolutional neural networks Activity recognition method |
CN110175580A (en) * | 2019-05-29 | 2019-08-27 | 复旦大学 | A kind of video behavior recognition methods based on timing cause and effect convolutional network |
CN111353395A (en) * | 2020-02-19 | 2020-06-30 | 南京信息工程大学 | Face changing video detection method based on long-term and short-term memory network |
-
2020
- 2020-12-30 CN CN202011619964.6A patent/CN112668495B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107609460A (en) * | 2017-05-24 | 2018-01-19 | 南京邮电大学 | A kind of Human bodys' response method for merging space-time dual-network stream and attention mechanism |
CN107463949A (en) * | 2017-07-14 | 2017-12-12 | 北京协同创新研究院 | A kind of processing method and processing device of video actions classification |
CN109446923A (en) * | 2018-10-10 | 2019-03-08 | 北京理工大学 | Depth based on training characteristics fusion supervises convolutional neural networks Activity recognition method |
CN110175580A (en) * | 2019-05-29 | 2019-08-27 | 复旦大学 | A kind of video behavior recognition methods based on timing cause and effect convolutional network |
CN111353395A (en) * | 2020-02-19 | 2020-06-30 | 南京信息工程大学 | Face changing video detection method based on long-term and short-term memory network |
Non-Patent Citations (4)
Title |
---|
YUTONG CAI等: "Multi-scale spatiotemporal information fusion network for video action recognition", 《2018 IEEE VISUAL COMMUNICATIONS AND IMAGE PROCESSING 》, pages 1 - 4 * |
ZHENHUA T等: "FTCF: Full temporal cross fusion network for violence detection in videos", 《 APPLIED INTELLIGENCE 》, vol. 53, pages 4218 - 4230 * |
夏清沛: "基于深度学习的人体行为识别", 《CNKI优秀硕士毕业论文全文库(信息科技辑)》, no. 4, pages 138 - 886 * |
谭等泰等: "多特征融合的行为识别模型", 《中国图象图形学报》, vol. 25, no. 12, pages 2541 - 2552 * |
Also Published As
Publication number | Publication date |
---|---|
CN112668495B (en) | 2024-02-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112132156B (en) | Image saliency target detection method and system based on multi-depth feature fusion | |
EP3291558B1 (en) | Video coding and decoding methods and apparatus | |
Hoang Ngan Le et al. | Robust hand detection and classification in vehicles and in the wild | |
CN108875608B (en) | Motor vehicle traffic signal identification method based on deep learning | |
CN109636795B (en) | Real-time non-tracking monitoring video remnant detection method | |
CN112150450B (en) | Image tampering detection method and device based on dual-channel U-Net model | |
CN111461039B (en) | Landmark identification method based on multi-scale feature fusion | |
US11544510B2 (en) | System and method for multi-modal image classification | |
CN110020658B (en) | Salient object detection method based on multitask deep learning | |
CN111931859B (en) | Multi-label image recognition method and device | |
CN111814817A (en) | Video classification method and device, storage medium and electronic equipment | |
CN112287983B (en) | Remote sensing image target extraction system and method based on deep learning | |
Mlích et al. | Fire segmentation in still images | |
Santana et al. | A novel siamese-based approach for scene change detection with applications to obstructed routes in hazardous environments | |
CN111401368B (en) | News video title extraction method based on deep learning | |
CN115131797A (en) | Scene text detection method based on feature enhancement pyramid network | |
CN116129291A (en) | Unmanned aerial vehicle animal husbandry-oriented image target recognition method and device | |
CN117197763A (en) | Road crack detection method and system based on cross attention guide feature alignment network | |
CN115984537A (en) | Image processing method and device and related equipment | |
Ghali et al. | CT-Fire: a CNN-Transformer for wildfire classification on ground and aerial images | |
CN112668495A (en) | Violent video detection algorithm based on full space-time convolution module | |
CN115393901A (en) | Cross-modal pedestrian re-identification method and computer readable storage medium | |
CN113255646B (en) | Real-time scene text detection method | |
CN114359789A (en) | Target detection method, device, equipment and medium for video image | |
CN110866487B (en) | Large-scale pedestrian detection and re-identification sample set construction method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |