CN112668495A

CN112668495A - Violent video detection algorithm based on full space-time convolution module

Info

Publication number: CN112668495A
Application number: CN202011619964.6A
Authority: CN
Inventors: 谭振华; 王鹏飞; 夏祯彻; 毛克明; 张斌
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2021-04-16
Anticipated expiration: 2040-12-30
Also published as: CN112668495B

Abstract

The invention discloses a violent video detection algorithm based on a full space-time convolution module, which realizes the characteristic fusion of time sequence characteristics in a local space and a full time sequence, can effectively extract the local space and the full time sequence characteristics of a violent video, and effectively improves the detection accuracy and the generalization capability of a model. The full-time-space volume module can be applied to other network architectures, and the feature fusion capability of the time space is fully utilized to achieve a better video behavior classification effect.

Description

Violent video detection algorithm based on full space-time convolution module

Technical Field

The invention relates to the technical field of deep learning and violent video detection, in particular to a violent video detection algorithm based on a full space-time convolution module.

Background

With the popularization of public place monitoring, the violent cases in the public places are continuously reduced. However, the traditional method needs a manual method to identify and classify a large number of monitoring pictures, which is time-consuming and labor-consuming, and often has hysteresis in processing violent cases in public places, and cannot respond in time. Therefore, monitoring and early warning of violent behaviors are important for public safety, especially for people-flow-intensive areas such as airports, railway stations, and intersections.

The technical methods for detecting violent videos are mainly classified into 2 types. In the first category, after the complete global spatial features of the video frame are extracted through a 2D convolution kernel, the global spatial features are fused on a time dimension by using a Long short-term memory (Long short-term memory) network, so that the purpose of judging whether the video content is violent or not is achieved. This method, after extracting global spatial features, is not reasonable for fusion in the time dimension because, in some cases, violent behavior in the video has local temporal order, for example, in the video, the punch action of the force applicator is only local spatial variation in the time dimension of the arm part. This is ignored by this type of approach. Second, the 2D convolution kernel and 2D pooling layer are extended to 3D convolution kernel and 3D pooling layer. The method has certain defects in violent video detection, firstly, the 3D convolution kernel considers the feature fusion of local features in the time dimension, and the size of the convolution kernel in the time dimension is 3 which is far smaller than the number of input video time sequence frames (generally 10-64 frames are unequal), so that the method is obviously unreasonable for violent actions with rapid deformation degree of human limbs in short time because the full time dimension information of the input video frames is not fully utilized.

Aiming at the problems, the invention provides a violent video detection algorithm based on a full space-time convolution module.

Disclosure of Invention

The invention aims to provide a violent video detection algorithm based on a full space-time convolution module, thereby solving the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme: a violent video detection algorithm based on a full space-time convolution module is characterized by comprising the following steps:

s01: selecting a random starting point in a target video;

s02: using a subnet based on Xception pre-trained by ImageNet as an extractor of high-level semantics of image spatial features;

s03: fusing information of the characteristic diagram on local space and all time dimensions by using a full space-time convolution module;

s04: connecting the full-time-space convolution module in the S03 with the 3D pooling layer repeatedly for multiple times in the deep neural network architecture, and fusing full-time-space characteristics of the time sequence characteristic diagrams with different scales;

s05: and inputting the feature map with the full-time space feature information into a classifier, wherein the final output result has two categories which respectively represent whether violent behaviors exist in the input continuous video frames.

Further, in step S01, 1 frame of video is selected every 1 frame after the random start point is selected, and 15 frames are selected in total, and the continuous 15 frames of video will be used as input.

Further, in step S02, the Xception sub-network is a feature extraction network, the truncation layer of the sub-network is "add _ 3", the original video frame propagates forward through the Xception sub-network, and a feature map is generated after reaching the truncation layer.

Further, in step S03, the full-time-space convolution module has two paths, where the left path indicates extracting spatial information of the consecutive video frames, the right path indicates extracting full-time-sequence features of the consecutive video frames, and finally the local space and the full-time-sequence features are superimposed to generate a fusion feature having a local space and a full-time sequence.

Further, the full-time-space rolling module process comprises the following steps:

s1: the left path, with a 3D convolution kernel (i.e., K) of size (3, 3, 1) representing spatial width, spatial height, and temporal length, respectively_3×3×1) And extracting the spatial features of the input time sequence feature diagram.

S2: a right path for splitting the input time sequence characteristic diagram according to the channel dimension, and further X may be expressed as X ═ X₁，X₂，……，X_c]，X_i∈R^W×H×T。

S3: mixing X_i，i∈[1，c]Abstract is as

Using K_3×3And the 2D convolution kernel with the CT number of output channels is fused

The full timing feature of (1).

S4: and performing feature superposition on the spatial features of the time sequence feature diagram extracted in the S1 and the full time sequence features of the time sequence feature diagram extracted in the S3 to generate fusion features with local space and full time sequence.

Further, the full-time empty rolling module flow may be represented by the following notations:

Xin＝[X₁，X₂，……，X_c]

Xin-c＝BN(Xin*W_{3×3×1,filters})

Xfull-Ti＝Mish(X_i*W_3×3，T)i∈[1，c]

Xconcat＝Concat(Xfull-T_i|∈[1，c])

Xfull-c＝BN(Xconcat*W_{1×1×1,filters})

Xadd＝Add(Xfull-c，Xin-c)

Yout＝Mish(X_add)。

compared with the prior art, the invention has the following beneficial effects:

in the invention, aiming at the problem that the local space and full time sequence characteristics of continuous video frames cannot be obtained in the field of violent video detection by a 2DCNN + LSTM algorithm and a 3 DCNN-based algorithm, a full-time-space convolution module is designed, the detection accuracy is effectively improved, and the method has stronger generalization capability in sample data which is low in resolution, high in noise, mixed in characters and difficult to describe individual behaviors like a monitoring video, can effectively extract the local space and full time sequence characteristics of the continuous video frames, and can more accurately judge the violent behaviors in the video. With the rapid increase of the number of short video users in China in recent years, the content review work of the uploaded videos of the users becomes more difficult, and violent videos are not beneficial to the healthy development of the network world.

Drawings

FIG. 1 is a flow chart of the sampling of successive video frames of the present invention;

FIG. 2 is a network architecture diagram of the spatial advanced semantic extractor of the present invention;

FIG. 3 is an architecture diagram of the full spatiotemporal convolution module of the present invention;

FIG. 4 is an architecture diagram of full time-space convolution module full timing feature fusion of the present invention;

FIG. 5 is a flow chart of the violent video detection algorithm of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-5, an embodiment of the present invention provides a violent video detection algorithm based on a full space-time convolution module, which includes the following steps:

the method comprises the following steps: selecting continuous 30 frames in a target video, selecting 1 frame of video frames at intervals of 1 frame in the 30 frames of video, totaling 15 frames, and redefining the video frames to be 224 multiplied by 224 pixels, so that the input data is defined as [224, 224, 15, 3], wherein the last dimension represents that the input image is an RGB image with the channel number of 3;

step two: using an Xconcentration sub-network pre-trained based on ImageNet as an extractor of high-level semantics of image space features, wherein the truncation layer of the sub-network is 'add-3', and an original video frame is transmitted forwards through the Xconcentration sub-network and reaches the truncation layer to generate a feature map;

step three: fusing information of the characteristic diagram on local space and all time dimensions by using a full space-time convolution module;

step four: connecting the full-time-space convolution module in the third step with the 3D pooling layer repeatedly for multiple times in the deep neural network architecture, and fusing full-time-space characteristics of the time sequence characteristic diagrams with different scales;

step five: and inputting the feature map with the full-time space feature information into a classifier, wherein the final output result has two categories which respectively represent whether violent behaviors exist in the input continuous video frames.

Furthermore, the full-time-space convolution module has two paths, the left path represents extracting the spatial information of the continuous video frames, the right path represents extracting the full-time feature of the continuous video frames, and finally, the local space and the full-time feature are overlapped to generate a fusion feature with the local space and the full-time, X ═ W, H, T, C is defined as the continuous feature map input to the module, wherein W, H, T and C represent the width, height, time-sequence length and channel number of the feature map respectively. The module comprises the following specific steps:

s1: the left path, with a 3D convolution kernel (i.e., K) of size (3, 3, 1) representing spatial width, spatial height, and temporal length, respectively_3×3×1) Extracting the spatial features of the input time sequence feature diagram;

s2: a right path for splitting the input time sequence characteristic diagram according to the channel dimension, and further X may be expressed as X ═ X₁，X₂，……，X_c]，X_i∈R^W×H×T；

S3: mixing X_i，i∈[1，c]Abstract is as

The full-time-series characteristic of (c);

The flow of the full space-time convolution module can be expressed by the following notations:

wherein X_inA continuous feature graph representing the input; x_in-cRepresenting an output result obtained after the left channel extracts the spatial features of the continuous feature map and is processed by BatchNorm; x_full-TiThe output result which is obtained after the characteristics which are obtained in S2 and fused with the full time sequence are finally obtained and processed by a Mish activation function is represented; x_concatRepresents that X is_full-TiSplicing along the channel dimension, then increasing the channel number of the feature diagram through a convolution kernel of 1 multiplied by 1, and obtaining the final output feature X of the right path_full-c；X_addThe spatial feature extracted from the left path and the full-time-sequence feature extracted from the right path are superposed, and finally the output Y of the full-time-space convolution module is obtained through a Mish activation function_out。

In summary, the following steps: the invention provides a violent video detection algorithm based on a full space-time convolution module, which realizes the characteristic fusion of time sequence characteristics in a local space and a full time sequence. Compared with the conventional 2DCNN + LSTM and 3 DCNN-based correlation algorithm, the method has greater advantages in the field of violent video detection. The full-time-space volume module can be applied to other network architectures, and the feature fusion capability of the time space is fully utilized to achieve a better video behavior classification effect.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A violent video detection algorithm based on a full space-time convolution module is characterized by comprising the following steps:

s01: selecting a random starting point in a target video;

2. The violent video detection algorithm based on the full space-time convolution module as claimed in claim 1, characterized in that: in step S01, after the random starting point is selected, 1 frame of video is selected every 1 frame, and 15 frames are selected in total, and the continuous 15 frames of video will be used as input.

3. The violent video detection algorithm based on the full space-time convolution module as claimed in claim 1, characterized in that: in step S02, the Xception sub-network is a feature extraction network, the truncation layer of the sub-network is "add _ 3", and the original video frame propagates forward through the Xception sub-network and generates a feature map after reaching the truncation layer.

4. The violent video detection algorithm based on the full space-time convolution module as claimed in claim 1, characterized in that: in step S03, the full-time-space convolution module has two paths, where the left path represents extracting spatial information of the continuous video frames, the right path represents extracting full-time-sequence features of the continuous video frames, and finally, the local space and the full-time-sequence features are superimposed to generate a fusion feature having a local space and a full-time-sequence.

5. The violent video detection algorithm based on the full space-time convolution module as claimed in claim 4, characterized in that: the full-time-space rolling module process comprises the following steps:

S3: mixing X_i，i∈[1，c]Abstract is as

The full-time-series characteristic of (c);

6. The violent video detection algorithm based on the full space-time convolution module as claimed in claim 5, characterized in that: the flow of the full space-time convolution module can be represented by the following notations:

Xin＝[X₁，X₂，……，X_c]

Xin-c＝BN(Xin*W_{3×3×1,filters})

Xfull-Ti＝Mish(X_i*W_3×3，T)i∈[1，c]

Xconcat＝Concat(Xfull-T_i|∈[1，c])

Xfull-c＝BN(Xconcat*W_{1×1×1,filters})

Xadd＝Add(Xfull-c，Xin-c)

Yout＝Mish(X_add)。