CN112149504A

CN112149504A - Motion video identification method combining residual error network and attention of mixed convolution

Info

Publication number: CN112149504A
Application number: CN202010849991.6A
Authority: CN
Inventors: 杨慧敏; 田秋红
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2020-08-21
Filing date: 2020-08-21
Publication date: 2020-12-29
Anticipated expiration: 2040-08-21
Also published as: CN112149504B

Abstract

The invention discloses a mixed convolution residual error network and attention combined action video identification method. The method comprises the following steps: 1) reading the motion of a person in the motion video, and converting the motion video into an original video frame image; 2) respectively using methods of time sampling, random cutting and brightness adjustment to perform data enhancement on video frames of the action video to form and obtain video frame images; 3) constructing an attention module, constructing a mixed convolution block by using the attention module, constructing a mixed convolution residual network model based on the combination of a mixed convolution residual network and attention by cascading the mixed convolution block, and performing space-time feature learning on a video frame image by using the mixed convolution residual network model to obtain a key feature map; 4) the key feature maps are classified using a Softmax classification layer. The invention can maintain the characteristic information of the video frame while expanding the network depth, fully integrate the space-time characteristics, improve the correlation degree of the important channel characteristics and effectively improve the prediction performance of the action recognition.

Description

Motion video identification method combining residual error network and attention of mixed convolution

Technical Field

The invention belongs to a motion video identification method in the technical field of intelligent video analysis, and particularly relates to a motion video identification method based on combination of a mixed convolution residual error network and an attention mechanism.

Background

Motion recognition has application values such as video processing, pattern recognition, virtual reality and the like, and is one of important research subjects in the field of computer vision. Motion recognition in video is a key issue in video understanding tasks. It requires not only the capture of features in the spatial dimension, but also the encoding of the temporal relationship between a number of consecutive frames. Therefore, effectively extracting high-resolution spatio-temporal features from the motion video is of great significance for improving the accuracy of motion recognition. However, a video is a continuous frame sequence with a time relationship, each pixel has high similarity with its neighboring pixels, and the spatio-temporal correlation is very strong. The traditional convolutional neural network has excellent characteristic extraction performance on single image data, but cannot extract space-time characteristics from videos.

When the video input is a continuous image, there are currently three main methods: (1)2DCNNs bind RNN/LSTM, (2) dual stream CNNs, and (3)3 DCNNs. Dual stream CNNs use two independent networks to capture spatial signature and temporal motion information. Although this method works well, it cannot effectively mix appearance and motion information because the training of the two networks is separate. RNN/LSTM is better able to process sequence information and therefore is often combined with CNN to handle action recognition. However, this type of approach only preserves the top-level features, ignoring dependencies in the bottom-level features. The space-time information is obtained by using 3DCNN, which is an effective method. However, the 3DCNN model has a huge parameter amount and contains a large amount of redundant spatial data, and the training of 3DCNNs is a very challenging task. In recent years, many studies have attempted to introduce attention mechanisms from different perspectives to enhance the robustness of behavior recognition. However, the mechanism of attention stacking in deep networks can result in repeated dot products, thereby reducing the value of the features.

Disclosure of Invention

In order to solve the problems in the background art, the invention aims to provide a motion identification method in motion video based on the combination of a residual network of hybrid convolution and an attention mechanism, and an MC-RAN module is designed, which is based on the residual network of hybrid convolution and decouples a 3D convolution by using a 2D convolutionD convolution and 1D convolution are respectively matched with adaptive space attention module M_SSAnd channel attention module M_CSAnd fusion is carried out, the correlation degree of the important channel characteristics is improved, and the global correlation of the characteristic diagram is increased, so that the performance of action recognition is improved.

The technical scheme adopted by the invention is as follows:

the invention comprises the following steps:

1) reading the motion of a person in the motion video, and converting the motion video into an original video frame image;

2) respectively using methods of time sampling, random cutting and brightness adjustment to perform data enhancement on video frames of the action video to form and obtain video frame images;

the step 2) is specifically as follows:

time sampling: for each motion video, randomly sampling continuous frames of 16 motion videos for training; if the number of the continuous frames does not reach 16 frames, the action video is played circularly until the number of the continuous frames reaches 16 frames;

random cutting: resizing the original video frame image to 128 × 171 pixels, and then randomly cropping the original video frame image to 112 × 112 pixels;

and (3) brightness adjustment: and randomly adjusting the brightness of the original video frame image.

3) Constructing an attention module, constructing a mixed convolution block by using the attention module, constructing a mixed convolution residual network model based on the combination of a mixed convolution residual network and attention by cascading the mixed convolution block, and performing space-time feature learning on a video frame image by using the mixed convolution residual network model to obtain a key feature map;

the mixed volume block is expressed as:

X_t+1＝X_t+W(X_t)

wherein, X_tAnd X_t+1Represents the input and output of the tth MC-RAN module; x_tAnd X_t+1The characteristic dimensions are the same, and W represents a mixed convolution residual function added with an attention mechanism;

the step 3) is specifically as follows: selecting 3DResNet networkThe network structure is used as a basic network structure, an original 3D convolution module in the 3D ReesNet network structure is replaced by a first convolution layer and four mixed convolution blocks, and each mixed convolution block comprises an MC-RAN module and an addition layer; the MC-RAN module comprises a (2+1) D convolution layer, a first batch normalization layer, a first ReLU activation layer, a 3D convolution layer and a second batch normalization layer, wherein the (2+1) D convolution layer is formed by adding an attention module into the 2D convolution layer; input X of mixed volume block_tInput MC-RAN module, and output characteristic diagram and input X of MC-RAN module_tAdding the feature maps through the adding layer, wherein the output of the added feature maps processed by the second ReLU activation layer is used as the output X of the mixed volume block_t+1After each mixed rolling block, cascading a 3D maximum pooling layer for down-sampling;

ith size of N_i-1The 3D convolution layer of x t x D consists of M_iEach size is N_i-1A second 2D convolutional layer of x 1 xdxdxd and N_iEach size is M_iComposition of time-series convolution layer of x t x 1, M_iCalculated by the following formula:

wherein D represents a width and height dimension parameter of the 3D convolution layer output characteristic diagram, t represents a time sequence, and [ ] represents a downward rounding.

The (2+1) D convolutional layer mainly comprises a first 2D convolutional layer and a space attention module M_SSTime convolution layer and channel attention module M_CSFormed in cascade and consisting of spatial attention modules M_SSAnd channel attention module M_CSAn attention module is formed;

space attention module M_SSObtaining a spatial weight map W of the input feature map in a spatial dimension by a third 2D convolutional layer_SS(ii) a Channel attention module M_CSAcquiring channel weight graph W of input feature graph in channel dimension by adding multi-layer perceptron_CS；

The spatial attention module M_SSThe construction method specifically comprises the following steps: when the input feature map F is largeWhen the size is C multiplied by H multiplied by W, C represents the channel number of each frame of image in the input characteristic diagram, and H and W represent the width and height size parameters of each frame of image in the input characteristic diagram; firstly, compressing a channel of each frame of image in an input feature map by using global average pooling to generate a 2D space descriptor Z with the size of 1 multiplied by H multiplied by W; then, a third 2D convolution layer is used for performing convolution on the 2D space descriptor Z to obtain an interested target area in the input feature map; and finally, adding a third batch normalization layer on a third 2D convolution layer to carry out dimension transformation on the interested target region to obtain a space attention weight graph W_SS；

Spatial attention weight graph W_SSCan be expressed as:

W_SS(F)＝BN(σ(f^7×7(Avgpool(F)))

wherein BN () represents batch normalization, σ () represents sigmoid activation function, f^7×7() Represents the convolution operation with a convolution kernel size of 7 × 7, Avgpool () represents the global average pooling, and F represents the input signature;

the channel attention module M_CSThe construction method specifically comprises the following steps: when the size of the input feature map Q is C multiplied by H multiplied by W, and C represents the number of channels of each frame image in the input feature map, firstly, carrying out global average pooling operation on the input feature map Q to generate a channel vector Q' with the size of 1 multiplied by C; subsequently, the channel vector Q 'is processed using a multi-layer perceptron to learn weights of the channel vector Q';

the channel vector Q' can be calculated by the following formula:

wherein F (i, j) represents a feature map at coordinates (i, j), i represents a pixel point at dimension H, and j represents a pixel point at dimension W;

finally, adding a fourth batch normalization layer behind the multilayer perceptron to perform dimension conversion to obtain a channel attention weight graph W_CS；

Channel attention weight graph W_CSCan be expressed as:

W_CS(F)＝BN(MLP(Avgpool(F)))＝BN(σ(W₁((W₀Avgpool(F)+b₀)+b₁)))

wherein MLP () represents a multilayer perceptron with hidden layers, W₀And W₁Is the weight of MLP () with the size of C/r × C and C × C/r, r is the compression ratio, () is the linear correction unit, b₀And b₁And bias terms representing MLP () with the sizes of C/r and C, respectively.

4) The key feature maps are classified using a Softmax classification layer.

The step 4) is specifically as follows: after the video frame images pass through the four MC-RAN modules, the spatio-temporal characteristics in the video frame images are fused, the mixed convolution residual error network model obtains key characteristics, and the key characteristic images are input into a Softmax layer for classification.

The input feature map of the input feature map in the first MC-RAN module is the output feature map of the video frame image in step 2) after passing through the first convolutional layer, and the input feature map of the subsequent MC-RAN module is the output feature map of the previous MC-RAN module after passing through the 3D max-pooling layer.

The invention has the beneficial effects that:

1) the MC-RAN module is designed, and based on a residual error network of mixed convolution, the 2D convolution and the 1D convolution which are decoupled by the 3D convolution are respectively fused with the adaptive space attention module and the channel attention module, so that the space-time characteristics are fully fused, the correlation degree of important channel characteristics is improved, the global correlation of a characteristic diagram is increased, and the performance of behavior identification is improved.

2) The mixed convolution residual error network model provided by the invention can expand the network depth and simultaneously reserve the characteristic information. According to the invention, comparative tests are carried out on the common data sets UCF101 and HMDB51, and after the data sets Kinetics are pre-trained, the accuracy rates of Top1 on the UCF101 and HMDB51 test sets respectively reach 96.8% and 74.8%.

Drawings

FIG. 1 is an example of a partial data set according to an embodiment of the present invention;

FIG. 2 is a block diagram of an embodiment of the present invention;

FIG. 3 is a block diagram of a spatial attention module according to an embodiment of the present invention;

FIG. 4 is a block diagram of a channel attention module according to an embodiment of the present invention;

FIG. 5 is a block cascade diagram of a hybrid convolution according to an embodiment of the present invention;

FIG. 6 is a feature diagram of an embodiment of the present invention; (a) the (b), (c) and (d) are original video frames; (e) the (f), (g) and (h) are corresponding characteristic diagrams.

Detailed Description

The invention is further illustrated by the following figures and examples.

The invention provides a motion video identification method combining a residual error network of mixed convolution and attention, which utilizes an open source data set UCF101 as an experimental data set, and a specific data set is shown in figure 1. The figure shows a video frame image of a partial motion video into which one motion video is converted, the image being saved in a jpg format, the final picture size being 320 x 240.

The embodiment of the invention is as follows:

step 1: the motion video is read by using a VideoCapture function in Opencv, and the read motion video is converted into a video frame image of the motion video, where a part of the video frame image of the motion video is shown in fig. 1.

Step 2: the method firstly carries out data preprocessing on the motion recognition model, and then carries out pre-training on the Kinetics data set instead of training our model from the beginning so as to improve the accuracy of our model.

2.1) data preprocessing of motion recognition is as follows:

respectively using methods of time sampling, random cutting and brightness adjustment to perform data enhancement on video frames of the action video to form and obtain video frame images;

2.2) the model pre-training process of motion recognition is as follows:

inputting the preprocessed video frame image into a mixed convolution residual error network model for feature extraction in space and channel dimensions, wherein the batch processing size batch _ size of the input image of the mixed convolution residual error network model is 16 multiplied by 112 multiplied by 3, and the batch processing size batch _ size of the output image of the mixed convolution residual error network model is a category label. The loss value was optimized using a random gradient descent SGD with an initial learning rate set to 0.01 and divided by 10 when the verification loss reached saturation. Momentum coefficient of 0.9, dropout coefficient of 0.5, weight attenuation rate of 10e^-3And trained using the batch norm acceleration model, on the server using 8 blocks of Tesla V100 GPUs, with 8 batch _ size on each GPU and 64 total batch _ size.

And step 3: constructing an attention module, wherein an attention mechanism is used in the attention module to focus on the position mentioned by the priori knowledge, the interference of background and noise on motion recognition is removed, and different attentions are automatically allocated to different positions of the input feature map according to the priori knowledge;

constructing a mixed convolution block by using an attention module, cascading the mixed convolution block to construct a mixed convolution residual network model based on the combination of a mixed convolution residual network and attention, and performing space-time feature learning on a video frame image by using the mixed convolution residual network model to obtain a key feature map;

the mixed volume block is expressed as:

X_t+1＝X_t+W(X_t)

in the formula, X_tAnd X_t+1Represents the input and output of the tth MC-RAN module; x_tAnd X_t+1With the same characteristic dimensions, W represents the hybrid convolution residual function with the attention mechanism added.

The step 3) is specifically as follows: selecting 3DThe ResNet network structure is used as a basic network structure, an original 3D convolution module in the 3D DResNet network structure is replaced by a first convolution layer and four mixed convolution block structures, and each mixed convolution block comprises an MC-RAN module and an addition layer; the MC-RAN module comprises a (2+1) D convolution layer, a first batch normalization layer, a first ReLU activation layer, a 3D convolution layer and a second batch normalization layer which are sequentially connected; input X of mixed volume block_tInput MC-RAN module, and output characteristic diagram and input X of MC-RAN module_tAdding the feature maps through the adding layer, wherein the output of the added feature maps processed by the second ReLU activation layer is used as the output X of the mixed volume block_t+1Each mixed volume block is then downsampled in cascade with a 3D max-pooling layer.

a. Ith size of N_i-1The 3D convolution layer of x t x D consists of M_iEach size is N_i-1A second 2D convolutional layer of x 1 xdxdxd and N_iEach size is M_iComposition of time-series convolution layer of x t x 1, M_iCalculated by the following formula:

wherein D represents a width and height size parameter of the output characteristic diagram of the 3D convolution layer, t represents a time sequence, and [ ] represents a downward rounding;

b. spatial down-sampling is performed at the first convolution layer conv1 with a step size of 1 × 2 × 2. With respect to the third, fourth, and fifth hybrid convolution blocks conv3_1, conv4_1, and conv5_1, the first 2D convolutional layer and the time convolutional layer of (2+1) D convolution therein are spatio-temporally down-sampled by steps of 1 × 2 × 2 and 2 × 1 × 1, respectively. Table 1 is a network structure diagram of the first convolution layer and the hybrid convolution block.

Table 1 shows the network layer structure of the first convolution layer and the hybrid convolution block.

c. Hybrid convolution Block cascade graph as shown in FIG. 5(2+1) the D convolution layer is formed by adding an attention module into the 2D convolution layer; the (2+1) D convolution layer mainly comprises a first 2D convolution layer and a space attention module M_SSTime convolution layer and channel attention module M_CSAnd (4) cascade formation. The attention module applies attention to the space and the channel of the input feature map respectively, and the space attention module M applies attention to the space and the channel_SSAnd channel attention module M_CSConstituting an attention module.

Space attention module M_SSObtaining a spatial weight map W of the input feature map in a spatial dimension by a third 2D convolution kernel_SS(ii) a Channel attention module M_CSAcquiring channel weight graph W of input feature graph in channel dimension by adding multi-layer perceptron_CS；

The spatial attention module M_SSThe construction method specifically comprises the following steps: when the size of the input feature map F is C multiplied by H multiplied by W, C represents the channel number of each frame of image in the input feature map, and H and W represent the width and height dimension parameters of each frame of image in the input feature map; first, the channel of each frame of image in the input feature map is compressed by using global average pooling, and a 2D spatial descriptor Z with a size of 1 × H × W is generated, and the element of Z at the coordinate (i, j) is calculated as follows:

wherein F_i,j(k) Representing a characteristic diagram of a Kth channel at coordinates (i, j), wherein i represents a pixel point in an H dimension, and j represents a pixel point in a W dimension; then, a third 2D convolutional layer with the size of 7 multiplied by 7 is used for carrying out convolution on the 2D space descriptor to obtain an interested target region in the input feature map; and finally, adding a third batch normalization layer on a third 2D convolution layer to carry out dimension transformation on the interested target region to obtain a space attention weight graph W_SS。

Spatial attention weight graph W_SSCan be expressed as:

W_SS(F)＝BN(σ(f^7×7(Avgpool(F)))

wherein BN () representsBatch normalization, σ () representing a sigmoid activation function, f^7×7() Represents the convolution operation with a convolution kernel size of 7 x 7, Avgpool () represents the global average pooling, and F represents the input feature map.

Channel attention module M_CSThe construction method specifically comprises the following steps: when a feature map Q with a size H × W × C is input, C represents the number of channels of each frame image in the input feature map. Firstly, carrying out global average pooling operation on an input feature map Q to generate a feature map Q' with the size of 1 multiplied by C; subsequently, the channel vector Q 'is processed using the multilayer perceptron FC with hidden layers to learn the weights of the channel vector Q'; with the weights as the correlation, to limit the complexity of the channel attention module and save the parameter cost, the size of the hidden active layer is set to 1 × 1 × C/r, where r is the compression ratio, set to 16.

The channel vector Q' can be calculated by the following formula:

finally, adding a fourth batch normalization layer behind the multilayer perceptron to perform dimension conversion to obtain a channel attention weight graph W_CS。

Channel attention weight graph W_CSCan be expressed as:

W_CS(F)＝BN(MLP(Avgpool(F)))＝BN(σ(W₁((W₀Avgpool(F)+b₀)+b₁)))

wherein MLP denotes a multilayer perceptron with hidden layer, W₀And W₁Are the MLP weights, with the sizes C/r C and C/r, respectively. σ () is the sigmoid activation function, () is the linear modification unit, b₀And b₁And bias terms representing MLP () with the sizes of C/r and C, respectively.

And 4, step 4: the spatiotemporal features in the video frame image are fused after the video frame image passes through the first convolution layer and the four mixed convolution blocks, the mixed convolution residual error network model obtains key features, and the feature map visualization after the attention module is added is shown in fig. 6. And inputting the key feature map into a Softmax layer for classification. Each video in the validation set is evaluated using the trained network and a corresponding category label is obtained. After training, the proposed mixed convolution residual error network model is compared with different network models, and the experimental results are shown in table 2, and the results show that the recognition accuracy of both Top1 and Top5 is increased under the condition that the number of parameters of the mixed convolution residual error network model is not increased.

Table 2 shows the recognition results of the hybrid convolution residual network model compared to other models.

Network model	Amount of ginseng	Top-1 recognition Rate (%)	Top-5 recognition Rate (%)	Average recognition rate (%)
					ResNet[39]	63.72M	60.1	81.9	71.0
(2+1)D-ResNet[12]	63.88M	66.8	88.1	77.45
					MC-ResNet[28]	63.88M	67.3	89.2	78.25
RAN[26]	63.97M	61.7	83.2	72.45
					(2+1)D-RAN	63.98M	67.8	89.3	78.55
MC-RAN	63.98M	68.8	89.9	79.35

The foregoing detailed description is intended to illustrate and not limit the invention, which is intended to be within the spirit and scope of the appended claims, and any changes and modifications that fall within the true spirit and scope of the invention are intended to be covered by the following claims.

Claims

1. A motion video identification method combining a residual error network of hybrid convolution and attention is characterized in that: the method comprises the following steps:

the mixed volume block is expressed as:

X_t+1＝X_t+W(X_t)

4) the key feature maps are classified using a Softmax classification layer.

2. The hybrid convolved residual network and attention-combined motion video recognition method according to claim 1, characterized in that: the step 2) is specifically as follows:

3. The hybrid convolved residual network and attention-combined motion video recognition method according to claim 1, characterized in that:

the step 3) is specifically as follows: selecting 3DResNet networkThe structure is used as a basic network structure, an original 3D convolution module in the 3D ReesNet network structure is replaced by a first convolution layer and four mixed convolution blocks, and each mixed convolution block comprises an MC-RAN module and an addition layer; the MC-RAN module comprises a (2+1) D convolution layer, a first batch normalization layer, a first ReLU activation layer, a 3D convolution layer and a second batch normalization layer, wherein the (2+1) D convolution layer is formed by adding an attention module into the 2D convolution layer; input X of mixed volume block_tInput MC-RAN module, and output characteristic diagram and input X of MC-RAN module_tAdding the feature maps through the adding layer, wherein the output of the added feature maps processed by the second ReLU activation layer is used as the output X of the mixed volume block_t+1After each mixed rolling block, cascading a 3D maximum pooling layer for down-sampling;

4. The hybrid convolved residual network and attention-combined motion video recognition method according to claim 3, characterized in that:

space attention module M_SSObtaining a spatial weight map W of the input feature map in a spatial dimension by a third 2D convolutional layer_SS(ii) a Channel attention module M_CSBy addingObtaining a channel weight map W of an input feature map in a channel dimension by a multi-layer perceptron_CS；

The spatial attention module M_SSThe construction method specifically comprises the following steps: when the size of the input feature map F is C multiplied by H multiplied by W, C represents the channel number of each frame of image in the input feature map, and H and W represent the width and height dimension parameters of each frame of image in the input feature map; firstly, compressing a channel of each frame of image in an input feature map by using global average pooling to generate a 2D space descriptor Z with the size of 1 multiplied by H multiplied by W; then, a third 2D convolution layer is used for performing convolution on the 2D space descriptor Z to obtain an interested target area in the input feature map; and finally, adding a third batch normalization layer on a third 2D convolution layer to carry out dimension transformation on the interested target region to obtain a space attention weight graph W_SS；

Spatial attention weight graph W_SSCan be expressed as:

W_SS(F)＝BN(σ(f^7×7(Avgpool(F)))

the channel vector Q' can be calculated by the following formula:

finally in multiple layersAdding a fourth batch normalization layer behind the sensor to perform dimension conversion to obtain a channel attention weight graph W_CS；

Channel attention weight graph W_CSCan be expressed as:

W_CS(F)＝BN(MLP(Avgpool(F)))＝BN(σ(W₁((W₀Avgpool(F)+b₀)+b₁)))

5. The hybrid convolved residual network and attention-combined motion video recognition method according to claim 1, characterized in that: the step 4) is specifically as follows: after the video frame images pass through the four MC-RAN modules, the spatio-temporal characteristics in the video frame images are fused, the mixed convolution residual error network model obtains key characteristics, and the key characteristic images are input into a Softmax layer for classification.

6. The hybrid convolved residual network and attention-combined motion video recognition method according to claim 1, characterized in that: the input feature map of the input feature map in the first MC-RAN module is the output feature map of the video frame image in step 2) after passing through the first convolutional layer, and the input feature map of the subsequent MC-RAN module is the output feature map of the previous MC-RAN module after passing through the 3D max-pooling layer.