CN112149504A - Motion video identification method combining residual error network and attention of mixed convolution - Google Patents
Motion video identification method combining residual error network and attention of mixed convolution Download PDFInfo
- Publication number
- CN112149504A CN112149504A CN202010849991.6A CN202010849991A CN112149504A CN 112149504 A CN112149504 A CN 112149504A CN 202010849991 A CN202010849991 A CN 202010849991A CN 112149504 A CN112149504 A CN 112149504A
- Authority
- CN
- China
- Prior art keywords
- convolution
- layer
- attention
- feature map
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 25
- 238000005070 sampling Methods 0.000 claims abstract description 13
- 230000009471 action Effects 0.000 claims abstract description 11
- 238000010586 diagram Methods 0.000 claims description 18
- 238000010606 normalization Methods 0.000 claims description 15
- 238000011176 pooling Methods 0.000 claims description 14
- 230000004913 activation Effects 0.000 claims description 10
- 238000012549 training Methods 0.000 claims description 9
- 230000006870 function Effects 0.000 claims description 8
- 230000007246 mechanism Effects 0.000 claims description 8
- 238000010276 construction Methods 0.000 claims description 6
- 239000000203 mixture Substances 0.000 claims description 4
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 230000006835 compression Effects 0.000 claims description 3
- 238000007906 compression Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 238000012937 correction Methods 0.000 claims description 2
- 238000005096 rolling process Methods 0.000 claims description 2
- 238000013527 convolutional neural network Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 3
- 101150041570 TOP1 gene Proteins 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 230000009977 dual effect Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 101100153586 Caenorhabditis elegans top-1 gene Proteins 0.000 description 1
- 241001136239 Cymbidium hybrid cultivar Species 0.000 description 1
- 101100370075 Mus musculus Top1 gene Proteins 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Human Computer Interaction (AREA)
- Social Psychology (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Psychiatry (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a mixed convolution residual error network and attention combined action video identification method. The method comprises the following steps: 1) reading the motion of a person in the motion video, and converting the motion video into an original video frame image; 2) respectively using methods of time sampling, random cutting and brightness adjustment to perform data enhancement on video frames of the action video to form and obtain video frame images; 3) constructing an attention module, constructing a mixed convolution block by using the attention module, constructing a mixed convolution residual network model based on the combination of a mixed convolution residual network and attention by cascading the mixed convolution block, and performing space-time feature learning on a video frame image by using the mixed convolution residual network model to obtain a key feature map; 4) the key feature maps are classified using a Softmax classification layer. The invention can maintain the characteristic information of the video frame while expanding the network depth, fully integrate the space-time characteristics, improve the correlation degree of the important channel characteristics and effectively improve the prediction performance of the action recognition.
Description
Technical Field
The invention belongs to a motion video identification method in the technical field of intelligent video analysis, and particularly relates to a motion video identification method based on combination of a mixed convolution residual error network and an attention mechanism.
Background
Motion recognition has application values such as video processing, pattern recognition, virtual reality and the like, and is one of important research subjects in the field of computer vision. Motion recognition in video is a key issue in video understanding tasks. It requires not only the capture of features in the spatial dimension, but also the encoding of the temporal relationship between a number of consecutive frames. Therefore, effectively extracting high-resolution spatio-temporal features from the motion video is of great significance for improving the accuracy of motion recognition. However, a video is a continuous frame sequence with a time relationship, each pixel has high similarity with its neighboring pixels, and the spatio-temporal correlation is very strong. The traditional convolutional neural network has excellent characteristic extraction performance on single image data, but cannot extract space-time characteristics from videos.
When the video input is a continuous image, there are currently three main methods: (1)2DCNNs bind RNN/LSTM, (2) dual stream CNNs, and (3)3 DCNNs. Dual stream CNNs use two independent networks to capture spatial signature and temporal motion information. Although this method works well, it cannot effectively mix appearance and motion information because the training of the two networks is separate. RNN/LSTM is better able to process sequence information and therefore is often combined with CNN to handle action recognition. However, this type of approach only preserves the top-level features, ignoring dependencies in the bottom-level features. The space-time information is obtained by using 3DCNN, which is an effective method. However, the 3DCNN model has a huge parameter amount and contains a large amount of redundant spatial data, and the training of 3DCNNs is a very challenging task. In recent years, many studies have attempted to introduce attention mechanisms from different perspectives to enhance the robustness of behavior recognition. However, the mechanism of attention stacking in deep networks can result in repeated dot products, thereby reducing the value of the features.
Disclosure of Invention
In order to solve the problems in the background art, the invention aims to provide a motion identification method in motion video based on the combination of a residual network of hybrid convolution and an attention mechanism, and an MC-RAN module is designed, which is based on the residual network of hybrid convolution and decouples a 3D convolution by using a 2D convolutionD convolution and 1D convolution are respectively matched with adaptive space attention module MSSAnd channel attention module MCSAnd fusion is carried out, the correlation degree of the important channel characteristics is improved, and the global correlation of the characteristic diagram is increased, so that the performance of action recognition is improved.
The technical scheme adopted by the invention is as follows:
the invention comprises the following steps:
1) reading the motion of a person in the motion video, and converting the motion video into an original video frame image;
2) respectively using methods of time sampling, random cutting and brightness adjustment to perform data enhancement on video frames of the action video to form and obtain video frame images;
the step 2) is specifically as follows:
time sampling: for each motion video, randomly sampling continuous frames of 16 motion videos for training; if the number of the continuous frames does not reach 16 frames, the action video is played circularly until the number of the continuous frames reaches 16 frames;
random cutting: resizing the original video frame image to 128 × 171 pixels, and then randomly cropping the original video frame image to 112 × 112 pixels;
and (3) brightness adjustment: and randomly adjusting the brightness of the original video frame image.
3) Constructing an attention module, constructing a mixed convolution block by using the attention module, constructing a mixed convolution residual network model based on the combination of a mixed convolution residual network and attention by cascading the mixed convolution block, and performing space-time feature learning on a video frame image by using the mixed convolution residual network model to obtain a key feature map;
the mixed volume block is expressed as:
Xt+1=Xt+W(Xt)
wherein, XtAnd Xt+1Represents the input and output of the tth MC-RAN module; xtAnd Xt+1The characteristic dimensions are the same, and W represents a mixed convolution residual function added with an attention mechanism;
the step 3) is specifically as follows: selecting 3DResNet networkThe network structure is used as a basic network structure, an original 3D convolution module in the 3D ReesNet network structure is replaced by a first convolution layer and four mixed convolution blocks, and each mixed convolution block comprises an MC-RAN module and an addition layer; the MC-RAN module comprises a (2+1) D convolution layer, a first batch normalization layer, a first ReLU activation layer, a 3D convolution layer and a second batch normalization layer, wherein the (2+1) D convolution layer is formed by adding an attention module into the 2D convolution layer; input X of mixed volume blocktInput MC-RAN module, and output characteristic diagram and input X of MC-RAN moduletAdding the feature maps through the adding layer, wherein the output of the added feature maps processed by the second ReLU activation layer is used as the output X of the mixed volume blockt+1After each mixed rolling block, cascading a 3D maximum pooling layer for down-sampling;
ith size of Ni-1The 3D convolution layer of x t x D consists of MiEach size is Ni-1A second 2D convolutional layer of x 1 xdxdxd and NiEach size is MiComposition of time-series convolution layer of x t x 1, MiCalculated by the following formula:
wherein D represents a width and height dimension parameter of the 3D convolution layer output characteristic diagram, t represents a time sequence, and [ ] represents a downward rounding.
The (2+1) D convolutional layer mainly comprises a first 2D convolutional layer and a space attention module MSSTime convolution layer and channel attention module MCSFormed in cascade and consisting of spatial attention modules MSSAnd channel attention module MCSAn attention module is formed;
space attention module MSSObtaining a spatial weight map W of the input feature map in a spatial dimension by a third 2D convolutional layerSS(ii) a Channel attention module MCSAcquiring channel weight graph W of input feature graph in channel dimension by adding multi-layer perceptronCS;
The spatial attention module MSSThe construction method specifically comprises the following steps: when the input feature map F is largeWhen the size is C multiplied by H multiplied by W, C represents the channel number of each frame of image in the input characteristic diagram, and H and W represent the width and height size parameters of each frame of image in the input characteristic diagram; firstly, compressing a channel of each frame of image in an input feature map by using global average pooling to generate a 2D space descriptor Z with the size of 1 multiplied by H multiplied by W; then, a third 2D convolution layer is used for performing convolution on the 2D space descriptor Z to obtain an interested target area in the input feature map; and finally, adding a third batch normalization layer on a third 2D convolution layer to carry out dimension transformation on the interested target region to obtain a space attention weight graph WSS;
Spatial attention weight graph WSSCan be expressed as:
WSS(F)=BN(σ(f7×7(Avgpool(F)))
wherein BN () represents batch normalization, σ () represents sigmoid activation function, f7×7() Represents the convolution operation with a convolution kernel size of 7 × 7, Avgpool () represents the global average pooling, and F represents the input signature;
the channel attention module MCSThe construction method specifically comprises the following steps: when the size of the input feature map Q is C multiplied by H multiplied by W, and C represents the number of channels of each frame image in the input feature map, firstly, carrying out global average pooling operation on the input feature map Q to generate a channel vector Q' with the size of 1 multiplied by C; subsequently, the channel vector Q 'is processed using a multi-layer perceptron to learn weights of the channel vector Q';
the channel vector Q' can be calculated by the following formula:
wherein F (i, j) represents a feature map at coordinates (i, j), i represents a pixel point at dimension H, and j represents a pixel point at dimension W;
finally, adding a fourth batch normalization layer behind the multilayer perceptron to perform dimension conversion to obtain a channel attention weight graph WCS;
Channel attention weight graph WCSCan be expressed as:
WCS(F)=BN(MLP(Avgpool(F)))=BN(σ(W1((W0Avgpool(F)+b0)+b1)))
wherein MLP () represents a multilayer perceptron with hidden layers, W0And W1Is the weight of MLP () with the size of C/r × C and C × C/r, r is the compression ratio, () is the linear correction unit, b0And b1And bias terms representing MLP () with the sizes of C/r and C, respectively.
4) The key feature maps are classified using a Softmax classification layer.
The step 4) is specifically as follows: after the video frame images pass through the four MC-RAN modules, the spatio-temporal characteristics in the video frame images are fused, the mixed convolution residual error network model obtains key characteristics, and the key characteristic images are input into a Softmax layer for classification.
The input feature map of the input feature map in the first MC-RAN module is the output feature map of the video frame image in step 2) after passing through the first convolutional layer, and the input feature map of the subsequent MC-RAN module is the output feature map of the previous MC-RAN module after passing through the 3D max-pooling layer.
The invention has the beneficial effects that:
1) the MC-RAN module is designed, and based on a residual error network of mixed convolution, the 2D convolution and the 1D convolution which are decoupled by the 3D convolution are respectively fused with the adaptive space attention module and the channel attention module, so that the space-time characteristics are fully fused, the correlation degree of important channel characteristics is improved, the global correlation of a characteristic diagram is increased, and the performance of behavior identification is improved.
2) The mixed convolution residual error network model provided by the invention can expand the network depth and simultaneously reserve the characteristic information. According to the invention, comparative tests are carried out on the common data sets UCF101 and HMDB51, and after the data sets Kinetics are pre-trained, the accuracy rates of Top1 on the UCF101 and HMDB51 test sets respectively reach 96.8% and 74.8%.
Drawings
FIG. 1 is an example of a partial data set according to an embodiment of the present invention;
FIG. 2 is a block diagram of an embodiment of the present invention;
FIG. 3 is a block diagram of a spatial attention module according to an embodiment of the present invention;
FIG. 4 is a block diagram of a channel attention module according to an embodiment of the present invention;
FIG. 5 is a block cascade diagram of a hybrid convolution according to an embodiment of the present invention;
FIG. 6 is a feature diagram of an embodiment of the present invention; (a) the (b), (c) and (d) are original video frames; (e) the (f), (g) and (h) are corresponding characteristic diagrams.
Detailed Description
The invention is further illustrated by the following figures and examples.
The invention provides a motion video identification method combining a residual error network of mixed convolution and attention, which utilizes an open source data set UCF101 as an experimental data set, and a specific data set is shown in figure 1. The figure shows a video frame image of a partial motion video into which one motion video is converted, the image being saved in a jpg format, the final picture size being 320 x 240.
The embodiment of the invention is as follows:
step 1: the motion video is read by using a VideoCapture function in Opencv, and the read motion video is converted into a video frame image of the motion video, where a part of the video frame image of the motion video is shown in fig. 1.
Step 2: the method firstly carries out data preprocessing on the motion recognition model, and then carries out pre-training on the Kinetics data set instead of training our model from the beginning so as to improve the accuracy of our model.
2.1) data preprocessing of motion recognition is as follows:
respectively using methods of time sampling, random cutting and brightness adjustment to perform data enhancement on video frames of the action video to form and obtain video frame images;
time sampling: for each motion video, randomly sampling continuous frames of 16 motion videos for training; if the number of the continuous frames does not reach 16 frames, the action video is played circularly until the number of the continuous frames reaches 16 frames;
random cutting: resizing the original video frame image to 128 × 171 pixels, and then randomly cropping the original video frame image to 112 × 112 pixels;
and (3) brightness adjustment: and randomly adjusting the brightness of the original video frame image.
2.2) the model pre-training process of motion recognition is as follows:
inputting the preprocessed video frame image into a mixed convolution residual error network model for feature extraction in space and channel dimensions, wherein the batch processing size batch _ size of the input image of the mixed convolution residual error network model is 16 multiplied by 112 multiplied by 3, and the batch processing size batch _ size of the output image of the mixed convolution residual error network model is a category label. The loss value was optimized using a random gradient descent SGD with an initial learning rate set to 0.01 and divided by 10 when the verification loss reached saturation. Momentum coefficient of 0.9, dropout coefficient of 0.5, weight attenuation rate of 10e-3And trained using the batch norm acceleration model, on the server using 8 blocks of Tesla V100 GPUs, with 8 batch _ size on each GPU and 64 total batch _ size.
And step 3: constructing an attention module, wherein an attention mechanism is used in the attention module to focus on the position mentioned by the priori knowledge, the interference of background and noise on motion recognition is removed, and different attentions are automatically allocated to different positions of the input feature map according to the priori knowledge;
constructing a mixed convolution block by using an attention module, cascading the mixed convolution block to construct a mixed convolution residual network model based on the combination of a mixed convolution residual network and attention, and performing space-time feature learning on a video frame image by using the mixed convolution residual network model to obtain a key feature map;
the mixed volume block is expressed as:
Xt+1=Xt+W(Xt)
in the formula, XtAnd Xt+1Represents the input and output of the tth MC-RAN module; xtAnd Xt+1With the same characteristic dimensions, W represents the hybrid convolution residual function with the attention mechanism added.
The step 3) is specifically as follows: selecting 3DThe ResNet network structure is used as a basic network structure, an original 3D convolution module in the 3D DResNet network structure is replaced by a first convolution layer and four mixed convolution block structures, and each mixed convolution block comprises an MC-RAN module and an addition layer; the MC-RAN module comprises a (2+1) D convolution layer, a first batch normalization layer, a first ReLU activation layer, a 3D convolution layer and a second batch normalization layer which are sequentially connected; input X of mixed volume blocktInput MC-RAN module, and output characteristic diagram and input X of MC-RAN moduletAdding the feature maps through the adding layer, wherein the output of the added feature maps processed by the second ReLU activation layer is used as the output X of the mixed volume blockt+1Each mixed volume block is then downsampled in cascade with a 3D max-pooling layer.
a. Ith size of Ni-1The 3D convolution layer of x t x D consists of MiEach size is Ni-1A second 2D convolutional layer of x 1 xdxdxd and NiEach size is MiComposition of time-series convolution layer of x t x 1, MiCalculated by the following formula:
wherein D represents a width and height size parameter of the output characteristic diagram of the 3D convolution layer, t represents a time sequence, and [ ] represents a downward rounding;
b. spatial down-sampling is performed at the first convolution layer conv1 with a step size of 1 × 2 × 2. With respect to the third, fourth, and fifth hybrid convolution blocks conv3_1, conv4_1, and conv5_1, the first 2D convolutional layer and the time convolutional layer of (2+1) D convolution therein are spatio-temporally down-sampled by steps of 1 × 2 × 2 and 2 × 1 × 1, respectively. Table 1 is a network structure diagram of the first convolution layer and the hybrid convolution block.
Table 1 shows the network layer structure of the first convolution layer and the hybrid convolution block.
c. Hybrid convolution Block cascade graph as shown in FIG. 5(2+1) the D convolution layer is formed by adding an attention module into the 2D convolution layer; the (2+1) D convolution layer mainly comprises a first 2D convolution layer and a space attention module MSSTime convolution layer and channel attention module MCSAnd (4) cascade formation. The attention module applies attention to the space and the channel of the input feature map respectively, and the space attention module M applies attention to the space and the channelSSAnd channel attention module MCSConstituting an attention module.
Space attention module MSSObtaining a spatial weight map W of the input feature map in a spatial dimension by a third 2D convolution kernelSS(ii) a Channel attention module MCSAcquiring channel weight graph W of input feature graph in channel dimension by adding multi-layer perceptronCS;
The spatial attention module MSSThe construction method specifically comprises the following steps: when the size of the input feature map F is C multiplied by H multiplied by W, C represents the channel number of each frame of image in the input feature map, and H and W represent the width and height dimension parameters of each frame of image in the input feature map; first, the channel of each frame of image in the input feature map is compressed by using global average pooling, and a 2D spatial descriptor Z with a size of 1 × H × W is generated, and the element of Z at the coordinate (i, j) is calculated as follows:
wherein Fi,j(k) Representing a characteristic diagram of a Kth channel at coordinates (i, j), wherein i represents a pixel point in an H dimension, and j represents a pixel point in a W dimension; then, a third 2D convolutional layer with the size of 7 multiplied by 7 is used for carrying out convolution on the 2D space descriptor to obtain an interested target region in the input feature map; and finally, adding a third batch normalization layer on a third 2D convolution layer to carry out dimension transformation on the interested target region to obtain a space attention weight graph WSS。
Spatial attention weight graph WSSCan be expressed as:
WSS(F)=BN(σ(f7×7(Avgpool(F)))
wherein BN () representsBatch normalization, σ () representing a sigmoid activation function, f7×7() Represents the convolution operation with a convolution kernel size of 7 x 7, Avgpool () represents the global average pooling, and F represents the input feature map.
Channel attention module MCSThe construction method specifically comprises the following steps: when a feature map Q with a size H × W × C is input, C represents the number of channels of each frame image in the input feature map. Firstly, carrying out global average pooling operation on an input feature map Q to generate a feature map Q' with the size of 1 multiplied by C; subsequently, the channel vector Q 'is processed using the multilayer perceptron FC with hidden layers to learn the weights of the channel vector Q'; with the weights as the correlation, to limit the complexity of the channel attention module and save the parameter cost, the size of the hidden active layer is set to 1 × 1 × C/r, where r is the compression ratio, set to 16.
The channel vector Q' can be calculated by the following formula:
wherein F (i, j) represents a feature map at coordinates (i, j), i represents a pixel point at dimension H, and j represents a pixel point at dimension W;
finally, adding a fourth batch normalization layer behind the multilayer perceptron to perform dimension conversion to obtain a channel attention weight graph WCS。
Channel attention weight graph WCSCan be expressed as:
WCS(F)=BN(MLP(Avgpool(F)))=BN(σ(W1((W0Avgpool(F)+b0)+b1)))
wherein MLP denotes a multilayer perceptron with hidden layer, W0And W1Are the MLP weights, with the sizes C/r C and C/r, respectively. σ () is the sigmoid activation function, () is the linear modification unit, b0And b1And bias terms representing MLP () with the sizes of C/r and C, respectively.
And 4, step 4: the spatiotemporal features in the video frame image are fused after the video frame image passes through the first convolution layer and the four mixed convolution blocks, the mixed convolution residual error network model obtains key features, and the feature map visualization after the attention module is added is shown in fig. 6. And inputting the key feature map into a Softmax layer for classification. Each video in the validation set is evaluated using the trained network and a corresponding category label is obtained. After training, the proposed mixed convolution residual error network model is compared with different network models, and the experimental results are shown in table 2, and the results show that the recognition accuracy of both Top1 and Top5 is increased under the condition that the number of parameters of the mixed convolution residual error network model is not increased.
Table 2 shows the recognition results of the hybrid convolution residual network model compared to other models.
Network model | Amount of ginseng | Top-1 recognition Rate (%) | Top-5 recognition Rate (%) | Average recognition rate (%) |
ResNet[39] | 63.72M | 60.1 | 81.9 | 71.0 |
(2+1)D-ResNet[12] | 63.88M | 66.8 | 88.1 | 77.45 |
MC-ResNet[28] | 63.88M | 67.3 | 89.2 | 78.25 |
RAN[26] | 63.97M | 61.7 | 83.2 | 72.45 |
(2+1)D-RAN | 63.98M | 67.8 | 89.3 | 78.55 |
MC-RAN | 63.98M | 68.8 | 89.9 | 79.35 |
The foregoing detailed description is intended to illustrate and not limit the invention, which is intended to be within the spirit and scope of the appended claims, and any changes and modifications that fall within the true spirit and scope of the invention are intended to be covered by the following claims.
Claims (6)
1. A motion video identification method combining a residual error network of hybrid convolution and attention is characterized in that: the method comprises the following steps:
1) reading the motion of a person in the motion video, and converting the motion video into an original video frame image;
2) respectively using methods of time sampling, random cutting and brightness adjustment to perform data enhancement on video frames of the action video to form and obtain video frame images;
3) constructing an attention module, constructing a mixed convolution block by using the attention module, constructing a mixed convolution residual network model based on the combination of a mixed convolution residual network and attention by cascading the mixed convolution block, and performing space-time feature learning on a video frame image by using the mixed convolution residual network model to obtain a key feature map;
the mixed volume block is expressed as:
Xt+1=Xt+W(Xt)
wherein, XtAnd Xt+1Represents the input and output of the tth MC-RAN module; xtAnd Xt+1The characteristic dimensions are the same, and W represents a mixed convolution residual function added with an attention mechanism;
4) the key feature maps are classified using a Softmax classification layer.
2. The hybrid convolved residual network and attention-combined motion video recognition method according to claim 1, characterized in that: the step 2) is specifically as follows:
time sampling: for each motion video, randomly sampling continuous frames of 16 motion videos for training; if the number of the continuous frames does not reach 16 frames, the action video is played circularly until the number of the continuous frames reaches 16 frames;
random cutting: resizing the original video frame image to 128 × 171 pixels, and then randomly cropping the original video frame image to 112 × 112 pixels;
and (3) brightness adjustment: and randomly adjusting the brightness of the original video frame image.
3. The hybrid convolved residual network and attention-combined motion video recognition method according to claim 1, characterized in that:
the step 3) is specifically as follows: selecting 3DResNet networkThe structure is used as a basic network structure, an original 3D convolution module in the 3D ReesNet network structure is replaced by a first convolution layer and four mixed convolution blocks, and each mixed convolution block comprises an MC-RAN module and an addition layer; the MC-RAN module comprises a (2+1) D convolution layer, a first batch normalization layer, a first ReLU activation layer, a 3D convolution layer and a second batch normalization layer, wherein the (2+1) D convolution layer is formed by adding an attention module into the 2D convolution layer; input X of mixed volume blocktInput MC-RAN module, and output characteristic diagram and input X of MC-RAN moduletAdding the feature maps through the adding layer, wherein the output of the added feature maps processed by the second ReLU activation layer is used as the output X of the mixed volume blockt+1After each mixed rolling block, cascading a 3D maximum pooling layer for down-sampling;
ith size of Ni-1The 3D convolution layer of x t x D consists of MiEach size is Ni-1A second 2D convolutional layer of x 1 xdxdxd and NiEach size is MiComposition of time-series convolution layer of x t x 1, MiCalculated by the following formula:
wherein D represents a width and height dimension parameter of the 3D convolution layer output characteristic diagram, t represents a time sequence, and [ ] represents a downward rounding.
4. The hybrid convolved residual network and attention-combined motion video recognition method according to claim 3, characterized in that:
the (2+1) D convolutional layer mainly comprises a first 2D convolutional layer and a space attention module MSSTime convolution layer and channel attention module MCSFormed in cascade and consisting of spatial attention modules MSSAnd channel attention module MCSAn attention module is formed;
space attention module MSSObtaining a spatial weight map W of the input feature map in a spatial dimension by a third 2D convolutional layerSS(ii) a Channel attention module MCSBy addingObtaining a channel weight map W of an input feature map in a channel dimension by a multi-layer perceptronCS;
The spatial attention module MSSThe construction method specifically comprises the following steps: when the size of the input feature map F is C multiplied by H multiplied by W, C represents the channel number of each frame of image in the input feature map, and H and W represent the width and height dimension parameters of each frame of image in the input feature map; firstly, compressing a channel of each frame of image in an input feature map by using global average pooling to generate a 2D space descriptor Z with the size of 1 multiplied by H multiplied by W; then, a third 2D convolution layer is used for performing convolution on the 2D space descriptor Z to obtain an interested target area in the input feature map; and finally, adding a third batch normalization layer on a third 2D convolution layer to carry out dimension transformation on the interested target region to obtain a space attention weight graph WSS;
Spatial attention weight graph WSSCan be expressed as:
WSS(F)=BN(σ(f7×7(Avgpool(F)))
wherein BN () represents batch normalization, σ () represents sigmoid activation function, f7×7() Represents the convolution operation with a convolution kernel size of 7 × 7, Avgpool () represents the global average pooling, and F represents the input signature;
the channel attention module MCSThe construction method specifically comprises the following steps: when the size of the input feature map Q is C multiplied by H multiplied by W, and C represents the number of channels of each frame image in the input feature map, firstly, carrying out global average pooling operation on the input feature map Q to generate a channel vector Q' with the size of 1 multiplied by C; subsequently, the channel vector Q 'is processed using a multi-layer perceptron to learn weights of the channel vector Q';
the channel vector Q' can be calculated by the following formula:
wherein F (i, j) represents a feature map at coordinates (i, j), i represents a pixel point at dimension H, and j represents a pixel point at dimension W;
finally in multiple layersAdding a fourth batch normalization layer behind the sensor to perform dimension conversion to obtain a channel attention weight graph WCS;
Channel attention weight graph WCSCan be expressed as:
WCS(F)=BN(MLP(Avgpool(F)))=BN(σ(W1((W0Avgpool(F)+b0)+b1)))
wherein MLP () represents a multilayer perceptron with hidden layers, W0And W1Is the weight of MLP () with the size of C/r × C and C × C/r, r is the compression ratio, () is the linear correction unit, b0And b1And bias terms representing MLP () with the sizes of C/r and C, respectively.
5. The hybrid convolved residual network and attention-combined motion video recognition method according to claim 1, characterized in that: the step 4) is specifically as follows: after the video frame images pass through the four MC-RAN modules, the spatio-temporal characteristics in the video frame images are fused, the mixed convolution residual error network model obtains key characteristics, and the key characteristic images are input into a Softmax layer for classification.
6. The hybrid convolved residual network and attention-combined motion video recognition method according to claim 1, characterized in that: the input feature map of the input feature map in the first MC-RAN module is the output feature map of the video frame image in step 2) after passing through the first convolutional layer, and the input feature map of the subsequent MC-RAN module is the output feature map of the previous MC-RAN module after passing through the 3D max-pooling layer.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010849991.6A CN112149504B (en) | 2020-08-21 | 2020-08-21 | Motion video identification method combining mixed convolution residual network and attention |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010849991.6A CN112149504B (en) | 2020-08-21 | 2020-08-21 | Motion video identification method combining mixed convolution residual network and attention |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112149504A true CN112149504A (en) | 2020-12-29 |
CN112149504B CN112149504B (en) | 2024-03-26 |
Family
ID=73889023
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010849991.6A Active CN112149504B (en) | 2020-08-21 | 2020-08-21 | Motion video identification method combining mixed convolution residual network and attention |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112149504B (en) |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112766172A (en) * | 2021-01-21 | 2021-05-07 | 北京师范大学 | Face continuous expression recognition method based on time sequence attention mechanism |
CN112800957A (en) * | 2021-01-28 | 2021-05-14 | 内蒙古科技大学 | Video pedestrian re-identification method and device, electronic equipment and storage medium |
CN112818843A (en) * | 2021-01-29 | 2021-05-18 | 山东大学 | Video behavior identification method and system based on channel attention guide time modeling |
CN112883264A (en) * | 2021-02-09 | 2021-06-01 | 联想(北京)有限公司 | Recommendation method and device |
CN113128395A (en) * | 2021-04-16 | 2021-07-16 | 重庆邮电大学 | Video motion recognition method and system based on hybrid convolution and multi-level feature fusion model |
CN113139530A (en) * | 2021-06-21 | 2021-07-20 | 城云科技(中国)有限公司 | Method and device for detecting sleep post behavior and electronic equipment thereof |
CN113160117A (en) * | 2021-02-04 | 2021-07-23 | 成都信息工程大学 | Three-dimensional point cloud target detection method under automatic driving scene |
CN113283338A (en) * | 2021-05-25 | 2021-08-20 | 湖南大学 | Method, device and equipment for identifying driving behavior of driver and readable storage medium |
CN113288162A (en) * | 2021-06-03 | 2021-08-24 | 北京航空航天大学 | Short-term electrocardiosignal atrial fibrillation automatic detection system based on self-adaptive attention mechanism |
CN113343760A (en) * | 2021-04-29 | 2021-09-03 | 暖屋信息科技(苏州)有限公司 | Human behavior recognition method based on multi-scale characteristic neural network |
CN113468531A (en) * | 2021-07-15 | 2021-10-01 | 杭州电子科技大学 | Malicious code classification method based on deep residual error network and mixed attention mechanism |
CN113673559A (en) * | 2021-07-14 | 2021-11-19 | 南京邮电大学 | Video character space-time feature extraction method based on residual error network |
CN113837263A (en) * | 2021-09-18 | 2021-12-24 | 浙江理工大学 | Gesture image classification method based on feature fusion attention module and feature selection |
CN113850135A (en) * | 2021-08-24 | 2021-12-28 | 中国船舶重工集团公司第七0九研究所 | Dynamic gesture recognition method and system based on time shift frame |
CN113850182A (en) * | 2021-09-23 | 2021-12-28 | 浙江理工大学 | Action identification method based on DAMR-3 DNet |
CN114037930A (en) * | 2021-10-18 | 2022-02-11 | 苏州大学 | Video action recognition method based on space-time enhanced network |
CN114140654A (en) * | 2022-01-27 | 2022-03-04 | 苏州浪潮智能科技有限公司 | Image action recognition method and device and electronic equipment |
CN114783053A (en) * | 2022-03-24 | 2022-07-22 | 武汉工程大学 | Behavior identification method and system based on space attention and grouping convolution |
CN114842542A (en) * | 2022-05-31 | 2022-08-02 | 中国矿业大学 | Facial action unit identification method and device based on self-adaptive attention and space-time correlation |
CN115035605A (en) * | 2022-08-10 | 2022-09-09 | 广东履安实业有限公司 | Action recognition method, device and equipment based on deep learning and storage medium |
CN115049969A (en) * | 2022-08-15 | 2022-09-13 | 山东百盟信息技术有限公司 | Poor video detection method for improving YOLOv3 and BiConvLSTM |
CN116304984A (en) * | 2023-03-14 | 2023-06-23 | 烟台大学 | Multi-modal intention recognition method and system based on contrast learning |
CN116416479A (en) * | 2023-06-06 | 2023-07-11 | 江西理工大学南昌校区 | Mineral classification method based on deep convolution fusion of multi-scale image features |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109886225A (en) * | 2019-02-27 | 2019-06-14 | 浙江理工大学 | A kind of image gesture motion on-line checking and recognition methods based on deep learning |
CN109886090A (en) * | 2019-01-07 | 2019-06-14 | 北京大学 | A kind of video pedestrian recognition methods again based on Multiple Time Scales convolutional neural networks |
CN110110646A (en) * | 2019-04-30 | 2019-08-09 | 浙江理工大学 | A kind of images of gestures extraction method of key frame based on deep learning |
CN110245593A (en) * | 2019-06-03 | 2019-09-17 | 浙江理工大学 | A kind of images of gestures extraction method of key frame based on image similarity |
CN110457524A (en) * | 2019-07-12 | 2019-11-15 | 北京奇艺世纪科技有限公司 | Model generating method, video classification methods and device |
CN110807808A (en) * | 2019-10-14 | 2020-02-18 | 浙江理工大学 | Commodity identification method based on physical engine and deep full convolution network |
CN111091045A (en) * | 2019-10-25 | 2020-05-01 | 重庆邮电大学 | Sign language identification method based on space-time attention mechanism |
-
2020
- 2020-08-21 CN CN202010849991.6A patent/CN112149504B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109886090A (en) * | 2019-01-07 | 2019-06-14 | 北京大学 | A kind of video pedestrian recognition methods again based on Multiple Time Scales convolutional neural networks |
CN109886225A (en) * | 2019-02-27 | 2019-06-14 | 浙江理工大学 | A kind of image gesture motion on-line checking and recognition methods based on deep learning |
CN110110646A (en) * | 2019-04-30 | 2019-08-09 | 浙江理工大学 | A kind of images of gestures extraction method of key frame based on deep learning |
CN110245593A (en) * | 2019-06-03 | 2019-09-17 | 浙江理工大学 | A kind of images of gestures extraction method of key frame based on image similarity |
CN110457524A (en) * | 2019-07-12 | 2019-11-15 | 北京奇艺世纪科技有限公司 | Model generating method, video classification methods and device |
CN110807808A (en) * | 2019-10-14 | 2020-02-18 | 浙江理工大学 | Commodity identification method based on physical engine and deep full convolution network |
CN111091045A (en) * | 2019-10-25 | 2020-05-01 | 重庆邮电大学 | Sign language identification method based on space-time attention mechanism |
Non-Patent Citations (3)
Title |
---|
包嘉欣;田秋红;杨慧敏;陈影柔;: "基于肤色分割与改进VGG网络的手语识别", 计算机***应用, no. 06 * |
王晨浩: "多粒度唇语识别技术研究", CNKI * |
解怀奇;乐红兵;: "基于通道注意力机制的视频人体行为识别", 电子技术与软件工程, no. 04 * |
Cited By (33)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112766172A (en) * | 2021-01-21 | 2021-05-07 | 北京师范大学 | Face continuous expression recognition method based on time sequence attention mechanism |
CN112766172B (en) * | 2021-01-21 | 2024-02-02 | 北京师范大学 | Facial continuous expression recognition method based on time sequence attention mechanism |
CN112800957A (en) * | 2021-01-28 | 2021-05-14 | 内蒙古科技大学 | Video pedestrian re-identification method and device, electronic equipment and storage medium |
CN112818843A (en) * | 2021-01-29 | 2021-05-18 | 山东大学 | Video behavior identification method and system based on channel attention guide time modeling |
CN113160117A (en) * | 2021-02-04 | 2021-07-23 | 成都信息工程大学 | Three-dimensional point cloud target detection method under automatic driving scene |
CN112883264A (en) * | 2021-02-09 | 2021-06-01 | 联想(北京)有限公司 | Recommendation method and device |
CN113128395A (en) * | 2021-04-16 | 2021-07-16 | 重庆邮电大学 | Video motion recognition method and system based on hybrid convolution and multi-level feature fusion model |
CN113128395B (en) * | 2021-04-16 | 2022-05-20 | 重庆邮电大学 | Video action recognition method and system based on hybrid convolution multistage feature fusion model |
CN113343760A (en) * | 2021-04-29 | 2021-09-03 | 暖屋信息科技(苏州)有限公司 | Human behavior recognition method based on multi-scale characteristic neural network |
CN113283338A (en) * | 2021-05-25 | 2021-08-20 | 湖南大学 | Method, device and equipment for identifying driving behavior of driver and readable storage medium |
CN113288162B (en) * | 2021-06-03 | 2022-06-28 | 北京航空航天大学 | Short-term electrocardiosignal atrial fibrillation automatic detection system based on self-adaptive attention mechanism |
CN113288162A (en) * | 2021-06-03 | 2021-08-24 | 北京航空航天大学 | Short-term electrocardiosignal atrial fibrillation automatic detection system based on self-adaptive attention mechanism |
CN113139530A (en) * | 2021-06-21 | 2021-07-20 | 城云科技(中国)有限公司 | Method and device for detecting sleep post behavior and electronic equipment thereof |
CN113139530B (en) * | 2021-06-21 | 2021-09-03 | 城云科技(中国)有限公司 | Method and device for detecting sleep post behavior and electronic equipment thereof |
CN113673559A (en) * | 2021-07-14 | 2021-11-19 | 南京邮电大学 | Video character space-time feature extraction method based on residual error network |
CN113673559B (en) * | 2021-07-14 | 2023-08-25 | 南京邮电大学 | Video character space-time characteristic extraction method based on residual error network |
CN113468531A (en) * | 2021-07-15 | 2021-10-01 | 杭州电子科技大学 | Malicious code classification method based on deep residual error network and mixed attention mechanism |
CN113850135A (en) * | 2021-08-24 | 2021-12-28 | 中国船舶重工集团公司第七0九研究所 | Dynamic gesture recognition method and system based on time shift frame |
CN113837263A (en) * | 2021-09-18 | 2021-12-24 | 浙江理工大学 | Gesture image classification method based on feature fusion attention module and feature selection |
CN113837263B (en) * | 2021-09-18 | 2023-09-26 | 浙江理工大学 | Gesture image classification method based on feature fusion attention module and feature selection |
CN113850182A (en) * | 2021-09-23 | 2021-12-28 | 浙江理工大学 | Action identification method based on DAMR-3 DNet |
CN114037930A (en) * | 2021-10-18 | 2022-02-11 | 苏州大学 | Video action recognition method based on space-time enhanced network |
CN114037930B (en) * | 2021-10-18 | 2022-07-12 | 苏州大学 | Video action recognition method based on space-time enhanced network |
CN114140654A (en) * | 2022-01-27 | 2022-03-04 | 苏州浪潮智能科技有限公司 | Image action recognition method and device and electronic equipment |
CN114783053A (en) * | 2022-03-24 | 2022-07-22 | 武汉工程大学 | Behavior identification method and system based on space attention and grouping convolution |
CN114842542A (en) * | 2022-05-31 | 2022-08-02 | 中国矿业大学 | Facial action unit identification method and device based on self-adaptive attention and space-time correlation |
CN114842542B (en) * | 2022-05-31 | 2023-06-13 | 中国矿业大学 | Facial action unit identification method and device based on self-adaptive attention and space-time correlation |
CN115035605A (en) * | 2022-08-10 | 2022-09-09 | 广东履安实业有限公司 | Action recognition method, device and equipment based on deep learning and storage medium |
CN115035605B (en) * | 2022-08-10 | 2023-04-07 | 广东履安实业有限公司 | Action recognition method, device and equipment based on deep learning and storage medium |
CN115049969A (en) * | 2022-08-15 | 2022-09-13 | 山东百盟信息技术有限公司 | Poor video detection method for improving YOLOv3 and BiConvLSTM |
CN116304984A (en) * | 2023-03-14 | 2023-06-23 | 烟台大学 | Multi-modal intention recognition method and system based on contrast learning |
CN116416479A (en) * | 2023-06-06 | 2023-07-11 | 江西理工大学南昌校区 | Mineral classification method based on deep convolution fusion of multi-scale image features |
CN116416479B (en) * | 2023-06-06 | 2023-08-29 | 江西理工大学南昌校区 | Mineral classification method based on deep convolution fusion of multi-scale image features |
Also Published As
Publication number | Publication date |
---|---|
CN112149504B (en) | 2024-03-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112149504B (en) | Motion video identification method combining mixed convolution residual network and attention | |
Kim et al. | Fully deep blind image quality predictor | |
CN108229338B (en) | Video behavior identification method based on deep convolution characteristics | |
Hara et al. | Learning spatio-temporal features with 3d residual networks for action recognition | |
CN111639692A (en) | Shadow detection method based on attention mechanism | |
CN112257572B (en) | Behavior identification method based on self-attention mechanism | |
CN114596520A (en) | First visual angle video action identification method and device | |
CN113011329A (en) | Pyramid network based on multi-scale features and dense crowd counting method | |
CN113870335A (en) | Monocular depth estimation method based on multi-scale feature fusion | |
CN113920581A (en) | Method for recognizing motion in video by using space-time convolution attention network | |
CN112507920A (en) | Examination abnormal behavior identification method based on time displacement and attention mechanism | |
CN115147456A (en) | Target tracking method based on time sequence adaptive convolution and attention mechanism | |
Hongmeng et al. | A detection method for deepfake hard compressed videos based on super-resolution reconstruction using CNN | |
CN117011342A (en) | Attention-enhanced space-time transducer vision single-target tracking method | |
CN112149662A (en) | Multi-mode fusion significance detection method based on expansion volume block | |
CN113344110B (en) | Fuzzy image classification method based on super-resolution reconstruction | |
CN113850182A (en) | Action identification method based on DAMR-3 DNet | |
CN111539434B (en) | Infrared weak and small target detection method based on similarity | |
CN112818840A (en) | Unmanned aerial vehicle online detection system and method | |
Zhao et al. | Multi-layer fusion neural network for deepfake detection | |
CN116797640A (en) | Depth and 3D key point estimation method for intelligent companion line inspection device | |
CN115424051B (en) | Panoramic stitching image quality evaluation method | |
CN114612305B (en) | Event-driven video super-resolution method based on stereogram modeling | |
CN115527253A (en) | Attention mechanism-based lightweight facial expression recognition method and system | |
CN110188706B (en) | Neural network training method and detection method based on character expression in video for generating confrontation network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |