CN112149504A - Motion video identification method combining residual error network and attention of mixed convolution - Google Patents

Motion video identification method combining residual error network and attention of mixed convolution Download PDF

Info

Publication number
CN112149504A
CN112149504A CN202010849991.6A CN202010849991A CN112149504A CN 112149504 A CN112149504 A CN 112149504A CN 202010849991 A CN202010849991 A CN 202010849991A CN 112149504 A CN112149504 A CN 112149504A
Authority
CN
China
Prior art keywords
convolution
layer
attention
feature map
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010849991.6A
Other languages
Chinese (zh)
Other versions
CN112149504B (en
Inventor
杨慧敏
田秋红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202010849991.6A priority Critical patent/CN112149504B/en
Publication of CN112149504A publication Critical patent/CN112149504A/en
Application granted granted Critical
Publication of CN112149504B publication Critical patent/CN112149504B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Social Psychology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Psychiatry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a mixed convolution residual error network and attention combined action video identification method. The method comprises the following steps: 1) reading the motion of a person in the motion video, and converting the motion video into an original video frame image; 2) respectively using methods of time sampling, random cutting and brightness adjustment to perform data enhancement on video frames of the action video to form and obtain video frame images; 3) constructing an attention module, constructing a mixed convolution block by using the attention module, constructing a mixed convolution residual network model based on the combination of a mixed convolution residual network and attention by cascading the mixed convolution block, and performing space-time feature learning on a video frame image by using the mixed convolution residual network model to obtain a key feature map; 4) the key feature maps are classified using a Softmax classification layer. The invention can maintain the characteristic information of the video frame while expanding the network depth, fully integrate the space-time characteristics, improve the correlation degree of the important channel characteristics and effectively improve the prediction performance of the action recognition.

Description

Motion video identification method combining residual error network and attention of mixed convolution
Technical Field
The invention belongs to a motion video identification method in the technical field of intelligent video analysis, and particularly relates to a motion video identification method based on combination of a mixed convolution residual error network and an attention mechanism.
Background
Motion recognition has application values such as video processing, pattern recognition, virtual reality and the like, and is one of important research subjects in the field of computer vision. Motion recognition in video is a key issue in video understanding tasks. It requires not only the capture of features in the spatial dimension, but also the encoding of the temporal relationship between a number of consecutive frames. Therefore, effectively extracting high-resolution spatio-temporal features from the motion video is of great significance for improving the accuracy of motion recognition. However, a video is a continuous frame sequence with a time relationship, each pixel has high similarity with its neighboring pixels, and the spatio-temporal correlation is very strong. The traditional convolutional neural network has excellent characteristic extraction performance on single image data, but cannot extract space-time characteristics from videos.
When the video input is a continuous image, there are currently three main methods: (1)2DCNNs bind RNN/LSTM, (2) dual stream CNNs, and (3)3 DCNNs. Dual stream CNNs use two independent networks to capture spatial signature and temporal motion information. Although this method works well, it cannot effectively mix appearance and motion information because the training of the two networks is separate. RNN/LSTM is better able to process sequence information and therefore is often combined with CNN to handle action recognition. However, this type of approach only preserves the top-level features, ignoring dependencies in the bottom-level features. The space-time information is obtained by using 3DCNN, which is an effective method. However, the 3DCNN model has a huge parameter amount and contains a large amount of redundant spatial data, and the training of 3DCNNs is a very challenging task. In recent years, many studies have attempted to introduce attention mechanisms from different perspectives to enhance the robustness of behavior recognition. However, the mechanism of attention stacking in deep networks can result in repeated dot products, thereby reducing the value of the features.
Disclosure of Invention
In order to solve the problems in the background art, the invention aims to provide a motion identification method in motion video based on the combination of a residual network of hybrid convolution and an attention mechanism, and an MC-RAN module is designed, which is based on the residual network of hybrid convolution and decouples a 3D convolution by using a 2D convolutionD convolution and 1D convolution are respectively matched with adaptive space attention module MSSAnd channel attention module MCSAnd fusion is carried out, the correlation degree of the important channel characteristics is improved, and the global correlation of the characteristic diagram is increased, so that the performance of action recognition is improved.
The technical scheme adopted by the invention is as follows:
the invention comprises the following steps:
1) reading the motion of a person in the motion video, and converting the motion video into an original video frame image;
2) respectively using methods of time sampling, random cutting and brightness adjustment to perform data enhancement on video frames of the action video to form and obtain video frame images;
the step 2) is specifically as follows:
time sampling: for each motion video, randomly sampling continuous frames of 16 motion videos for training; if the number of the continuous frames does not reach 16 frames, the action video is played circularly until the number of the continuous frames reaches 16 frames;
random cutting: resizing the original video frame image to 128 × 171 pixels, and then randomly cropping the original video frame image to 112 × 112 pixels;
and (3) brightness adjustment: and randomly adjusting the brightness of the original video frame image.
3) Constructing an attention module, constructing a mixed convolution block by using the attention module, constructing a mixed convolution residual network model based on the combination of a mixed convolution residual network and attention by cascading the mixed convolution block, and performing space-time feature learning on a video frame image by using the mixed convolution residual network model to obtain a key feature map;
the mixed volume block is expressed as:
Xt+1=Xt+W(Xt)
wherein, XtAnd Xt+1Represents the input and output of the tth MC-RAN module; xtAnd Xt+1The characteristic dimensions are the same, and W represents a mixed convolution residual function added with an attention mechanism;
the step 3) is specifically as follows: selecting 3DResNet networkThe network structure is used as a basic network structure, an original 3D convolution module in the 3D ReesNet network structure is replaced by a first convolution layer and four mixed convolution blocks, and each mixed convolution block comprises an MC-RAN module and an addition layer; the MC-RAN module comprises a (2+1) D convolution layer, a first batch normalization layer, a first ReLU activation layer, a 3D convolution layer and a second batch normalization layer, wherein the (2+1) D convolution layer is formed by adding an attention module into the 2D convolution layer; input X of mixed volume blocktInput MC-RAN module, and output characteristic diagram and input X of MC-RAN moduletAdding the feature maps through the adding layer, wherein the output of the added feature maps processed by the second ReLU activation layer is used as the output X of the mixed volume blockt+1After each mixed rolling block, cascading a 3D maximum pooling layer for down-sampling;
ith size of Ni-1The 3D convolution layer of x t x D consists of MiEach size is Ni-1A second 2D convolutional layer of x 1 xdxdxd and NiEach size is MiComposition of time-series convolution layer of x t x 1, MiCalculated by the following formula:
Figure BDA0002644396770000021
wherein D represents a width and height dimension parameter of the 3D convolution layer output characteristic diagram, t represents a time sequence, and [ ] represents a downward rounding.
The (2+1) D convolutional layer mainly comprises a first 2D convolutional layer and a space attention module MSSTime convolution layer and channel attention module MCSFormed in cascade and consisting of spatial attention modules MSSAnd channel attention module MCSAn attention module is formed;
space attention module MSSObtaining a spatial weight map W of the input feature map in a spatial dimension by a third 2D convolutional layerSS(ii) a Channel attention module MCSAcquiring channel weight graph W of input feature graph in channel dimension by adding multi-layer perceptronCS
The spatial attention module MSSThe construction method specifically comprises the following steps: when the input feature map F is largeWhen the size is C multiplied by H multiplied by W, C represents the channel number of each frame of image in the input characteristic diagram, and H and W represent the width and height size parameters of each frame of image in the input characteristic diagram; firstly, compressing a channel of each frame of image in an input feature map by using global average pooling to generate a 2D space descriptor Z with the size of 1 multiplied by H multiplied by W; then, a third 2D convolution layer is used for performing convolution on the 2D space descriptor Z to obtain an interested target area in the input feature map; and finally, adding a third batch normalization layer on a third 2D convolution layer to carry out dimension transformation on the interested target region to obtain a space attention weight graph WSS
Spatial attention weight graph WSSCan be expressed as:
WSS(F)=BN(σ(f7×7(Avgpool(F)))
wherein BN () represents batch normalization, σ () represents sigmoid activation function, f7×7() Represents the convolution operation with a convolution kernel size of 7 × 7, Avgpool () represents the global average pooling, and F represents the input signature;
the channel attention module MCSThe construction method specifically comprises the following steps: when the size of the input feature map Q is C multiplied by H multiplied by W, and C represents the number of channels of each frame image in the input feature map, firstly, carrying out global average pooling operation on the input feature map Q to generate a channel vector Q' with the size of 1 multiplied by C; subsequently, the channel vector Q 'is processed using a multi-layer perceptron to learn weights of the channel vector Q';
the channel vector Q' can be calculated by the following formula:
Figure BDA0002644396770000031
wherein F (i, j) represents a feature map at coordinates (i, j), i represents a pixel point at dimension H, and j represents a pixel point at dimension W;
finally, adding a fourth batch normalization layer behind the multilayer perceptron to perform dimension conversion to obtain a channel attention weight graph WCS
Channel attention weight graph WCSCan be expressed as:
WCS(F)=BN(MLP(Avgpool(F)))=BN(σ(W1((W0Avgpool(F)+b0)+b1)))
wherein MLP () represents a multilayer perceptron with hidden layers, W0And W1Is the weight of MLP () with the size of C/r × C and C × C/r, r is the compression ratio, () is the linear correction unit, b0And b1And bias terms representing MLP () with the sizes of C/r and C, respectively.
4) The key feature maps are classified using a Softmax classification layer.
The step 4) is specifically as follows: after the video frame images pass through the four MC-RAN modules, the spatio-temporal characteristics in the video frame images are fused, the mixed convolution residual error network model obtains key characteristics, and the key characteristic images are input into a Softmax layer for classification.
The input feature map of the input feature map in the first MC-RAN module is the output feature map of the video frame image in step 2) after passing through the first convolutional layer, and the input feature map of the subsequent MC-RAN module is the output feature map of the previous MC-RAN module after passing through the 3D max-pooling layer.
The invention has the beneficial effects that:
1) the MC-RAN module is designed, and based on a residual error network of mixed convolution, the 2D convolution and the 1D convolution which are decoupled by the 3D convolution are respectively fused with the adaptive space attention module and the channel attention module, so that the space-time characteristics are fully fused, the correlation degree of important channel characteristics is improved, the global correlation of a characteristic diagram is increased, and the performance of behavior identification is improved.
2) The mixed convolution residual error network model provided by the invention can expand the network depth and simultaneously reserve the characteristic information. According to the invention, comparative tests are carried out on the common data sets UCF101 and HMDB51, and after the data sets Kinetics are pre-trained, the accuracy rates of Top1 on the UCF101 and HMDB51 test sets respectively reach 96.8% and 74.8%.
Drawings
FIG. 1 is an example of a partial data set according to an embodiment of the present invention;
FIG. 2 is a block diagram of an embodiment of the present invention;
FIG. 3 is a block diagram of a spatial attention module according to an embodiment of the present invention;
FIG. 4 is a block diagram of a channel attention module according to an embodiment of the present invention;
FIG. 5 is a block cascade diagram of a hybrid convolution according to an embodiment of the present invention;
FIG. 6 is a feature diagram of an embodiment of the present invention; (a) the (b), (c) and (d) are original video frames; (e) the (f), (g) and (h) are corresponding characteristic diagrams.
Detailed Description
The invention is further illustrated by the following figures and examples.
The invention provides a motion video identification method combining a residual error network of mixed convolution and attention, which utilizes an open source data set UCF101 as an experimental data set, and a specific data set is shown in figure 1. The figure shows a video frame image of a partial motion video into which one motion video is converted, the image being saved in a jpg format, the final picture size being 320 x 240.
The embodiment of the invention is as follows:
step 1: the motion video is read by using a VideoCapture function in Opencv, and the read motion video is converted into a video frame image of the motion video, where a part of the video frame image of the motion video is shown in fig. 1.
Step 2: the method firstly carries out data preprocessing on the motion recognition model, and then carries out pre-training on the Kinetics data set instead of training our model from the beginning so as to improve the accuracy of our model.
2.1) data preprocessing of motion recognition is as follows:
respectively using methods of time sampling, random cutting and brightness adjustment to perform data enhancement on video frames of the action video to form and obtain video frame images;
time sampling: for each motion video, randomly sampling continuous frames of 16 motion videos for training; if the number of the continuous frames does not reach 16 frames, the action video is played circularly until the number of the continuous frames reaches 16 frames;
random cutting: resizing the original video frame image to 128 × 171 pixels, and then randomly cropping the original video frame image to 112 × 112 pixels;
and (3) brightness adjustment: and randomly adjusting the brightness of the original video frame image.
2.2) the model pre-training process of motion recognition is as follows:
inputting the preprocessed video frame image into a mixed convolution residual error network model for feature extraction in space and channel dimensions, wherein the batch processing size batch _ size of the input image of the mixed convolution residual error network model is 16 multiplied by 112 multiplied by 3, and the batch processing size batch _ size of the output image of the mixed convolution residual error network model is a category label. The loss value was optimized using a random gradient descent SGD with an initial learning rate set to 0.01 and divided by 10 when the verification loss reached saturation. Momentum coefficient of 0.9, dropout coefficient of 0.5, weight attenuation rate of 10e-3And trained using the batch norm acceleration model, on the server using 8 blocks of Tesla V100 GPUs, with 8 batch _ size on each GPU and 64 total batch _ size.
And step 3: constructing an attention module, wherein an attention mechanism is used in the attention module to focus on the position mentioned by the priori knowledge, the interference of background and noise on motion recognition is removed, and different attentions are automatically allocated to different positions of the input feature map according to the priori knowledge;
constructing a mixed convolution block by using an attention module, cascading the mixed convolution block to construct a mixed convolution residual network model based on the combination of a mixed convolution residual network and attention, and performing space-time feature learning on a video frame image by using the mixed convolution residual network model to obtain a key feature map;
the mixed volume block is expressed as:
Xt+1=Xt+W(Xt)
in the formula, XtAnd Xt+1Represents the input and output of the tth MC-RAN module; xtAnd Xt+1With the same characteristic dimensions, W represents the hybrid convolution residual function with the attention mechanism added.
The step 3) is specifically as follows: selecting 3DThe ResNet network structure is used as a basic network structure, an original 3D convolution module in the 3D DResNet network structure is replaced by a first convolution layer and four mixed convolution block structures, and each mixed convolution block comprises an MC-RAN module and an addition layer; the MC-RAN module comprises a (2+1) D convolution layer, a first batch normalization layer, a first ReLU activation layer, a 3D convolution layer and a second batch normalization layer which are sequentially connected; input X of mixed volume blocktInput MC-RAN module, and output characteristic diagram and input X of MC-RAN moduletAdding the feature maps through the adding layer, wherein the output of the added feature maps processed by the second ReLU activation layer is used as the output X of the mixed volume blockt+1Each mixed volume block is then downsampled in cascade with a 3D max-pooling layer.
a. Ith size of Ni-1The 3D convolution layer of x t x D consists of MiEach size is Ni-1A second 2D convolutional layer of x 1 xdxdxd and NiEach size is MiComposition of time-series convolution layer of x t x 1, MiCalculated by the following formula:
Figure BDA0002644396770000061
wherein D represents a width and height size parameter of the output characteristic diagram of the 3D convolution layer, t represents a time sequence, and [ ] represents a downward rounding;
b. spatial down-sampling is performed at the first convolution layer conv1 with a step size of 1 × 2 × 2. With respect to the third, fourth, and fifth hybrid convolution blocks conv3_1, conv4_1, and conv5_1, the first 2D convolutional layer and the time convolutional layer of (2+1) D convolution therein are spatio-temporally down-sampled by steps of 1 × 2 × 2 and 2 × 1 × 1, respectively. Table 1 is a network structure diagram of the first convolution layer and the hybrid convolution block.
Table 1 shows the network layer structure of the first convolution layer and the hybrid convolution block.
Figure BDA0002644396770000062
c. Hybrid convolution Block cascade graph as shown in FIG. 5(2+1) the D convolution layer is formed by adding an attention module into the 2D convolution layer; the (2+1) D convolution layer mainly comprises a first 2D convolution layer and a space attention module MSSTime convolution layer and channel attention module MCSAnd (4) cascade formation. The attention module applies attention to the space and the channel of the input feature map respectively, and the space attention module M applies attention to the space and the channelSSAnd channel attention module MCSConstituting an attention module.
Space attention module MSSObtaining a spatial weight map W of the input feature map in a spatial dimension by a third 2D convolution kernelSS(ii) a Channel attention module MCSAcquiring channel weight graph W of input feature graph in channel dimension by adding multi-layer perceptronCS
The spatial attention module MSSThe construction method specifically comprises the following steps: when the size of the input feature map F is C multiplied by H multiplied by W, C represents the channel number of each frame of image in the input feature map, and H and W represent the width and height dimension parameters of each frame of image in the input feature map; first, the channel of each frame of image in the input feature map is compressed by using global average pooling, and a 2D spatial descriptor Z with a size of 1 × H × W is generated, and the element of Z at the coordinate (i, j) is calculated as follows:
Figure BDA0002644396770000071
wherein Fi,j(k) Representing a characteristic diagram of a Kth channel at coordinates (i, j), wherein i represents a pixel point in an H dimension, and j represents a pixel point in a W dimension; then, a third 2D convolutional layer with the size of 7 multiplied by 7 is used for carrying out convolution on the 2D space descriptor to obtain an interested target region in the input feature map; and finally, adding a third batch normalization layer on a third 2D convolution layer to carry out dimension transformation on the interested target region to obtain a space attention weight graph WSS
Spatial attention weight graph WSSCan be expressed as:
WSS(F)=BN(σ(f7×7(Avgpool(F)))
wherein BN () representsBatch normalization, σ () representing a sigmoid activation function, f7×7() Represents the convolution operation with a convolution kernel size of 7 x 7, Avgpool () represents the global average pooling, and F represents the input feature map.
Channel attention module MCSThe construction method specifically comprises the following steps: when a feature map Q with a size H × W × C is input, C represents the number of channels of each frame image in the input feature map. Firstly, carrying out global average pooling operation on an input feature map Q to generate a feature map Q' with the size of 1 multiplied by C; subsequently, the channel vector Q 'is processed using the multilayer perceptron FC with hidden layers to learn the weights of the channel vector Q'; with the weights as the correlation, to limit the complexity of the channel attention module and save the parameter cost, the size of the hidden active layer is set to 1 × 1 × C/r, where r is the compression ratio, set to 16.
The channel vector Q' can be calculated by the following formula:
Figure BDA0002644396770000072
wherein F (i, j) represents a feature map at coordinates (i, j), i represents a pixel point at dimension H, and j represents a pixel point at dimension W;
finally, adding a fourth batch normalization layer behind the multilayer perceptron to perform dimension conversion to obtain a channel attention weight graph WCS
Channel attention weight graph WCSCan be expressed as:
WCS(F)=BN(MLP(Avgpool(F)))=BN(σ(W1((W0Avgpool(F)+b0)+b1)))
wherein MLP denotes a multilayer perceptron with hidden layer, W0And W1Are the MLP weights, with the sizes C/r C and C/r, respectively. σ () is the sigmoid activation function, () is the linear modification unit, b0And b1And bias terms representing MLP () with the sizes of C/r and C, respectively.
And 4, step 4: the spatiotemporal features in the video frame image are fused after the video frame image passes through the first convolution layer and the four mixed convolution blocks, the mixed convolution residual error network model obtains key features, and the feature map visualization after the attention module is added is shown in fig. 6. And inputting the key feature map into a Softmax layer for classification. Each video in the validation set is evaluated using the trained network and a corresponding category label is obtained. After training, the proposed mixed convolution residual error network model is compared with different network models, and the experimental results are shown in table 2, and the results show that the recognition accuracy of both Top1 and Top5 is increased under the condition that the number of parameters of the mixed convolution residual error network model is not increased.
Table 2 shows the recognition results of the hybrid convolution residual network model compared to other models.
Network model Amount of ginseng Top-1 recognition Rate (%) Top-5 recognition Rate (%) Average recognition rate (%)
ResNet[39] 63.72M 60.1 81.9 71.0
(2+1)D-ResNet[12] 63.88M 66.8 88.1 77.45
MC-ResNet[28] 63.88M 67.3 89.2 78.25
RAN[26] 63.97M 61.7 83.2 72.45
(2+1)D-RAN 63.98M 67.8 89.3 78.55
MC-RAN 63.98M 68.8 89.9 79.35
The foregoing detailed description is intended to illustrate and not limit the invention, which is intended to be within the spirit and scope of the appended claims, and any changes and modifications that fall within the true spirit and scope of the invention are intended to be covered by the following claims.

Claims (6)

1. A motion video identification method combining a residual error network of hybrid convolution and attention is characterized in that: the method comprises the following steps:
1) reading the motion of a person in the motion video, and converting the motion video into an original video frame image;
2) respectively using methods of time sampling, random cutting and brightness adjustment to perform data enhancement on video frames of the action video to form and obtain video frame images;
3) constructing an attention module, constructing a mixed convolution block by using the attention module, constructing a mixed convolution residual network model based on the combination of a mixed convolution residual network and attention by cascading the mixed convolution block, and performing space-time feature learning on a video frame image by using the mixed convolution residual network model to obtain a key feature map;
the mixed volume block is expressed as:
Xt+1=Xt+W(Xt)
wherein, XtAnd Xt+1Represents the input and output of the tth MC-RAN module; xtAnd Xt+1The characteristic dimensions are the same, and W represents a mixed convolution residual function added with an attention mechanism;
4) the key feature maps are classified using a Softmax classification layer.
2. The hybrid convolved residual network and attention-combined motion video recognition method according to claim 1, characterized in that: the step 2) is specifically as follows:
time sampling: for each motion video, randomly sampling continuous frames of 16 motion videos for training; if the number of the continuous frames does not reach 16 frames, the action video is played circularly until the number of the continuous frames reaches 16 frames;
random cutting: resizing the original video frame image to 128 × 171 pixels, and then randomly cropping the original video frame image to 112 × 112 pixels;
and (3) brightness adjustment: and randomly adjusting the brightness of the original video frame image.
3. The hybrid convolved residual network and attention-combined motion video recognition method according to claim 1, characterized in that:
the step 3) is specifically as follows: selecting 3DResNet networkThe structure is used as a basic network structure, an original 3D convolution module in the 3D ReesNet network structure is replaced by a first convolution layer and four mixed convolution blocks, and each mixed convolution block comprises an MC-RAN module and an addition layer; the MC-RAN module comprises a (2+1) D convolution layer, a first batch normalization layer, a first ReLU activation layer, a 3D convolution layer and a second batch normalization layer, wherein the (2+1) D convolution layer is formed by adding an attention module into the 2D convolution layer; input X of mixed volume blocktInput MC-RAN module, and output characteristic diagram and input X of MC-RAN moduletAdding the feature maps through the adding layer, wherein the output of the added feature maps processed by the second ReLU activation layer is used as the output X of the mixed volume blockt+1After each mixed rolling block, cascading a 3D maximum pooling layer for down-sampling;
ith size of Ni-1The 3D convolution layer of x t x D consists of MiEach size is Ni-1A second 2D convolutional layer of x 1 xdxdxd and NiEach size is MiComposition of time-series convolution layer of x t x 1, MiCalculated by the following formula:
Figure FDA0002644396760000021
wherein D represents a width and height dimension parameter of the 3D convolution layer output characteristic diagram, t represents a time sequence, and [ ] represents a downward rounding.
4. The hybrid convolved residual network and attention-combined motion video recognition method according to claim 3, characterized in that:
the (2+1) D convolutional layer mainly comprises a first 2D convolutional layer and a space attention module MSSTime convolution layer and channel attention module MCSFormed in cascade and consisting of spatial attention modules MSSAnd channel attention module MCSAn attention module is formed;
space attention module MSSObtaining a spatial weight map W of the input feature map in a spatial dimension by a third 2D convolutional layerSS(ii) a Channel attention module MCSBy addingObtaining a channel weight map W of an input feature map in a channel dimension by a multi-layer perceptronCS
The spatial attention module MSSThe construction method specifically comprises the following steps: when the size of the input feature map F is C multiplied by H multiplied by W, C represents the channel number of each frame of image in the input feature map, and H and W represent the width and height dimension parameters of each frame of image in the input feature map; firstly, compressing a channel of each frame of image in an input feature map by using global average pooling to generate a 2D space descriptor Z with the size of 1 multiplied by H multiplied by W; then, a third 2D convolution layer is used for performing convolution on the 2D space descriptor Z to obtain an interested target area in the input feature map; and finally, adding a third batch normalization layer on a third 2D convolution layer to carry out dimension transformation on the interested target region to obtain a space attention weight graph WSS
Spatial attention weight graph WSSCan be expressed as:
WSS(F)=BN(σ(f7×7(Avgpool(F)))
wherein BN () represents batch normalization, σ () represents sigmoid activation function, f7×7() Represents the convolution operation with a convolution kernel size of 7 × 7, Avgpool () represents the global average pooling, and F represents the input signature;
the channel attention module MCSThe construction method specifically comprises the following steps: when the size of the input feature map Q is C multiplied by H multiplied by W, and C represents the number of channels of each frame image in the input feature map, firstly, carrying out global average pooling operation on the input feature map Q to generate a channel vector Q' with the size of 1 multiplied by C; subsequently, the channel vector Q 'is processed using a multi-layer perceptron to learn weights of the channel vector Q';
the channel vector Q' can be calculated by the following formula:
Figure FDA0002644396760000022
wherein F (i, j) represents a feature map at coordinates (i, j), i represents a pixel point at dimension H, and j represents a pixel point at dimension W;
finally in multiple layersAdding a fourth batch normalization layer behind the sensor to perform dimension conversion to obtain a channel attention weight graph WCS
Channel attention weight graph WCSCan be expressed as:
WCS(F)=BN(MLP(Avgpool(F)))=BN(σ(W1((W0Avgpool(F)+b0)+b1)))
wherein MLP () represents a multilayer perceptron with hidden layers, W0And W1Is the weight of MLP () with the size of C/r × C and C × C/r, r is the compression ratio, () is the linear correction unit, b0And b1And bias terms representing MLP () with the sizes of C/r and C, respectively.
5. The hybrid convolved residual network and attention-combined motion video recognition method according to claim 1, characterized in that: the step 4) is specifically as follows: after the video frame images pass through the four MC-RAN modules, the spatio-temporal characteristics in the video frame images are fused, the mixed convolution residual error network model obtains key characteristics, and the key characteristic images are input into a Softmax layer for classification.
6. The hybrid convolved residual network and attention-combined motion video recognition method according to claim 1, characterized in that: the input feature map of the input feature map in the first MC-RAN module is the output feature map of the video frame image in step 2) after passing through the first convolutional layer, and the input feature map of the subsequent MC-RAN module is the output feature map of the previous MC-RAN module after passing through the 3D max-pooling layer.
CN202010849991.6A 2020-08-21 2020-08-21 Motion video identification method combining mixed convolution residual network and attention Active CN112149504B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010849991.6A CN112149504B (en) 2020-08-21 2020-08-21 Motion video identification method combining mixed convolution residual network and attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010849991.6A CN112149504B (en) 2020-08-21 2020-08-21 Motion video identification method combining mixed convolution residual network and attention

Publications (2)

Publication Number Publication Date
CN112149504A true CN112149504A (en) 2020-12-29
CN112149504B CN112149504B (en) 2024-03-26

Family

ID=73889023

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010849991.6A Active CN112149504B (en) 2020-08-21 2020-08-21 Motion video identification method combining mixed convolution residual network and attention

Country Status (1)

Country Link
CN (1) CN112149504B (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112766172A (en) * 2021-01-21 2021-05-07 北京师范大学 Face continuous expression recognition method based on time sequence attention mechanism
CN112800957A (en) * 2021-01-28 2021-05-14 内蒙古科技大学 Video pedestrian re-identification method and device, electronic equipment and storage medium
CN112818843A (en) * 2021-01-29 2021-05-18 山东大学 Video behavior identification method and system based on channel attention guide time modeling
CN112883264A (en) * 2021-02-09 2021-06-01 联想(北京)有限公司 Recommendation method and device
CN113128395A (en) * 2021-04-16 2021-07-16 重庆邮电大学 Video motion recognition method and system based on hybrid convolution and multi-level feature fusion model
CN113139530A (en) * 2021-06-21 2021-07-20 城云科技(中国)有限公司 Method and device for detecting sleep post behavior and electronic equipment thereof
CN113160117A (en) * 2021-02-04 2021-07-23 成都信息工程大学 Three-dimensional point cloud target detection method under automatic driving scene
CN113283338A (en) * 2021-05-25 2021-08-20 湖南大学 Method, device and equipment for identifying driving behavior of driver and readable storage medium
CN113288162A (en) * 2021-06-03 2021-08-24 北京航空航天大学 Short-term electrocardiosignal atrial fibrillation automatic detection system based on self-adaptive attention mechanism
CN113343760A (en) * 2021-04-29 2021-09-03 暖屋信息科技(苏州)有限公司 Human behavior recognition method based on multi-scale characteristic neural network
CN113468531A (en) * 2021-07-15 2021-10-01 杭州电子科技大学 Malicious code classification method based on deep residual error network and mixed attention mechanism
CN113673559A (en) * 2021-07-14 2021-11-19 南京邮电大学 Video character space-time feature extraction method based on residual error network
CN113837263A (en) * 2021-09-18 2021-12-24 浙江理工大学 Gesture image classification method based on feature fusion attention module and feature selection
CN113850135A (en) * 2021-08-24 2021-12-28 中国船舶重工集团公司第七0九研究所 Dynamic gesture recognition method and system based on time shift frame
CN113850182A (en) * 2021-09-23 2021-12-28 浙江理工大学 Action identification method based on DAMR-3 DNet
CN114037930A (en) * 2021-10-18 2022-02-11 苏州大学 Video action recognition method based on space-time enhanced network
CN114140654A (en) * 2022-01-27 2022-03-04 苏州浪潮智能科技有限公司 Image action recognition method and device and electronic equipment
CN114783053A (en) * 2022-03-24 2022-07-22 武汉工程大学 Behavior identification method and system based on space attention and grouping convolution
CN114842542A (en) * 2022-05-31 2022-08-02 中国矿业大学 Facial action unit identification method and device based on self-adaptive attention and space-time correlation
CN115035605A (en) * 2022-08-10 2022-09-09 广东履安实业有限公司 Action recognition method, device and equipment based on deep learning and storage medium
CN115049969A (en) * 2022-08-15 2022-09-13 山东百盟信息技术有限公司 Poor video detection method for improving YOLOv3 and BiConvLSTM
CN116304984A (en) * 2023-03-14 2023-06-23 烟台大学 Multi-modal intention recognition method and system based on contrast learning
CN116416479A (en) * 2023-06-06 2023-07-11 江西理工大学南昌校区 Mineral classification method based on deep convolution fusion of multi-scale image features

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109886225A (en) * 2019-02-27 2019-06-14 浙江理工大学 A kind of image gesture motion on-line checking and recognition methods based on deep learning
CN109886090A (en) * 2019-01-07 2019-06-14 北京大学 A kind of video pedestrian recognition methods again based on Multiple Time Scales convolutional neural networks
CN110110646A (en) * 2019-04-30 2019-08-09 浙江理工大学 A kind of images of gestures extraction method of key frame based on deep learning
CN110245593A (en) * 2019-06-03 2019-09-17 浙江理工大学 A kind of images of gestures extraction method of key frame based on image similarity
CN110457524A (en) * 2019-07-12 2019-11-15 北京奇艺世纪科技有限公司 Model generating method, video classification methods and device
CN110807808A (en) * 2019-10-14 2020-02-18 浙江理工大学 Commodity identification method based on physical engine and deep full convolution network
CN111091045A (en) * 2019-10-25 2020-05-01 重庆邮电大学 Sign language identification method based on space-time attention mechanism

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109886090A (en) * 2019-01-07 2019-06-14 北京大学 A kind of video pedestrian recognition methods again based on Multiple Time Scales convolutional neural networks
CN109886225A (en) * 2019-02-27 2019-06-14 浙江理工大学 A kind of image gesture motion on-line checking and recognition methods based on deep learning
CN110110646A (en) * 2019-04-30 2019-08-09 浙江理工大学 A kind of images of gestures extraction method of key frame based on deep learning
CN110245593A (en) * 2019-06-03 2019-09-17 浙江理工大学 A kind of images of gestures extraction method of key frame based on image similarity
CN110457524A (en) * 2019-07-12 2019-11-15 北京奇艺世纪科技有限公司 Model generating method, video classification methods and device
CN110807808A (en) * 2019-10-14 2020-02-18 浙江理工大学 Commodity identification method based on physical engine and deep full convolution network
CN111091045A (en) * 2019-10-25 2020-05-01 重庆邮电大学 Sign language identification method based on space-time attention mechanism

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
包嘉欣;田秋红;杨慧敏;陈影柔;: "基于肤色分割与改进VGG网络的手语识别", 计算机***应用, no. 06 *
王晨浩: "多粒度唇语识别技术研究", CNKI *
解怀奇;乐红兵;: "基于通道注意力机制的视频人体行为识别", 电子技术与软件工程, no. 04 *

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112766172A (en) * 2021-01-21 2021-05-07 北京师范大学 Face continuous expression recognition method based on time sequence attention mechanism
CN112766172B (en) * 2021-01-21 2024-02-02 北京师范大学 Facial continuous expression recognition method based on time sequence attention mechanism
CN112800957A (en) * 2021-01-28 2021-05-14 内蒙古科技大学 Video pedestrian re-identification method and device, electronic equipment and storage medium
CN112818843A (en) * 2021-01-29 2021-05-18 山东大学 Video behavior identification method and system based on channel attention guide time modeling
CN113160117A (en) * 2021-02-04 2021-07-23 成都信息工程大学 Three-dimensional point cloud target detection method under automatic driving scene
CN112883264A (en) * 2021-02-09 2021-06-01 联想(北京)有限公司 Recommendation method and device
CN113128395A (en) * 2021-04-16 2021-07-16 重庆邮电大学 Video motion recognition method and system based on hybrid convolution and multi-level feature fusion model
CN113128395B (en) * 2021-04-16 2022-05-20 重庆邮电大学 Video action recognition method and system based on hybrid convolution multistage feature fusion model
CN113343760A (en) * 2021-04-29 2021-09-03 暖屋信息科技(苏州)有限公司 Human behavior recognition method based on multi-scale characteristic neural network
CN113283338A (en) * 2021-05-25 2021-08-20 湖南大学 Method, device and equipment for identifying driving behavior of driver and readable storage medium
CN113288162B (en) * 2021-06-03 2022-06-28 北京航空航天大学 Short-term electrocardiosignal atrial fibrillation automatic detection system based on self-adaptive attention mechanism
CN113288162A (en) * 2021-06-03 2021-08-24 北京航空航天大学 Short-term electrocardiosignal atrial fibrillation automatic detection system based on self-adaptive attention mechanism
CN113139530A (en) * 2021-06-21 2021-07-20 城云科技(中国)有限公司 Method and device for detecting sleep post behavior and electronic equipment thereof
CN113139530B (en) * 2021-06-21 2021-09-03 城云科技(中国)有限公司 Method and device for detecting sleep post behavior and electronic equipment thereof
CN113673559A (en) * 2021-07-14 2021-11-19 南京邮电大学 Video character space-time feature extraction method based on residual error network
CN113673559B (en) * 2021-07-14 2023-08-25 南京邮电大学 Video character space-time characteristic extraction method based on residual error network
CN113468531A (en) * 2021-07-15 2021-10-01 杭州电子科技大学 Malicious code classification method based on deep residual error network and mixed attention mechanism
CN113850135A (en) * 2021-08-24 2021-12-28 中国船舶重工集团公司第七0九研究所 Dynamic gesture recognition method and system based on time shift frame
CN113837263A (en) * 2021-09-18 2021-12-24 浙江理工大学 Gesture image classification method based on feature fusion attention module and feature selection
CN113837263B (en) * 2021-09-18 2023-09-26 浙江理工大学 Gesture image classification method based on feature fusion attention module and feature selection
CN113850182A (en) * 2021-09-23 2021-12-28 浙江理工大学 Action identification method based on DAMR-3 DNet
CN114037930A (en) * 2021-10-18 2022-02-11 苏州大学 Video action recognition method based on space-time enhanced network
CN114037930B (en) * 2021-10-18 2022-07-12 苏州大学 Video action recognition method based on space-time enhanced network
CN114140654A (en) * 2022-01-27 2022-03-04 苏州浪潮智能科技有限公司 Image action recognition method and device and electronic equipment
CN114783053A (en) * 2022-03-24 2022-07-22 武汉工程大学 Behavior identification method and system based on space attention and grouping convolution
CN114842542A (en) * 2022-05-31 2022-08-02 中国矿业大学 Facial action unit identification method and device based on self-adaptive attention and space-time correlation
CN114842542B (en) * 2022-05-31 2023-06-13 中国矿业大学 Facial action unit identification method and device based on self-adaptive attention and space-time correlation
CN115035605A (en) * 2022-08-10 2022-09-09 广东履安实业有限公司 Action recognition method, device and equipment based on deep learning and storage medium
CN115035605B (en) * 2022-08-10 2023-04-07 广东履安实业有限公司 Action recognition method, device and equipment based on deep learning and storage medium
CN115049969A (en) * 2022-08-15 2022-09-13 山东百盟信息技术有限公司 Poor video detection method for improving YOLOv3 and BiConvLSTM
CN116304984A (en) * 2023-03-14 2023-06-23 烟台大学 Multi-modal intention recognition method and system based on contrast learning
CN116416479A (en) * 2023-06-06 2023-07-11 江西理工大学南昌校区 Mineral classification method based on deep convolution fusion of multi-scale image features
CN116416479B (en) * 2023-06-06 2023-08-29 江西理工大学南昌校区 Mineral classification method based on deep convolution fusion of multi-scale image features

Also Published As

Publication number Publication date
CN112149504B (en) 2024-03-26

Similar Documents

Publication Publication Date Title
CN112149504B (en) Motion video identification method combining mixed convolution residual network and attention
Kim et al. Fully deep blind image quality predictor
CN108229338B (en) Video behavior identification method based on deep convolution characteristics
Hara et al. Learning spatio-temporal features with 3d residual networks for action recognition
CN111639692A (en) Shadow detection method based on attention mechanism
CN112257572B (en) Behavior identification method based on self-attention mechanism
CN114596520A (en) First visual angle video action identification method and device
CN113011329A (en) Pyramid network based on multi-scale features and dense crowd counting method
CN113870335A (en) Monocular depth estimation method based on multi-scale feature fusion
CN113920581A (en) Method for recognizing motion in video by using space-time convolution attention network
CN112507920A (en) Examination abnormal behavior identification method based on time displacement and attention mechanism
CN115147456A (en) Target tracking method based on time sequence adaptive convolution and attention mechanism
Hongmeng et al. A detection method for deepfake hard compressed videos based on super-resolution reconstruction using CNN
CN117011342A (en) Attention-enhanced space-time transducer vision single-target tracking method
CN112149662A (en) Multi-mode fusion significance detection method based on expansion volume block
CN113344110B (en) Fuzzy image classification method based on super-resolution reconstruction
CN113850182A (en) Action identification method based on DAMR-3 DNet
CN111539434B (en) Infrared weak and small target detection method based on similarity
CN112818840A (en) Unmanned aerial vehicle online detection system and method
Zhao et al. Multi-layer fusion neural network for deepfake detection
CN116797640A (en) Depth and 3D key point estimation method for intelligent companion line inspection device
CN115424051B (en) Panoramic stitching image quality evaluation method
CN114612305B (en) Event-driven video super-resolution method based on stereogram modeling
CN115527253A (en) Attention mechanism-based lightweight facial expression recognition method and system
CN110188706B (en) Neural network training method and detection method based on character expression in video for generating confrontation network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant