CN111709306B - Double-flow network behavior identification method based on multilevel space-time feature fusion enhancement - Google Patents

Double-flow network behavior identification method based on multilevel space-time feature fusion enhancement Download PDF

Info

Publication number
CN111709306B
CN111709306B CN202010441559.3A CN202010441559A CN111709306B CN 111709306 B CN111709306 B CN 111709306B CN 202010441559 A CN202010441559 A CN 202010441559A CN 111709306 B CN111709306 B CN 111709306B
Authority
CN
China
Prior art keywords
network
space
features
time
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010441559.3A
Other languages
Chinese (zh)
Other versions
CN111709306A (en
Inventor
孔军
王圣全
蒋敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangnan University
Original Assignee
Jiangnan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangnan University filed Critical Jiangnan University
Priority to CN202010441559.3A priority Critical patent/CN111709306B/en
Publication of CN111709306A publication Critical patent/CN111709306A/en
Application granted granted Critical
Publication of CN111709306B publication Critical patent/CN111709306B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • G06F18/256Fusion techniques of classification results, e.g. of results related to same input data of results relating to different input data, e.g. multimodal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

A dual-flow network behavior recognition method based on multi-level space-time feature fusion enhancement. The method adopts a network architecture based on a space-time double-flow network, which is called a multi-level space-time characteristic fusion enhancement network. Aiming at the problems that the effect of shallow layer characteristics is ignored and the complementary characteristics of the double-flow network cannot be fully utilized due to the fact that the class probability distribution of two flows is only fused at the last layer of the traditional double-flow network, the invention provides a multi-level space-time characteristic fusion module, and the multi-depth-level mixed characteristics are captured through the space-time characteristic fusion module at different depth levels of double flows so as to fully utilize the double-flow network. Furthermore, treating all features equally in the network weakens the effect of those features that contribute significantly to classification. The invention provides a grouping enhanced attention module in a network, which automatically enhances the significance of effective areas and channels on characteristics. Finally, the invention further improves the robustness of the behavior recognition model by collecting the classification results of the double-flow network and the feature fusion.

Description

Double-flow network behavior identification method based on multilevel space-time feature fusion enhancement
Technical Field
The invention belongs to the field of machine vision, and particularly relates to a double-flow network behavior recognition method based on multi-level space-time feature fusion enhancement.
Background
Action recognition has become an active field of the computer vision world and is widely applied to various fields such as video surveillance, violence detection, man-machine interaction and the like. Video motion recognition is to mine key features that can express target motion represented by video, and compared with static images, the video motion recognition contains rich motion information, however, the diversity of motion scenes still makes extraction of effective features challenging. Therefore, the invention aims at the problems faced by extracting the spatial and temporal features in the video by using the video as a research object and provides a unique feature fusion method and an attention method to effectively extract the distinguishing features for behavior recognition.
Currently, video-oriented behavior recognition mainly uses a dual-stream network, and the development trend is very good. In dual-flow networks, the dual-flow architecture captures appearance information and motion information by training the respective convolutional networks on the appearance and optical flow stacks, respectively, and finally uses the score to fuse the classification results of the two convolutional networks. However, conventional dual-flow networks still face the following problems: (1) how effectively information captured by two streams separately? (2) How effectively the captured features are refined should each region and channel of the network that is equally treated with features impair the effect of those regions and channels that are useful for classification? (3) How effectively the acquired spatial information and temporal information are fused?
Based on the above consideration, the invention provides a double-flow network behavior recognition method based on multi-level space-time feature fusion enhancement. Firstly, the proposed space-time feature fusion module is used for fusing the features of modules at different depth layers of a double-flow network to extract multi-depth-level mixed features. Secondly, the extracted mixed features are further refined by using the proposed grouping enhanced attention module, so that the network automatically focuses on the regions and channels which have an effect on classification in the features.
Disclosure of Invention
The invention mainly aims to provide a multi-level space-time feature fusion enhanced double-flow network (MDFFEN) behavior recognition method, which can better acquire effective features of videos and distinguishing information on the features so as to perform efficient behavior recognition.
In order to achieve the above object, the present invention provides the following technical solutions:
a double-flow network behavior recognition method based on multi-level space-time feature fusion enhancement comprises the following steps:
step one, obtaining RGB frames: performing frame taking processing on each video in the data set to obtain RGB original frames
Figure GDA0004085207380000021
N is the number of frames;
step two, calculating a light flow diagram: TVL1[ Coloma Ballester, lluis Garrido, vanelLazcano, and VicentCasseles. Atv-l1 optical flow method with occlusion detection. In Point Dagm,2013 was applied.]Algorithm for RGB original frame f rgb Calculating two by two to obtain a light flow graph
Figure GDA0004085207380000022
Step three, segmenting all the extracted RGB frames and the optical flow diagram: dividing all RGB frames and optical flow diagrams acquired in the first step and the second step into three sections
Figure GDA0004085207380000023
Each segment is continuous in time sequence, and any two segments are not overlapped.
Step four, from s rgb Each segment of the three-dimensional network is used for randomly acquiring input of RGB frames to construct a spatial network respectively:
Figure GDA0004085207380000024
wherein />
Figure GDA0004085207380000025
Step five, from s opt The network input of the construction time of a plurality of optical flow diagrams is randomly acquired respectively in each section:
Figure GDA0004085207380000026
wherein />
Figure GDA0004085207380000027
Step six, based on the space network N s Calculating a spatial class probability distribution O S Input of the space network constructed in the fourth step
Figure GDA0004085207380000028
Respectively send into the space network N s Extracting features, spatial network N s Based on InceptionV3 [2] Constructing a network, and obtaining spatial category probability distribution ∈10 through global average pooling operation and full connection operation>
Figure GDA0004085207380000031
wherein
Figure GDA0004085207380000032
The ith RGB frame segment RGB representing step three i Corresponding spatial class probability distribution;
step seven, based on time network N t Calculating a temporal class probability distribution O T Input of the time network constructed in the step five
Figure GDA0004085207380000033
Respectively send into time network N t Extracting features, time network N t Based on InceptionV3[ Christian Szegedy, vincent Vanhoucke, sergey Ioffe, jonathon sylens, and Zbigniew Wojna.rethinking the inception architecture for Computer version.in Computer Vision&Pattern Recognition,2016.]Constructing a network, and obtaining time category probability distribution by global average pooling operation and full connection operation>
Figure GDA0004085207380000034
wherein />
Figure GDA0004085207380000035
Representing the ith optical flow map segment OPT in step three i Corresponding time class probabilities;
step eight, based on double-flow fusion network N TSFF Computing feature fusion class probability distribution O F : embedding space-time feature fusion modules STFF into a spatial network N respectively by using multi-level space-time feature fusion modules s And time network N t In the multiple submodules of InceptionV3 to fusion extractThe multi-depth-level mixed features are then further extracted by the grouping enhanced attention module, and finally feature fusion category probability distribution is obtained by global average pooling operation and full connection operation
Figure GDA0004085207380000036
wherein />
Figure GDA0004085207380000037
The ith RGB frame segment RGB representing step three i And ith optical flow map segment OPT i Corresponding feature fusion class probability distribution;
step nine, calculating multi-section fused class probability distribution, namely obtaining multi-section class probability distribution according to the step six, the step seven and the step eight
Figure GDA0004085207380000038
And->
Figure GDA0004085207380000039
Obtaining a class probability distribution of multi-segment fusion by three-segment average value>
Figure GDA00040852073800000310
Step ten, calculating class probability distribution delta of three stream weighted fusion: fusion step nine obtained multi-section fusion space category probability distribution delta on the basis of double-flow network s Multi-segment fused temporal class probability distribution delta t Feature fusion class probability distribution delta fused with multiple segments f The present invention uses a weighted average fusion approach.
Step eleven, calculating a final classification result P: p=argmax (δ), where argmax (δ) is the index value that calculates the maximum value in the δ vector, i.e. calculates the category of all behavior categories with the highest category probability distribution.
Compared with the prior art, the invention has the following beneficial effects:
1. and (3) carrying out feature fusion on different depth layers of the double flow through the double flow feature fusion network constructed in the step (eight) to obtain space-time mixing features of multiple depth levels, and fully utilizing the features of the shallow layers and the complementary features of the double flow.
2. The double-flow feature fusion network constructed in the step eight provides a grouping enhanced attention module to further refine local information and global information of the extracted mixed features, so that behavior recognition accuracy is effectively improved.
Drawings
FIG. 1 is a flow chart of an algorithm of the present invention;
FIG. 2 is a diagram of an algorithm model of the present invention;
FIG. 3 is a dual flow feature fusion network N TSFF A figure;
FIG. 4 is a spatio-temporal feature fusion graph;
fig. 5 is a grouping enhanced attention module.
Detailed Description
FIG. 2 is an overall model diagram of the present invention;
fig. 2 shows an algorithm model diagram of the present invention. The algorithm takes a multi-segment RGB image and a light flow graph as inputs, and the model comprises a space network, a time network, a feature fusion network, multi-segment class probability distribution fusion and multi-stream class probability distribution fusion. The spatial network and the time network are both constructed based on the InceptionV3, the feature fusion network is constructed through the spatial network and the time network, in brief, the proposed multi-level space-time feature fusion module is used for fusing space-time hybrid features with different depth levels, wherein the space-time hybrid features are features respectively extracted from the spatial network and the time network through the fusion of the proposed space-time feature fusion module, then the proposed grouping enhanced attention module is used for further refining the multi-depth level hybrid features, and the feature fusion class probability distribution is obtained through global averaging pooling and full connection operation like the spatial network and the time network. And then fusing the three segmentation input extracted corresponding class probability distributions of each flow to obtain multi-segment fused class probability distributions of the corresponding flow, and finally fusing the multi-segment fused class probability distributions corresponding to the three flows by adopting a weighted average method.
For a better explanation of the present invention, the disclosed behavior data set UCF101 is described below as an example.
In the technical scheme, in the fourth step, the step is from s rgb The specific method for randomly acquiring RGB frames from each segment of the frame comprises the following steps:
the ith segment RGB frame sequence RGB obtained in the third step i Is to acquire consecutive L at random positions of (a) s RGB frame acquisition
Figure GDA0004085207380000051
wherein Ls In this example 1.
In the technical scheme, in the fifth step, the step S opt The specific method for randomly acquiring a plurality of optical flow diagrams from each section of the optical flow diagrams comprises the following steps:
OPT of Zhang Guangliu figure at i-th paragraph obtained from step three i Starts to acquire consecutive L at random positions of (1) t Zhang Guangliu map is obtained
Figure GDA0004085207380000052
wherein Lt In this example 5.
The double-flow characteristic fusion method in the step eight in the technical scheme specifically comprises the following steps:
conventional dual-flow network behavior recognition methods typically incorporate class probability distributions at the final layer. Since conventional feature fusion fuses the features at the deepest level in the final layer, the effect of shallow features on classification is often ignored. Therefore, the invention provides a multi-level space-time characteristic fusion module. The specific implementation is shown in fig. 3. Unlike traditional methods, the multi-level spatio-temporal feature fusion module presented in the present invention considers shallow features of the depth network to capture hybrid features with multiple depth levels. In addition, the invention proposes a grouping enhanced attention module to further optimize the hybrid features extracted from the multi-level spatio-temporal feature fusion module. Finally, the class probability distribution is generated by the operation of the complete connection layer FC on the feature vectors, wherein the feature vectors are generated by summarizing the feature images through the global average pooling operation. The overall process of dual stream feature fusion is formally written as the following:
Figure GDA0004085207380000061
wherein MMDFF (. Cndot. Cndot.). Cndot.represents a multi-level spatiotemporal feature fusion module, M GSCE (. Cndot.) represents the output characteristics of the packet enhanced attention module. FC represents a fully connected operation, GAP represents a global average pooling operation.
The multi-level space-time feature fusion method applied in the step eight in the technical scheme comprises the following steps:
the InceptionV3 consists of 11 sub-modules connected in series, namely, inc.1-Inc.11, from which different depth level features can be extracted. In order to further enhance the classification capability of the InceptionV3 network, the invention embeds a space-time feature fusion module STFF into each sub-module of the spatial network and the temporal network to capture novel features with different depth levels. The last four sub-modules, namely sub-modules from inc.8 to inc.11, are selected in the embodiment, and the selection of the sub-modules in the specific application can be adjusted according to the actual application. All the mixed spatiotemporal features generated by sub-modules of multiple depths of the network are cascaded to obtain abstract convolved mixed spatiotemporal features having multiple depth levels. Multi-level space-time characteristic fusion module M MDFF The scheme for (-) is shown below:
Figure GDA0004085207380000062
wherein MSTFF (. Cndot. Cndot.) represents a spatiotemporal feature fusion module.
Figure GDA0004085207380000063
and />
Figure GDA0004085207380000064
Respectively indicate +.>
Figure GDA0004085207380000065
and />
Figure GDA0004085207380000066
Features fed into the spatial and temporal networks and extracted from the inc.j module therein. The cascade of hybrid features generated from Inc.8 to Inc.11 consists of +.>
Figure GDA0004085207380000067
And (3) representing. Conv (·) represents a convolution operation, the present example uses 2048 kernel-sized 3*3 convolution filters to further extract abstract features from mixed features with different depth levels, while the number of channels of the obtained features will translate to 2048.
The specific construction method of the space-time feature fusion module STFF in the eighth step in the technical scheme comprises the following steps:
the output features of the space-time feature fusion module are formed by fusing three types of features (namely, preliminary mixed space-time features, spatial features and time features).
Fig. 4 is a spatio-temporal feature fusion module. The identification on each box represents the name of the feature map and the size of the feature map.
Figure GDA0004085207380000071
Representing element-by-element summation operations, N Filter Is the number of convolution filters.
As detailed in fig. 4, the spatial features extracted from the sub-modules in the spatial network are first fused with the temporal features extracted from the sub-modules in the temporal network by element-wise summation and convolution operations to obtain primary hybrid abstract features. By ignoring the superscript i and the subscript inc.j in equation (2), one can write
Figure GDA0004085207380000072
and />
Figure GDA0004085207380000073
Write as +.>
Figure GDA0004085207380000074
and />
Figure GDA0004085207380000075
For ease of expression, where C, H and W represent the number of channels, height and width, respectively, of the feature map. The preliminary hybrid abstract feature F is then formally expressed as the following formula:
Figure GDA0004085207380000076
wherein Ψk,n A sequence of ReLU (BN (Conv.)) operations of convolution kernel size k and filter number n is represented, where ReLU and BN represent a ReLU activation function and a bulk normalization operation, respectively, and Conv (. Cndot.). A convolution operation is represented. In addition, in order to further suppress invalid information and extract valid information, the present invention proposes a feature extractor M FE (·)。M FE (. Cndot.) consists of two ψs with different filter numbers n 3*3,n The operation is composed, wherein the first filter number is half of the input channel number C, and the other is the same as the input channel number. Then through a feature extractor M FE (. Cndot.) features of all three types (spatial feature S, temporal feature T and primary spatio-temporal mixture feature F) are further extracted independently from the non-linear abstract features. Feature extractor M FE The detailed procedure of (-) is expressed as the following formula:
M FE (Z)=Z FE2 =Ψ 3*3,C (Z FE1 ) (4)
Figure GDA0004085207380000077
wherein Z ε { S, T, F } represents M FE The input features of (-), S, T, F represent spatial features, temporal features, and primary spatio-temporal mixing features, respectively.
Then, it will pass through the feature extractor M FE (. Cndot.) refined spatial features S FE2 And time feature T FE2 Respectively with refined mixed features F FE2 Fusion to obtain a more deep fusion feature F S and FT The following is shown:
F S =Φ(S FE2 ,F FE2 ) (6)
F T =Φ(T FE2 ,F FE2 ) (7)
here, Φ (. Cndot. ) is the same as the formula (3).
Finally, F is calculated by phi (& gt ) S and FT And (3) fusing to obtain the final mixed space-time characteristics of the space-time characteristic fusion module STFF:
M STFF (S,T)=Φ(F S ,F T ) (8)
in the above technical scheme, the grouping attention enhancement module in the step eight is specifically as follows:
in order to obtain more efficient spatiotemporal features from global and local information, the present invention constructs a grouping enhanced attention module to further refine the hybrid features. Fig. 5 shows a detailed structure of the module. The connection of two of the attention modules is parallel, which allows the module to extract both spatial and temporal information.
Fig. 5 is a grouping enhanced attention module. The group-level spatial attention module is used to mine each local region of interest, while the channel attention module is used to capture global responses in the channel dimension. They are then connected to enhance spatial saliency and channel saliency by element-wise multiplication with the original input feature map. Finally, residual connection is utilized to reduce the likelihood of gradient extinction. GAP and GMP in the figure represent global average pooling operations and global maximum pooling operations. They both operate in the spatial dimension in the spatial attention module and in the temporal dimension in the channel attention module, respectively.
With SGE [ Xiang Li, xiaolin Hu, and Jian Yang.]The invention aims to capture the response between the spatial features and the channel features, i.e. contains similarities between the global features and the local features in each group. Thus, the present invention introduces a grouping strategy into the Spatial Attention (SA) module, thereby generating a group level spatial attention (GSA) module that can be used to capture local information to supplement global information extracted by the Channel Attention (CA) module. The SA and CA modules mentioned herein are found in CBAM [ Sanghyun Woo ],Jongchan Park,Joon-Young Lee,and In So Kweon.Cbam:Convolutional block attention module.2018.]Is described in detail. Formally defining an input feature map as
Figure GDA0004085207380000091
The invention obtains space attention by GSA module and CA module>
Figure GDA0004085207380000092
And channel response->
Figure GDA0004085207380000093
Further pass through
Figure GDA0004085207380000094
Operation assignment of fused weights->
Figure GDA0004085207380000095
To refine the original input feature Q. In addition, in order to reduce the possibility of gradient disappearance and speed training progress, the invention also introduces attention residual, namely by
Figure GDA0004085207380000096
The operation directly establishes a connection between Q and the final refined feature. Finally, the saliency enhancement feature of the group enhanced attention module output +.>
Figure GDA0004085207380000097
The production process of (2) is shown in the following formula (9).
Figure GDA0004085207380000098
Figure GDA0004085207380000099
Representing element-by-element multiplication, where M C(Q) and MGS (Q) between->
Figure GDA00040852073800000910
The operations include broadcast operations that automatically multiply M by element C The size C of (Q) 1*1 is converted to M GS The sizes of (Q) c×h×w are identical.
In the above technical scheme, the construction method of the group-level spatial attention GSA module in the step eight is as follows:
the complete features input by the general attention module consist of sub-features distributed in groups in multiple channels of features. Moreover, these sub-features are treated in the same way and are therefore likely to be affected by background noise, which can easily lead to erroneous recognition and localization results. In view of this, the present invention proposes a group-level spatial attention GSA module for generating a local spatial response in each individual group divided from the original feature map. I.e. the input feature map Q is divided into groups by grouping strategy
Figure GDA00040852073800000911
wherein />
Figure GDA00040852073800000912
A feature map group having a group number l. G represents the total number of groups divided, in this example 16. It effectively captures information from sub-features through targeted learning and noise suppression. Then the local spatial response of group l is obtained using SA module>
Figure GDA00040852073800000913
Wherein the SA module is found In CBAM [ Sanghyun Woo, jongchan Park, joon-Young Lee, and In So Kwen. Cbam: convolutional block attention module.2018.]Is described in detail. Finally, the output response of group-level spatial attention module
Figure GDA0004085207380000101
The generation of (2) is shown in the following formula:
Figure GDA0004085207380000102
wherein the Expand (·) operation represents repeating a feature in the channel dimension
Figure GDA0004085207380000103
And twice.
In the technical scheme, the method for fusing the spatial category probability distribution, the temporal category probability distribution and the feature fusion category probability distribution in the step ten comprises the following steps:
the present invention uses a weighted average fusion method, i.e., δ=δ s *w st *w tf *w f ,w s ,w t ,w f The weights of the spatial stream, the time stream and the characteristic fusion stream are respectively represented, the default fusion weights of the three streams are respectively 0.4, 2.7 and 2.4, and the fusion weights can be adjusted according to the actual application requirements.
To verify the accuracy and robustness of the present invention, the present invention conducted experiments on the disclosed UCF101 and HMDB51 datasets.
UCF101 is a typical challenging human motion recognition dataset that contains 13320 videos collected from the YouTube video website at a resolution of 320 x 240. It contains a total of 101 action categories, with each category containing 25 people. UCF101 datasets have great diversity in motion acquisition, including camera operation, appearance changes, pose changes, object scale changes, background changes, light changes, and the like. 101 actions can be broadly divided into five categories: human-to-object interactions, human-to-human interactions, instrument performance and motion.
The HMDB51 dataset contains 6849 320 x 240 resolution video samples, which consists of 51 categories, where each category contains at least 101 samples. Most video comes from movies, some from public datasets or online video libraries (e.g., youTube). The operation categories can be divided into five types: general facial movements, facial movements and object manipulations, general body movements, body movements and object interactions, human movements. Background clutter and changes in light conditions make it very challenging to identify the target actions represented by a video.
Table 1 is the individual parameter settings of the two data sets in the experiment:
table 1 database experimental parameter settings
Figure GDA0004085207380000111
Table 2 shows the test results of the method MDFFEN according to the present invention on UCF101 and HMDB51 data sets, and the present invention achieves a higher recognition rate on both data sets. Although these two data sets present difficulties of occlusion, deformation, background clutter, low resolution, etc., the method proposed by the present invention is robust to these difficulties and therefore performs relatively well.
TABLE 2 identification rates on UCF101 and HMDB51
Data set UCF101 HMDB51
MDFFEN 95.3% 71.6%
The invention mainly provides two mechanisms, namely multi-level space-time feature fusion and grouping enhancement attention. As can be seen from table 3, the accuracy of using a dual-flow network alone reaches 93.61% for the UCF101 dataset. And multilevel space-time feature fusion is added in a basic network, so that the precision is improved to 94.63%. On the basis, the grouping is added to enhance the attention, and the precision is further improved to 95.31%. Experimental results show that the multi-level space-time feature fusion method effectively extracts multi-depth-level mixed features, the grouping reinforcing attention further improves distinguishing features in the mixed features, and both mechanisms have good influence on behavior recognition performance and effectively improve recognition accuracy.
TABLE 3 influence of two mechanisms on UCF101 dataset
Figure GDA0004085207380000112
While the present invention has been described in detail with reference to the drawings, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (7)

1. A double-flow network behavior recognition method based on multi-level space-time feature fusion enhancement is characterized by comprising the following steps:
step one, obtaining RGB frames: performing frame taking processing on each video in the data set to obtain RGB original frames
Figure FDA0004085207360000011
N is the number of frames;
step two, calculating a light flow diagram: application of TVL1 algorithm to RGB original frame f rgb Calculating two by two to obtain a light flow graph
Figure FDA0004085207360000012
Step three, segmenting all the extracted RGB frames and the optical flow diagram: dividing all RGB frames and optical flow diagrams acquired in the first step and the second step into three sections
Figure FDA0004085207360000013
Each section is continuous in time sequence, and any two sections are not overlapped;
step four, from s rgb Each segment of the three-dimensional network is used for randomly acquiring input of RGB frames to construct a spatial network respectively:
Figure FDA0004085207360000014
wherein />
Figure FDA0004085207360000015
Step five, from s opt The network input of the construction time of a plurality of optical flow diagrams is randomly acquired respectively in each section:
Figure FDA0004085207360000016
wherein />
Figure FDA0004085207360000017
Step six, based on the space network N s Calculating a spatial class probability distribution O S : inputting the space network constructed in the step four
Figure FDA0004085207360000018
Respectively send into the space network N s Extracting features, spatial network N s Based on InceptionV3 network construction, obtaining spatial category probability distribution ++through global average pooling operation and full connection operation>
Figure FDA0004085207360000019
wherein />
Figure FDA00040852073600000110
The ith RGB frame segment RGB representing step three i Corresponding spatial class probability distribution;
step seven, based on time network N t Calculating a temporal class probability distribution O T : input of the time network constructed in the step five
Figure FDA00040852073600000111
Respectively send into time network N t Extracting features, time network N t Based on InceptionV3 network construction, and then passing through the globalAveraging pooling and fully connected operations to get a temporal class probability distribution +.>
Figure FDA00040852073600000112
wherein />
Figure FDA00040852073600000113
Representing the ith optical flow map segment OPT in step three i Corresponding time class probabilities;
step eight, based on double-flow fusion network N TSFF Computing feature fusion class probability distribution O F : embedding space-time feature fusion modules STFF into a spatial network N respectively by using multi-level space-time feature fusion modules s And time network N t Fusion extracting multi-depth-level mixed characteristics from a plurality of submodules of InceptionV3, extracting the extracted characteristics through a grouping enhanced attention module, and finally obtaining characteristic fusion category probability distribution through global average pooling operation and full connection operation
Figure FDA0004085207360000021
wherein />
Figure FDA0004085207360000022
The ith RGB frame segment RGB representing step three i And ith optical flow map segment OPT i Corresponding feature fusion class probability distribution;
step nine, calculating the class probability distribution of multi-segment fusion: according to the multi-section class probability distribution obtained in the step six, the step seven and the step eight
Figure FDA0004085207360000023
And->
Figure FDA0004085207360000024
Obtaining a class probability distribution of multi-segment fusion by three-segment average value>
Figure FDA0004085207360000025
Step ten, calculating class probability distribution delta of three stream weighted fusion: fusion step nine obtained multi-section fusion space category probability distribution delta on the basis of double-flow network s Multi-segment fused temporal class probability distribution delta t Feature fusion class probability distribution delta fused with multiple segments f Calculating class probability distribution delta by adopting a weighted average fusion method;
step eleven, calculating a final classification result P: p=argmax (δ), where argmax (δ) is the index value that calculates the maximum value in the δ vector, which is the category in which the category probability distribution is highest among all the behavior categories.
2. The dual-flow network behavior recognition method based on multi-level space-time feature fusion enhancement according to claim 1, wherein the model for completing the dual-flow network behavior recognition method comprises a spatial network, a time network, a feature fusion network, multi-section class probability distribution fusion and multi-flow class probability distribution fusion; the space network and the time network are both constructed based on the InceptionV3, and the feature fusion network is constructed through the space network and the time network; using a multi-level space-time feature fusion module to fuse space-time mixed features with different depth levels, wherein the space-time mixed features are features extracted from a space network and a time network respectively by using the space-time feature fusion module, then extracting the mixed features with the multiple depth levels by using a grouping enhanced attention module, and obtaining feature fusion category probability distribution by using global average pooling and full connection operation like the space network and the time network; and then fusing the three segmentation input extracted corresponding class probability distributions of each flow to obtain multi-segment fused class probability distributions of the corresponding flows, and finally fusing the multi-segment fused class probability distributions corresponding to the three flows by adopting a weighted average method.
3. The dual-flow network behavior recognition method based on multi-level space-time feature fusion enhancement according to claim 1, wherein the whole process of the step eight is formally written as the following formula:
Figure FDA0004085207360000031
wherein MMDFF (. Cndot. Cndot.). Cndot.represents a multi-level spatiotemporal feature fusion module, M GSCE (-) represents the output characteristics of the packet enhanced attention module; FC represents a fully connected operation, GAP represents a global average pooling operation.
4. The dual-flow network behavior recognition method based on multi-level space-time feature fusion enhancement according to claim 3, wherein the multi-level space-time feature fusion method applied in the step eight is as follows: the InceptionV3 consists of j sub-modules which are connected in series, namely, inc.1-Inc. j, and can extract different depth level features from the sub-modules; embedding a space-time feature fusion module STFF into each sub-module of the spatial network and the temporal network to capture novel features with different depth levels; cascading all the mixed space-time features generated by the submodules with a plurality of depths of the network to obtain abstract convolution mixed space-time features with a plurality of depth levels; multi-level space-time characteristic fusion module M MDFF The scheme for (-) is shown below:
Figure FDA0004085207360000032
wherein MSTFF (. Cndot. Cndot.) represents a spatiotemporal feature fusion module;
Figure FDA0004085207360000033
and />
Figure FDA0004085207360000034
Respectively indicate +.>
Figure FDA0004085207360000035
and />
Figure FDA0004085207360000036
Features fed into the spatial network and the temporal network and extracted from an inc.j module therein; />
Figure FDA0004085207360000037
Representing a cascade of hybrid features generated from inc.l1 to inc.l2; conv (·) represents a convolution operation.
5. The dual-flow network behavior recognition method based on multi-level space-time feature fusion enhancement according to claim 4, wherein the output features of the space-time feature fusion module are formed by fusing three types of features of preliminary mixed space-time features, spatial features and temporal features; the specific process of the space-time characteristic fusion module is as follows: firstly, fusing the spatial features extracted from the sub-modules in the spatial network with the temporal features extracted from the sub-modules in the temporal network through element-by-element summation and convolution operation to obtain primary mixed abstract features; by ignoring the superscript i and the subscript inc.j in equation (2), one will
Figure FDA0004085207360000041
And
Figure FDA0004085207360000042
write as +.>
Figure FDA0004085207360000043
and />
Figure FDA0004085207360000044
For ease of expression, wherein C, H and W represent the number of channels, height and width of the feature map, respectively; the preliminary hybrid abstract feature F is then formally expressed as the following formula:
Figure FDA0004085207360000045
wherein Ψk,n A sequence of ReLU (BN (Conv.)) operations representing a convolution kernel size k and a filter number n, wherein ReLU and BN represent a ReLU activation function and a bulk normalization operation, respectively, conv (·) represents a convolution operation, and a numer represents an element-by-element summation operation;
in order to suppress invalid information and extract valid information, a feature extractor M is employed FE (·);M FE (. Cndot.) consists of two ψs with different filter numbers n 3*3,n An arithmetic component in which the first filter number is half the number of input channels C and the other is the same as the number of input channels; then through a feature extractor M FE (. Cndot.) all spatial features S, temporal features T and primary space-time mixing features F are independently extracted to obtain nonlinear abstract features; feature extractor M FE The detailed procedure of (-) is expressed as the following formula:
MFE(Z)=Z FE2 =Ψ 3*3,C (Z FE1 ) (4)
Figure FDA0004085207360000046
wherein Z ε { S, T, F } represents M FE Input features of (-), S, T, F represent spatial features, temporal features, and primary spatio-temporal mixing features, respectively;
then, it will pass through the feature extractor M FE (. Cndot.) refined spatial features S FE2 And time feature T FE2 Respectively with refined mixed features F FE2 Fusion to obtain a more deep fusion feature F S and FT The following is shown:
F S =Φ(S FE2 ,F FE2 ) (6)
F T =Φ(T FE2 ,F FE2 ) (7)
here Φ (·, ·) is the same as formula (3);
finally, F is calculated by phi (& gt ) S and FT And (3) fusing to obtain the final mixed space-time characteristics of the space-time characteristic fusion module STFF:
M STFF (S,T)=Φ(F S ,F T ) (8)。
6. the method for identifying double-flow network behavior based on multi-level space-time feature fusion enhancement according to claim 1, wherein the group enhanced attention module in the eighth step comprises a group-level space attention module and a channel attention module, and the connection of the two attention modules is parallel; the group-level spatial attention module is used for mining each local area of interest, and the channel attention module is used for capturing global response in the channel dimension; connecting the two attention modules, and enhancing the spatial significance and the channel significance by multiplying the two attention modules with the original input characteristic diagram element by element; finally, residual connection is utilized to reduce the possibility of gradient extinction; the global average pooling operation GAP and the global maximum pooling operation GMP operate on a space dimension in the space attention module and a time dimension in the channel attention module respectively; the method comprises the following steps:
introducing a grouping strategy into the spatial attention SA module so as to generate a group-level spatial attention GSA module which is used for capturing local information so as to supplement global information extracted by the channel attention CA module; the SA and CA modules formally define the input feature maps as
Figure FDA0004085207360000051
Acquisition of spatial awareness by GSA Module and CA Module>
Figure FDA0004085207360000052
And channel response->
Figure FDA0004085207360000053
By passing through
Figure FDA0004085207360000054
Operation assignment of fused weights->
Figure FDA0004085207360000055
To refine the original input feature Q; attention deficit is introduced by->
Figure FDA0004085207360000056
The operation directly establishes a connection between the Q and the final refined feature; finally, the saliency enhancement feature of the group enhanced attention module output +.>
Figure FDA0004085207360000057
The generation process of (2) is shown in the following formula (9);
Figure FDA0004085207360000058
wherein ,
Figure FDA0004085207360000059
representing element-by-element multiplication, where M C(Q) and MGS (Q) between->
Figure FDA00040852073600000510
The operations include broadcast operations that automatically multiply M by element C The size C of (Q) 1*1 is converted to M GS The sizes of (Q) c×h×w are identical.
7. The dual-flow network behavior recognition method based on multi-level space-time feature fusion enhancement as claimed in claim 6, wherein the construction method of the group-level spatial attention GSA module is as follows: a group-level spatial attention GSA module for generating a local spatial response in each individual group divided from the original feature map; dividing an input feature map Q into groups by grouping strategy
Figure FDA0004085207360000061
wherein />
Figure FDA0004085207360000062
A feature map group having a group number l; g represents the total number of groups divided, which effectively captures information from the sub-features through targeted learning and noise suppression; then the local spatial response of group l is obtained using SA module>
Figure FDA0004085207360000063
Finally, the output response of group level spatial attention module +.>
Figure FDA0004085207360000064
The generation of (2) is shown in the following formula:
Figure FDA0004085207360000065
wherein the Expand (·) operation represents repeating a feature in the channel dimension
Figure FDA0004085207360000066
And twice. />
CN202010441559.3A 2020-05-22 2020-05-22 Double-flow network behavior identification method based on multilevel space-time feature fusion enhancement Active CN111709306B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010441559.3A CN111709306B (en) 2020-05-22 2020-05-22 Double-flow network behavior identification method based on multilevel space-time feature fusion enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010441559.3A CN111709306B (en) 2020-05-22 2020-05-22 Double-flow network behavior identification method based on multilevel space-time feature fusion enhancement

Publications (2)

Publication Number Publication Date
CN111709306A CN111709306A (en) 2020-09-25
CN111709306B true CN111709306B (en) 2023-06-09

Family

ID=72537459

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010441559.3A Active CN111709306B (en) 2020-05-22 2020-05-22 Double-flow network behavior identification method based on multilevel space-time feature fusion enhancement

Country Status (1)

Country Link
CN (1) CN111709306B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112489092B (en) * 2020-12-09 2023-10-31 浙江中控技术股份有限公司 Fine-grained industrial motion modality classification method, storage medium, device and apparatus
CN112712124B (en) * 2020-12-31 2021-12-10 山东奥邦交通设施工程有限公司 Multi-module cooperative object recognition system and method based on deep learning
CN112381072B (en) * 2021-01-11 2021-05-25 西南交通大学 Human body abnormal behavior detection method based on time-space information and human-object interaction
CN113066022B (en) * 2021-03-17 2022-08-16 天津大学 Video bit enhancement method based on efficient space-time information fusion
CN113111822B (en) * 2021-04-22 2024-02-09 深圳集智数字科技有限公司 Video processing method and device for congestion identification and electronic equipment
CN113393521B (en) * 2021-05-19 2023-05-05 中国科学院声学研究所南海研究站 High-precision flame positioning method and system based on dual semantic attention mechanism
CN114677704B (en) * 2022-02-23 2024-03-26 西北大学 Behavior recognition method based on three-dimensional convolution and space-time feature multi-level fusion
CN115348215B (en) * 2022-07-25 2023-11-24 南京信息工程大学 Encryption network traffic classification method based on space-time attention mechanism

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110188239B (en) * 2018-12-26 2021-06-22 北京大学 Double-current video classification method and device based on cross-mode attention mechanism
CN109993077A (en) * 2019-03-18 2019-07-09 南京信息工程大学 A kind of Activity recognition method based on binary-flow network
CN110119703B (en) * 2019-05-07 2022-10-04 福州大学 Human body action recognition method fusing attention mechanism and spatio-temporal graph convolutional neural network in security scene
CN110569773B (en) * 2019-08-30 2020-12-15 江南大学 Double-flow network behavior identification method based on space-time significance behavior attention

Also Published As

Publication number Publication date
CN111709306A (en) 2020-09-25

Similar Documents

Publication Publication Date Title
CN111709306B (en) Double-flow network behavior identification method based on multilevel space-time feature fusion enhancement
Zhuge et al. Salient object detection via integrity learning
Liu et al. SwinNet: Swin transformer drives edge-aware RGB-D and RGB-T salient object detection
Shao et al. Temporal interlacing network
Qian et al. Thinking in frequency: Face forgery detection by mining frequency-aware clues
Nguyen et al. A neural network based on SPD manifold learning for skeleton-based hand gesture recognition
Li et al. Selective kernel networks
Gao et al. MSCFNet: A lightweight network with multi-scale context fusion for real-time semantic segmentation
Kim et al. Fully deep blind image quality predictor
Wang et al. NAS-guided lightweight multiscale attention fusion network for hyperspectral image classification
Li et al. Micro-expression action unit detection with spatial and channel attention
Fang et al. Deep3DSaliency: Deep stereoscopic video saliency detection model by 3D convolutional networks
CN111242181B (en) RGB-D saliency object detector based on image semantics and detail
CN113343950B (en) Video behavior identification method based on multi-feature fusion
Pan et al. No-reference image quality assessment via multibranch convolutional neural networks
Jia et al. Stacked denoising tensor auto-encoder for action recognition with spatiotemporal corruptions
Li et al. ConvTransNet: A CNN–transformer network for change detection with multiscale global–local representations
Liu et al. APSNet: Toward adaptive point sampling for efficient 3D action recognition
Zhao et al. Alignment-guided temporal attention for video action recognition
Yin et al. Dynamic difference learning with spatio-temporal correlation for deepfake video detection
Ning et al. Enhancement, integration, expansion: Activating representation of detailed features for occluded person re-identification
Shi et al. A pooling-based feature pyramid network for salient object detection
Huang et al. Region-based non-local operation for video classification
ABAWATEW et al. Attention augmented residual network for tomato disease detection andclassification
Cheng et al. Bidirectional collaborative mentoring network for marine organism detection and beyond

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant