CN111709306B - Double-flow network behavior identification method based on multilevel space-time feature fusion enhancement - Google Patents
Double-flow network behavior identification method based on multilevel space-time feature fusion enhancement Download PDFInfo
- Publication number
- CN111709306B CN111709306B CN202010441559.3A CN202010441559A CN111709306B CN 111709306 B CN111709306 B CN 111709306B CN 202010441559 A CN202010441559 A CN 202010441559A CN 111709306 B CN111709306 B CN 111709306B
- Authority
- CN
- China
- Prior art keywords
- network
- space
- features
- time
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000004927 fusion Effects 0.000 title claims abstract description 95
- 238000000034 method Methods 0.000 title claims abstract description 34
- 238000009826 distribution Methods 0.000 claims abstract description 51
- 230000002123 temporal effect Effects 0.000 claims description 23
- 238000011176 pooling Methods 0.000 claims description 15
- 238000010586 diagram Methods 0.000 claims description 14
- 230000003287 optical effect Effects 0.000 claims description 14
- 230000004044 response Effects 0.000 claims description 9
- 238000010276 construction Methods 0.000 claims description 7
- 238000007500 overflow downdraw method Methods 0.000 claims description 7
- 239000000284 extract Substances 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 4
- 239000013598 vector Substances 0.000 claims description 4
- 101100194606 Mus musculus Rfxank gene Proteins 0.000 claims description 2
- 230000004913 activation Effects 0.000 claims description 2
- 230000008033 biological extinction Effects 0.000 claims description 2
- 230000006870 function Effects 0.000 claims description 2
- 238000010606 normalization Methods 0.000 claims description 2
- 238000012545 processing Methods 0.000 claims description 2
- 230000011218 segmentation Effects 0.000 claims description 2
- 239000013589 supplement Substances 0.000 claims description 2
- 230000001629 suppression Effects 0.000 claims description 2
- 230000006735 deficit Effects 0.000 claims 1
- 230000002708 enhancing effect Effects 0.000 claims 1
- 238000005065 mining Methods 0.000 claims 1
- 230000000694 effects Effects 0.000 abstract description 5
- 230000000295 complement effect Effects 0.000 abstract description 2
- 230000033001 locomotion Effects 0.000 description 14
- 230000009471 action Effects 0.000 description 4
- 230000003993 interaction Effects 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 3
- 238000001514 detection method Methods 0.000 description 2
- 230000009977 dual effect Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000001815 facial effect Effects 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 241001351225 Sergey Species 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 230000003014 reinforcing effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/254—Fusion techniques of classification results, e.g. of results related to same input data
- G06F18/256—Fusion techniques of classification results, e.g. of results related to same input data of results relating to different input data, e.g. multimodal recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
A dual-flow network behavior recognition method based on multi-level space-time feature fusion enhancement. The method adopts a network architecture based on a space-time double-flow network, which is called a multi-level space-time characteristic fusion enhancement network. Aiming at the problems that the effect of shallow layer characteristics is ignored and the complementary characteristics of the double-flow network cannot be fully utilized due to the fact that the class probability distribution of two flows is only fused at the last layer of the traditional double-flow network, the invention provides a multi-level space-time characteristic fusion module, and the multi-depth-level mixed characteristics are captured through the space-time characteristic fusion module at different depth levels of double flows so as to fully utilize the double-flow network. Furthermore, treating all features equally in the network weakens the effect of those features that contribute significantly to classification. The invention provides a grouping enhanced attention module in a network, which automatically enhances the significance of effective areas and channels on characteristics. Finally, the invention further improves the robustness of the behavior recognition model by collecting the classification results of the double-flow network and the feature fusion.
Description
Technical Field
The invention belongs to the field of machine vision, and particularly relates to a double-flow network behavior recognition method based on multi-level space-time feature fusion enhancement.
Background
Action recognition has become an active field of the computer vision world and is widely applied to various fields such as video surveillance, violence detection, man-machine interaction and the like. Video motion recognition is to mine key features that can express target motion represented by video, and compared with static images, the video motion recognition contains rich motion information, however, the diversity of motion scenes still makes extraction of effective features challenging. Therefore, the invention aims at the problems faced by extracting the spatial and temporal features in the video by using the video as a research object and provides a unique feature fusion method and an attention method to effectively extract the distinguishing features for behavior recognition.
Currently, video-oriented behavior recognition mainly uses a dual-stream network, and the development trend is very good. In dual-flow networks, the dual-flow architecture captures appearance information and motion information by training the respective convolutional networks on the appearance and optical flow stacks, respectively, and finally uses the score to fuse the classification results of the two convolutional networks. However, conventional dual-flow networks still face the following problems: (1) how effectively information captured by two streams separately? (2) How effectively the captured features are refined should each region and channel of the network that is equally treated with features impair the effect of those regions and channels that are useful for classification? (3) How effectively the acquired spatial information and temporal information are fused?
Based on the above consideration, the invention provides a double-flow network behavior recognition method based on multi-level space-time feature fusion enhancement. Firstly, the proposed space-time feature fusion module is used for fusing the features of modules at different depth layers of a double-flow network to extract multi-depth-level mixed features. Secondly, the extracted mixed features are further refined by using the proposed grouping enhanced attention module, so that the network automatically focuses on the regions and channels which have an effect on classification in the features.
Disclosure of Invention
The invention mainly aims to provide a multi-level space-time feature fusion enhanced double-flow network (MDFFEN) behavior recognition method, which can better acquire effective features of videos and distinguishing information on the features so as to perform efficient behavior recognition.
In order to achieve the above object, the present invention provides the following technical solutions:
a double-flow network behavior recognition method based on multi-level space-time feature fusion enhancement comprises the following steps:
step one, obtaining RGB frames: performing frame taking processing on each video in the data set to obtain RGB original framesN is the number of frames;
step two, calculating a light flow diagram: TVL1[ Coloma Ballester, lluis Garrido, vanelLazcano, and VicentCasseles. Atv-l1 optical flow method with occlusion detection. In Point Dagm,2013 was applied.]Algorithm for RGB original frame f rgb Calculating two by two to obtain a light flow graph
Step three, segmenting all the extracted RGB frames and the optical flow diagram: dividing all RGB frames and optical flow diagrams acquired in the first step and the second step into three sectionsEach segment is continuous in time sequence, and any two segments are not overlapped.
Step four, from s rgb Each segment of the three-dimensional network is used for randomly acquiring input of RGB frames to construct a spatial network respectively: wherein />
Step five, from s opt The network input of the construction time of a plurality of optical flow diagrams is randomly acquired respectively in each section: wherein />
Step six, based on the space network N s Calculating a spatial class probability distribution O S Input of the space network constructed in the fourth stepRespectively send into the space network N s Extracting features, spatial network N s Based on InceptionV3 [2] Constructing a network, and obtaining spatial category probability distribution ∈10 through global average pooling operation and full connection operation> wherein The ith RGB frame segment RGB representing step three i Corresponding spatial class probability distribution;
step seven, based on time network N t Calculating a temporal class probability distribution O T Input of the time network constructed in the step fiveRespectively send into time network N t Extracting features, time network N t Based on InceptionV3[ Christian Szegedy, vincent Vanhoucke, sergey Ioffe, jonathon sylens, and Zbigniew Wojna.rethinking the inception architecture for Computer version.in Computer Vision&Pattern Recognition,2016.]Constructing a network, and obtaining time category probability distribution by global average pooling operation and full connection operation> wherein />Representing the ith optical flow map segment OPT in step three i Corresponding time class probabilities;
step eight, based on double-flow fusion network N TSFF Computing feature fusion class probability distribution O F : embedding space-time feature fusion modules STFF into a spatial network N respectively by using multi-level space-time feature fusion modules s And time network N t In the multiple submodules of InceptionV3 to fusion extractThe multi-depth-level mixed features are then further extracted by the grouping enhanced attention module, and finally feature fusion category probability distribution is obtained by global average pooling operation and full connection operation wherein />The ith RGB frame segment RGB representing step three i And ith optical flow map segment OPT i Corresponding feature fusion class probability distribution;
step nine, calculating multi-section fused class probability distribution, namely obtaining multi-section class probability distribution according to the step six, the step seven and the step eightAnd->Obtaining a class probability distribution of multi-segment fusion by three-segment average value>
Step ten, calculating class probability distribution delta of three stream weighted fusion: fusion step nine obtained multi-section fusion space category probability distribution delta on the basis of double-flow network s Multi-segment fused temporal class probability distribution delta t Feature fusion class probability distribution delta fused with multiple segments f The present invention uses a weighted average fusion approach.
Step eleven, calculating a final classification result P: p=argmax (δ), where argmax (δ) is the index value that calculates the maximum value in the δ vector, i.e. calculates the category of all behavior categories with the highest category probability distribution.
Compared with the prior art, the invention has the following beneficial effects:
1. and (3) carrying out feature fusion on different depth layers of the double flow through the double flow feature fusion network constructed in the step (eight) to obtain space-time mixing features of multiple depth levels, and fully utilizing the features of the shallow layers and the complementary features of the double flow.
2. The double-flow feature fusion network constructed in the step eight provides a grouping enhanced attention module to further refine local information and global information of the extracted mixed features, so that behavior recognition accuracy is effectively improved.
Drawings
FIG. 1 is a flow chart of an algorithm of the present invention;
FIG. 2 is a diagram of an algorithm model of the present invention;
FIG. 3 is a dual flow feature fusion network N TSFF A figure;
FIG. 4 is a spatio-temporal feature fusion graph;
fig. 5 is a grouping enhanced attention module.
Detailed Description
FIG. 2 is an overall model diagram of the present invention;
fig. 2 shows an algorithm model diagram of the present invention. The algorithm takes a multi-segment RGB image and a light flow graph as inputs, and the model comprises a space network, a time network, a feature fusion network, multi-segment class probability distribution fusion and multi-stream class probability distribution fusion. The spatial network and the time network are both constructed based on the InceptionV3, the feature fusion network is constructed through the spatial network and the time network, in brief, the proposed multi-level space-time feature fusion module is used for fusing space-time hybrid features with different depth levels, wherein the space-time hybrid features are features respectively extracted from the spatial network and the time network through the fusion of the proposed space-time feature fusion module, then the proposed grouping enhanced attention module is used for further refining the multi-depth level hybrid features, and the feature fusion class probability distribution is obtained through global averaging pooling and full connection operation like the spatial network and the time network. And then fusing the three segmentation input extracted corresponding class probability distributions of each flow to obtain multi-segment fused class probability distributions of the corresponding flow, and finally fusing the multi-segment fused class probability distributions corresponding to the three flows by adopting a weighted average method.
For a better explanation of the present invention, the disclosed behavior data set UCF101 is described below as an example.
In the technical scheme, in the fourth step, the step is from s rgb The specific method for randomly acquiring RGB frames from each segment of the frame comprises the following steps:
the ith segment RGB frame sequence RGB obtained in the third step i Is to acquire consecutive L at random positions of (a) s RGB frame acquisition wherein Ls In this example 1.
In the technical scheme, in the fifth step, the step S opt The specific method for randomly acquiring a plurality of optical flow diagrams from each section of the optical flow diagrams comprises the following steps:
OPT of Zhang Guangliu figure at i-th paragraph obtained from step three i Starts to acquire consecutive L at random positions of (1) t Zhang Guangliu map is obtained wherein Lt In this example 5.
The double-flow characteristic fusion method in the step eight in the technical scheme specifically comprises the following steps:
conventional dual-flow network behavior recognition methods typically incorporate class probability distributions at the final layer. Since conventional feature fusion fuses the features at the deepest level in the final layer, the effect of shallow features on classification is often ignored. Therefore, the invention provides a multi-level space-time characteristic fusion module. The specific implementation is shown in fig. 3. Unlike traditional methods, the multi-level spatio-temporal feature fusion module presented in the present invention considers shallow features of the depth network to capture hybrid features with multiple depth levels. In addition, the invention proposes a grouping enhanced attention module to further optimize the hybrid features extracted from the multi-level spatio-temporal feature fusion module. Finally, the class probability distribution is generated by the operation of the complete connection layer FC on the feature vectors, wherein the feature vectors are generated by summarizing the feature images through the global average pooling operation. The overall process of dual stream feature fusion is formally written as the following:
wherein MMDFF (. Cndot. Cndot.). Cndot.represents a multi-level spatiotemporal feature fusion module, M GSCE (. Cndot.) represents the output characteristics of the packet enhanced attention module. FC represents a fully connected operation, GAP represents a global average pooling operation.
The multi-level space-time feature fusion method applied in the step eight in the technical scheme comprises the following steps:
the InceptionV3 consists of 11 sub-modules connected in series, namely, inc.1-Inc.11, from which different depth level features can be extracted. In order to further enhance the classification capability of the InceptionV3 network, the invention embeds a space-time feature fusion module STFF into each sub-module of the spatial network and the temporal network to capture novel features with different depth levels. The last four sub-modules, namely sub-modules from inc.8 to inc.11, are selected in the embodiment, and the selection of the sub-modules in the specific application can be adjusted according to the actual application. All the mixed spatiotemporal features generated by sub-modules of multiple depths of the network are cascaded to obtain abstract convolved mixed spatiotemporal features having multiple depth levels. Multi-level space-time characteristic fusion module M MDFF The scheme for (-) is shown below:
wherein MSTFF (. Cndot. Cndot.) represents a spatiotemporal feature fusion module. and />Respectively indicate +.> and />Features fed into the spatial and temporal networks and extracted from the inc.j module therein. The cascade of hybrid features generated from Inc.8 to Inc.11 consists of +.>And (3) representing. Conv (·) represents a convolution operation, the present example uses 2048 kernel-sized 3*3 convolution filters to further extract abstract features from mixed features with different depth levels, while the number of channels of the obtained features will translate to 2048.
The specific construction method of the space-time feature fusion module STFF in the eighth step in the technical scheme comprises the following steps:
the output features of the space-time feature fusion module are formed by fusing three types of features (namely, preliminary mixed space-time features, spatial features and time features).
Fig. 4 is a spatio-temporal feature fusion module. The identification on each box represents the name of the feature map and the size of the feature map.Representing element-by-element summation operations, N Filter Is the number of convolution filters.
As detailed in fig. 4, the spatial features extracted from the sub-modules in the spatial network are first fused with the temporal features extracted from the sub-modules in the temporal network by element-wise summation and convolution operations to obtain primary hybrid abstract features. By ignoring the superscript i and the subscript inc.j in equation (2), one can write and />Write as +.> and />For ease of expression, where C, H and W represent the number of channels, height and width, respectively, of the feature map. The preliminary hybrid abstract feature F is then formally expressed as the following formula:
wherein Ψk,n A sequence of ReLU (BN (Conv.)) operations of convolution kernel size k and filter number n is represented, where ReLU and BN represent a ReLU activation function and a bulk normalization operation, respectively, and Conv (. Cndot.). A convolution operation is represented. In addition, in order to further suppress invalid information and extract valid information, the present invention proposes a feature extractor M FE (·)。M FE (. Cndot.) consists of two ψs with different filter numbers n 3*3,n The operation is composed, wherein the first filter number is half of the input channel number C, and the other is the same as the input channel number. Then through a feature extractor M FE (. Cndot.) features of all three types (spatial feature S, temporal feature T and primary spatio-temporal mixture feature F) are further extracted independently from the non-linear abstract features. Feature extractor M FE The detailed procedure of (-) is expressed as the following formula:
M FE (Z)=Z FE2 =Ψ 3*3,C (Z FE1 ) (4)
wherein Z ε { S, T, F } represents M FE The input features of (-), S, T, F represent spatial features, temporal features, and primary spatio-temporal mixing features, respectively.
Then, it will pass through the feature extractor M FE (. Cndot.) refined spatial features S FE2 And time feature T FE2 Respectively with refined mixed features F FE2 Fusion to obtain a more deep fusion feature F S and FT The following is shown:
F S =Φ(S FE2 ,F FE2 ) (6)
F T =Φ(T FE2 ,F FE2 ) (7)
here, Φ (. Cndot. ) is the same as the formula (3).
Finally, F is calculated by phi (& gt ) S and FT And (3) fusing to obtain the final mixed space-time characteristics of the space-time characteristic fusion module STFF:
M STFF (S,T)=Φ(F S ,F T ) (8)
in the above technical scheme, the grouping attention enhancement module in the step eight is specifically as follows:
in order to obtain more efficient spatiotemporal features from global and local information, the present invention constructs a grouping enhanced attention module to further refine the hybrid features. Fig. 5 shows a detailed structure of the module. The connection of two of the attention modules is parallel, which allows the module to extract both spatial and temporal information.
Fig. 5 is a grouping enhanced attention module. The group-level spatial attention module is used to mine each local region of interest, while the channel attention module is used to capture global responses in the channel dimension. They are then connected to enhance spatial saliency and channel saliency by element-wise multiplication with the original input feature map. Finally, residual connection is utilized to reduce the likelihood of gradient extinction. GAP and GMP in the figure represent global average pooling operations and global maximum pooling operations. They both operate in the spatial dimension in the spatial attention module and in the temporal dimension in the channel attention module, respectively.
With SGE [ Xiang Li, xiaolin Hu, and Jian Yang.]The invention aims to capture the response between the spatial features and the channel features, i.e. contains similarities between the global features and the local features in each group. Thus, the present invention introduces a grouping strategy into the Spatial Attention (SA) module, thereby generating a group level spatial attention (GSA) module that can be used to capture local information to supplement global information extracted by the Channel Attention (CA) module. The SA and CA modules mentioned herein are found in CBAM [ Sanghyun Woo ],Jongchan Park,Joon-Young Lee,and In So Kweon.Cbam:Convolutional block attention module.2018.]Is described in detail. Formally defining an input feature map asThe invention obtains space attention by GSA module and CA module>And channel response->
Further pass throughOperation assignment of fused weights->To refine the original input feature Q. In addition, in order to reduce the possibility of gradient disappearance and speed training progress, the invention also introduces attention residual, namely byThe operation directly establishes a connection between Q and the final refined feature. Finally, the saliency enhancement feature of the group enhanced attention module output +.>The production process of (2) is shown in the following formula (9).
Representing element-by-element multiplication, where M C(Q) and MGS (Q) between->The operations include broadcast operations that automatically multiply M by element C The size C of (Q) 1*1 is converted to M GS The sizes of (Q) c×h×w are identical.
In the above technical scheme, the construction method of the group-level spatial attention GSA module in the step eight is as follows:
the complete features input by the general attention module consist of sub-features distributed in groups in multiple channels of features. Moreover, these sub-features are treated in the same way and are therefore likely to be affected by background noise, which can easily lead to erroneous recognition and localization results. In view of this, the present invention proposes a group-level spatial attention GSA module for generating a local spatial response in each individual group divided from the original feature map. I.e. the input feature map Q is divided into groups by grouping strategy wherein />A feature map group having a group number l. G represents the total number of groups divided, in this example 16. It effectively captures information from sub-features through targeted learning and noise suppression. Then the local spatial response of group l is obtained using SA module>Wherein the SA module is found In CBAM [ Sanghyun Woo, jongchan Park, joon-Young Lee, and In So Kwen. Cbam: convolutional block attention module.2018.]Is described in detail. Finally, the output response of group-level spatial attention moduleThe generation of (2) is shown in the following formula:
In the technical scheme, the method for fusing the spatial category probability distribution, the temporal category probability distribution and the feature fusion category probability distribution in the step ten comprises the following steps:
the present invention uses a weighted average fusion method, i.e., δ=δ s *w s +δ t *w t +δ f *w f ,w s ,w t ,w f The weights of the spatial stream, the time stream and the characteristic fusion stream are respectively represented, the default fusion weights of the three streams are respectively 0.4, 2.7 and 2.4, and the fusion weights can be adjusted according to the actual application requirements.
To verify the accuracy and robustness of the present invention, the present invention conducted experiments on the disclosed UCF101 and HMDB51 datasets.
UCF101 is a typical challenging human motion recognition dataset that contains 13320 videos collected from the YouTube video website at a resolution of 320 x 240. It contains a total of 101 action categories, with each category containing 25 people. UCF101 datasets have great diversity in motion acquisition, including camera operation, appearance changes, pose changes, object scale changes, background changes, light changes, and the like. 101 actions can be broadly divided into five categories: human-to-object interactions, human-to-human interactions, instrument performance and motion.
The HMDB51 dataset contains 6849 320 x 240 resolution video samples, which consists of 51 categories, where each category contains at least 101 samples. Most video comes from movies, some from public datasets or online video libraries (e.g., youTube). The operation categories can be divided into five types: general facial movements, facial movements and object manipulations, general body movements, body movements and object interactions, human movements. Background clutter and changes in light conditions make it very challenging to identify the target actions represented by a video.
Table 1 is the individual parameter settings of the two data sets in the experiment:
table 1 database experimental parameter settings
Table 2 shows the test results of the method MDFFEN according to the present invention on UCF101 and HMDB51 data sets, and the present invention achieves a higher recognition rate on both data sets. Although these two data sets present difficulties of occlusion, deformation, background clutter, low resolution, etc., the method proposed by the present invention is robust to these difficulties and therefore performs relatively well.
TABLE 2 identification rates on UCF101 and HMDB51
Data set | UCF101 | HMDB51 |
MDFFEN | 95.3% | 71.6% |
The invention mainly provides two mechanisms, namely multi-level space-time feature fusion and grouping enhancement attention. As can be seen from table 3, the accuracy of using a dual-flow network alone reaches 93.61% for the UCF101 dataset. And multilevel space-time feature fusion is added in a basic network, so that the precision is improved to 94.63%. On the basis, the grouping is added to enhance the attention, and the precision is further improved to 95.31%. Experimental results show that the multi-level space-time feature fusion method effectively extracts multi-depth-level mixed features, the grouping reinforcing attention further improves distinguishing features in the mixed features, and both mechanisms have good influence on behavior recognition performance and effectively improve recognition accuracy.
TABLE 3 influence of two mechanisms on UCF101 dataset
While the present invention has been described in detail with reference to the drawings, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.
Claims (7)
1. A double-flow network behavior recognition method based on multi-level space-time feature fusion enhancement is characterized by comprising the following steps:
step one, obtaining RGB frames: performing frame taking processing on each video in the data set to obtain RGB original framesN is the number of frames;
step two, calculating a light flow diagram: application of TVL1 algorithm to RGB original frame f rgb Calculating two by two to obtain a light flow graph
Step three, segmenting all the extracted RGB frames and the optical flow diagram: dividing all RGB frames and optical flow diagrams acquired in the first step and the second step into three sectionsEach section is continuous in time sequence, and any two sections are not overlapped;
step four, from s rgb Each segment of the three-dimensional network is used for randomly acquiring input of RGB frames to construct a spatial network respectively: wherein />
Step five, from s opt The network input of the construction time of a plurality of optical flow diagrams is randomly acquired respectively in each section: wherein />
Step six, based on the space network N s Calculating a spatial class probability distribution O S : inputting the space network constructed in the step fourRespectively send into the space network N s Extracting features, spatial network N s Based on InceptionV3 network construction, obtaining spatial category probability distribution ++through global average pooling operation and full connection operation> wherein />The ith RGB frame segment RGB representing step three i Corresponding spatial class probability distribution;
step seven, based on time network N t Calculating a temporal class probability distribution O T : input of the time network constructed in the step fiveRespectively send into time network N t Extracting features, time network N t Based on InceptionV3 network construction, and then passing through the globalAveraging pooling and fully connected operations to get a temporal class probability distribution +.> wherein />Representing the ith optical flow map segment OPT in step three i Corresponding time class probabilities;
step eight, based on double-flow fusion network N TSFF Computing feature fusion class probability distribution O F : embedding space-time feature fusion modules STFF into a spatial network N respectively by using multi-level space-time feature fusion modules s And time network N t Fusion extracting multi-depth-level mixed characteristics from a plurality of submodules of InceptionV3, extracting the extracted characteristics through a grouping enhanced attention module, and finally obtaining characteristic fusion category probability distribution through global average pooling operation and full connection operation wherein />The ith RGB frame segment RGB representing step three i And ith optical flow map segment OPT i Corresponding feature fusion class probability distribution;
step nine, calculating the class probability distribution of multi-segment fusion: according to the multi-section class probability distribution obtained in the step six, the step seven and the step eightAnd->Obtaining a class probability distribution of multi-segment fusion by three-segment average value>
Step ten, calculating class probability distribution delta of three stream weighted fusion: fusion step nine obtained multi-section fusion space category probability distribution delta on the basis of double-flow network s Multi-segment fused temporal class probability distribution delta t Feature fusion class probability distribution delta fused with multiple segments f Calculating class probability distribution delta by adopting a weighted average fusion method;
step eleven, calculating a final classification result P: p=argmax (δ), where argmax (δ) is the index value that calculates the maximum value in the δ vector, which is the category in which the category probability distribution is highest among all the behavior categories.
2. The dual-flow network behavior recognition method based on multi-level space-time feature fusion enhancement according to claim 1, wherein the model for completing the dual-flow network behavior recognition method comprises a spatial network, a time network, a feature fusion network, multi-section class probability distribution fusion and multi-flow class probability distribution fusion; the space network and the time network are both constructed based on the InceptionV3, and the feature fusion network is constructed through the space network and the time network; using a multi-level space-time feature fusion module to fuse space-time mixed features with different depth levels, wherein the space-time mixed features are features extracted from a space network and a time network respectively by using the space-time feature fusion module, then extracting the mixed features with the multiple depth levels by using a grouping enhanced attention module, and obtaining feature fusion category probability distribution by using global average pooling and full connection operation like the space network and the time network; and then fusing the three segmentation input extracted corresponding class probability distributions of each flow to obtain multi-segment fused class probability distributions of the corresponding flows, and finally fusing the multi-segment fused class probability distributions corresponding to the three flows by adopting a weighted average method.
3. The dual-flow network behavior recognition method based on multi-level space-time feature fusion enhancement according to claim 1, wherein the whole process of the step eight is formally written as the following formula:
wherein MMDFF (. Cndot. Cndot.). Cndot.represents a multi-level spatiotemporal feature fusion module, M GSCE (-) represents the output characteristics of the packet enhanced attention module; FC represents a fully connected operation, GAP represents a global average pooling operation.
4. The dual-flow network behavior recognition method based on multi-level space-time feature fusion enhancement according to claim 3, wherein the multi-level space-time feature fusion method applied in the step eight is as follows: the InceptionV3 consists of j sub-modules which are connected in series, namely, inc.1-Inc. j, and can extract different depth level features from the sub-modules; embedding a space-time feature fusion module STFF into each sub-module of the spatial network and the temporal network to capture novel features with different depth levels; cascading all the mixed space-time features generated by the submodules with a plurality of depths of the network to obtain abstract convolution mixed space-time features with a plurality of depth levels; multi-level space-time characteristic fusion module M MDFF The scheme for (-) is shown below:
wherein MSTFF (. Cndot. Cndot.) represents a spatiotemporal feature fusion module; and />Respectively indicate +.> and />Features fed into the spatial network and the temporal network and extracted from an inc.j module therein; />Representing a cascade of hybrid features generated from inc.l1 to inc.l2; conv (·) represents a convolution operation.
5. The dual-flow network behavior recognition method based on multi-level space-time feature fusion enhancement according to claim 4, wherein the output features of the space-time feature fusion module are formed by fusing three types of features of preliminary mixed space-time features, spatial features and temporal features; the specific process of the space-time characteristic fusion module is as follows: firstly, fusing the spatial features extracted from the sub-modules in the spatial network with the temporal features extracted from the sub-modules in the temporal network through element-by-element summation and convolution operation to obtain primary mixed abstract features; by ignoring the superscript i and the subscript inc.j in equation (2), one willAndwrite as +.> and />For ease of expression, wherein C, H and W represent the number of channels, height and width of the feature map, respectively; the preliminary hybrid abstract feature F is then formally expressed as the following formula:
wherein Ψk,n A sequence of ReLU (BN (Conv.)) operations representing a convolution kernel size k and a filter number n, wherein ReLU and BN represent a ReLU activation function and a bulk normalization operation, respectively, conv (·) represents a convolution operation, and a numer represents an element-by-element summation operation;
in order to suppress invalid information and extract valid information, a feature extractor M is employed FE (·);M FE (. Cndot.) consists of two ψs with different filter numbers n 3*3,n An arithmetic component in which the first filter number is half the number of input channels C and the other is the same as the number of input channels; then through a feature extractor M FE (. Cndot.) all spatial features S, temporal features T and primary space-time mixing features F are independently extracted to obtain nonlinear abstract features; feature extractor M FE The detailed procedure of (-) is expressed as the following formula:
MFE(Z)=Z FE2 =Ψ 3*3,C (Z FE1 ) (4)
wherein Z ε { S, T, F } represents M FE Input features of (-), S, T, F represent spatial features, temporal features, and primary spatio-temporal mixing features, respectively;
then, it will pass through the feature extractor M FE (. Cndot.) refined spatial features S FE2 And time feature T FE2 Respectively with refined mixed features F FE2 Fusion to obtain a more deep fusion feature F S and FT The following is shown:
F S =Φ(S FE2 ,F FE2 ) (6)
F T =Φ(T FE2 ,F FE2 ) (7)
here Φ (·, ·) is the same as formula (3);
finally, F is calculated by phi (& gt ) S and FT And (3) fusing to obtain the final mixed space-time characteristics of the space-time characteristic fusion module STFF:
M STFF (S,T)=Φ(F S ,F T ) (8)。
6. the method for identifying double-flow network behavior based on multi-level space-time feature fusion enhancement according to claim 1, wherein the group enhanced attention module in the eighth step comprises a group-level space attention module and a channel attention module, and the connection of the two attention modules is parallel; the group-level spatial attention module is used for mining each local area of interest, and the channel attention module is used for capturing global response in the channel dimension; connecting the two attention modules, and enhancing the spatial significance and the channel significance by multiplying the two attention modules with the original input characteristic diagram element by element; finally, residual connection is utilized to reduce the possibility of gradient extinction; the global average pooling operation GAP and the global maximum pooling operation GMP operate on a space dimension in the space attention module and a time dimension in the channel attention module respectively; the method comprises the following steps:
introducing a grouping strategy into the spatial attention SA module so as to generate a group-level spatial attention GSA module which is used for capturing local information so as to supplement global information extracted by the channel attention CA module; the SA and CA modules formally define the input feature maps asAcquisition of spatial awareness by GSA Module and CA Module>And channel response->
By passing throughOperation assignment of fused weights->To refine the original input feature Q; attention deficit is introduced by->The operation directly establishes a connection between the Q and the final refined feature; finally, the saliency enhancement feature of the group enhanced attention module output +.>The generation process of (2) is shown in the following formula (9);
7. The dual-flow network behavior recognition method based on multi-level space-time feature fusion enhancement as claimed in claim 6, wherein the construction method of the group-level spatial attention GSA module is as follows: a group-level spatial attention GSA module for generating a local spatial response in each individual group divided from the original feature map; dividing an input feature map Q into groups by grouping strategy wherein />A feature map group having a group number l; g represents the total number of groups divided, which effectively captures information from the sub-features through targeted learning and noise suppression; then the local spatial response of group l is obtained using SA module>Finally, the output response of group level spatial attention module +.>The generation of (2) is shown in the following formula:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010441559.3A CN111709306B (en) | 2020-05-22 | 2020-05-22 | Double-flow network behavior identification method based on multilevel space-time feature fusion enhancement |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010441559.3A CN111709306B (en) | 2020-05-22 | 2020-05-22 | Double-flow network behavior identification method based on multilevel space-time feature fusion enhancement |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111709306A CN111709306A (en) | 2020-09-25 |
CN111709306B true CN111709306B (en) | 2023-06-09 |
Family
ID=72537459
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010441559.3A Active CN111709306B (en) | 2020-05-22 | 2020-05-22 | Double-flow network behavior identification method based on multilevel space-time feature fusion enhancement |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111709306B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112489092B (en) * | 2020-12-09 | 2023-10-31 | 浙江中控技术股份有限公司 | Fine-grained industrial motion modality classification method, storage medium, device and apparatus |
CN112712124B (en) * | 2020-12-31 | 2021-12-10 | 山东奥邦交通设施工程有限公司 | Multi-module cooperative object recognition system and method based on deep learning |
CN112381072B (en) * | 2021-01-11 | 2021-05-25 | 西南交通大学 | Human body abnormal behavior detection method based on time-space information and human-object interaction |
CN113066022B (en) * | 2021-03-17 | 2022-08-16 | 天津大学 | Video bit enhancement method based on efficient space-time information fusion |
CN113111822B (en) * | 2021-04-22 | 2024-02-09 | 深圳集智数字科技有限公司 | Video processing method and device for congestion identification and electronic equipment |
CN113393521B (en) * | 2021-05-19 | 2023-05-05 | 中国科学院声学研究所南海研究站 | High-precision flame positioning method and system based on dual semantic attention mechanism |
CN114677704B (en) * | 2022-02-23 | 2024-03-26 | 西北大学 | Behavior recognition method based on three-dimensional convolution and space-time feature multi-level fusion |
CN115348215B (en) * | 2022-07-25 | 2023-11-24 | 南京信息工程大学 | Encryption network traffic classification method based on space-time attention mechanism |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110188239B (en) * | 2018-12-26 | 2021-06-22 | 北京大学 | Double-current video classification method and device based on cross-mode attention mechanism |
CN109993077A (en) * | 2019-03-18 | 2019-07-09 | 南京信息工程大学 | A kind of Activity recognition method based on binary-flow network |
CN110119703B (en) * | 2019-05-07 | 2022-10-04 | 福州大学 | Human body action recognition method fusing attention mechanism and spatio-temporal graph convolutional neural network in security scene |
CN110569773B (en) * | 2019-08-30 | 2020-12-15 | 江南大学 | Double-flow network behavior identification method based on space-time significance behavior attention |
-
2020
- 2020-05-22 CN CN202010441559.3A patent/CN111709306B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN111709306A (en) | 2020-09-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111709306B (en) | Double-flow network behavior identification method based on multilevel space-time feature fusion enhancement | |
Zhuge et al. | Salient object detection via integrity learning | |
Liu et al. | SwinNet: Swin transformer drives edge-aware RGB-D and RGB-T salient object detection | |
Shao et al. | Temporal interlacing network | |
Qian et al. | Thinking in frequency: Face forgery detection by mining frequency-aware clues | |
Nguyen et al. | A neural network based on SPD manifold learning for skeleton-based hand gesture recognition | |
Li et al. | Selective kernel networks | |
Gao et al. | MSCFNet: A lightweight network with multi-scale context fusion for real-time semantic segmentation | |
Kim et al. | Fully deep blind image quality predictor | |
Wang et al. | NAS-guided lightweight multiscale attention fusion network for hyperspectral image classification | |
Li et al. | Micro-expression action unit detection with spatial and channel attention | |
Fang et al. | Deep3DSaliency: Deep stereoscopic video saliency detection model by 3D convolutional networks | |
CN111242181B (en) | RGB-D saliency object detector based on image semantics and detail | |
CN113343950B (en) | Video behavior identification method based on multi-feature fusion | |
Pan et al. | No-reference image quality assessment via multibranch convolutional neural networks | |
Jia et al. | Stacked denoising tensor auto-encoder for action recognition with spatiotemporal corruptions | |
Li et al. | ConvTransNet: A CNN–transformer network for change detection with multiscale global–local representations | |
Liu et al. | APSNet: Toward adaptive point sampling for efficient 3D action recognition | |
Zhao et al. | Alignment-guided temporal attention for video action recognition | |
Yin et al. | Dynamic difference learning with spatio-temporal correlation for deepfake video detection | |
Ning et al. | Enhancement, integration, expansion: Activating representation of detailed features for occluded person re-identification | |
Shi et al. | A pooling-based feature pyramid network for salient object detection | |
Huang et al. | Region-based non-local operation for video classification | |
ABAWATEW et al. | Attention augmented residual network for tomato disease detection andclassification | |
Cheng et al. | Bidirectional collaborative mentoring network for marine organism detection and beyond |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |