CN111709306B

CN111709306B - Double-flow network behavior identification method based on multilevel space-time feature fusion enhancement

Info

Publication number: CN111709306B
Application number: CN202010441559.3A
Authority: CN
Inventors: 孔军; 王圣全; 蒋敏
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2020-05-22
Filing date: 2020-05-22
Publication date: 2023-06-09
Anticipated expiration: 2040-05-22
Also published as: CN111709306A

Abstract

A dual-flow network behavior recognition method based on multi-level space-time feature fusion enhancement. The method adopts a network architecture based on a space-time double-flow network, which is called a multi-level space-time characteristic fusion enhancement network. Aiming at the problems that the effect of shallow layer characteristics is ignored and the complementary characteristics of the double-flow network cannot be fully utilized due to the fact that the class probability distribution of two flows is only fused at the last layer of the traditional double-flow network, the invention provides a multi-level space-time characteristic fusion module, and the multi-depth-level mixed characteristics are captured through the space-time characteristic fusion module at different depth levels of double flows so as to fully utilize the double-flow network. Furthermore, treating all features equally in the network weakens the effect of those features that contribute significantly to classification. The invention provides a grouping enhanced attention module in a network, which automatically enhances the significance of effective areas and channels on characteristics. Finally, the invention further improves the robustness of the behavior recognition model by collecting the classification results of the double-flow network and the feature fusion.

Description

Double-flow network behavior identification method based on multilevel space-time feature fusion enhancement

Technical Field

The invention belongs to the field of machine vision, and particularly relates to a double-flow network behavior recognition method based on multi-level space-time feature fusion enhancement.

Background

Action recognition has become an active field of the computer vision world and is widely applied to various fields such as video surveillance, violence detection, man-machine interaction and the like. Video motion recognition is to mine key features that can express target motion represented by video, and compared with static images, the video motion recognition contains rich motion information, however, the diversity of motion scenes still makes extraction of effective features challenging. Therefore, the invention aims at the problems faced by extracting the spatial and temporal features in the video by using the video as a research object and provides a unique feature fusion method and an attention method to effectively extract the distinguishing features for behavior recognition.

Currently, video-oriented behavior recognition mainly uses a dual-stream network, and the development trend is very good. In dual-flow networks, the dual-flow architecture captures appearance information and motion information by training the respective convolutional networks on the appearance and optical flow stacks, respectively, and finally uses the score to fuse the classification results of the two convolutional networks. However, conventional dual-flow networks still face the following problems: (1) how effectively information captured by two streams separately? (2) How effectively the captured features are refined should each region and channel of the network that is equally treated with features impair the effect of those regions and channels that are useful for classification? (3) How effectively the acquired spatial information and temporal information are fused?

Based on the above consideration, the invention provides a double-flow network behavior recognition method based on multi-level space-time feature fusion enhancement. Firstly, the proposed space-time feature fusion module is used for fusing the features of modules at different depth layers of a double-flow network to extract multi-depth-level mixed features. Secondly, the extracted mixed features are further refined by using the proposed grouping enhanced attention module, so that the network automatically focuses on the regions and channels which have an effect on classification in the features.

Disclosure of Invention

The invention mainly aims to provide a multi-level space-time feature fusion enhanced double-flow network (MDFFEN) behavior recognition method, which can better acquire effective features of videos and distinguishing information on the features so as to perform efficient behavior recognition.

In order to achieve the above object, the present invention provides the following technical solutions:

a double-flow network behavior recognition method based on multi-level space-time feature fusion enhancement comprises the following steps:

step one, obtaining RGB frames: performing frame taking processing on each video in the data set to obtain RGB original frames

N is the number of frames;

step two, calculating a light flow diagram: TVL1[ Coloma Ballester, lluis Garrido, vanelLazcano, and VicentCasseles. Atv-l1 optical flow method with occlusion detection. In Point Dagm,2013 was applied.]Algorithm for RGB original frame f _rgb Calculating two by two to obtain a light flow graph

Step three, segmenting all the extracted RGB frames and the optical flow diagram: dividing all RGB frames and optical flow diagrams acquired in the first step and the second step into three sections

Each segment is continuous in time sequence, and any two segments are not overlapped.

Step four, from s _rgb Each segment of the three-dimensional network is used for randomly acquiring input of RGB frames to construct a spatial network respectively:

wherein />

Step five, from s _opt The network input of the construction time of a plurality of optical flow diagrams is randomly acquired respectively in each section:

wherein />

Step six, based on the space network N _s Calculating a spatial class probability distribution O _S Input of the space network constructed in the fourth step

Respectively send into the space network N _s Extracting features, spatial network N _s Based on InceptionV3 ^[2] Constructing a network, and obtaining spatial category probability distribution ∈10 through global average pooling operation and full connection operation>

wherein

The ith RGB frame segment RGB representing step three _i Corresponding spatial class probability distribution;

step seven, based on time network N _t Calculating a temporal class probability distribution O _T Input of the time network constructed in the step five

Respectively send into time network N _t Extracting features, time network N _t Based on InceptionV3[ Christian Szegedy, vincent Vanhoucke, sergey Ioffe, jonathon sylens, and Zbigniew Wojna.rethinking the inception architecture for Computer version.in Computer Vision&Pattern Recognition,2016.]Constructing a network, and obtaining time category probability distribution by global average pooling operation and full connection operation>

wherein />

Representing the ith optical flow map segment OPT in step three _i Corresponding time class probabilities;

step eight, based on double-flow fusion network N _TSFF Computing feature fusion class probability distribution O _F : embedding space-time feature fusion modules STFF into a spatial network N respectively by using multi-level space-time feature fusion modules _s And time network N _t In the multiple submodules of InceptionV3 to fusion extractThe multi-depth-level mixed features are then further extracted by the grouping enhanced attention module, and finally feature fusion category probability distribution is obtained by global average pooling operation and full connection operation

wherein />

The ith RGB frame segment RGB representing step three _i And ith optical flow map segment OPT _i Corresponding feature fusion class probability distribution;

step nine, calculating multi-section fused class probability distribution, namely obtaining multi-section class probability distribution according to the step six, the step seven and the step eight

And->

Obtaining a class probability distribution of multi-segment fusion by three-segment average value>

Step ten, calculating class probability distribution delta of three stream weighted fusion: fusion step nine obtained multi-section fusion space category probability distribution delta on the basis of double-flow network _s Multi-segment fused temporal class probability distribution delta _t Feature fusion class probability distribution delta fused with multiple segments _f The present invention uses a weighted average fusion approach.

Step eleven, calculating a final classification result P: p=argmax (δ), where argmax (δ) is the index value that calculates the maximum value in the δ vector, i.e. calculates the category of all behavior categories with the highest category probability distribution.

Compared with the prior art, the invention has the following beneficial effects:

1. and (3) carrying out feature fusion on different depth layers of the double flow through the double flow feature fusion network constructed in the step (eight) to obtain space-time mixing features of multiple depth levels, and fully utilizing the features of the shallow layers and the complementary features of the double flow.

2. The double-flow feature fusion network constructed in the step eight provides a grouping enhanced attention module to further refine local information and global information of the extracted mixed features, so that behavior recognition accuracy is effectively improved.

Drawings

FIG. 1 is a flow chart of an algorithm of the present invention;

FIG. 2 is a diagram of an algorithm model of the present invention;

FIG. 3 is a dual flow feature fusion network N _TSFF A figure;

FIG. 4 is a spatio-temporal feature fusion graph;

fig. 5 is a grouping enhanced attention module.

Detailed Description

FIG. 2 is an overall model diagram of the present invention;

fig. 2 shows an algorithm model diagram of the present invention. The algorithm takes a multi-segment RGB image and a light flow graph as inputs, and the model comprises a space network, a time network, a feature fusion network, multi-segment class probability distribution fusion and multi-stream class probability distribution fusion. The spatial network and the time network are both constructed based on the InceptionV3, the feature fusion network is constructed through the spatial network and the time network, in brief, the proposed multi-level space-time feature fusion module is used for fusing space-time hybrid features with different depth levels, wherein the space-time hybrid features are features respectively extracted from the spatial network and the time network through the fusion of the proposed space-time feature fusion module, then the proposed grouping enhanced attention module is used for further refining the multi-depth level hybrid features, and the feature fusion class probability distribution is obtained through global averaging pooling and full connection operation like the spatial network and the time network. And then fusing the three segmentation input extracted corresponding class probability distributions of each flow to obtain multi-segment fused class probability distributions of the corresponding flow, and finally fusing the multi-segment fused class probability distributions corresponding to the three flows by adopting a weighted average method.

For a better explanation of the present invention, the disclosed behavior data set UCF101 is described below as an example.

In the technical scheme, in the fourth step, the step is from s _rgb The specific method for randomly acquiring RGB frames from each segment of the frame comprises the following steps:

the ith segment RGB frame sequence RGB obtained in the third step _i Is to acquire consecutive L at random positions of (a) _s RGB frame acquisition

wherein L_s In this example 1.

In the technical scheme, in the fifth step, the step S _opt The specific method for randomly acquiring a plurality of optical flow diagrams from each section of the optical flow diagrams comprises the following steps:

OPT of Zhang Guangliu figure at i-th paragraph obtained from step three _i Starts to acquire consecutive L at random positions of (1) _t Zhang Guangliu map is obtained

wherein L_t In this example 5.

The double-flow characteristic fusion method in the step eight in the technical scheme specifically comprises the following steps:

conventional dual-flow network behavior recognition methods typically incorporate class probability distributions at the final layer. Since conventional feature fusion fuses the features at the deepest level in the final layer, the effect of shallow features on classification is often ignored. Therefore, the invention provides a multi-level space-time characteristic fusion module. The specific implementation is shown in fig. 3. Unlike traditional methods, the multi-level spatio-temporal feature fusion module presented in the present invention considers shallow features of the depth network to capture hybrid features with multiple depth levels. In addition, the invention proposes a grouping enhanced attention module to further optimize the hybrid features extracted from the multi-level spatio-temporal feature fusion module. Finally, the class probability distribution is generated by the operation of the complete connection layer FC on the feature vectors, wherein the feature vectors are generated by summarizing the feature images through the global average pooling operation. The overall process of dual stream feature fusion is formally written as the following:

wherein M_MDFF (. Cndot. Cndot.). Cndot.represents a multi-level spatiotemporal feature fusion module, M _GSCE (. Cndot.) represents the output characteristics of the packet enhanced attention module. FC represents a fully connected operation, GAP represents a global average pooling operation.

The multi-level space-time feature fusion method applied in the step eight in the technical scheme comprises the following steps:

the InceptionV3 consists of 11 sub-modules connected in series, namely, inc.1-Inc.11, from which different depth level features can be extracted. In order to further enhance the classification capability of the InceptionV3 network, the invention embeds a space-time feature fusion module STFF into each sub-module of the spatial network and the temporal network to capture novel features with different depth levels. The last four sub-modules, namely sub-modules from inc.8 to inc.11, are selected in the embodiment, and the selection of the sub-modules in the specific application can be adjusted according to the actual application. All the mixed spatiotemporal features generated by sub-modules of multiple depths of the network are cascaded to obtain abstract convolved mixed spatiotemporal features having multiple depth levels. Multi-level space-time characteristic fusion module M _MDFF The scheme for (-) is shown below:

wherein M_STFF (. Cndot. Cndot.) represents a spatiotemporal feature fusion module.

and />

Respectively indicate +.>

and />

Features fed into the spatial and temporal networks and extracted from the inc.j module therein. The cascade of hybrid features generated from Inc.8 to Inc.11 consists of +.>

And (3) representing. Conv (·) represents a convolution operation, the present example uses 2048 kernel-sized 3*3 convolution filters to further extract abstract features from mixed features with different depth levels, while the number of channels of the obtained features will translate to 2048.

The specific construction method of the space-time feature fusion module STFF in the eighth step in the technical scheme comprises the following steps:

the output features of the space-time feature fusion module are formed by fusing three types of features (namely, preliminary mixed space-time features, spatial features and time features).

Fig. 4 is a spatio-temporal feature fusion module. The identification on each box represents the name of the feature map and the size of the feature map.

Representing element-by-element summation operations, N _Filter Is the number of convolution filters.

As detailed in fig. 4, the spatial features extracted from the sub-modules in the spatial network are first fused with the temporal features extracted from the sub-modules in the temporal network by element-wise summation and convolution operations to obtain primary hybrid abstract features. By ignoring the superscript i and the subscript inc.j in equation (2), one can write

and />

Write as +.>

and />

For ease of expression, where C, H and W represent the number of channels, height and width, respectively, of the feature map. The preliminary hybrid abstract feature F is then formally expressed as the following formula:

wherein Ψ_k,n A sequence of ReLU (BN (Conv.)) operations of convolution kernel size k and filter number n is represented, where ReLU and BN represent a ReLU activation function and a bulk normalization operation, respectively, and Conv (. Cndot.). A convolution operation is represented. In addition, in order to further suppress invalid information and extract valid information, the present invention proposes a feature extractor M _FE (·)。M _FE (. Cndot.) consists of two ψs with different filter numbers n _3*3,n The operation is composed, wherein the first filter number is half of the input channel number C, and the other is the same as the input channel number. Then through a feature extractor M _FE (. Cndot.) features of all three types (spatial feature S, temporal feature T and primary spatio-temporal mixture feature F) are further extracted independently from the non-linear abstract features. Feature extractor M _FE The detailed procedure of (-) is expressed as the following formula:

M _FE (Z)＝Z _FE2 ＝Ψ _3*3,C (Z _FE1 ) (4)

wherein Z ε { S, T, F } represents M _FE The input features of (-), S, T, F represent spatial features, temporal features, and primary spatio-temporal mixing features, respectively.

Then, it will pass through the feature extractor M _FE (. Cndot.) refined spatial features S _FE2 And time feature T _FE2 Respectively with refined mixed features F _FE2 Fusion to obtain a more deep fusion feature F _S and F_T The following is shown:

F _S ＝Φ(S _FE2 ,F _FE2 ) (6)

F _T ＝Φ(T _FE2 ,F _FE2 ) (7)

here, Φ (. Cndot. ) is the same as the formula (3).

Finally, F is calculated by phi (& gt ) _S and F_T And (3) fusing to obtain the final mixed space-time characteristics of the space-time characteristic fusion module STFF:

M _STFF (S,T)＝Φ(F _S ,F _T ) (8)

in the above technical scheme, the grouping attention enhancement module in the step eight is specifically as follows:

in order to obtain more efficient spatiotemporal features from global and local information, the present invention constructs a grouping enhanced attention module to further refine the hybrid features. Fig. 5 shows a detailed structure of the module. The connection of two of the attention modules is parallel, which allows the module to extract both spatial and temporal information.

Fig. 5 is a grouping enhanced attention module. The group-level spatial attention module is used to mine each local region of interest, while the channel attention module is used to capture global responses in the channel dimension. They are then connected to enhance spatial saliency and channel saliency by element-wise multiplication with the original input feature map. Finally, residual connection is utilized to reduce the likelihood of gradient extinction. GAP and GMP in the figure represent global average pooling operations and global maximum pooling operations. They both operate in the spatial dimension in the spatial attention module and in the temporal dimension in the channel attention module, respectively.

With SGE [ Xiang Li, xiaolin Hu, and Jian Yang.]The invention aims to capture the response between the spatial features and the channel features, i.e. contains similarities between the global features and the local features in each group. Thus, the present invention introduces a grouping strategy into the Spatial Attention (SA) module, thereby generating a group level spatial attention (GSA) module that can be used to capture local information to supplement global information extracted by the Channel Attention (CA) module. The SA and CA modules mentioned herein are found in CBAM [ Sanghyun Woo ],Jongchan Park,Joon-Young Lee,and In So Kweon.Cbam:Convolutional block attention module.2018.]Is described in detail. Formally defining an input feature map as

The invention obtains space attention by GSA module and CA module>

And channel response->

Further pass through

Operation assignment of fused weights->

To refine the original input feature Q. In addition, in order to reduce the possibility of gradient disappearance and speed training progress, the invention also introduces attention residual, namely by

The operation directly establishes a connection between Q and the final refined feature. Finally, the saliency enhancement feature of the group enhanced attention module output +.>

The production process of (2) is shown in the following formula (9).

Representing element-by-element multiplication, where M _C(Q) and M_GS (Q) between->

The operations include broadcast operations that automatically multiply M by element _C The size C of (Q) 1*1 is converted to M _GS The sizes of (Q) c×h×w are identical.

In the above technical scheme, the construction method of the group-level spatial attention GSA module in the step eight is as follows:

the complete features input by the general attention module consist of sub-features distributed in groups in multiple channels of features. Moreover, these sub-features are treated in the same way and are therefore likely to be affected by background noise, which can easily lead to erroneous recognition and localization results. In view of this, the present invention proposes a group-level spatial attention GSA module for generating a local spatial response in each individual group divided from the original feature map. I.e. the input feature map Q is divided into groups by grouping strategy

wherein />

A feature map group having a group number l. G represents the total number of groups divided, in this example 16. It effectively captures information from sub-features through targeted learning and noise suppression. Then the local spatial response of group l is obtained using SA module>

Wherein the SA module is found In CBAM [ Sanghyun Woo, jongchan Park, joon-Young Lee, and In So Kwen. Cbam: convolutional block attention module.2018.]Is described in detail. Finally, the output response of group-level spatial attention module

The generation of (2) is shown in the following formula:

wherein the Expand (·) operation represents repeating a feature in the channel dimension

And twice.

In the technical scheme, the method for fusing the spatial category probability distribution, the temporal category probability distribution and the feature fusion category probability distribution in the step ten comprises the following steps:

the present invention uses a weighted average fusion method, i.e., δ=δ _s *w _s +δ _t *w _t +δ _f *w _f ，w _s ,w _t ,w _f The weights of the spatial stream, the time stream and the characteristic fusion stream are respectively represented, the default fusion weights of the three streams are respectively 0.4, 2.7 and 2.4, and the fusion weights can be adjusted according to the actual application requirements.

To verify the accuracy and robustness of the present invention, the present invention conducted experiments on the disclosed UCF101 and HMDB51 datasets.

UCF101 is a typical challenging human motion recognition dataset that contains 13320 videos collected from the YouTube video website at a resolution of 320 x 240. It contains a total of 101 action categories, with each category containing 25 people. UCF101 datasets have great diversity in motion acquisition, including camera operation, appearance changes, pose changes, object scale changes, background changes, light changes, and the like. 101 actions can be broadly divided into five categories: human-to-object interactions, human-to-human interactions, instrument performance and motion.

The HMDB51 dataset contains 6849 320 x 240 resolution video samples, which consists of 51 categories, where each category contains at least 101 samples. Most video comes from movies, some from public datasets or online video libraries (e.g., youTube). The operation categories can be divided into five types: general facial movements, facial movements and object manipulations, general body movements, body movements and object interactions, human movements. Background clutter and changes in light conditions make it very challenging to identify the target actions represented by a video.

Table 1 is the individual parameter settings of the two data sets in the experiment:

table 1 database experimental parameter settings

Table 2 shows the test results of the method MDFFEN according to the present invention on UCF101 and HMDB51 data sets, and the present invention achieves a higher recognition rate on both data sets. Although these two data sets present difficulties of occlusion, deformation, background clutter, low resolution, etc., the method proposed by the present invention is robust to these difficulties and therefore performs relatively well.

TABLE 2 identification rates on UCF101 and HMDB51

Data set	UCF101	HMDB51
			MDFFEN	95.3％	71.6％

The invention mainly provides two mechanisms, namely multi-level space-time feature fusion and grouping enhancement attention. As can be seen from table 3, the accuracy of using a dual-flow network alone reaches 93.61% for the UCF101 dataset. And multilevel space-time feature fusion is added in a basic network, so that the precision is improved to 94.63%. On the basis, the grouping is added to enhance the attention, and the precision is further improved to 95.31%. Experimental results show that the multi-level space-time feature fusion method effectively extracts multi-depth-level mixed features, the grouping reinforcing attention further improves distinguishing features in the mixed features, and both mechanisms have good influence on behavior recognition performance and effectively improve recognition accuracy.

TABLE 3 influence of two mechanisms on UCF101 dataset

While the present invention has been described in detail with reference to the drawings, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. A double-flow network behavior recognition method based on multi-level space-time feature fusion enhancement is characterized by comprising the following steps:

N is the number of frames;

step two, calculating a light flow diagram: application of TVL1 algorithm to RGB original frame f _rgb Calculating two by two to obtain a light flow graph

Each section is continuous in time sequence, and any two sections are not overlapped;

wherein />

wherein />

Step six, based on the space network N _s Calculating a spatial class probability distribution O _S : inputting the space network constructed in the step four

Respectively send into the space network N _s Extracting features, spatial network N _s Based on InceptionV3 network construction, obtaining spatial category probability distribution ++through global average pooling operation and full connection operation>

wherein />

step seven, based on time network N _t Calculating a temporal class probability distribution O _T : input of the time network constructed in the step five

Respectively send into time network N _t Extracting features, time network N _t Based on InceptionV3 network construction, and then passing through the globalAveraging pooling and fully connected operations to get a temporal class probability distribution +.>

wherein />

step eight, based on double-flow fusion network N _TSFF Computing feature fusion class probability distribution O _F : embedding space-time feature fusion modules STFF into a spatial network N respectively by using multi-level space-time feature fusion modules _s And time network N _t Fusion extracting multi-depth-level mixed characteristics from a plurality of submodules of InceptionV3, extracting the extracted characteristics through a grouping enhanced attention module, and finally obtaining characteristic fusion category probability distribution through global average pooling operation and full connection operation

wherein />

step nine, calculating the class probability distribution of multi-segment fusion: according to the multi-section class probability distribution obtained in the step six, the step seven and the step eight

And->

Step ten, calculating class probability distribution delta of three stream weighted fusion: fusion step nine obtained multi-section fusion space category probability distribution delta on the basis of double-flow network _s Multi-segment fused temporal class probability distribution delta _t Feature fusion class probability distribution delta fused with multiple segments _f Calculating class probability distribution delta by adopting a weighted average fusion method;

step eleven, calculating a final classification result P: p=argmax (δ), where argmax (δ) is the index value that calculates the maximum value in the δ vector, which is the category in which the category probability distribution is highest among all the behavior categories.

2. The dual-flow network behavior recognition method based on multi-level space-time feature fusion enhancement according to claim 1, wherein the model for completing the dual-flow network behavior recognition method comprises a spatial network, a time network, a feature fusion network, multi-section class probability distribution fusion and multi-flow class probability distribution fusion; the space network and the time network are both constructed based on the InceptionV3, and the feature fusion network is constructed through the space network and the time network; using a multi-level space-time feature fusion module to fuse space-time mixed features with different depth levels, wherein the space-time mixed features are features extracted from a space network and a time network respectively by using the space-time feature fusion module, then extracting the mixed features with the multiple depth levels by using a grouping enhanced attention module, and obtaining feature fusion category probability distribution by using global average pooling and full connection operation like the space network and the time network; and then fusing the three segmentation input extracted corresponding class probability distributions of each flow to obtain multi-segment fused class probability distributions of the corresponding flows, and finally fusing the multi-segment fused class probability distributions corresponding to the three flows by adopting a weighted average method.

3. The dual-flow network behavior recognition method based on multi-level space-time feature fusion enhancement according to claim 1, wherein the whole process of the step eight is formally written as the following formula:

wherein M_MDFF (. Cndot. Cndot.). Cndot.represents a multi-level spatiotemporal feature fusion module, M _GSCE (-) represents the output characteristics of the packet enhanced attention module; FC represents a fully connected operation, GAP represents a global average pooling operation.

4. The dual-flow network behavior recognition method based on multi-level space-time feature fusion enhancement according to claim 3, wherein the multi-level space-time feature fusion method applied in the step eight is as follows: the InceptionV3 consists of j sub-modules which are connected in series, namely, inc.1-Inc. j, and can extract different depth level features from the sub-modules; embedding a space-time feature fusion module STFF into each sub-module of the spatial network and the temporal network to capture novel features with different depth levels; cascading all the mixed space-time features generated by the submodules with a plurality of depths of the network to obtain abstract convolution mixed space-time features with a plurality of depth levels; multi-level space-time characteristic fusion module M _MDFF The scheme for (-) is shown below:

wherein M_STFF (. Cndot. Cndot.) represents a spatiotemporal feature fusion module;

and />

Respectively indicate +.>

and />

Features fed into the spatial network and the temporal network and extracted from an inc.j module therein; />

Representing a cascade of hybrid features generated from inc.l1 to inc.l2; conv (·) represents a convolution operation.

5. The dual-flow network behavior recognition method based on multi-level space-time feature fusion enhancement according to claim 4, wherein the output features of the space-time feature fusion module are formed by fusing three types of features of preliminary mixed space-time features, spatial features and temporal features; the specific process of the space-time characteristic fusion module is as follows: firstly, fusing the spatial features extracted from the sub-modules in the spatial network with the temporal features extracted from the sub-modules in the temporal network through element-by-element summation and convolution operation to obtain primary mixed abstract features; by ignoring the superscript i and the subscript inc.j in equation (2), one will

And

write as +.>

and />

For ease of expression, wherein C, H and W represent the number of channels, height and width of the feature map, respectively; the preliminary hybrid abstract feature F is then formally expressed as the following formula:

wherein Ψ_k,n A sequence of ReLU (BN (Conv.)) operations representing a convolution kernel size k and a filter number n, wherein ReLU and BN represent a ReLU activation function and a bulk normalization operation, respectively, conv (·) represents a convolution operation, and a numer represents an element-by-element summation operation;

in order to suppress invalid information and extract valid information, a feature extractor M is employed _FE (·)；M _FE (. Cndot.) consists of two ψs with different filter numbers n _3*3,n An arithmetic component in which the first filter number is half the number of input channels C and the other is the same as the number of input channels; then through a feature extractor M _FE (. Cndot.) all spatial features S, temporal features T and primary space-time mixing features F are independently extracted to obtain nonlinear abstract features; feature extractor M _FE The detailed procedure of (-) is expressed as the following formula:

MFE(Z)＝Z _FE2 ＝Ψ _3*3,C (Z _FE1 ) (4)

wherein Z ε { S, T, F } represents M _FE Input features of (-), S, T, F represent spatial features, temporal features, and primary spatio-temporal mixing features, respectively;

F _S ＝Φ(S _FE2 ，F _FE2 ) (6)

F _T ＝Φ(T _FE2 ，F _FE2 ) (7)

here Φ (·, ·) is the same as formula (3);

M _STFF (S,T)＝Φ(F _S ,F _T ) (8)。

6. the method for identifying double-flow network behavior based on multi-level space-time feature fusion enhancement according to claim 1, wherein the group enhanced attention module in the eighth step comprises a group-level space attention module and a channel attention module, and the connection of the two attention modules is parallel; the group-level spatial attention module is used for mining each local area of interest, and the channel attention module is used for capturing global response in the channel dimension; connecting the two attention modules, and enhancing the spatial significance and the channel significance by multiplying the two attention modules with the original input characteristic diagram element by element; finally, residual connection is utilized to reduce the possibility of gradient extinction; the global average pooling operation GAP and the global maximum pooling operation GMP operate on a space dimension in the space attention module and a time dimension in the channel attention module respectively; the method comprises the following steps:

introducing a grouping strategy into the spatial attention SA module so as to generate a group-level spatial attention GSA module which is used for capturing local information so as to supplement global information extracted by the channel attention CA module; the SA and CA modules formally define the input feature maps as

Acquisition of spatial awareness by GSA Module and CA Module>

And channel response->

By passing through

Operation assignment of fused weights->

To refine the original input feature Q; attention deficit is introduced by->

The operation directly establishes a connection between the Q and the final refined feature; finally, the saliency enhancement feature of the group enhanced attention module output +.>

The generation process of (2) is shown in the following formula (9);

wherein ,

7. The dual-flow network behavior recognition method based on multi-level space-time feature fusion enhancement as claimed in claim 6, wherein the construction method of the group-level spatial attention GSA module is as follows: a group-level spatial attention GSA module for generating a local spatial response in each individual group divided from the original feature map; dividing an input feature map Q into groups by grouping strategy

wherein />

A feature map group having a group number l; g represents the total number of groups divided, which effectively captures information from the sub-features through targeted learning and noise suppression; then the local spatial response of group l is obtained using SA module>

Finally, the output response of group level spatial attention module +.>

The generation of (2) is shown in the following formula:

And twice. />