CN111325145B

CN111325145B - Behavior recognition method based on combined time domain channel correlation block

Info

Publication number: CN111325145B
Application number: CN202010102863.5A
Authority: CN
Inventors: 胡建国; 蔡佳辉; 王金鹏; 陈嘉敏; 林佳玲
Original assignee: Development Research Institute Of Guangzhou Smart City; Sun Yat Sen University
Current assignee: Development Research Institute Of Guangzhou Smart City; Sun Yat Sen University
Priority date: 2020-02-19
Filing date: 2020-02-19
Publication date: 2023-04-25
Anticipated expiration: 2040-02-19
Also published as: CN111325145A

Abstract

The invention relates to the field of computer vision and discloses a behavior recognition method based on a time domain channel correlation block, which compresses an input initial feature map through space global average pooling operation to obtain a time domain channel description operator; inputting a time domain channel description operator into an attention module to obtain a time domain channel global nonlinear dependence; and assigning the tensor output by the attention module as the weight of the importance of each channel after feature selection, and multiplying the input initial feature map with the tensor output by the attention module channel by channel through residual connection to obtain a feature map after channel weighting. The invention effectively captures the related information between time domains and channels through a network layer to obtain a channel-by-channel description operator, weights the channel-by-channel description operator to the previous characteristics through multiplication, completes the re-weighting of the original characteristics in the channel dimension, and concentrates more computing resources of the network into the characteristic channels important to the output result.

Description

Behavior recognition method based on combined time domain channel correlation block

Technical Field

The invention relates to the field of computer vision, in particular to a behavior recognition method based on a correlation block combined with a time domain channel.

Background

Video occupies a 70% share of the internet traffic and is also continuously rising. Most cell phone cameras now capture not only images, but also high resolution video. Many real world data sources are video-based, ranging from warehouse inventory systems to autopilot cars or drones. Video can be said to be the next-to-field front of computer vision because it captures a large amount of information that still images cannot convey. Therefore, video behavior recognition has been a hot problem in research in the fields of computer vision and the like.

Human motion in a video sequence is a three-dimensional (3D) spatiotemporal signal containing spatial features and temporal features. The spatial features mainly describe the appearance of the objects related to the motion and the configuration of the scene and the scene within each frame of the video. Spatial feature learning is similar to still image recognition and therefore readily benefits from the recent advances in deep Convolutional Neural Networks (CNNs). Video temporal features capture motion cues that are embedded in evolving frames over time, containing valuable motion information that needs to be incorporated into video recognition tasks. Two main problems that video behavior recognition needs to solve: one how to learn the temporal features and the other how to properly fuse the spatial and temporal features.

Researchers initially explicitly model temporal motion information and spatial information in parallel. The optical flow between the original frame and the adjacent frame is used as two input streams to the deep neural network. On the other hand, as generalization of two-dimensional convolution (2D Conv) for still image recognition, three-dimensional convolution (3D Conv) has been proposed to process 3D volume video data. In a three-dimensional convolutional network, spatial and temporal features are tightly entangled together and co-learned. That is, rather than learning spatial and temporal features separately and fusing them at the top of the network, joint spatiotemporal features are learned by three-dimensional convolution distributed across the network. Given the excellent feature representation learning capability of CNN, ideal three-dimensional convolution should have great success in video understanding, just as two-dimensional convolution is on image recognition. However, the large number of model parameters and low computational efficiency limit the effectiveness and practicality of three-dimensional convolution.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a behavior recognition method based on a correlation block combined with a time domain channel.

A behavior recognition method based on a combined time domain channel correlation block, comprising the steps of:

s1, compressing an input initial three-dimensional space-time signal feature map through space global average pooling operation to obtain a time domain channel description operator;

s2, inputting a time domain channel description operator into an attention module to obtain a time domain channel global nonlinear dependence;

and S3, assigning the tensor output by the attention module as the weight of the importance of each channel after feature selection, and multiplying the initial three-dimensional space-time signal feature map input in the step S1 with the tensor output by the attention module in the step S2 channel by channel through residual connection to obtain a feature map after channel weighting.

Preferably, in the above-mentioned behavior recognition method based on the combined time-domain channel correlation block, the three-dimensional spatio-temporal signal of the initial feature map input in the step S1 is expressed as:

where T, H, W, C represents the input signal time domain length, the spatial domain height and width, and the number of channels, respectively.

Preferably, in the above behavior recognition method based on the combined time domain channel correlation block, the expression of the time domain channel description operator obtained in the step S1 is:

wherein ,/>

Preferably, in the above behavior recognition method based on the combined time domain channel correlation block, the attention module is composed of two fully connected layers, wherein the feature dimension of the first fully connected layer is reduced to

While the second feature dimension increases the feature to C.

Preferably, in the above behavior recognition method based on the combined time domain channel correlation block, in the step S3, the correlation block is passed throughThe process of the attention module fusing time domain-channel information and extracting channel-by-channel information is expressed as: z=σ (MLP (F))=σ (W) ₁ (δ(W ₀ z)); wherein,

and->

Delta and sigma are denoted ReLU and Sigmoid activation functions, respectively, r is a hyper-parameter used to reduce the number of parameters of the attention module.

Preferably, in the above behavior recognition method based on the combined time domain channel correlation block, in the step S3, the tensor value output by the attention module is assigned

Preferably, in the above-mentioned behavior recognition method based on the combined time domain channel correlation block, in the step S3, the feature map obtained by multiplying the initial three-dimensional spatio-temporal signal feature map input in the step S1 and the tensor output by the attention module in the step S2 channel by channel through residual connection to obtain the channel weighting is denoted as X _c ，X _c ＝F _scale (X, Z) =x·z; wherein x= [ X ] ₁ ,x ₂ ,…,x _C ]，F _scale (X, Z) represents a feature map

and />

Is multiplied channel by channel.

Preferably, in the above behavior recognition method based on the combined time domain channel correlation block, the value of the super parameter r is [2,4,8,16,32, … ].

Preferably, in the above behavior recognition method based on the combined time domain channel correlation block, the value of the super parameter r is 16.

The invention has the beneficial effects that: according to the invention, the network layer effectively captures the related information between the time domain and the channel, the time domain-channel related characteristic learning can be effectively executed on any network, a channel-by-channel description operator is obtained, the channel-by-channel weighting is carried out on the previous characteristic through multiplication, and the re-weighting of the original characteristic in the channel dimension is completed. By concentrating more computing resources of the network into characteristic channels important for output results, the computing resources of the network are optimized, and the behavior recognition accuracy is improved.

Detailed Description

The following description of the embodiments of the present invention will clearly and fully describe the technical solutions of the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment of the invention provides a behavior recognition method based on a combined time domain channel correlation block, which comprises the following steps:

s1, compressing an input initial three-dimensional space-time signal feature map through space global average pooling operation to obtain a time domain channel description operator. The three-dimensional spatiotemporal signal of the input initial feature map is expressed as:

where T, H, W, C represents the input signal time domain length, the spatial domain height and width, and the number of channels, respectively. The expression of the obtained time domain channel description operator is as follows: />

wherein ,/>

S2, inputting a time domain channel description operator into an attention module to obtain the time domain channel global nonlinear dependence, wherein in order to achieve the goal, the attention module must meet two conditions: first, the attention module must have flexibility, in particular, the attention module must be able to learn the nonlinear interactions between time-domain channels; second, the attention module must learn a non-exclusive relationship. As we will aim to ensure that multiple channels are allowed to be reinforced, rather than single thermal activation.

In particular, the attention module consists of two fully connected layers, wherein the characteristic dimension of the first fully connected layer is reduced to

While the second feature dimension adds features to C, a spatial dimension global receptive field can be obtained using spatial global averaging pooling.

Specifically, in a preferred embodiment of the present invention, the process of fusing time domain-channel information and extracting channel-by-channel information by the attention module in step S3 is expressed as: z=σ (MLP (F))=σ (W) ₁ (δ(W ₀ z)); wherein,

and->

Tensor assignment output by the attention module

Said step by residual connectionThe feature map obtained by multiplying the initial three-dimensional space-time signal feature map input in the step S1 with tensors output by the attention module in the step S2 channel by channel to obtain channel weighted feature map is expressed as X _c ，X _c ＝F _scale (X, Z) =x·z; wherein x= [ X ] ₁ ,x ₂ ,…,x _C ]，F _scale (X, Z) represents a characteristic map->

and />

Is multiplied channel by channel. The value of the super parameter r is [2,4,8,16,32, … ]]I.e. 2 ³ⁿ N is a natural number equal to or greater than 0. In practice, experiments show that the value of the super parameter r is 16, and the effect is best.

Specifically, the overall network architecture is shown in the following table:

in the above table, 3D-ResNet101 represents the basic 101-layer residual network, while 3D CTC-ResNet101 represents the network architecture after adding "combined time-domain channel correlation blocks (CTCs)", we add a CTC module to each block of the residual network to construct a 3D CTC-ResNet10 network. Both network architectures of the table above employ three-dimensional convolution kernels and three-dimensional pooling, with each convolution layer shown in the table corresponding to a composite sequence BN-ReLU-Conv operation. Tests on the behavior recognition data set UCF-101 and the HMDB-51 show that the recognition rate can be improved to a certain extent by the reference network and the CTC module.

In summary, the invention effectively captures the correlation information between the time domain and the channel through the network layer, can effectively perform the time domain-channel correlation feature learning on any network, obtains a channel-by-channel description operator, weights the channel-by-channel to the previous feature through multiplication, and completes the re-weighting of the original feature in the channel dimension. By concentrating more computing resources of the network into characteristic channels important for output results, the computing resources of the network are optimized, and the behavior recognition accuracy is improved.

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the foregoing embodiments, but rather, the foregoing embodiments and description illustrate the principles of the invention, and that various changes and modifications may be effected therein without departing from the spirit and scope of the invention as defined by the appended claims and their equivalents.

Claims

1. The behavior recognition method based on the combined time domain channel correlation block is characterized by comprising the following steps of:

s3, assigning the tensor output by the attention module as the weight of the importance of each channel after feature selection, and multiplying the initial three-dimensional space-time signal feature map input in the step S1 with the tensor output by the attention module in the step S2 channel by channel through residual connection to obtain a feature map after channel weighting;

in the step S3, the process of fusing time-channel information and extracting channel-by-channel information by the attention module is expressed as: z=σ (MLP (F))=σ (W) ₁ (δ(W ₀ z)); wherein,

and->

Delta and sigma are respectively expressed as ReLU and Sigmoid activation functions, and r is a super parameter for reducing the parameter quantity of the attention module;

the attention moduleTensor assignment of output

The initial three-dimensional space-time signal characteristic diagram input in the step S1 is multiplied with tensor output by the attention module in the step S2 channel by channel through residual connection to obtain a characteristic diagram after channel weighting, and the characteristic diagram is expressed as X _c ，X _c ＝F _scale (X, Z) =x·z; wherein x= [ X ] ₁ ,x ₂ ,…,x _C ]，F _scale (X, Z) represents a feature map

and />

Is multiplied channel by channel.

2. The method of claim 1, wherein the three-dimensional spatio-temporal signal of the initial feature map input in said step S1 is represented as:

3. The behavior recognition method based on the combined time domain channel correlation block according to claim 1 or 2, wherein the expression of the time domain channel description operator obtained in the step S1 is:

wherein ,/>

4. The behavior recognition method based on the combined time domain channel correlation block according to claim 1, which is specificCharacterized in that the attention module consists of two fully connected layers, wherein the characteristic dimension of the first fully connected layer is reduced to

While the second feature dimension increases the feature to C.

5. The behavior recognition method based on the combined time domain channel correlation block according to claim 1, wherein the value of the super parameter r is [2,4,8,16,32, … ].

6. The behavior recognition method based on the combined time domain channel correlation block according to claim 5, wherein the super parameter r takes a value of 16.