CN111325145B - Behavior recognition method based on combined time domain channel correlation block - Google Patents

Behavior recognition method based on combined time domain channel correlation block Download PDF

Info

Publication number
CN111325145B
CN111325145B CN202010102863.5A CN202010102863A CN111325145B CN 111325145 B CN111325145 B CN 111325145B CN 202010102863 A CN202010102863 A CN 202010102863A CN 111325145 B CN111325145 B CN 111325145B
Authority
CN
China
Prior art keywords
channel
time domain
attention module
domain channel
method based
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010102863.5A
Other languages
Chinese (zh)
Other versions
CN111325145A (en
Inventor
胡建国
蔡佳辉
王金鹏
陈嘉敏
林佳玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Development Research Institute Of Guangzhou Smart City
Sun Yat Sen University
Original Assignee
Development Research Institute Of Guangzhou Smart City
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Development Research Institute Of Guangzhou Smart City, Sun Yat Sen University filed Critical Development Research Institute Of Guangzhou Smart City
Priority to CN202010102863.5A priority Critical patent/CN111325145B/en
Publication of CN111325145A publication Critical patent/CN111325145A/en
Application granted granted Critical
Publication of CN111325145B publication Critical patent/CN111325145B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the field of computer vision and discloses a behavior recognition method based on a time domain channel correlation block, which compresses an input initial feature map through space global average pooling operation to obtain a time domain channel description operator; inputting a time domain channel description operator into an attention module to obtain a time domain channel global nonlinear dependence; and assigning the tensor output by the attention module as the weight of the importance of each channel after feature selection, and multiplying the input initial feature map with the tensor output by the attention module channel by channel through residual connection to obtain a feature map after channel weighting. The invention effectively captures the related information between time domains and channels through a network layer to obtain a channel-by-channel description operator, weights the channel-by-channel description operator to the previous characteristics through multiplication, completes the re-weighting of the original characteristics in the channel dimension, and concentrates more computing resources of the network into the characteristic channels important to the output result.

Description

Behavior recognition method based on combined time domain channel correlation block
Technical Field
The invention relates to the field of computer vision, in particular to a behavior recognition method based on a correlation block combined with a time domain channel.
Background
Video occupies a 70% share of the internet traffic and is also continuously rising. Most cell phone cameras now capture not only images, but also high resolution video. Many real world data sources are video-based, ranging from warehouse inventory systems to autopilot cars or drones. Video can be said to be the next-to-field front of computer vision because it captures a large amount of information that still images cannot convey. Therefore, video behavior recognition has been a hot problem in research in the fields of computer vision and the like.
Human motion in a video sequence is a three-dimensional (3D) spatiotemporal signal containing spatial features and temporal features. The spatial features mainly describe the appearance of the objects related to the motion and the configuration of the scene and the scene within each frame of the video. Spatial feature learning is similar to still image recognition and therefore readily benefits from the recent advances in deep Convolutional Neural Networks (CNNs). Video temporal features capture motion cues that are embedded in evolving frames over time, containing valuable motion information that needs to be incorporated into video recognition tasks. Two main problems that video behavior recognition needs to solve: one how to learn the temporal features and the other how to properly fuse the spatial and temporal features.
Researchers initially explicitly model temporal motion information and spatial information in parallel. The optical flow between the original frame and the adjacent frame is used as two input streams to the deep neural network. On the other hand, as generalization of two-dimensional convolution (2D Conv) for still image recognition, three-dimensional convolution (3D Conv) has been proposed to process 3D volume video data. In a three-dimensional convolutional network, spatial and temporal features are tightly entangled together and co-learned. That is, rather than learning spatial and temporal features separately and fusing them at the top of the network, joint spatiotemporal features are learned by three-dimensional convolution distributed across the network. Given the excellent feature representation learning capability of CNN, ideal three-dimensional convolution should have great success in video understanding, just as two-dimensional convolution is on image recognition. However, the large number of model parameters and low computational efficiency limit the effectiveness and practicality of three-dimensional convolution.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a behavior recognition method based on a correlation block combined with a time domain channel.
A behavior recognition method based on a combined time domain channel correlation block, comprising the steps of:
s1, compressing an input initial three-dimensional space-time signal feature map through space global average pooling operation to obtain a time domain channel description operator;
s2, inputting a time domain channel description operator into an attention module to obtain a time domain channel global nonlinear dependence;
and S3, assigning the tensor output by the attention module as the weight of the importance of each channel after feature selection, and multiplying the initial three-dimensional space-time signal feature map input in the step S1 with the tensor output by the attention module in the step S2 channel by channel through residual connection to obtain a feature map after channel weighting.
Preferably, in the above-mentioned behavior recognition method based on the combined time-domain channel correlation block, the three-dimensional spatio-temporal signal of the initial feature map input in the step S1 is expressed as:
Figure BDA0002387452240000031
where T, H, W, C represents the input signal time domain length, the spatial domain height and width, and the number of channels, respectively.
Preferably, in the above behavior recognition method based on the combined time domain channel correlation block, the expression of the time domain channel description operator obtained in the step S1 is:
Figure BDA0002387452240000032
wherein ,/>
Figure BDA0002387452240000033
Preferably, in the above behavior recognition method based on the combined time domain channel correlation block, the attention module is composed of two fully connected layers, wherein the feature dimension of the first fully connected layer is reduced to
Figure BDA0002387452240000034
While the second feature dimension increases the feature to C.
Preferably, in the above behavior recognition method based on the combined time domain channel correlation block, in the step S3, the correlation block is passed throughThe process of the attention module fusing time domain-channel information and extracting channel-by-channel information is expressed as: z=σ (MLP (F))=σ (W) 1 (δ(W 0 z)); wherein,
Figure BDA0002387452240000035
Figure BDA0002387452240000036
and->
Figure BDA0002387452240000037
Delta and sigma are denoted ReLU and Sigmoid activation functions, respectively, r is a hyper-parameter used to reduce the number of parameters of the attention module.
Preferably, in the above behavior recognition method based on the combined time domain channel correlation block, in the step S3, the tensor value output by the attention module is assigned
Figure BDA0002387452240000038
Preferably, in the above-mentioned behavior recognition method based on the combined time domain channel correlation block, in the step S3, the feature map obtained by multiplying the initial three-dimensional spatio-temporal signal feature map input in the step S1 and the tensor output by the attention module in the step S2 channel by channel through residual connection to obtain the channel weighting is denoted as X c ,X c =F scale (X, Z) =x·z; wherein x= [ X ] 1 ,x 2 ,…,x C ],F scale (X, Z) represents a feature map
Figure BDA0002387452240000039
and />
Figure BDA00023874522400000310
Is multiplied channel by channel.
Preferably, in the above behavior recognition method based on the combined time domain channel correlation block, the value of the super parameter r is [2,4,8,16,32, … ].
Preferably, in the above behavior recognition method based on the combined time domain channel correlation block, the value of the super parameter r is 16.
The invention has the beneficial effects that: according to the invention, the network layer effectively captures the related information between the time domain and the channel, the time domain-channel related characteristic learning can be effectively executed on any network, a channel-by-channel description operator is obtained, the channel-by-channel weighting is carried out on the previous characteristic through multiplication, and the re-weighting of the original characteristic in the channel dimension is completed. By concentrating more computing resources of the network into characteristic channels important for output results, the computing resources of the network are optimized, and the behavior recognition accuracy is improved.
Detailed Description
The following description of the embodiments of the present invention will clearly and fully describe the technical solutions of the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The embodiment of the invention provides a behavior recognition method based on a combined time domain channel correlation block, which comprises the following steps:
s1, compressing an input initial three-dimensional space-time signal feature map through space global average pooling operation to obtain a time domain channel description operator. The three-dimensional spatiotemporal signal of the input initial feature map is expressed as:
Figure BDA0002387452240000041
where T, H, W, C represents the input signal time domain length, the spatial domain height and width, and the number of channels, respectively. The expression of the obtained time domain channel description operator is as follows: />
Figure BDA0002387452240000051
wherein ,/>
Figure BDA0002387452240000052
S2, inputting a time domain channel description operator into an attention module to obtain the time domain channel global nonlinear dependence, wherein in order to achieve the goal, the attention module must meet two conditions: first, the attention module must have flexibility, in particular, the attention module must be able to learn the nonlinear interactions between time-domain channels; second, the attention module must learn a non-exclusive relationship. As we will aim to ensure that multiple channels are allowed to be reinforced, rather than single thermal activation.
In particular, the attention module consists of two fully connected layers, wherein the characteristic dimension of the first fully connected layer is reduced to
Figure BDA0002387452240000053
While the second feature dimension adds features to C, a spatial dimension global receptive field can be obtained using spatial global averaging pooling.
And S3, assigning the tensor output by the attention module as the weight of the importance of each channel after feature selection, and multiplying the initial three-dimensional space-time signal feature map input in the step S1 with the tensor output by the attention module in the step S2 channel by channel through residual connection to obtain a feature map after channel weighting.
Specifically, in a preferred embodiment of the present invention, the process of fusing time domain-channel information and extracting channel-by-channel information by the attention module in step S3 is expressed as: z=σ (MLP (F))=σ (W) 1 (δ(W 0 z)); wherein,
Figure BDA0002387452240000054
and->
Figure BDA0002387452240000055
Delta and sigma are denoted ReLU and Sigmoid activation functions, respectively, r is a hyper-parameter used to reduce the number of parameters of the attention module.
Tensor assignment output by the attention module
Figure BDA0002387452240000056
Said step by residual connectionThe feature map obtained by multiplying the initial three-dimensional space-time signal feature map input in the step S1 with tensors output by the attention module in the step S2 channel by channel to obtain channel weighted feature map is expressed as X c ,X c =F scale (X, Z) =x·z; wherein x= [ X ] 1 ,x 2 ,…,x C ],F scale (X, Z) represents a characteristic map->
Figure BDA0002387452240000061
and />
Figure BDA0002387452240000062
Is multiplied channel by channel. The value of the super parameter r is [2,4,8,16,32, … ]]I.e. 2 3n N is a natural number equal to or greater than 0. In practice, experiments show that the value of the super parameter r is 16, and the effect is best.
Specifically, the overall network architecture is shown in the following table:
Figure BDA0002387452240000063
in the above table, 3D-ResNet101 represents the basic 101-layer residual network, while 3D CTC-ResNet101 represents the network architecture after adding "combined time-domain channel correlation blocks (CTCs)", we add a CTC module to each block of the residual network to construct a 3D CTC-ResNet10 network. Both network architectures of the table above employ three-dimensional convolution kernels and three-dimensional pooling, with each convolution layer shown in the table corresponding to a composite sequence BN-ReLU-Conv operation. Tests on the behavior recognition data set UCF-101 and the HMDB-51 show that the recognition rate can be improved to a certain extent by the reference network and the CTC module.
In summary, the invention effectively captures the correlation information between the time domain and the channel through the network layer, can effectively perform the time domain-channel correlation feature learning on any network, obtains a channel-by-channel description operator, weights the channel-by-channel to the previous feature through multiplication, and completes the re-weighting of the original feature in the channel dimension. By concentrating more computing resources of the network into characteristic channels important for output results, the computing resources of the network are optimized, and the behavior recognition accuracy is improved.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the foregoing embodiments, but rather, the foregoing embodiments and description illustrate the principles of the invention, and that various changes and modifications may be effected therein without departing from the spirit and scope of the invention as defined by the appended claims and their equivalents.

Claims (6)

1. The behavior recognition method based on the combined time domain channel correlation block is characterized by comprising the following steps of:
s1, compressing an input initial three-dimensional space-time signal feature map through space global average pooling operation to obtain a time domain channel description operator;
s2, inputting a time domain channel description operator into an attention module to obtain a time domain channel global nonlinear dependence;
s3, assigning the tensor output by the attention module as the weight of the importance of each channel after feature selection, and multiplying the initial three-dimensional space-time signal feature map input in the step S1 with the tensor output by the attention module in the step S2 channel by channel through residual connection to obtain a feature map after channel weighting;
in the step S3, the process of fusing time-channel information and extracting channel-by-channel information by the attention module is expressed as: z=σ (MLP (F))=σ (W) 1 (δ(W 0 z)); wherein,
Figure QLYQS_1
and->
Figure QLYQS_2
Delta and sigma are respectively expressed as ReLU and Sigmoid activation functions, and r is a super parameter for reducing the parameter quantity of the attention module;
the attention moduleTensor assignment of output
Figure QLYQS_3
The initial three-dimensional space-time signal characteristic diagram input in the step S1 is multiplied with tensor output by the attention module in the step S2 channel by channel through residual connection to obtain a characteristic diagram after channel weighting, and the characteristic diagram is expressed as X c ,X c =F scale (X, Z) =x·z; wherein x= [ X ] 1 ,x 2 ,…,x C ],F scale (X, Z) represents a feature map
Figure QLYQS_4
and />
Figure QLYQS_5
Is multiplied channel by channel.
2. The method of claim 1, wherein the three-dimensional spatio-temporal signal of the initial feature map input in said step S1 is represented as:
Figure QLYQS_6
where T, H, W, C represents the input signal time domain length, the spatial domain height and width, and the number of channels, respectively.
3. The behavior recognition method based on the combined time domain channel correlation block according to claim 1 or 2, wherein the expression of the time domain channel description operator obtained in the step S1 is:
Figure QLYQS_7
wherein ,/>
Figure QLYQS_8
4. The behavior recognition method based on the combined time domain channel correlation block according to claim 1, which is specificCharacterized in that the attention module consists of two fully connected layers, wherein the characteristic dimension of the first fully connected layer is reduced to
Figure QLYQS_9
While the second feature dimension increases the feature to C.
5. The behavior recognition method based on the combined time domain channel correlation block according to claim 1, wherein the value of the super parameter r is [2,4,8,16,32, … ].
6. The behavior recognition method based on the combined time domain channel correlation block according to claim 5, wherein the super parameter r takes a value of 16.
CN202010102863.5A 2020-02-19 2020-02-19 Behavior recognition method based on combined time domain channel correlation block Active CN111325145B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010102863.5A CN111325145B (en) 2020-02-19 2020-02-19 Behavior recognition method based on combined time domain channel correlation block

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010102863.5A CN111325145B (en) 2020-02-19 2020-02-19 Behavior recognition method based on combined time domain channel correlation block

Publications (2)

Publication Number Publication Date
CN111325145A CN111325145A (en) 2020-06-23
CN111325145B true CN111325145B (en) 2023-04-25

Family

ID=71172703

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010102863.5A Active CN111325145B (en) 2020-02-19 2020-02-19 Behavior recognition method based on combined time domain channel correlation block

Country Status (1)

Country Link
CN (1) CN111325145B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113989940B (en) * 2021-11-17 2024-03-29 中国科学技术大学 Method, system, device and storage medium for identifying actions in video data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109726659A (en) * 2018-12-21 2019-05-07 北京达佳互联信息技术有限公司 Detection method, device, electronic equipment and the readable medium of skeleton key point
CN109871777A (en) * 2019-01-23 2019-06-11 广州智慧城市发展研究院 A kind of Activity recognition system based on attention mechanism
CN110070073A (en) * 2019-05-07 2019-07-30 国家广播电视总局广播电视科学研究院 Pedestrian's recognition methods again of global characteristics and local feature based on attention mechanism
CN110084180A (en) * 2019-04-24 2019-08-02 北京达佳互联信息技术有限公司 Critical point detection method, apparatus, electronic equipment and readable storage medium storing program for executing
CN110610129A (en) * 2019-08-05 2019-12-24 华中科技大学 Deep learning face recognition system and method based on self-attention mechanism

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109726659A (en) * 2018-12-21 2019-05-07 北京达佳互联信息技术有限公司 Detection method, device, electronic equipment and the readable medium of skeleton key point
CN109871777A (en) * 2019-01-23 2019-06-11 广州智慧城市发展研究院 A kind of Activity recognition system based on attention mechanism
CN110084180A (en) * 2019-04-24 2019-08-02 北京达佳互联信息技术有限公司 Critical point detection method, apparatus, electronic equipment and readable storage medium storing program for executing
CN110070073A (en) * 2019-05-07 2019-07-30 国家广播电视总局广播电视科学研究院 Pedestrian's recognition methods again of global characteristics and local feature based on attention mechanism
CN110610129A (en) * 2019-08-05 2019-12-24 华中科技大学 Deep learning face recognition system and method based on self-attention mechanism

Also Published As

Publication number Publication date
CN111325145A (en) 2020-06-23

Similar Documents

Publication Publication Date Title
Zhang et al. Hierarchical feature fusion with mixed convolution attention for single image dehazing
CN112597883B (en) Human skeleton action recognition method based on generalized graph convolution and reinforcement learning
CN113378600B (en) Behavior recognition method and system
Zou et al. Crowd counting via hierarchical scale recalibration network
CN113554599B (en) Video quality evaluation method based on human visual effect
CN113255464A (en) Airplane action recognition method and system
WO2021057091A1 (en) Viewpoint image processing method and related device
CN114708665A (en) Skeleton map human behavior identification method and system based on multi-stream fusion
CN116563355A (en) Target tracking method based on space-time interaction attention mechanism
CN111325145B (en) Behavior recognition method based on combined time domain channel correlation block
Jiang et al. Cross-level reinforced attention network for person re-identification
Zhang et al. Accurate and efficient event-based semantic segmentation using adaptive spiking encoder-decoder network
CN113393435A (en) Video significance detection method based on dynamic context-aware filter network
Yadav et al. Video object detection from compressed formats for modern lightweight consumer electronics
Yuan et al. Multi-filter dynamic graph convolutional networks for skeleton-based action recognition
TWI826160B (en) Image encoding and decoding method and apparatus
CN116597144A (en) Image semantic segmentation method based on event camera
CN116863241A (en) End-to-end semantic aerial view generation method, model and equipment based on computer vision under road scene
CN116453025A (en) Volleyball match group behavior identification method integrating space-time information in frame-missing environment
CN111325149A (en) Video action identification method based on voting time sequence correlation model
CN114648722B (en) Motion recognition method based on video multipath space-time characteristic network
CN114022371B (en) Defogging device and defogging method based on space and channel attention residual error network
CN115511858A (en) Video quality evaluation method based on novel time sequence characteristic relation mapping
CN114639166A (en) Examination room abnormal behavior recognition method based on motion recognition
CN113850158A (en) Video feature extraction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant