CN114049582A - Weak supervision behavior detection method and device based on network structure search and background-action enhancement - Google Patents

Weak supervision behavior detection method and device based on network structure search and background-action enhancement Download PDF

Info

Publication number
CN114049582A
CN114049582A CN202111135223.5A CN202111135223A CN114049582A CN 114049582 A CN114049582 A CN 114049582A CN 202111135223 A CN202111135223 A CN 202111135223A CN 114049582 A CN114049582 A CN 114049582A
Authority
CN
China
Prior art keywords
video
self
attention
features
behavior detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111135223.5A
Other languages
Chinese (zh)
Inventor
张晓宇
张亚如
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN202111135223.5A priority Critical patent/CN114049582A/en
Publication of CN114049582A publication Critical patent/CN114049582A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a weak supervision behavior detection method and a device based on network structure search and background-action enhancement, which comprises the steps of extracting video characteristics of a target video; network structure search is carried out on a predefined self-attention module, an optimized self-attention module is constructed, video features are input into the optimized self-attention module, and a self-attention weight vector based on local-global information is calculated; carrying out weighted fusion on the video features by using the self-attention weight vector to obtain a video feature vector, and classifying based on the video feature vector to obtain a video classification result; and performing time sequence class activation mapping calculation according to the self-attention weight vector and the video classification result to obtain a behavior detection result. The invention integrates rich action knowledge and constructive background information, realizes fine-grained background modeling, better learns long video characteristics, and reduces the computation complexity and time sequence labeling time.

Description

Weak supervision behavior detection method and device based on network structure search and background-action enhancement
Technical Field
The invention belongs to the field of video understanding, relates to a video behavior identification and detection technology, and particularly relates to a weak supervision behavior detection method and device based on network structure search and background-action enhancement.
Background
Video understanding refers to understanding events or behaviors occurring in a video using video modeling methods such as computer vision analysis. With the development of information and storage technologies, video gradually becomes one of the largest information carriers in the current society, and various video understanding requirements are generated in real life. Behavior recognition is a fundamental technology in the field of video understanding, and is generally directed to classifying actions of manually cropped video segments. However, most real-world videos exist in an uncut form, semantic information is rich, data size is large, and it is increasingly difficult to meet practical requirements when videos are cut manually for video understanding, which poses a great threat to network and information security. Therefore, the academic and industrial fields begin to focus on the task of temporal behavior detection, i.e., determining the motion category of each motion segment while locating the motion instance boundaries in long videos. The research can help people to quickly locate the key content in the video, and can be applied to the fields of abnormal behavior detection, intelligent video monitoring, video retrieval and the like.
An uncut video sequence is typically composed of a number of motion segments and meaningless background segments. The video time sequence behavior detection task and the target detection task have certain similarity, and both the video time sequence behavior detection task and the target detection task need to acquire boundary information of a foreground main body (behavior or object) and classify the main body information. In the research method, the time sequence behavior detection task largely refers to the framework of the target detection task, and can be divided into a two-stage method and a single-stage method. The two-stage method first generates candidate behavior video segments using sliding windows or according to behavior scores, and then classifies the behavior instances of the candidate segment regions. However, these frameworks require separate optimization of the functional modules, and the models are complex and slow to operate. The single-phase method aims to generate behavior categories and time boundaries directly from an original video at the same time, and the method is mainly the time sequence expansion of target detection methods such as SSD and Faster R-CNN. However, these methods all belong to the field of supervised learning, and require artificial annotation of massive video data, so that weak supervised learning is introduced into the task of video content identification and detection to reduce the high video annotation time cost.
Disclosure of Invention
The invention aims to provide a weak supervision action detection method and device based on network structure search and background-action enhancement, which utilize a network structure search technology to construct local-global attention characteristic expression of a long video, perform background-action enhancement through a background modeling technology related to actions, use extra background knowledge to provide rich clues for action identification and detection, and only utilize a class label at a video level of the long video and not use a time sequence label at a frame level of the long video in a network training process to improve the action identification and detection capability of the long video by using weak supervision learning.
The method comprises the steps of firstly extracting RGB and optical flow characteristics of a long video by using a pre-trained I3D network, then forming a searchable self-attention module by constructing a predefined search space and introducing structural parameter factors, and obtaining optimal structural parameters through a differential search algorithm to reconstruct a self-attention model structure. For each video feature, inputting the video feature into an optimized self-attention module structure, and respectively obtaining a double-flow self-attention feature expression of an action branch and a background branch. Next, the two feature expressions are input into a classifier based on background-action enhancement, and action and background classification scores related to the categories are output respectively. And finally, obtaining an integrated T-CAM score by using the classification result, and performing model training by combining with a self-attention weight vector irrelevant to the classification, so as to further perform weak supervision behavior detection.
The technical scheme adopted by the invention is as follows:
a weak supervision behavior detection method based on network structure search and background-action enhancement comprises the following steps:
1) extracting video characteristics of a target video;
2) network structure search is carried out on a predefined self-attention module, an optimized self-attention module is constructed, video features are input into the optimized self-attention module, and a self-attention weight vector based on local-global information is calculated;
3) carrying out weighted fusion on the video features by using the self-attention weight vector to obtain a video feature vector, and classifying based on the video feature vector to obtain a video classification result;
4) and performing time sequence class activation mapping calculation according to the self-attention weight vector and the video classification result to obtain a behavior detection result.
Further, the video features include: RGB features and optical flow features.
Further, before extracting the RGB features and the optical flow features, preprocessing the video, wherein the preprocessing includes: and performing uniform cropping operation on the pictures of the video.
Further, the uniform cropping operation comprises: center crop operation.
Further, the method for extracting the RGB features and the optical flow features includes: using an I3D network, using a C3D network, using a TSN network, or using a TSP network.
Further, an optimized self-attention module is constructed by:
1) predefining a search space;
2) introducing a structural parameter factor;
3) the search space is continuous;
4) and constructing an optimization self-attention module by utilizing one of a differentiable search algorithm, a reinforcement learning algorithm or an evolutionary algorithm.
Further, the predefined timing operation of the search space comprises: standard convolution, hole convolution, separable convolution, convolution-normalized activation operation, hopping operation, spatial pooling, and channel pooling.
Further, the method for classifying includes: a single-layer Linear classifier is used.
Further, the behavior detection result is obtained by the following steps:
1) activating mapping calculation according to the time sequence type to obtain the result of the integrated T-CAM;
2) and comparing the integrated T-CAM score with a set threshold value, and taking the reserved action occurrence time period as a behavior detection result.
Further, aiming at the obtained behavior positioning prediction result, a non-maximum suppression method is adopted for duplicate removal.
A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the above method when executed.
An electronic device comprising a memory and a processor, wherein the memory stores a program that performs the above described method.
The method of the invention can classify the action of the long video and position the time interval of the action, compared with the prior art, the method has the following advantages:
1. the invention provides a self-attention model optimization method based on network structure search, which can automatically capture local-global dependence in a long video frame and between frames, thereby carrying out complete modeling on a long video and simultaneously saving the time and cost for manually designing a complex self-attention mechanism;
2. the invention uses the background-action enhancement technology related to the action, integrates rich action knowledge and constructive background information in the long video, realizes background modeling with fine granularity, and better learns the characteristics of the long video;
3. the method uses a weak supervised learning mechanism learning model, only uses the labels at the video level for training, does not use the labels at the time sequence, and greatly reduces the calculation complexity and the time for labeling the time sequence.
Drawings
Fig. 1 is a flow chart of video behavior identification and detection using the method of the present invention.
Detailed Description
The present invention will be described in further detail below with reference to specific examples and the accompanying drawings.
The invention provides a weak supervision behavior positioning method based on network structure search and background-action enhancement, which is suitable for performing behavior identification and detection on a long video, the flow of the method is shown in figure 1, and the method mainly comprises the following steps: the long video is first pre-processed to extract RGB frames and optical flow for the video, and then features are extracted for the RGB frames and optical flow, respectively, using an I3D network. And then, constructing a predefined search space, introducing a structural parameter factor, continuously forming a searchable self-attention module by the search space, and obtaining an optimal structural parameter reconstruction self-attention model structure by a differential search algorithm. For each video feature, inputting the video feature into an optimized self-attention module structure, and respectively obtaining a double-flow self-attention feature expression of an action branch and a background branch. Next, the two feature expressions are input into a classifier based on background-action enhancement, and action and background classification scores related to the categories are output respectively. And finally, obtaining an integrated T-CAM score by using the classification result, and performing model training by combining with a self-attention weight vector irrelevant to the classification, so as to further perform weak supervision behavior detection.
The method comprises three parts of searching, training and testing, and in the searching stage, the self-attention model structure is optimized by using a differential search algorithm to obtain the optimal structure parameters. In the training stage, a self-attention model is reconstructed through a network structure corresponding to the optimal structure parameters, and the self-attention module and a classifier are trained by using the action labels at the video level. In the testing stage, the trained self-attention module and classifier obtain the classification result and the self-attention weight of the video, and perform video motion positioning according to the obtained classification result and the self-attention weight.
Example 1 Weak supervision behavior detection method and apparatus based on network structure search and background-action enhancement
Take the thumb 14 dataset as an example:
1) firstly, extracting RGB and optical flow characteristics of a long video data set by using I3D, C3D, TSN or TSP;
prior to extracting RGB and optical flow features, the picture is pre-processed by a uniform cropping operation (e.g., a center cropping operation to achieve a size of 224 × 224)
2) Network structure searching is carried out on the predefined self-attention unit by using a network structure searching technology, and local-global self-attention characteristic expressions of video segment intra-frame and inter-frame are obtained;
the network structure searching technology comprises 4 steps of predefining a searching space, introducing a structure parameter factor, enabling the searching space to be continuous, and carrying out a differentiable searching algorithm, a reinforcement learning algorithm or an evolutionary algorithm;
the predefined search space comprises 7 types of 1-dimensional time sequence operation, standard convolution, hole convolution, separable convolution, convolution-normalization-activation operation, jumping operation, space pooling and channel pooling;
the search space serialization operation adopts a Softmax function;
the differentiable search algorithm adopts a standard DARTs algorithm;
3) inputting the video double-stream characteristics obtained in the step 1) into the self-attention module optimized in the step 2), and calculating a self-attention weight vector based on local-global information;
4) carrying out weighted aggregation on the long video features of 1) by the self-attention weight vector obtained in the step 3) to obtain a self-attention video feature vector;
5) inputting the video feature vectors obtained in the step 4) into a classifier for classification to obtain a video classification result;
further, the video level classifier is a single-layer Linear classifier;
6) calculating the score of a time sequence category activation mapping (T-CAM) according to the self-attention weight vector obtained in the step 3) and the video classification result in the step 4), then calculating the integrated T-CAM score of the action and the background, and calculating the integrated T-CAM score for RGB and optical flow respectively to obtain the result of double-current integrated T-CAM;
7) based on the result of the double-current integrated T-CAM obtained in the step 6), actions or backgrounds in the long video can be distinguished, positions exceeding a threshold value are reserved, and finally a non-maximum value suppression method is adopted to remove a highly overlapped prediction result, so that a time period of actions in the video is finally obtained;
comparing the results of the method of the present invention with those of other methods, the obtained behavior recognition accuracy is shown in table 1, and the obtained average accuracy of behavior localization is shown in table 2:
TABLE 1 accuracy of behavior recognition on THUMOS14 data sets
Data set The method of the invention W-TALC AdapNet
THUMOS14 0.897 0.856 0.879
In Table 1, W-TALC and AdapNet are comparative methods, both of which are poorly supervised methods. Thus, a comparison can be made using the method of the present invention.
TABLE 2 average accuracy of behavioral positioning (mAP) at IoU ═ 0.5 on THUMOS14 dataset
Data set The method of the invention W-TALC AdapNet
THUMOS14 28.79 16.9 23.65
According to the results in table 1 and table 2, it is shown that the method of the present invention can significantly improve the behavior recognition and positioning results of the video, respectively.
Example 2 Weak supervision behavior detection method and apparatus based on network structure search and background-action enhancement
The feature extraction unit is used for performing double-current feature extraction on the long video data set;
the network structure searching unit is used for carrying out structure searching on the predefined self-attention unit to obtain local-global self-attention characteristic expressions between the video segment frames and between the video segment frames;
the self-attention unit is used for extracting the self-attention features of the features obtained by the feature extraction unit to obtain more compact feature expression;
the action-background recognition unit is used for classifying the action and background characteristics weighted by the self-attention weight to obtain the probability that the long video belongs to a certain type of action and background;
further, the system also comprises a behavior positioning unit, which is used for calculating an integrated T-CAM value for the action and background identification scores based on background-action enhancement, optimizing self-attention weight, distinguishing the action or background in the video, and performing post-processing by adopting non-maximum value suppression to obtain the time interval of action instances in the video, so as to improve the average precision of behavior positioning.
The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims (10)

1. A weak supervision behavior detection method based on network structure search and background-action enhancement comprises the following steps:
1) extracting video characteristics of a target video;
2) network structure search is carried out on a predefined self-attention module, an optimized self-attention module is constructed, video features are input into the optimized self-attention module, and a self-attention weight vector based on local-global information is calculated;
3) carrying out weighted fusion on the video features by using the self-attention weight vector to obtain a video feature vector, and classifying based on the video feature vector to obtain a video classification result;
4) and performing time sequence class activation mapping calculation according to the self-attention weight vector and the video classification result to obtain a behavior detection result.
2. The method of claim 1, wherein the video features comprise: RGB features and optical flow features.
3. The method of claim 1, wherein prior to extracting the video features, pre-processing the video, wherein pre-processing comprises: performing uniform cutting operation on the pictures of the video; the uniform cropping operation comprises: center crop operation.
4. The method of claim 1, wherein extracting RGB features and optical flow features comprises: using an I3D network, using a C3D network, using a TSN network, or using a TSP network.
5. The method of claim 1, wherein the optimized self-attention module is constructed by:
1) a predefined search space, wherein the timing operations of the predefined search space comprise: standard convolution, hole convolution, separable convolution, convolution-normalized activation operation, hopping operation, spatial pooling, and channel pooling;
2) introducing a structural parameter factor;
3) the search space is continuous;
4) and constructing an optimization self-attention module by utilizing one of a differentiable search algorithm, a reinforcement learning algorithm or an evolutionary algorithm.
6. The method of claim 1, wherein the method of classifying comprises: a single-layer Linear classifier is used.
7. The method of claim 1, wherein the behavior detection result is obtained by:
1) activating mapping calculation according to the time sequence type to obtain the result of the integrated T-CAM;
2) and comparing the integrated T-CAM score with a set threshold value, and taking the reserved action occurrence time period as a behavior detection result.
8. The method of claim 7, wherein the deduplication is performed using a non-maxima suppression method for the obtained behavior localization prediction results.
9. A storage medium having a computer program stored thereon, wherein the computer program is arranged to, when run, perform the method of any of claims 1-8.
10. An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the method according to any of claims 1-8.
CN202111135223.5A 2021-09-27 2021-09-27 Weak supervision behavior detection method and device based on network structure search and background-action enhancement Pending CN114049582A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111135223.5A CN114049582A (en) 2021-09-27 2021-09-27 Weak supervision behavior detection method and device based on network structure search and background-action enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111135223.5A CN114049582A (en) 2021-09-27 2021-09-27 Weak supervision behavior detection method and device based on network structure search and background-action enhancement

Publications (1)

Publication Number Publication Date
CN114049582A true CN114049582A (en) 2022-02-15

Family

ID=80204859

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111135223.5A Pending CN114049582A (en) 2021-09-27 2021-09-27 Weak supervision behavior detection method and device based on network structure search and background-action enhancement

Country Status (1)

Country Link
CN (1) CN114049582A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114758285A (en) * 2022-06-14 2022-07-15 山东省人工智能研究院 Video interaction action detection method based on anchor freedom and long-term attention perception

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114758285A (en) * 2022-06-14 2022-07-15 山东省人工智能研究院 Video interaction action detection method based on anchor freedom and long-term attention perception

Similar Documents

Publication Publication Date Title
Liu et al. Beyond short-term snippet: Video relation detection with spatio-temporal global context
Li et al. Automatic text detection and tracking in digital video
Bhaumik et al. Hybrid soft computing approaches to content based video retrieval: A brief review
CN111046821B (en) Video behavior recognition method and system and electronic equipment
CN114049581B (en) Weak supervision behavior positioning method and device based on action segment sequencing
CN111368754B (en) Airport runway foreign matter detection method based on global context information
CN111985333B (en) Behavior detection method based on graph structure information interaction enhancement and electronic device
CN114547249B (en) Vehicle retrieval method based on natural language and visual features
CN112580362A (en) Visual behavior recognition method and system based on text semantic supervision and computer readable medium
Gawande et al. Deep learning approach to key frame detection in human action videos
Patil et al. An approach of understanding human activity recognition and detection for video surveillance using HOG descriptor and SVM classifier
CN114049582A (en) Weak supervision behavior detection method and device based on network structure search and background-action enhancement
Kumar et al. Semi-supervised annotation of faces in image collection
US20240005662A1 (en) Surgical instrument recognition from surgical videos
Gong et al. Research on an improved KCF target tracking algorithm based on CNN feature extraction
Hao et al. Human behavior analysis based on attention mechanism and LSTM neural network
Guo et al. Dynamic facial expression recognition based on ResNet and LSTM
CN116935303A (en) Weak supervision self-training video anomaly detection method
Yu et al. Aud-tgn: Advancing action unit detection with temporal convolution and gpt-2 in wild audiovisual contexts
Joshi et al. Unsupervised synthesis of anomalies in videos: Transforming the normal
CN113420608A (en) Human body abnormal behavior identification method based on dense space-time graph convolutional network
Xu et al. Violent Physical Behavior Detection using 3D Spatio-Temporal Convolutional Neural Networks
Sumadhi et al. An enhanced approach for solving class imbalance problem in automatic image annotation
Singh et al. Computer vision based visual activity classification through deep learning approaches
Yang et al. Abnormal actions detection of robotic arm via 3D convolution neural network and support vector data description

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination