CN114049582A

CN114049582A - Weak supervision behavior detection method and device based on network structure search and background-action enhancement

Info

Publication number: CN114049582A
Application number: CN202111135223.5A
Authority: CN
Inventors: 张晓宇; 张亚如
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2022-02-15

Abstract

The invention discloses a weak supervision behavior detection method and a device based on network structure search and background-action enhancement, which comprises the steps of extracting video characteristics of a target video; network structure search is carried out on a predefined self-attention module, an optimized self-attention module is constructed, video features are input into the optimized self-attention module, and a self-attention weight vector based on local-global information is calculated; carrying out weighted fusion on the video features by using the self-attention weight vector to obtain a video feature vector, and classifying based on the video feature vector to obtain a video classification result; and performing time sequence class activation mapping calculation according to the self-attention weight vector and the video classification result to obtain a behavior detection result. The invention integrates rich action knowledge and constructive background information, realizes fine-grained background modeling, better learns long video characteristics, and reduces the computation complexity and time sequence labeling time.

Description

Weak supervision behavior detection method and device based on network structure search and background-action enhancement

Technical Field

The invention belongs to the field of video understanding, relates to a video behavior identification and detection technology, and particularly relates to a weak supervision behavior detection method and device based on network structure search and background-action enhancement.

Background

Video understanding refers to understanding events or behaviors occurring in a video using video modeling methods such as computer vision analysis. With the development of information and storage technologies, video gradually becomes one of the largest information carriers in the current society, and various video understanding requirements are generated in real life. Behavior recognition is a fundamental technology in the field of video understanding, and is generally directed to classifying actions of manually cropped video segments. However, most real-world videos exist in an uncut form, semantic information is rich, data size is large, and it is increasingly difficult to meet practical requirements when videos are cut manually for video understanding, which poses a great threat to network and information security. Therefore, the academic and industrial fields begin to focus on the task of temporal behavior detection, i.e., determining the motion category of each motion segment while locating the motion instance boundaries in long videos. The research can help people to quickly locate the key content in the video, and can be applied to the fields of abnormal behavior detection, intelligent video monitoring, video retrieval and the like.

An uncut video sequence is typically composed of a number of motion segments and meaningless background segments. The video time sequence behavior detection task and the target detection task have certain similarity, and both the video time sequence behavior detection task and the target detection task need to acquire boundary information of a foreground main body (behavior or object) and classify the main body information. In the research method, the time sequence behavior detection task largely refers to the framework of the target detection task, and can be divided into a two-stage method and a single-stage method. The two-stage method first generates candidate behavior video segments using sliding windows or according to behavior scores, and then classifies the behavior instances of the candidate segment regions. However, these frameworks require separate optimization of the functional modules, and the models are complex and slow to operate. The single-phase method aims to generate behavior categories and time boundaries directly from an original video at the same time, and the method is mainly the time sequence expansion of target detection methods such as SSD and Faster R-CNN. However, these methods all belong to the field of supervised learning, and require artificial annotation of massive video data, so that weak supervised learning is introduced into the task of video content identification and detection to reduce the high video annotation time cost.

Disclosure of Invention

The invention aims to provide a weak supervision action detection method and device based on network structure search and background-action enhancement, which utilize a network structure search technology to construct local-global attention characteristic expression of a long video, perform background-action enhancement through a background modeling technology related to actions, use extra background knowledge to provide rich clues for action identification and detection, and only utilize a class label at a video level of the long video and not use a time sequence label at a frame level of the long video in a network training process to improve the action identification and detection capability of the long video by using weak supervision learning.

The method comprises the steps of firstly extracting RGB and optical flow characteristics of a long video by using a pre-trained I3D network, then forming a searchable self-attention module by constructing a predefined search space and introducing structural parameter factors, and obtaining optimal structural parameters through a differential search algorithm to reconstruct a self-attention model structure. For each video feature, inputting the video feature into an optimized self-attention module structure, and respectively obtaining a double-flow self-attention feature expression of an action branch and a background branch. Next, the two feature expressions are input into a classifier based on background-action enhancement, and action and background classification scores related to the categories are output respectively. And finally, obtaining an integrated T-CAM score by using the classification result, and performing model training by combining with a self-attention weight vector irrelevant to the classification, so as to further perform weak supervision behavior detection.

The technical scheme adopted by the invention is as follows:

a weak supervision behavior detection method based on network structure search and background-action enhancement comprises the following steps:

1) extracting video characteristics of a target video;

2) network structure search is carried out on a predefined self-attention module, an optimized self-attention module is constructed, video features are input into the optimized self-attention module, and a self-attention weight vector based on local-global information is calculated;

3) carrying out weighted fusion on the video features by using the self-attention weight vector to obtain a video feature vector, and classifying based on the video feature vector to obtain a video classification result;

4) and performing time sequence class activation mapping calculation according to the self-attention weight vector and the video classification result to obtain a behavior detection result.

Further, the video features include: RGB features and optical flow features.

Further, before extracting the RGB features and the optical flow features, preprocessing the video, wherein the preprocessing includes: and performing uniform cropping operation on the pictures of the video.

Further, the uniform cropping operation comprises: center crop operation.

Further, the method for extracting the RGB features and the optical flow features includes: using an I3D network, using a C3D network, using a TSN network, or using a TSP network.

Further, an optimized self-attention module is constructed by:

1) predefining a search space;

2) introducing a structural parameter factor;

3) the search space is continuous;

4) and constructing an optimization self-attention module by utilizing one of a differentiable search algorithm, a reinforcement learning algorithm or an evolutionary algorithm.

Further, the predefined timing operation of the search space comprises: standard convolution, hole convolution, separable convolution, convolution-normalized activation operation, hopping operation, spatial pooling, and channel pooling.

Further, the method for classifying includes: a single-layer Linear classifier is used.

Further, the behavior detection result is obtained by the following steps:

1) activating mapping calculation according to the time sequence type to obtain the result of the integrated T-CAM;

2) and comparing the integrated T-CAM score with a set threshold value, and taking the reserved action occurrence time period as a behavior detection result.

Further, aiming at the obtained behavior positioning prediction result, a non-maximum suppression method is adopted for duplicate removal.

A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the above method when executed.

An electronic device comprising a memory and a processor, wherein the memory stores a program that performs the above described method.

The method of the invention can classify the action of the long video and position the time interval of the action, compared with the prior art, the method has the following advantages:

1. the invention provides a self-attention model optimization method based on network structure search, which can automatically capture local-global dependence in a long video frame and between frames, thereby carrying out complete modeling on a long video and simultaneously saving the time and cost for manually designing a complex self-attention mechanism;

2. the invention uses the background-action enhancement technology related to the action, integrates rich action knowledge and constructive background information in the long video, realizes background modeling with fine granularity, and better learns the characteristics of the long video;

3. the method uses a weak supervised learning mechanism learning model, only uses the labels at the video level for training, does not use the labels at the time sequence, and greatly reduces the calculation complexity and the time for labeling the time sequence.

Drawings

Fig. 1 is a flow chart of video behavior identification and detection using the method of the present invention.

Detailed Description

The present invention will be described in further detail below with reference to specific examples and the accompanying drawings.

The invention provides a weak supervision behavior positioning method based on network structure search and background-action enhancement, which is suitable for performing behavior identification and detection on a long video, the flow of the method is shown in figure 1, and the method mainly comprises the following steps: the long video is first pre-processed to extract RGB frames and optical flow for the video, and then features are extracted for the RGB frames and optical flow, respectively, using an I3D network. And then, constructing a predefined search space, introducing a structural parameter factor, continuously forming a searchable self-attention module by the search space, and obtaining an optimal structural parameter reconstruction self-attention model structure by a differential search algorithm. For each video feature, inputting the video feature into an optimized self-attention module structure, and respectively obtaining a double-flow self-attention feature expression of an action branch and a background branch. Next, the two feature expressions are input into a classifier based on background-action enhancement, and action and background classification scores related to the categories are output respectively. And finally, obtaining an integrated T-CAM score by using the classification result, and performing model training by combining with a self-attention weight vector irrelevant to the classification, so as to further perform weak supervision behavior detection.

The method comprises three parts of searching, training and testing, and in the searching stage, the self-attention model structure is optimized by using a differential search algorithm to obtain the optimal structure parameters. In the training stage, a self-attention model is reconstructed through a network structure corresponding to the optimal structure parameters, and the self-attention module and a classifier are trained by using the action labels at the video level. In the testing stage, the trained self-attention module and classifier obtain the classification result and the self-attention weight of the video, and perform video motion positioning according to the obtained classification result and the self-attention weight.

Example 1 Weak supervision behavior detection method and apparatus based on network structure search and background-action enhancement

Take the thumb 14 dataset as an example:

1) firstly, extracting RGB and optical flow characteristics of a long video data set by using I3D, C3D, TSN or TSP;

prior to extracting RGB and optical flow features, the picture is pre-processed by a uniform cropping operation (e.g., a center cropping operation to achieve a size of 224 × 224)

2) Network structure searching is carried out on the predefined self-attention unit by using a network structure searching technology, and local-global self-attention characteristic expressions of video segment intra-frame and inter-frame are obtained;

the network structure searching technology comprises 4 steps of predefining a searching space, introducing a structure parameter factor, enabling the searching space to be continuous, and carrying out a differentiable searching algorithm, a reinforcement learning algorithm or an evolutionary algorithm;

the predefined search space comprises 7 types of 1-dimensional time sequence operation, standard convolution, hole convolution, separable convolution, convolution-normalization-activation operation, jumping operation, space pooling and channel pooling;

the search space serialization operation adopts a Softmax function;

the differentiable search algorithm adopts a standard DARTs algorithm;

3) inputting the video double-stream characteristics obtained in the step 1) into the self-attention module optimized in the step 2), and calculating a self-attention weight vector based on local-global information;

4) carrying out weighted aggregation on the long video features of 1) by the self-attention weight vector obtained in the step 3) to obtain a self-attention video feature vector;

5) inputting the video feature vectors obtained in the step 4) into a classifier for classification to obtain a video classification result;

further, the video level classifier is a single-layer Linear classifier;

6) calculating the score of a time sequence category activation mapping (T-CAM) according to the self-attention weight vector obtained in the step 3) and the video classification result in the step 4), then calculating the integrated T-CAM score of the action and the background, and calculating the integrated T-CAM score for RGB and optical flow respectively to obtain the result of double-current integrated T-CAM;

7) based on the result of the double-current integrated T-CAM obtained in the step 6), actions or backgrounds in the long video can be distinguished, positions exceeding a threshold value are reserved, and finally a non-maximum value suppression method is adopted to remove a highly overlapped prediction result, so that a time period of actions in the video is finally obtained;

comparing the results of the method of the present invention with those of other methods, the obtained behavior recognition accuracy is shown in table 1, and the obtained average accuracy of behavior localization is shown in table 2:

TABLE 1 accuracy of behavior recognition on THUMOS14 data sets

Data set	The method of the invention	W-TALC	AdapNet
				THUMOS14	0.897	0.856	0.879

In Table 1, W-TALC and AdapNet are comparative methods, both of which are poorly supervised methods. Thus, a comparison can be made using the method of the present invention.

TABLE 2 average accuracy of behavioral positioning (mAP) at IoU ═ 0.5 on THUMOS14 dataset

Data set	The method of the invention	W-TALC	AdapNet
				THUMOS14	28.79	16.9	23.65

According to the results in table 1 and table 2, it is shown that the method of the present invention can significantly improve the behavior recognition and positioning results of the video, respectively.

Example 2 Weak supervision behavior detection method and apparatus based on network structure search and background-action enhancement

The feature extraction unit is used for performing double-current feature extraction on the long video data set;

the network structure searching unit is used for carrying out structure searching on the predefined self-attention unit to obtain local-global self-attention characteristic expressions between the video segment frames and between the video segment frames;

the self-attention unit is used for extracting the self-attention features of the features obtained by the feature extraction unit to obtain more compact feature expression;

the action-background recognition unit is used for classifying the action and background characteristics weighted by the self-attention weight to obtain the probability that the long video belongs to a certain type of action and background;

further, the system also comprises a behavior positioning unit, which is used for calculating an integrated T-CAM value for the action and background identification scores based on background-action enhancement, optimizing self-attention weight, distinguishing the action or background in the video, and performing post-processing by adopting non-maximum value suppression to obtain the time interval of action instances in the video, so as to improve the average precision of behavior positioning.

The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A weak supervision behavior detection method based on network structure search and background-action enhancement comprises the following steps:

1) extracting video characteristics of a target video;

2. The method of claim 1, wherein the video features comprise: RGB features and optical flow features.

3. The method of claim 1, wherein prior to extracting the video features, pre-processing the video, wherein pre-processing comprises: performing uniform cutting operation on the pictures of the video; the uniform cropping operation comprises: center crop operation.

4. The method of claim 1, wherein extracting RGB features and optical flow features comprises: using an I3D network, using a C3D network, using a TSN network, or using a TSP network.

5. The method of claim 1, wherein the optimized self-attention module is constructed by:

1) a predefined search space, wherein the timing operations of the predefined search space comprise: standard convolution, hole convolution, separable convolution, convolution-normalized activation operation, hopping operation, spatial pooling, and channel pooling;

2) introducing a structural parameter factor;

3) the search space is continuous;

6. The method of claim 1, wherein the method of classifying comprises: a single-layer Linear classifier is used.

7. The method of claim 1, wherein the behavior detection result is obtained by:

8. The method of claim 7, wherein the deduplication is performed using a non-maxima suppression method for the obtained behavior localization prediction results.

9. A storage medium having a computer program stored thereon, wherein the computer program is arranged to, when run, perform the method of any of claims 1-8.

10. An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the method according to any of claims 1-8.