CN116612420A

CN116612420A - Weak supervision video time sequence action detection method, system, equipment and storage medium

Info

Publication number: CN116612420A
Application number: CN202310891912.1A
Authority: CN
Inventors: 王子磊; 李志林
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2023-07-20
Filing date: 2023-07-20
Publication date: 2023-08-18
Anticipated expiration: 2043-07-20
Also published as: CN116612420B

Abstract

The invention discloses a weak supervision video time sequence action detection method, a weak supervision video time sequence action detection system, weak supervision video time sequence action detection equipment and a storage medium, wherein the weak supervision video time sequence action detection method, the weak supervision video time sequence action detection system and the weak supervision video time sequence action detection equipment are in a one-to-one corresponding scheme, and the weak supervision video time sequence action detection method comprises the following steps: a self-training branch separated from classification tasks is designed, and the branch can generate a comprehensive action sequence without being interfered by action context information; the false positive fragments in the prediction result are designed in a targeted manner, and the probability of the false positive fragments is modeled, and the high-probability fragments are restrained, so that the number of the false positive fragments is greatly reduced; in addition, a foreground enhancement branch is designed, and the recognition capability of the model to the foreground segment is enhanced. In general, the invention effectively inhibits the false positive fragments and improves the detection performance of the model.

Description

Weak supervision video time sequence action detection method, system, equipment and storage medium

Technical Field

The invention relates to the technical field of video analysis and understanding, in particular to a method, a system, equipment and a storage medium for detecting a time sequence action of a weak supervision video.

Background

In recent years, video data has exploded, and the application requirements for video understanding have gradually increased. The time sequence motion detection task is used as a popular downstream task for video understanding, and is focused by a plurality of researchers at home and abroad due to wide practical application such as security monitoring, video retrieval, sports video clip editing, video auditing and the like. Researchers need to design proper time sequence motion detection schemes according to the requirements of different application scenes, and provide accurate motion positioning and motion classification results.

At present, the time sequence motion detection task mainly has two learning paradigms: (1) Providing a full-supervision learning paradigm of detailed frame-level annotation information; (2) Only a weakly supervised learning paradigm of video level action class labeling is provided. For the full-supervision time sequence action detection, all sample videos need to be manually marked frame by frame, the marking work needs to consume huge labor cost, and the precision is generally low. Therefore, in order to solve the problems of difficulty in labeling and labeling errors, weak supervision timing action detection is generated. This approach allows a large number of unchuded video tags on the network to be used as video level tags directly as training data for the model.

Although the weak supervision time sequence action detection method has a plurality of benefits, the detection performance is far weaker than that of the full supervision method due to the lack of fine labels. Most of the weak supervision time sequence action detection methods at the present stage use a multi-instance learning framework to classify all fragments in a video, generate a class activation sequence corresponding to the video, then aggregate k fragments with the highest score in each class to obtain a classification score of a video level, if the classification score exceeds a preset class score threshold value, the action is represented in the video, the activation sequence of the corresponding class is found in the class activation sequence according to the found action class, then a pre-selected action score threshold value is used to generate an action pre-selected frame, and finally non-maximum suppression is used for all pre-selected frames to obtain a final prediction result.

The above describes a method for locating using classification results, where the model can only update network parameters by optimizing the classification results at the video level. However, for the classification task at the video level, the model is easy to perform the action classification task according to the action context information (generally, the scene information of different actions has large difference and is easy to classify), so that most of the pre-selected frames obtained by the positioning are unavoidable by the traditional method, and the segments do not necessarily belong to action segments. Therefore, many false positive fragments exist in the prediction result of the model, and the positioning performance of the model is generally poor.

The motion recognition is performed by using a sparse key frame attention mechanism in the Chinese patent application with publication number of CN110832499A, namely weak supervision motion positioning through sparse time pooling network. In the Chinese patent application No. CN115439790A, the original class activation sequence is obtained according to the time sequence characteristics, the expanded class activation sequence is obtained through a seed growth strategy, the anti-erasure is carried out, the original class activation sequence and the erased class activation sequence are fused, and the class activation sequence with higher reliability is obtained so as to improve the detection precision. The strategy of cooperative distillation is adopted in the Chinese patent application of weak supervision video timing sequence action detection and classification method and system with publication number of CN115272941A, so that the advantages of a single-mode frame and a cross-mode frame are complemented, and more complete and accurate timing sequence action detection and classification are realized. In the Chinese patent application No. CN114898259A, namely a weak supervision video time sequence action positioning method based on action associated attention, an action associated attention model is adopted to establish the relation between action fragments in a video, a query mechanism is utilized to establish weak supervision pre-training, and the output of the query mechanism is input into a decoder of a Transformer architecture for realizing time positioning of a query set; and determining the relation between the video segment characteristics by utilizing an encoder of a transducer architecture, so as to realize the positioning and classification of the action segments.

However, the methods disclosed in the above patent applications all follow the way of achieving the action localization by optimizing the classification task. In such a positioning method, there are a large number of false positive fragments caused by motion context information in the positioning result, resulting in poor performance of time-series motion detection.

Disclosure of Invention

The invention aims to provide a method, a system, equipment and a storage medium for detecting time sequence actions of a weak surveillance video, which can effectively inhibit false positive fragments and improve the performance of detecting time sequence actions.

The invention aims at realizing the following technical scheme:

a weak supervision video timing sequence action detection method comprises the following steps:

constructing a weak supervision video timing sequence action detection model, wherein the weak supervision video timing sequence action detection model comprises: the system comprises a basic framework, a self-training action branch, a false positive suppression module and a prospect enhancement branch;

inputting training video data and corresponding optical flow data into a weak supervision video time sequence action detection model, extracting and obtaining a feature X through a basic framework, encoding the feature X into an embedded feature E, obtaining a class activation sequence A through classification, and calculating basic loss by combining a given video level label; the self-training action branch obtains action sequences of two modes by utilizing the characteristic X, obtains a comprehensive action sequence and a non-action sequence after fusion, and calculates self-training action loss based on the action sequences of the two modes and the comprehensive action sequence and the non-action sequence; the false positive suppression module obtains a false positive sequence by using the class activation sequence A and the non-action sequence, and calculates false positive suppression loss by combining a set uniform label; the foreground enhancement branch generates a segment level front Jing Quanchong by using an attention mechanism based on the embedded feature E, acts on the class activation sequence A to obtain a foreground enhancement class activation sequence, combines action sequences of the two modes to obtain a comprehensive class activation sequence, and combines a given video level label to calculate a foreground enhancement loss; training the weak supervision video time sequence action detection model by combining all losses;

Inputting the video data to be detected and the corresponding optical flow data into a trained weak supervision video time sequence action detection model, and realizing time sequence action detection by utilizing a foreground enhancement type activation sequence and a comprehensive type activation sequence which are obtained by a foreground enhancement branch.

A weak surveillance video timing motion detection system, comprising:

the model construction unit is used for constructing a weak supervision video time sequence action detection model, and the weak supervision video time sequence action detection model comprises the following components: the system comprises a basic framework, a self-training action branch, a false positive suppression module and a prospect enhancement branch;

the training unit is used for inputting training video data and corresponding optical flow data into a weak supervision video time sequence action detection model, extracting and obtaining a feature X through a basic framework, encoding the feature X into an embedded feature E, obtaining a class activation sequence A through classification, and calculating basic loss by combining a given video level label; the self-training action branch obtains action sequences of two modes by utilizing the characteristic X, obtains a comprehensive action sequence and a non-action sequence after fusion, and calculates self-training action loss based on the action sequences of the two modes and the comprehensive action sequence and the non-action sequence; the false positive suppression module obtains a false positive sequence by using the class activation sequence A and the non-action sequence, and calculates false positive suppression loss by combining a set uniform label; the foreground enhancement branch generates a segment level front Jing Quanchong by using an attention mechanism based on the embedded feature E, acts on the class activation sequence A to obtain a foreground enhancement class activation sequence, combines action sequences of the two modes to obtain a comprehensive class activation sequence, and combines a given video level label to calculate a foreground enhancement loss; training the weak supervision video time sequence action detection model by combining all losses;

The detection unit is used for inputting the video data to be detected and the corresponding optical flow data into the trained weak supervision video time sequence action detection model, and the time sequence action detection is realized by utilizing the foreground enhancement type activation sequence and the comprehensive type activation sequence obtained by the foreground enhancement branch.

A processing apparatus, comprising: one or more processors; a memory for storing one or more programs;

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the aforementioned methods.

A readable storage medium storing a computer program which, when executed by a processor, implements the method described above.

According to the technical scheme provided by the invention, a self-training branch which is separated from classification tasks is designed, and the branch can generate a comprehensive action sequence without being interfered by action context information; the false positive fragments in the prediction result are designed in a targeted manner, and the probability of the false positive fragments is modeled, and the high-probability fragments are restrained, so that the number of the false positive fragments is greatly reduced; in addition, a foreground enhancement branch is designed, and the recognition capability of the model to the foreground segment is enhanced. In general, the invention effectively inhibits the false positive fragments and improves the detection performance of the model.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a weak supervision video timing sequence action detection method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a weak surveillance video timing motion detection model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a self-training action branch structure according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a weak surveillance video timing motion detection system according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a processing apparatus according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

The terms that may be used herein will first be described as follows:

the terms "comprises," "comprising," "includes," "including," "has," "having" or other similar referents are to be construed to cover a non-exclusive inclusion. For example: including a particular feature (e.g., a starting material, component, ingredient, carrier, formulation, material, dimension, part, means, mechanism, apparatus, step, procedure, method, reaction condition, processing condition, parameter, algorithm, signal, data, product or article of manufacture, etc.), should be construed as including not only a particular feature but also other features known in the art that are not explicitly recited.

The following describes in detail a method, a system, a device and a storage medium for detecting a time sequence action of a weak surveillance video. What is not described in detail in the embodiments of the present invention belongs to the prior art known to those skilled in the art. The specific conditions are not noted in the examples of the present invention and are carried out according to the conditions conventional in the art or suggested by the manufacturer.

Example 1

The embodiment of the invention provides a weak supervision video time sequence action detection method, as shown in fig. 1, which mainly comprises the following steps:

And step 1, constructing a weak supervision video time sequence action detection model.

In the embodiment of the invention, a targeted design is made for the problem of excessive false positive fragments, and a weak supervision video time sequence action detection model is constructed, which mainly comprises the following steps: the system comprises a basic framework, a self-training action branch, a false positive suppression module and a prospect enhancement branch.

And step 2, training the weak supervision video time sequence action detection model.

In the embodiment of the invention, training video data and corresponding optical flow data are input into a weak supervision video time sequence action detection model, after a feature X is extracted through a basic framework, an embedded feature E is obtained through a feature encoder, finally a class activation sequence A is obtained through a classifier, and basic loss is calculated by combining a given video level label; the self-training action branch obtains action sequences of two modes by utilizing the characteristic X, obtains a comprehensive action sequence and a non-action sequence after fusion, and calculates self-training action loss based on the action sequences of the two modes and the comprehensive action sequence and the non-action sequence; the false positive suppression module obtains a false positive sequence by using the class activation sequence A and the non-action sequence, and calculates false positive suppression loss by combining a set uniform label; the foreground enhancement branch generates a segment level front Jing Quanchong by using an attention mechanism based on the embedded feature E, acts on the class activation sequence A to obtain a foreground enhancement class activation sequence, combines action sequences of the two modes to obtain a comprehensive class activation sequence, and combines a given video level label to calculate a foreground enhancement loss; and training the weak supervision video time sequence action detection model in combination with all losses.

In the embodiment of the invention, the processing process and the calculation mode of the related loss function of each part of the weak supervision video time sequence action detection model are as follows:

(1) The base frame includes: a pre-trained feature extraction network, a one-dimensional convolution layer and a classifier; wherein the pre-trained feature extraction network comprises: an RGB feature extraction network and an optical flow feature extraction network, wherein RGB refers to three channels of red, green and blue; extracting RGB features from training video data through an RGB feature extraction network, extracting optical flow features from corresponding optical flow data through an optical flow feature extraction network, and splicing the RGB features and the optical flow features in a channel dimension to obtain features X; the feature X is processed by a one-dimensional convolution layer (equivalent to the feature encoder described above) to obtain an embedded feature E, and the embedded feature E is classified by a classifier to obtain a class activation sequence A; the method comprises the steps that k fragments with highest scores on each category are aggregated in a category activation sequence A to obtain a category score of a video level, category probability is generated on a category dimension by using a softmax function, wherein each fragment comprises a set number of frame images, and k is a set positive integer; thereafter, the base loss is calculated in conjunction with the given video level tag.

(2) The self-training action branch respectively and sequentially carries out convolution, RELU activation function and Sigmoid activation function processing on the RGB features and the optical flow features to obtain corresponding RGB action sequences and optical flow action sequences; and fusing the RGB action sequence and the optical flow action sequence into a comprehensive action sequence, and obtaining a non-action sequence by utilizing the comprehensive action sequence. Taking the action sequence of each mode as a soft label of the action sequence of the other mode, and calculating the consistency loss; finding the highest score in the integrated action sequenceA motion segment, summing and averaging the motion scores to obtain a video-level motion scoreObtaining the non-action score of the rest fragments through the non-action sequence, and summing and averaging to obtain the non-action score of the video levelThen calculating action loss; finally, calculating self-training action loss by combining the action loss and the consistency loss; wherein (1)>Is a set positive integer.

(3) The false positive suppression module obtains a false positive sequence by using the class activation sequence A and the non-action sequenceIn false positive sequence->Is the highest score on each category>Obtaining false positive scores of video level, and generating false positive probability ++using softmax function in category dimension >Wherein each segment contains a set number of frame images, < > or->Is a set positive integer; then, false positive inhibition loss was calculated in combination with the set uniform label.

(4) The foreground enhancement branch generates segment level front Jing Quanchong based on the embedded feature E of the feature X and acts on the class activation sequence A to obtain a foreground enhancement class activation sequenceThe foreground enhancement type activating sequence is added>Calculating to obtain comprehensive activation sequence by averaging with action sequences of two modes>The method comprises the steps of carrying out a first treatment on the surface of the The comprehensive class activating sequence->The sequence number of k fragments with the highest score of each category is marked as index, and the corresponding fragments in each category of the foreground enhancement type activation sequence are found out according to the sequence number index and are aggregated to obtain the score ++>And generating a video level class probability ++using a softmax function in the class dimension>Thereafter, a foreground enhancement loss is calculated in conjunction with the given video level tag.

And step 3, inputting the video data to be detected and the corresponding optical flow data into a trained weak supervision video time sequence action detection model, and realizing time sequence action detection by utilizing a foreground enhancement type activation sequence and a comprehensive type activation sequence which are obtained by a foreground enhancement branch.

In the embodiment of the invention, after the training is finished, the false positive suppression module can be removed, and then the basic framework, the self-training action branch and the foreground enhancement branch work cooperatively in the training mode to obtain the foreground enhancement type activation sequence and the comprehensive type activation sequence. And then, performing action category prediction by utilizing foreground enhancement type activation and performing action positioning prediction by utilizing comprehensive type activation sequences, thereby realizing time sequence action detection.

The scheme provided by the embodiment of the invention comprises the following steps: a self-training branch separated from classification tasks is designed, and the branch can generate a comprehensive action sequence without being interfered by action context information; the false positive fragments in the prediction result are designed in a targeted manner, and the probability of the false positive fragments is modeled, and the high-probability fragments are restrained, so that the number of the false positive fragments is greatly reduced; in addition, a foreground enhancement branch is designed, and the recognition capability of the model to the foreground segment is enhanced. In general, the invention effectively inhibits the false positive fragments and improves the detection performance of the model.

In order to more clearly demonstrate the technical scheme and the technical effects provided by the invention, the method provided by the embodiment of the invention is described in detail below by using specific embodiments.

1. Summary of the principles.

In the embodiment of the invention, in order to solve the problem of excessive false positive fragments in the prediction result of the existing weak surveillance video time sequence action detection framework, a weak surveillance video time sequence action detection scheme based on false positive inhibition is provided, and the general overview is as follows: (1) Considering the problem that the existing framework can cause that the model cannot learn the category independent action attribute which is not interfered by action category information, the invention designs a self-training strategy to learn, and the self-training strategy is independent of video classification tasks and is called as a self-training action branch. (2) Modeling the false positive fragments by using an action sequence generated by the self-training action branches, and finally inhibiting the false positive fragments by using class uniform labels. (3) In order to enhance the recognition capability of the model for the foreground, a attention mechanism is used to feature-enhance the foreground segments in the video.

2. Model framework and training scheme.

The overall frame structure of the model is shown in fig. 2, which mainly comprises: basic framework, self-training action branch, false positive suppression module and foreground enhancement branch, symbols in FIG. 2Is Hadamard sign->Representing an aggregate fragment->Representing a Sigmoid activation function; the following describes the respective parts in detail.

1. A base frame.

In the embodiment of the invention, a traditional time sequence action detection basic framework is constructed, which mainly comprises the following steps: a pre-trained feature extraction network, a one-dimensional convolution layer and a classifier; wherein the pre-trained feature extraction network comprises: an RGB feature extraction network and an optical flow feature extraction network.

By way of example, the feature extraction network may select an I3D network (an inflated 3-dimensional convolutional network) and pre-train on a Kinetics400 dataset.

In the embodiment of the invention, a continuous video segment can be constructed by taking a set frame image (RGB image) as a unit of a video and an optical flow video corresponding to the video, then the video segment is input into a pre-training feature extraction network, RGB features are extracted from training video data through the RGB feature extraction network, optical flow features are extracted from corresponding optical flow data through the optical flow feature extraction network, and the RGB features and the optical flow features are spliced in a channel dimension to obtain a feature X; the feature X is processed by a one-dimensional convolution layer to obtain an embedded feature E, and the embedded feature E is classified by a classifier to obtain a class activation sequence A.

The above process can be expressed as:

；

wherein ,representing one-dimensional convolution layer>Representing the classifier.

By way of example, successive video clips may be constructed in units of 16 frame images.

On the basis of the basic framework, the invention is consistent with the traditional method, and a multi-instance learning paradigm and a cross entropy loss function are used. The k fragments with the highest scores on each category are aggregated in the category activation sequence A to obtain a category score of a video level, category probability is generated by using a softmax function in a category dimension, and then basic losses are calculated by combining given video level labels and a cross entropy loss function and expressed as:

；

wherein the softmax function is a normalized exponential function,classification score representing the ith video, +.>Representing class probability of the ith video, +.>Probability value for the c-th action category representing the i-th video,/th action category>Tag value representing the corresponding C-th action in video level tag of the i-th video, N representing the number of videos, C representing the number of action categories, < ->Representing the base loss.

2. The self-training action branches.

Learning high quality motion attributes is critical to motion localization tasks. However, action attributes in the base branch generated class activation sequence are prone to serious scenario dependency problems. To solve this problem, the present invention introduces a self-training action branch that learns to distinguish action segments from non-action segments by self-training, independent of classification tasks.

Typically, an action segment has a larger characteristic amplitude than a non-action segment. Therefore, in the self-training action branch, some fragments with larger characteristic amplitude are gathered into a positive multi-instance packet, and the rest fragments are gathered into a negative multi-instance packet, so that the self-training is performed in a multi-instance learning mode. In a large number of video clips of different action categories, their semantic commonality is in actionality and non-actionality, so the self-training action branches can learn high-quality action attributes without interference of category information.

In addition, in order to fully utilize the complementarity of the RGB features and the optical flow features, the invention obtains the corresponding RGB action sequence and optical flow action sequence by respectively carrying out convolution, RELU activation function and Sigmoid activation function processing on the RGB features and the optical flow features in sequence, and the corresponding RGB action sequence and optical flow action sequence are expressed as follows:

；

wherein ,representing RGB features->Representing the optical flow characteristics->And->All represent modules for sequentially performing convolution, RELU activation function and Sigmoid activation function, RELU represents a modified linear unit, sigmoid is an S-shaped growth curve, < >>Representing the RGB action sequence,/->Representing a sequence of optical flow actions.

Fig. 3 shows a detailed structure of the self-training action branch, which internally contains two sub-branch structures, and the one-dimensional convolution on the left side of the two sub-branches is the convolution processing procedure mentioned above, and the output of the convolution processing procedure is connected with a RELU activation function (not shown in fig. 3).

To facilitate mutual learning of the two modalities, the motion sequence of each modality is taken as a soft label of the motion sequence of the other modality, and the consistency loss is calculated, expressed as:

；

wherein ,indicating a loss of consistency, +.>Representing a similarity metric function; />Representing the RGB action sequence,/->The optical flow action sequences are represented, and the optical flow action sequences are action sequences of two modes.

The RGB motion sequence and the optical flow motion sequence are fused into a synthetic motion sequence, and a non-motion sequence is obtained by using the synthetic motion sequence, which is expressed as:

；

wherein S represents a comprehensive action sequence,super-parameters for controlling fusion ratio of action sequences of two modes>Representing a sequence of non-actions.

Then, find the highest score in the integrated action sequenceThe motion scores of the motion segments are summed and averaged to obtain the motion score +.>Summing and averaging the non-motion scores of the remaining segments (obtained by non-motion sequences) to obtain the non-motion score of the video level +.>Then calculate the motion loss:

；

wherein ,indicating loss of motion->Action score representing the ith video, +.>Representing the non-action score of the ith video, N representing the number of videos.

Loss of combined actionLoss of consistency->The self-training motion loss is calculated, expressed as:

；

wherein ,indicating the loss of self-training motion.

Those skilled in the art will appreciate that the action score at the video levelNon-action score->The score is processed by the Sigmoid function, i.e. the output value is compressed between 0 and 1, so the score here can also be regarded as a probability.

3. False positive suppression module.

Because of the lack of detailed tag information, if the positioning task is performed directly using the action sequence predicted by the base frame, the positioned pre-selected boxes have serious scene dependency problem, that is, most pre-selected boxes contain strong action context information, but lack specific actions required by the positioning task.

In order to solve the problem, the invention models the false positive sequence and pertinently inhibits the false positive fragments, thereby greatly reducing the number of false positives. Specifically, the Hadamard product is calculated by using a non-action sequence generated by the self-training action branch and a class activation sequence generated by the basic framework, and the generated new sequence is named as a false positive sequence. When the score of a segment in a non-action sequence is high, it is indicated that the segment may not contain action information. If the score of the fragment in the class activation sequence is also higher, the fragment is indicated to be a false positive fragment caused by scene information with high probability.

In the embodiment of the invention, the false positive sequence is expressed as follows:

；

wherein ,is Hadamard sign->Representing false positive sequences.

In false positive sequencesIs the highest score on each category>Obtaining false positive scores of video level, and generating false positive probability ++using softmax function in category dimension>Then, false positive inhibition loss is calculated by combining the set uniform label, specifically, using the class uniform label +.>The entropy of maximizing false positive probability yields false positive suppression loss expressed as:

；

wherein the softmax function is a normalized exponential function,false positive probability value representing the c-th action category of the i-th video, +.>Tag value representing the corresponding C-th action category in the uniform tag, N representing the number of videos, C representing the number of action categories,/->Indicating loss of false positive inhibition.

4. The foreground enhances the branching.

In order to improve the recognition capability of the model to the foreground and reduce the false positive inhibition of the false positive inhibition branch to the foreground, the invention designs a foreground enhancement branch. The foreground enhancement branch utilizes embedded feature E and generates fragment-level foreground weights using an attention mechanismApplying foreground weights to class activation sequences Obtaining the foreground enhancement type activation sequenceAnd then activating the foreground enhancement type sequenceThe average value of the motion sequences of the two modes is calculated to obtain the comprehensive class activation sequenceExpressed as:

；

。

will synthesize class activation sequencesThe sequence number of k fragments with the highest score of each category is marked as index, and the corresponding fragments in each category of the foreground enhancement type activation sequence are found out according to the sequence number index and are aggregated to obtain the score ++>And generating a video level class probability ++using a softmax function in the class dimension>Expressed as:

；

wherein the softmax function is a normalized exponential function,representing a foreground enhanced class-based activation sequence +.>The score of the c action category on the i-th video obtained by aggregation,/and/or>Representing a foreground enhanced class-based activation sequence +.>The probability value of the c action category of the i-th video is obtained.

Thereafter, a foreground enhancement loss is calculated in conjunction with a given video level tag, expressed as:

；

wherein ,tag value representing the corresponding C-th action in video level tag of the i-th video, N representing the number of videos, C representing the number of action categories, < ->Representing the foreground enhancement loss.

5. Total loss function.

Finally, the four losses are combined to obtain a total loss function L, and the total loss function L is expressed as follows:

；

wherein ,is a super parameter; exemplary, can be set->。

Then, the parameters in the model can be optimized by combining the total loss function, and considering that this part can be realized by conventional technology, the description will not be repeated.

In the embodiment of the invention, k,And->Are all positive integers set, exemplary: taking the THUMOS14 dataset as an example, all videos are sampled to 750 clips, at which time the settings are set: k=750// 8,/and->，Wherein, the symbol// represents integer division; similarly, if all video is sampled to 500 segments, then a corresponding adjustment is required, i.e., 75 above0 is replaced by 500; of course, this is merely illustrative, and in practical applications, the user may set specific values according to actual situations or experience.

3. And (5) model testing.

In the embodiment of the invention, after training is finished, a false positive suppression module can be removed, a basic framework is used for carrying out feature extraction, one-dimensional convolution processing and classification to obtain a class activation sequence A, a self-training action branch is used for carrying out processing by utilizing the features extracted by the basic framework to obtain an RGB action sequence and an optical flow action sequence, an embedding feature obtained by a foreground enhancement branch through the one-dimensional convolution processing of the basic framework is used for generating a segment-level front Jing Quanchong and acting on the class activation sequence A to obtain a foreground enhancement class activation sequence, and the foreground enhancement class activation sequence is combined with the RGB action sequence and the optical flow action sequence obtained by the self-training action branch to obtain a comprehensive class activation sequence; and then, performing action category prediction by utilizing foreground enhancement type activation, and performing action positioning prediction by utilizing comprehensive type activation sequences.

To facilitate an understanding of the foregoing aspects of the invention, a specific example flow is provided below.

Step S1, preparing a video data set for training and a video data set for testing. For a trained video dataset, it is necessary to make a video-level action class annotation for each video, i.e. annotate what actions are present in the video. And then aggregating the training video data set and the test data set into a fragment every 16 frames, inputting the downsampled training video data into an I3D model pre-trained by a kinetic 400 data set, and extracting RGB features and optical flow features of the video data.

Step S2, based on a Pytorch (an open source Python machine learning library) deep learning framework, constructing a basic framework by using a convolution network, and forming a weak supervision video time sequence action detection model by a self-training action branch, a false positive suppression module and a prospect enhancement branch.

And S3, for a basic framework, inputting RGB features and optical flow features of video data, aggregating k fragments with highest class scores in each class activation sequence to obtain class scores of video levels, and calculating basic losses with given video level labels by using a cross entropy function.

And S4, calculating the consistency loss of the optical flow action sequence and the RGB action sequence generated by the self-training action branch by using a mean square error loss function, fusing the two action sequences to generate a comprehensive action sequence and a non-action sequence, respectively aggregating the action fragments with higher scores and the non-action fragments to generate the action score of a video level and the non-action score of the video level, calculating the action loss by using a two-class cross entropy, and then adding the consistency loss and the action loss to obtain the self-training action loss.

And S5, for the false positive suppression module, firstly constructing a false positive sequence by using the non-action sequence generated in the step S4 and the class activation sequence output in the step S3, and then, aggregating high-score fragments of each class in the false positive sequence to generate a video-level false positive score. False positive suppression losses were calculated using class uniformity labels and a cross entropy loss function.

And S6, for the foreground enhancement branch, weighting the class activation sequence output by the basic branch and the attention score to generate a foreground enhancement class activation sequence, then using the same aggregation mode in the step 3, and finally calculating the foreground enhancement loss.

And S7, synthesizing the losses calculated in the steps S3-S6 to obtain a final optimization target (total loss function), minimizing the total loss function through a back propagation algorithm and a gradient descent strategy, updating parameters of a model, and finally storing trained model parameters.

And S8, inputting the RGB features and the optical flow features of the test data set obtained in the step S1 into a trained model, performing action type prediction and action positioning prediction, and performing performance evaluation by combining the two types of prediction results.

(1) And (5) predicting action categories.

Generating action classification probability of video level by using the foreground enhancement class activation sequence output by the foreground enhancement branch, and generating action class prediction by using a threshold method. When the action category prediction result is generated by using the thresholding method, a threshold (for example, 0.1 to 0.25) may be set according to the actual situation, and the category exceeding the threshold is marked as the predicted category.

(2) Motion localization prediction

And adding the foreground enhancement class activation sequence, the RGB action score and the optical flow action score to generate a comprehensive class activation sequence. According to the action type prediction result, a sequence of a corresponding type is taken out from the comprehensive type activation sequence, a plurality of thresholds (for example, 0.1-0.9 are set at intervals of 0.1) are set, each threshold is sequentially used for screening in the sequence, a continuous frame higher than the threshold is regarded as a prediction frame, and the confidence of the prediction frame is calculated. After screening is completed, a large number of prediction frames are obtained, the prediction frames are screened again by using a non-maximum value inhibition method, the prediction frames with partial overlap ratio being too high are removed, and finally the rest prediction frames are the action positioning prediction results.

(3) Performance evaluation.

And finally, evaluating the detection performance of the dynamic weak supervision video time sequence motion detection model according to the motion category prediction result and the motion positioning prediction result.

From the description of the above embodiments, it will be apparent to those skilled in the art that the above embodiments may be implemented in software, or may be implemented by means of software plus a necessary general hardware platform. With such understanding, the technical solutions of the foregoing embodiments may be embodied in a software product, where the software product may be stored in a nonvolatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and include several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the methods of the embodiments of the present invention.

Example two

The invention also provides a weak supervision video time sequence action detection system, which is mainly used for realizing the method provided by the previous embodiment, as shown in fig. 4, and mainly comprises the following steps:

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the system is divided into different functional modules to perform all or part of the functions described above.

Example III

The present invention also provides a processing apparatus, as shown in fig. 5, which mainly includes: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods provided by the foregoing embodiments.

Further, the processing device further comprises at least one input device and at least one output device; in the processing device, the processor, the memory, the input device and the output device are connected through buses.

In the embodiment of the invention, the specific types of the memory, the input device and the output device are not limited; for example:

the input device can be a touch screen, an image acquisition device, a physical key or a mouse and the like;

the output device may be a display terminal;

the memory may be random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as disk memory.

Example IV

The invention also provides a readable storage medium storing a computer program which, when executed by a processor, implements the method provided by the foregoing embodiments.

The readable storage medium according to the embodiment of the present invention may be provided as a computer readable storage medium in the aforementioned processing apparatus, for example, as a memory in the processing apparatus. The readable storage medium may be any of various media capable of storing a program code, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, and an optical disk.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. The weak supervision video time sequence action detection method is characterized by comprising the following steps of:

2. The method for detecting the time sequence action of the weak surveillance video according to claim 1, wherein the steps of obtaining the feature X through extraction of the basic framework, encoding the feature X into the embedded feature E, and obtaining the class activation sequence a through classification include:

the base frame includes: a pre-trained feature extraction network, a one-dimensional convolution layer and a classifier; wherein the pre-trained feature extraction network comprises: an RGB feature extraction network and an optical flow feature extraction network, wherein RGB refers to three channels of red, green and blue;

extracting RGB features from training video data through an RGB feature extraction network, extracting optical flow features from corresponding optical flow data through an optical flow feature extraction network, and splicing the RGB features and the optical flow features in a channel dimension to obtain features X; the feature X is processed by a one-dimensional convolution layer to obtain an embedded feature E, and the embedded feature E is classified by a classifier to obtain a class activation sequence A.

3. The method of claim 2, wherein calculating the base loss in combination with a given video level tag comprises:

The method comprises the steps that k fragments with highest scores on each category are aggregated in a category activation sequence A to obtain a category score of a video level, category probability is generated on a category dimension by using a softmax function, wherein each fragment comprises a set number of frame images, and k is a set positive integer; thereafter, a base loss is calculated in conjunction with a given video level tag, expressed as:

；

4. The method of claim 1, wherein the obtaining the motion sequences of the two modes by the self-training motion branch by using the feature X, and obtaining the integrated motion sequence and the non-motion sequence after the fusion comprises:

the feature X is obtained by splicing RGB features and optical flow features in the channel dimension, wherein RGB refers to three channels of red, green and blue;

the corresponding RGB action sequence and optical flow action sequence are obtained by respectively and sequentially carrying out convolution, RELU activation function and Sigmoid activation function processing on the RGB features and the optical flow features, and are expressed as follows:

；

wherein ,representing RGB features->Representing the optical flow characteristics->And->All represent modules for sequentially performing convolution, RELU activation function and Sigmoid activation function, RELU represents a modified linear unit, sigmoid is an S-shaped growth curve, < >>Representing the RGB action sequence,/->Representing an optical flow action sequence;

the RGB action sequence and the optical flow action sequence are fused into a comprehensive action sequence, and a non-action sequence is obtained by utilizing the comprehensive action sequence and is expressed as:

；

5. The method for detecting motion of weak surveillance video sequence according to claim 1 or 4, wherein the calculating the self-training motion loss based on the motion sequence and the combined motion sequence and the non-motion sequence of the two modes comprises:

taking the action sequence of each mode as a soft label of the action sequence of the other mode, and calculating the consistency loss, wherein the consistency loss is expressed as:

；

wherein ,indicating a loss of consistency, +.>Representing a similarity metric function; />Representing the RGB action sequence,/->Representing optical flow action sequences, namely action sequences of two modes, wherein RGB refers to three channels of red, green and blue;

Finding the highest score in the integrated action sequenceThe motion scores of the motion segments are summed and averaged to obtain the motion score +.>Obtaining the non-action score of the remaining fragments through the non-action sequence, and summing and averaging to obtain the non-action score of the video level +.>，/>For a set positive integer, then calculate the motion loss:

；

wherein ,indicating loss of motion->Action score representing the ith video, +.>A non-action score representing an ith video, N representing the number of videos;

；

wherein ,indicating the loss of self-training motion.

6. The method of claim 1, wherein the false positive suppression module obtains a false positive sequence by using the class activation sequence a and the non-action sequence, and calculates a false positive suppression loss by combining a set uniform label comprises:

and (3) obtaining a false positive sequence by using the class activation sequence A and the non-action sequence, wherein the false positive sequence is expressed as follows:

；

wherein ,is Hadamard sign->Representing a false positive sequence;

in false positive sequencesIs the highest score on each category >Obtaining false positive scores of video level, and generating false positive probability ++using softmax function in category dimension>Wherein each segment contains a set number of frame images, < > or->Is a set positive integer; then, false positive inhibition loss was calculated in combination with the set uniform label, expressed as:

；

7. The method of claim 1, wherein obtaining the integrated class activation sequence and calculating the foreground enhancement loss in combination with the given video level tag comprises:

activating sequence of foreground enhancement classThe average value of the motion sequences of the two modes is calculated to obtain the comprehensive class activation sequence；

Will synthesize class activation sequencesThe sequence number of k fragments with the highest score of each category is marked as index, and the corresponding fragments in each category of the foreground enhancement type activation sequence are found out according to the sequence number index and are aggregated to obtain the score ++ >And generating a video level class probability ++using a softmax function in the class dimension>Expressed as:

；

wherein the softmax function is a normalized exponential function,representing a foreground enhanced class-based activation sequence +.>The score of the c action category on the i-th video obtained by aggregation,/and/or>Representing a foreground enhanced class-based activation sequence +.>The obtained probability value of the c action category of the i-th video;

；

8. A weak surveillance video timing motion detection system, comprising:

9. A processing apparatus, comprising: one or more processors; a memory for storing one or more programs;

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-7.

10. A readable storage medium storing a computer program, which when executed by a processor implements the method of any one of claims 1-7.