CN112417206B

CN112417206B - Weak supervision video time interval retrieval method and system based on two-branch proposed network

Info

Publication number: CN112417206B
Application number: CN202011332463.XA
Authority: CN
Inventors: 童鑫远
Original assignee: Hangzhou Yizhi Intelligent Technology Co ltd
Current assignee: Hangzhou Yizhi Intelligent Technology Co ltd
Priority date: 2020-11-24
Filing date: 2020-11-24
Publication date: 2021-09-24
Anticipated expiration: 2040-11-24
Also published as: CN112417206A

Abstract

The invention discloses a weakly supervised video time interval retrieval method and system based on two branch proposed networks, and belongs to the field of video retrieval. The method mainly comprises the following steps: 1) for a group of videos and description text training sets, according to a cross-modal language perception filter, the joint representation of the video information and the text information is learned, and an enhanced video stream and a suppressed video stream with the text information are generated. 2) For the output generated by the cross-modal language perception filter, a time interval answer for the joint expression of the video and the text is output according to a parameter sharing regularized two-branch proposal network. The invention adopts a language perception filter which uses a scene-based cross-modal estimation method to generate an enhanced video stream and a suppressed video stream, adopts a novel two-branch proposal network, simultaneously considers the confrontation between samples and in the samples, adopts a proposal regularization strategy to stabilize the training process, and effectively improves the model performance.

Description

Weak supervision video time interval retrieval method and system based on two-branch proposed network

Technical Field

The invention relates to the field of video time interval retrieval, in particular to a weak supervision video time interval retrieval method and system based on two branch proposed networks.

Background

Video period retrieval is an important issue in the field of video retrieval, which aims at automatically locating a target period in an untrimmed video according to a given descriptive text.

Video session retrieval is a interdisciplinary field between computer vision and natural language processing. A video session retrieval model understands not only visual and textual content but also the correlation between them. Most of the existing methods are trained in a fully supervised situation by aligning annotated pairs of video texts, which is time consuming and expensive, especially for ambiguous descriptions. Recently, researchers began exploring weakly supervised period retrieval by only video-level sentence annotation.

Most of existing weak supervision time interval retrieval methods are based on a multi-instance learning (MIL) method, a matched video text pair is regarded as a positive sample, a unmatched video text pair is regarded as a negative sample, and the method mainly focuses on countermeasures among the samples to judge whether a video is matched with a given word description or not, and ignores the countermeasures in the samples, namely, determines which time interval is most matched with the given word description. Given a matching video text pair, a video typically contains continuous content with no missing negative periods of great relevance to the textual description portion, but not a perfect match, that are difficult to distinguish from the target period. Therefore, there is a need to develop sufficient in-sample confrontation between periods of similar content in a video.

In summary, in the prior art, the intra-sample confrontation cannot be effectively developed by using the adjacent time periods of the video, so that the performance is limited in the application of similar scenes, and the time period boundary cannot be accurately positioned.

Disclosure of Invention

The invention aims to solve the problems in the prior art and provides a method and a system for searching a weakly supervised video time interval based on a two-branch proposed network.

In order to achieve the purpose, the invention specifically adopts the following technical scheme:

a weakly supervised video time interval retrieval method based on a two-branch proposed network comprises the following steps:

1) establishing a network model consisting of a cross-modal language perception filter and a regularized two-branch proposal network, wherein the regularized two-branch proposal network comprises an enhanced branch proposal network and a suppressed branch proposal network;

2) acquiring a video and a description text as a training data set, and extracting frame characteristics of the video and text characteristics of the description text;

3) taking the frame characteristics and the text characteristics as the input of a cross-modal language perception filter, and generating an enhanced video stream and a suppressed video stream with the text characteristics;

4) taking the generated enhanced video stream and text characteristics as the input of an enhanced branch proposal network, outputting a proposal result and a score, and screening to obtain a positive proposal set; the generated inhibition video stream and text characteristics are used as the input of an inhibition branch proposal network, the proposal result and the score are output, and a negative proposal set is obtained by screening;

5) introducing proposal regularization into the enhanced branch proposal network, calculating a multitask loss function through intra-sample confrontation, inter-sample confrontation and proposal regularization, and updating parameters of a cross-modal language perception filter and a regularized two-branch proposal network to obtain a trained network model;

6) for the video and the query sentence to be detected, the frame characteristics of the video and the text characteristics of the query sentence are respectively extracted, the frame characteristics and the text characteristics are used as the input of a trained network model, and a positive proposal with the highest predicted score is obtained as a retrieval result.

Another objective of the present invention is to provide a system for searching video time interval under weak supervision based on two proposed networks, which is used for implementing the above searching method.

The weakly supervised video period retrieval system comprises:

the data acquisition module is used for acquiring videos and description texts as training data sets when the system is in a training stage; when the system is in the detection stage, the system is used for acquiring the video to be detected and the question sentences.

And the characteristic extraction module is used for extracting frame characteristics from the video and extracting text characteristics from the description text and the question sentences.

And the cross-modal language perception filtering module is used for receiving the frame characteristics and the text characteristics as input and outputting the enhanced video stream and the suppressed video stream containing the text characteristics.

Regularizing two proposed network modules: the system comprises an enhancement branch submodule and a suppression branch submodule, wherein the enhancement branch submodule and the suppression branch submodule are used for taking the generated enhancement video stream and text characteristics as the input of an enhancement branch proposal network, outputting a proposal result and a score, and screening to obtain a positive proposal set; and taking the generated inhibition video stream and text characteristics as input of an inhibition branch proposal network, outputting a proposal result and a score, and screening to obtain a negative proposal set.

Compared with the traditional method, the method effectively improves the performance of video time interval retrieval, and is specifically embodied in that:

(1) aiming at the problem that the conventional method ignores the intra-sample countermeasure, the invention designs a novel regularized two-branch proposal network, each branch comprises a cross-mode interaction unit for integrating text clues into visual features to generate language-perceived frame features, a two-dimensional time interval feature map which is generated based on the language-perceived frame features and is subjected to convolution processing to explore the adjacent time interval relationship, and a proposal screening module for selecting the proposal being proposed; by receiving the enhanced video stream, the weakened video stream, and the text features, a series of matching positive and negative offers can be generated, along with a score and boundary for each offer, for weakly supervised video session retrieval, and using a center-based offer screening technique, a superior set of positive offers and a reasonable set of negative offers are screened out.

In addition, the method also considers the inter-sample confrontation and the intra-sample confrontation at the same time, the intra-sample confrontation can be encouraged through the intra-sample loss, and the target time period and the similar interference negative time period in the same data pair are distinguished; inter-sample confrontation may be encouraged by inter-sample loss so that matching positive samples have higher scores than non-matching negative samples; therefore, whether the video is matched with the given text description or not is judged, and which time interval is most matched with the given text description is further judged, so that the negative time interval which has great correlation with the text description part but is not completely matched can be distinguished from the target time interval, and the accuracy of the retrieval result is improved.

(2) Aiming at the problem that an invalid negative sample which is too simple can be generated by in-sample countermeasure, the invention designs a language perception filter which uses a scene-based cross-modal estimation method, projects text features to a clustering center by utilizing a NetVLAD technology and generates a scene-based language feature sequence, further calculates a cross-modal matching score between a frame feature sequence and the scene-based language feature sequence, obtains a score of each frame and performs normalization processing, and finally adopts two branch gates to generate an enhanced video stream and a suppressed video stream according to the score distribution and the frame feature sequence after the normalization processing, wherein the enhanced video stream highlights key frame features related to languages, weakens irrelevant frame features, and the suppressed video stream is opposite.

(3) Aiming at some prior knowledge which is beneficial to model training, the proposed regularization strategy is designed in the regularized two-branch proposed network, and the regularization strategy only needs to be applied to an enhanced branch through the characteristics of consistent structure and parameter sharing of the two branches. In particular, considering that most of the time periods are unselected, i.e., not matched with the text description, the present invention uses a global penalty function term to reduce the average score of the proposal so that the score of the unselected time periods is close to 0; considering that a most accurate time period proposal is selected from a series of proposals as a final result, the invention applies the softmax function to all proposals, and introduces a gap loss function term to encourage the expansion of the score gap between the proposals; in summary, the proposed regularization strategy designed by the invention reduces the average score of all the proposals to reduce the influence of irrelevant proposals, and enlarges the difference of the proposed scores to help the selection of the optimal proposal, stabilize the model training and improve the model performance.

Drawings

Fig. 1 is a schematic diagram of a network model used in the present invention.

FIG. 2 is a schematic diagram of a cross-modal language-aware filter used in the present invention.

FIG. 3 is a schematic diagram of the structure of an enhanced branch in a regularized two-branch proposed network used by the present invention.

Detailed Description

The invention will be further elucidated and described with reference to the drawings and the detailed description.

As shown in fig. 1, the method for retrieving a weakly supervised video period based on two proposed networks provided by the present invention includes the following steps:

firstly, extracting frame characteristics of a video and text characteristics of a description text for an input video and the input text; and then generating an enhanced video stream with text features and a suppressed video stream through a cross-modal language perception filter.

And step two, generating a series of regularized positive and negative proposals and scores and boundaries thereof for the generated enhanced video stream with text characteristics and the generated suppressed video stream through a regularized two-branch proposal network, and calculating a multitask loss function through intra-sample confrontation, inter-sample confrontation and proposal regularization so as to update model parameters of the cross-modal language perception filter and the regularized two-branch proposal network.

And step three, for the video and the text of the answer to be predicted, obtaining a positive proposal time period with the highest predicted score as a retrieval result according to the finally generated cross-modal language perception filter and the regularized two-branch proposal network.

In one embodiment of the present invention, the first step is performed as follows:

1.1) acquiring a video and a description text as a training data set, and extracting frame characteristics and text characteristics;

the frame feature extraction method specifically comprises the following steps: extracting visual features of the video by using a pre-trained video feature extractor, and reducing the length of a visual feature sequence by using time sequence average pooling to obtain a frame feature sequence of the video

Wherein n is_vIs a characteristic number, v_iIs the frame characteristic of the ith frame in the video; the pre-trained video feature extractor is different from data set to data set, C3D features are extracted from a Charads-STA data set and an ActivityCaption data set, and optical flow features are extracted from a DiDeMo data set.

The text feature extraction method specifically comprises the following steps: extracting word features by using a pre-trained Glove word2vec embedding method, then taking the word features as the input of a Bi-GRU network, and learning word semantic representation with context information as a text feature sequence

Wherein n is_qIs the number of words, q_iIs the semantic feature of the ith word.

1.2) taking the frame characteristics and the text characteristics as the input of a cross-modal language perception filter, and generating an enhanced video stream with the text characteristics and a suppressed video stream.

The structure of the cross-modal language perception filter is shown in fig. 2, and a scene-based cross-modal estimation method is used, and specifically comprises the following steps:

first, the present invention characterizes text using NetVLAD technology

Projecting to a clustering center; in particular, a set of trainable center vectors is given

Wherein n is_cAs a number of centers, c_jFor the jth center, NetVLAD accumulates the residual between the text feature and the center vector by soft-allocation, which is:

wherein, W^cAnd b^cFor the projection matrix and the offset vector,

to correspond to n_cSoft distribution coefficient of individual cluster centers, α_ijI.e. the soft distribution coefficient, u, between the ith text feature and the jth cluster center_jFor features accumulated in the jth center, each center can be considered a language scene, u_jThe language features based on the scenes are obtained finally

Next, the invention calculates a sequence of frame features

And scene-based language feature sequences

Cross-modal matching fraction β therebetween_ijThe calculation formula is as follows:

wherein the content of the first and second substances,

and

as a projection matrix, b^aIn order to be a vector of the offset,

is a row vector, sigma is sigmoid activation function, beta_ijE (0,1) represents the matching score of the ith frame feature and the jth scene-based language feature; by the formula, an intermediate semantic space is introduced for the text and the video.

Considering the definition of important frames, that is, there is a certain language scene that is closely related to the frames, the present invention uses the overall score to evaluate a certain frame, specifically, the overall score of the ith frame is:

wherein the index j represents the value taken at the j dimension; meanwhile, in order to prevent the generation of a score distribution with an excessively small degree of separation, for example, if all the frame scores are close to 0 or 1, the score distribution is adjusted by using a maximum-minimum normalization method, and the adjustment formula is as follows:

thereby obtaining a normalized score distribution over the frame

Wherein the content of the first and second substances,

representing the correlation between the ith frame and the description text;

finally, the present invention uses a two-branch gate to generate the enhanced video stream

And suppression of video streams

The generation formula specifically comprises:

is an enhanced video stream V^enThe ith enhancement frame feature of (1),

is suppressing the video stream V^spThe ith suppression frame feature of (1); the enhancement video stream highlights key frames and attenuates the effects of non-key frames based on the normalized scores, and the suppression video stream is reversed.

In one embodiment of the present invention, the implementation of step two is as follows:

taking the generated enhanced video stream, the generated suppressed video stream and the text characteristics as the input of a regularized two-branch proposal network, and outputting the final regularized time period proposal result and score; comparing the result with the true value and updating parameters of the cross-modal language perception filter and the regularized two-branch proposed network to obtain a final network model;

2.1) the structure of the enhanced branch and the suppressed branch of the regularized two-branch proposed network is consistent and the parameters are shared, and the flow of the enhanced branch selection proposal is shown in fig. 3, which specifically comprises:

given an enhanced video stream

With text features

Constructing a cross-mode interaction unit and integrating the text clues into the visual characteristics; specifically, a frame-to-word attention structure is used to summarize the text features of each frame, and the integrated formula is:

wherein the content of the first and second substances,

an integral text representation form of the ith frame;

and

as a projection matrix, b^mIn order to be a vector of the offset,

is a row vector, δ_ijMatching scores of the ith enhanced frame feature in the enhanced video stream and the semantic feature describing the jth word in the text;

is the matching score after being processed by the softmax function.

2.2) passing it through a cross gate such that the enhanced frame features and the integrated text features interact with each other, the formula is as follows:

wherein the content of the first and second substances,

in order to be a visual door,

is a text gate, <' > is a symbol of element-by-element multiplication operation, W^v、W^tAs a projection matrix, b^v、b^tIs an offset vector; obtained by joining

And

to obtain speech-aware frame features

2.3) frame features according to language perception

Constructing a two-dimensional time period characteristic diagram through a two-dimensional temporal network; constructed two-dimensional time interval characteristic diagram

Includes three dimensions: the first two dimensions are used for indexing a start frame and an end frame of a period, and the third dimension is a characteristic dimension; time interval [ a, b]The characteristic calculation formula is as follows:

b>part of a will be treated as invalid and filled with 0; in addition, when n_vWhen it is too bigIn order to save the calculation cost, the model is subjected to sparse sampling.

According to the constructed two-dimensional time interval characteristic diagram, two-dimensional convolution is carried out twice to explore the relation between adjacent time intervals, and therefore the cross-modal characteristic is obtained

Wherein the content of the first and second substances,

cross-modal feature, M, corresponding to the ith time period in a two-dimensional time period feature map_enCalculating respective proposal scores for all the periods in the two-dimensional map, each period corresponding to one enhancement proposal

Ith enhanced proposal score

The calculation formula is specifically as follows:

wherein W^pAnd b^pAre the projection matrix and the offset vector.

2.4) screening the proposal according to the calculated proposal score by adopting a proposal screening method based on a center; specifically, the proposal with the highest score is taken as a center proposal, the rest proposals are sorted according to the time interval overlapping degree, T-1 proposals with the highest overlapping degree are taken, the T proposals are taken as positive proposals, and the positive proposal set is recorded as a positive proposal

And includes corresponding scores

And the boundary

By means of which the selection can be effectedA series of related positive offers are selected, and similarly, suppressing branches can effectively generate reasonable negative offers, with the negative offer set being recorded as

And includes corresponding scores

And the boundary

2.5) calculating an enhancement score based on the positive and negative proposals generated and their scores

And inhibition score

And in-sample loss function terms

Wherein, Delta_intraIs a boundary value (taken as 0.4 in the present invention);

the target time period is distinguished from similar negative periods of interference in the same data pair by increasing the positive proposal score and decreasing the negative proposal score to encourage in-sample confrontation.

Calculating an enhanced branch score K based on the generated positive offer^enAnd randomly sampling from the video or text that does not match the current data, and calculating the sum of the corresponding enhanced propositions scores as above

And

computing an inter-sample loss function term

Wherein Δ_interThe value is a boundary value (0.6 in the present invention).

And

matching scores of the current data and the randomly sampled unmatched video or text;

the inter-sample confrontation is encouraged by increasing the matching sample score and decreasing the non-matching sample score so that the matching positive sample has a higher score than the non-matching negative sample.

2.6) for the generated proposed result, a regularization means is adopted to introduce some prior knowledge, and the model training process is stabilized, specifically:

considering that most of the time period is not selected, i.e. not matched with the text description, the invention uses a global loss function term

To reduce the average score of the proposal, where M_enIs the number of all the time segments in the two-dimensional time segment graph;

the term makes the unselected period score close to 0 while

And

the term makes the proposal have a higher score.

Considering that the most accurate period proposal is selected from a series of positive proposals as the final result, the invention applies the softmax function to all positive proposals to obtain

Introducing a gap loss function term

To encourage expansion of the scoring gap between the positive offers.

Considering that the two branches are consistent in structure and shared in parameters, the regularization strategy is only applied to the enhanced branch.

The multitask loss function of the final application comprises the four loss functions and corresponding hyper-parameters, and specifically comprises the following steps:

wherein the content of the first and second substances,

in order to be a loss of the multi-tasking,

in order to combat the loss between the samples,

in order to combat the loss within the sample,

and

to propose two losses of regularization, λ₁、λ₂、λ₃、λ₄Is a hyper-parameter.

In an embodiment of the present invention, a system for retrieving a weakly supervised video period based on a two-branch proposed network is further provided, including:

And the characteristic extraction module is used for extracting frame characteristics from the video and extracting text characteristics from the description text and the question sentences. Specifically, text features can be extracted by adopting pre-trained glov word2vec, and video features can be extracted by adopting a pre-trained video feature extractor.

A cross-modal language-aware filtering module for receiving frame features

With text features

As an input, an enhanced video stream and a suppressed video stream containing text features are output, specifically:

V^en,V^sp＝Filter(V,Q)

wherein the content of the first and second substances,

representing the enhancement of the video stream,

representing a suppressed video stream; the enhancement video stream highlights language-dependent key frame features, weakens irrelevant frame features, and the suppression video stream is the opposite.

Specifically, the cross-modal language-aware filtering module comprises a module for generating a text feature sequence

Cluster center and scene-based linguistic feature sequence

A NetVLAD submodule for calculating a frame feature sequence

And scene-based language feature sequences

A cross-mode estimation submodule for cross-mode matching fraction between the two, and a module for using the fraction distribution and frame characteristic sequence after normalization processing

And generating two-branch gate modules for enhancing the video stream and suppressing the video stream.

Regularizing two proposed network modules: the system consists of an enhanced branch submodule and a suppressed branch submodule, and is used for taking the generated enhanced video stream and text characteristics as the input of an enhanced branch proposal network, outputting proposal results and scores, and screening to obtain a positive proposal set

Corresponding score

And the boundary

The generated inhibition video stream and text characteristics are used as the input of the inhibition branch proposal network, the proposal result and the score are output, and a negative proposal set is obtained by screening

And its corresponding score

And the boundary

Can be expressed as:

P^en,L^en,C^en＝EnhancedBranch_Θ(V^en,Q)

P^sp,L^sp,C^sp＝SuppressedBranch_Θ(V^sp,Q)

wherein, the enhanced branch sub-module comprises:

the cross-mode interaction unit is used for summarizing the text features of each frame of the video to obtain integrated text features;

a Bi-GRU sub-module for causing the integrated text features and the enhanced video stream to interact with each other to obtain language-aware frame features.

The two-dimensional time interval characteristic diagram submodule is internally provided with a two-dimensional time interval characteristic diagram, and the two-dimensional time interval characteristic diagram comprises three dimensions: the first two dimensions are used for indexing a start frame and an end frame of a period, and the third dimension is a characteristic dimension; the method is used for researching the relation of adjacent time intervals, obtaining the cross-modal characteristics of all the time intervals in the two-dimensional graph according to the frame characteristics of language perception, and calculating the score and the boundary of each proposal according to the cross-modal characteristics of all the time intervals in the two-dimensional graph.

And an offer screening submodule for screening offers using a center-based offer screening method and outputting a screened positive offer set.

The suppressing branch sub-module and the enhancing branch sub-module have the same structure, share parameters and finally generate a negative proposal set.

Wherein, a proposal regularization strategy is introduced into the enhanced branch submodule and comprises

Term loss and

item losses, in particular:

terms exclude the interference of non-matching proposals by reducing the proposal mean score so that the score of unselected proposals is close to 0, while

And

the term makes the proposal have a higher score;

terms are considered by expanding the scoring gap between positive proposals in order to select the most accurate proposal from a series of positive proposals as the end result.

Meanwhile, the regularization two-branch proposal network adopts inter-sample countermeasures and intra-sample countermeasures, and encourages the intra-sample countermeasures by improving the positive proposal score and reducing the negative proposal score; inter-sample confrontation is encouraged by increasing the matching sample score and decreasing the non-matching sample score.

In the specific embodiments provided in the present application, it should be understood that the above-described system embodiments are merely illustrative, and for example, the regularized two-branch proposed network module may be a logical functional partition, and may have another partition in actual implementation, for example, a plurality of modules may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the connections between the modules shown or discussed may be communication connections via interfaces, electrical or otherwise.

The method is applied to the following embodiments to achieve the technical effects of the present invention, and detailed steps in the embodiments are not described again.

Examples

The invention carries out experimental verification on three data sets, namely Charads-STA, Activinycaption and DiDeMo, wherein the specific conditions of the three data sets are as follows:

the Charads-STA data set contains 9848 indoor active videos, and the average video duration is 29.8 seconds; the data set was used for training, with 12408, 3720 tested sentence-period pairs, respectively.

The activityCaption data set contains 19209 videos with different contents, and the average duration of the videos is about 2 minutes; the data set was used for training, validation, and testing 37417, 17505, and 17031 sentence-period pairs, respectively.

The DiDeMo dataset contains 10464 videos, each video having a duration of 25-30 seconds; the data set is used for training, verifying and testing 33005, 4180 and 4021 sentence-period pairs respectively; in particular, each video in the DiDemo is divided into six five-second clips, the target time often includes one or more consecutive clips, so the time period of the DiDemo data set is only 21 candidates compared to the chardes-STA and ActivityCaption.

In terms of test evaluation criteria, the present invention follows widely used criteria, using R @ n, IoU ═ m as criteria for chardes-STA and ActivityCaption, and using Rank @1, Rank @5 and mlou as criteria for didmo; more specifically, the invention first calculates IoU values between the predicted time period and the true value; then, R @ n, IoU ═ m calculates the percentage of cases where the IoU value was greater than m for at least one of the previous n time periods, mlou calculates the mean of the first time period IoU values for all test samples, Rank @1, Rank @5 calculates the percentage of the true value in the first or first five cases.

Tables 1 to 3 show the results of the experiments of the present invention on three data sets, Charads-STA, ActivityCaption and DiDeMo, abbreviated to RTBPN.

TABLE 1 Experimental results on Charades-STA data set

TABLE 2 Experimental results on ActivityCaption data set

Table 3 experimental results on the DiDeMo data set

Because the data of the weakly supervised algorithm is only coarse-grained sentence-level labels, the training process of the weakly supervised algorithm often includes many invalid or even negatively-acting learning, and therefore the training effect under the same framework is much worse than that of the fully supervised algorithm with labels of corresponding periods of sentences.

However, as can be seen from tables 1-3, since the present invention considers both inter-sample confrontation and intra-sample confrontation, intra-sample confrontation can be encouraged by intra-sample loss, distinguishing the target time period from similar interference negative time period in the same data pair; inter-sample confrontation may be encouraged by inter-sample loss so that matching positive samples have higher scores than non-matching negative samples. Specifically, an enhanced video stream generated by a language-aware filter of a scene-based cross-modal estimation method highlights key frame features related to languages, weakens irrelevant frame features, and inhibits video streams, but the opposite is true, a novel regularized two-branch proposal network is designed, and a proposal screening technology based on a center is designed, so that an excellent positive proposal set and a reasonable negative proposal set can be screened out, the video period retrieval performance of the invention exceeds that of the early full-supervision algorithms VSA-RNN, VSA-STV and CTRL, and the invention achieves equivalent effects with TGN, QSPN and MCN, which undoubtedly proves the superiority of the performance of the invention.

Compared with other weak supervision algorithms currently existing, the method only slightly differs from the method in the case that R @1, IoU of the activiycation data set is 0.3, and the method makes progress in all other cases, which is enough to show that the method already exceeds the existing other weak supervision algorithms in video period retrieval performance.

The foregoing lists merely illustrate specific embodiments of the invention. It is obvious that the invention is not limited to the above embodiments, but that many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.

Claims

1. A weakly supervised video time interval retrieval method based on two-branch proposed network is characterized by comprising the following steps:

3) taking the frame characteristics and the text characteristics as the input of a cross-modal language perception filter, and generating an enhanced video stream and a suppressed video stream with the text characteristics; the cross-modal language perception filter generates an enhanced video stream and a suppressed video stream with text features by using a scene-based cross-modal estimation method, and specifically comprises the following steps:

3.1) characterizing text using NetVLAD technology

Projecting to a clustering center to obtain a language feature sequence based on a scene

Wherein n is_qIs the number of words, q_iIs a semantic feature of the ith word, n_cNumber of centers, u_jThe language features of the jth scene;

3.2) computing the sequence of frame features

And scene-based language feature sequences

Cross-modal matching scores therebetween; wherein n is_vIs a characteristic number, v_iIs the frame characteristic of the ith frame in the video;

3.3) calculating the score of each frame in the video according to the cross-modal matching score and carrying out normalization processing to obtain normalized score distribution;

3.4) adopting two branch gates to normalize the processed score distribution and the frame characteristic sequence

Processing to generate an enhanced video stream and a suppressed video stream;

the enhanced branch proposal network is specifically as follows:

4.1) taking the generated enhanced video stream and text features as input of an enhanced branch proposal network, and adopting a frame-to-word attention structure to summarize the text features of each frame to obtain integrated text features;

4.2) interacting the integrated text features and the enhanced video stream with each other through a cross gate to obtain language-aware frame features;

4.3) constructing a two-dimensional time period feature map through a two-dimensional temporal network according to the frame features of language perception; the constructed two-dimensional time interval feature map comprises three dimensions: the first two dimensions are used for indexing a start frame and an end frame of a period, and the third dimension is a characteristic dimension;

performing two-dimensional convolution twice according to the constructed two-dimensional time interval characteristic diagram to obtain cross-modal characteristics of all time intervals in the two-dimensional diagram; according to the cross-modal characteristics of all time intervals in the two-dimensional graph, each time interval corresponds to one proposed result, and the score of each proposed result is calculated;

4.4) screening the proposal using a center-based proposed screening: taking the proposal result with the highest score as a center proposal, sorting the rest proposals according to the time interval overlapping degree, taking T-1 proposals with the highest overlapping degree, taking one center proposal and the T-1 proposals with the highest overlapping degree as screened positive proposals, and forming a positive proposal set; the positive proposal set is a set consisting of positive proposal results, scores and boundaries;

the suppressing branch proposal network and the enhancing branch proposal network have consistent structures and shared parameters;

2. The method for retrieving the video time interval under the weak supervision based on the two-branch proposed network as claimed in claim 1, wherein the step 2) is specifically as follows:

2.1) acquiring a video and a description text as a training data set;

2.2) extracting word features in the text by using a pre-trained gloveword 2vec embedding method, then taking the word features as the input of a Bi-GRU network, and learning the word semantic expression with context information as a text feature sequence

Wherein n is_qIs the number of words in the description text, q_iIs a semantic feature of the ith word;

2.3) extracting the visual characteristics of the video by using a pre-trained video characteristic extractor, and reducing the length of a visual characteristic sequence by using time sequence average pooling to obtain a frame characteristic sequence of the video

Wherein n is_vIs a characteristic number, v_iIs the frame characteristic of the ith frame in the video; the pre-trained video feature extractor is different from a training data set, C3D features are extracted from a Charads-STA data set and an ActivityCaption data set, and optical flow features are extracted from a DiDeMo data set.

3. The method according to claim 1, wherein proposal regularization is introduced into the enhanced branch proposal network and fed back to the suppressed branch proposal network in the form of shared parameters;

the proposed regularization loss function formula is specifically:

wherein the content of the first and second substances,

and

to propose two losses of regularization, M_enFor all the time segment numbers in the two-dimensional time segment profile,

for the score of the ith positive offer,

is the ith positive proposal score processed by the softmax function; t denotes the number of positive proposals in the set of positive proposals.

4. The method according to claim 1, wherein the inter-sample countermeasure and intra-sample countermeasure loss function formulas are as follows:

wherein the content of the first and second substances,

in order to combat the loss between the samples,

for intra-specimen antagonism against loss, K^enFor the sum of all positive proposal scores, K^spFor the sum of all negative suggestive scores, Δ_intraFor positive propositions score boundary value, Δ_interA negative suggestive score boundary value;

and

a matching score is given for the current data and the randomly sampled unmatched video or text.

5. The method according to claim 3 or 4, wherein the multitask loss function is specifically:

wherein the content of the first and second substances,

in order to be a loss of the multi-tasking,

in order to combat the loss between the samples,

in order to combat the loss within the sample,

and

6. A weakly supervised video time interval retrieval system based on two-branch proposed network, for implementing the weakly supervised video time interval retrieval method of claim 1, the weakly supervised video time interval retrieval system comprising:

the data acquisition module is used for acquiring videos and description texts as training data sets when the system is in a training stage; when the system is in the detection stage, the system is used for acquiring a video to be detected and a question sentence;

the feature extraction module is used for extracting frame features from the video and extracting text features from the description text and the question sentences;

a cross-modal language-aware filtering module for receiving frame features

With text features

As inputs, an enhanced video stream containing text features and a suppressed video stream are output, where n_vIs a characteristic number, v_iIs the frame characteristic of the ith frame in the video, n_qIs the number of words, q_iIs a semantic feature of the ith word; comprising a device for generating a text feature sequence

Cluster center and scene-based linguistic feature sequence

NetVLAD submodelBlock, one for calculating a sequence of frame features

And scene-based language feature sequences

Generating two branch gate sub-modules for enhancing the video stream and inhibiting the video stream; wherein n is_cNumber of centers, u_jThe language features of the jth scene;

7. The system of claim 6, wherein the enhanced branch sub-module comprises:

a Bi-GRU sub-module for causing the integrated text features and the enhanced video stream to interact with each other to obtain language-aware frame features;

the two-dimensional time interval characteristic diagram submodule is internally provided with a two-dimensional time interval characteristic diagram, and the two-dimensional time interval characteristic diagram comprises three dimensions: the first two dimensions are used for indexing a start frame and an end frame of a period, and the third dimension is a characteristic dimension; the method is used for researching the relation of adjacent time intervals, acquiring cross-modal characteristics of all time intervals in the two-dimensional graph according to the frame characteristics of language perception, and outputting a proposed result and a score according to the cross-modal characteristics of all time intervals in the two-dimensional graph;

a proposal screening submodule that screens proposals using a center-based proposal screening method to screen a positive proposal set from the proposal results.