CN112417206B - Weak supervision video time interval retrieval method and system based on two-branch proposed network - Google Patents
Weak supervision video time interval retrieval method and system based on two-branch proposed network Download PDFInfo
- Publication number
- CN112417206B CN112417206B CN202011332463.XA CN202011332463A CN112417206B CN 112417206 B CN112417206 B CN 112417206B CN 202011332463 A CN202011332463 A CN 202011332463A CN 112417206 B CN112417206 B CN 112417206B
- Authority
- CN
- China
- Prior art keywords
- proposal
- video
- branch
- text
- frame
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/73—Querying
- G06F16/735—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7844—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/7867—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Library & Information Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a weakly supervised video time interval retrieval method and system based on two branch proposed networks, and belongs to the field of video retrieval. The method mainly comprises the following steps: 1) for a group of videos and description text training sets, according to a cross-modal language perception filter, the joint representation of the video information and the text information is learned, and an enhanced video stream and a suppressed video stream with the text information are generated. 2) For the output generated by the cross-modal language perception filter, a time interval answer for the joint expression of the video and the text is output according to a parameter sharing regularized two-branch proposal network. The invention adopts a language perception filter which uses a scene-based cross-modal estimation method to generate an enhanced video stream and a suppressed video stream, adopts a novel two-branch proposal network, simultaneously considers the confrontation between samples and in the samples, adopts a proposal regularization strategy to stabilize the training process, and effectively improves the model performance.
Description
Technical Field
The invention relates to the field of video time interval retrieval, in particular to a weak supervision video time interval retrieval method and system based on two branch proposed networks.
Background
Video period retrieval is an important issue in the field of video retrieval, which aims at automatically locating a target period in an untrimmed video according to a given descriptive text.
Video session retrieval is a interdisciplinary field between computer vision and natural language processing. A video session retrieval model understands not only visual and textual content but also the correlation between them. Most of the existing methods are trained in a fully supervised situation by aligning annotated pairs of video texts, which is time consuming and expensive, especially for ambiguous descriptions. Recently, researchers began exploring weakly supervised period retrieval by only video-level sentence annotation.
Most of existing weak supervision time interval retrieval methods are based on a multi-instance learning (MIL) method, a matched video text pair is regarded as a positive sample, a unmatched video text pair is regarded as a negative sample, and the method mainly focuses on countermeasures among the samples to judge whether a video is matched with a given word description or not, and ignores the countermeasures in the samples, namely, determines which time interval is most matched with the given word description. Given a matching video text pair, a video typically contains continuous content with no missing negative periods of great relevance to the textual description portion, but not a perfect match, that are difficult to distinguish from the target period. Therefore, there is a need to develop sufficient in-sample confrontation between periods of similar content in a video.
In summary, in the prior art, the intra-sample confrontation cannot be effectively developed by using the adjacent time periods of the video, so that the performance is limited in the application of similar scenes, and the time period boundary cannot be accurately positioned.
Disclosure of Invention
The invention aims to solve the problems in the prior art and provides a method and a system for searching a weakly supervised video time interval based on a two-branch proposed network.
In order to achieve the purpose, the invention specifically adopts the following technical scheme:
a weakly supervised video time interval retrieval method based on a two-branch proposed network comprises the following steps:
1) establishing a network model consisting of a cross-modal language perception filter and a regularized two-branch proposal network, wherein the regularized two-branch proposal network comprises an enhanced branch proposal network and a suppressed branch proposal network;
2) acquiring a video and a description text as a training data set, and extracting frame characteristics of the video and text characteristics of the description text;
3) taking the frame characteristics and the text characteristics as the input of a cross-modal language perception filter, and generating an enhanced video stream and a suppressed video stream with the text characteristics;
4) taking the generated enhanced video stream and text characteristics as the input of an enhanced branch proposal network, outputting a proposal result and a score, and screening to obtain a positive proposal set; the generated inhibition video stream and text characteristics are used as the input of an inhibition branch proposal network, the proposal result and the score are output, and a negative proposal set is obtained by screening;
5) introducing proposal regularization into the enhanced branch proposal network, calculating a multitask loss function through intra-sample confrontation, inter-sample confrontation and proposal regularization, and updating parameters of a cross-modal language perception filter and a regularized two-branch proposal network to obtain a trained network model;
6) for the video and the query sentence to be detected, the frame characteristics of the video and the text characteristics of the query sentence are respectively extracted, the frame characteristics and the text characteristics are used as the input of a trained network model, and a positive proposal with the highest predicted score is obtained as a retrieval result.
Another objective of the present invention is to provide a system for searching video time interval under weak supervision based on two proposed networks, which is used for implementing the above searching method.
The weakly supervised video period retrieval system comprises:
the data acquisition module is used for acquiring videos and description texts as training data sets when the system is in a training stage; when the system is in the detection stage, the system is used for acquiring the video to be detected and the question sentences.
And the characteristic extraction module is used for extracting frame characteristics from the video and extracting text characteristics from the description text and the question sentences.
And the cross-modal language perception filtering module is used for receiving the frame characteristics and the text characteristics as input and outputting the enhanced video stream and the suppressed video stream containing the text characteristics.
Regularizing two proposed network modules: the system comprises an enhancement branch submodule and a suppression branch submodule, wherein the enhancement branch submodule and the suppression branch submodule are used for taking the generated enhancement video stream and text characteristics as the input of an enhancement branch proposal network, outputting a proposal result and a score, and screening to obtain a positive proposal set; and taking the generated inhibition video stream and text characteristics as input of an inhibition branch proposal network, outputting a proposal result and a score, and screening to obtain a negative proposal set.
Compared with the traditional method, the method effectively improves the performance of video time interval retrieval, and is specifically embodied in that:
(1) aiming at the problem that the conventional method ignores the intra-sample countermeasure, the invention designs a novel regularized two-branch proposal network, each branch comprises a cross-mode interaction unit for integrating text clues into visual features to generate language-perceived frame features, a two-dimensional time interval feature map which is generated based on the language-perceived frame features and is subjected to convolution processing to explore the adjacent time interval relationship, and a proposal screening module for selecting the proposal being proposed; by receiving the enhanced video stream, the weakened video stream, and the text features, a series of matching positive and negative offers can be generated, along with a score and boundary for each offer, for weakly supervised video session retrieval, and using a center-based offer screening technique, a superior set of positive offers and a reasonable set of negative offers are screened out.
In addition, the method also considers the inter-sample confrontation and the intra-sample confrontation at the same time, the intra-sample confrontation can be encouraged through the intra-sample loss, and the target time period and the similar interference negative time period in the same data pair are distinguished; inter-sample confrontation may be encouraged by inter-sample loss so that matching positive samples have higher scores than non-matching negative samples; therefore, whether the video is matched with the given text description or not is judged, and which time interval is most matched with the given text description is further judged, so that the negative time interval which has great correlation with the text description part but is not completely matched can be distinguished from the target time interval, and the accuracy of the retrieval result is improved.
(2) Aiming at the problem that an invalid negative sample which is too simple can be generated by in-sample countermeasure, the invention designs a language perception filter which uses a scene-based cross-modal estimation method, projects text features to a clustering center by utilizing a NetVLAD technology and generates a scene-based language feature sequence, further calculates a cross-modal matching score between a frame feature sequence and the scene-based language feature sequence, obtains a score of each frame and performs normalization processing, and finally adopts two branch gates to generate an enhanced video stream and a suppressed video stream according to the score distribution and the frame feature sequence after the normalization processing, wherein the enhanced video stream highlights key frame features related to languages, weakens irrelevant frame features, and the suppressed video stream is opposite.
(3) Aiming at some prior knowledge which is beneficial to model training, the proposed regularization strategy is designed in the regularized two-branch proposed network, and the regularization strategy only needs to be applied to an enhanced branch through the characteristics of consistent structure and parameter sharing of the two branches. In particular, considering that most of the time periods are unselected, i.e., not matched with the text description, the present invention uses a global penalty function term to reduce the average score of the proposal so that the score of the unselected time periods is close to 0; considering that a most accurate time period proposal is selected from a series of proposals as a final result, the invention applies the softmax function to all proposals, and introduces a gap loss function term to encourage the expansion of the score gap between the proposals; in summary, the proposed regularization strategy designed by the invention reduces the average score of all the proposals to reduce the influence of irrelevant proposals, and enlarges the difference of the proposed scores to help the selection of the optimal proposal, stabilize the model training and improve the model performance.
Drawings
Fig. 1 is a schematic diagram of a network model used in the present invention.
FIG. 2 is a schematic diagram of a cross-modal language-aware filter used in the present invention.
FIG. 3 is a schematic diagram of the structure of an enhanced branch in a regularized two-branch proposed network used by the present invention.
Detailed Description
The invention will be further elucidated and described with reference to the drawings and the detailed description.
As shown in fig. 1, the method for retrieving a weakly supervised video period based on two proposed networks provided by the present invention includes the following steps:
firstly, extracting frame characteristics of a video and text characteristics of a description text for an input video and the input text; and then generating an enhanced video stream with text features and a suppressed video stream through a cross-modal language perception filter.
And step two, generating a series of regularized positive and negative proposals and scores and boundaries thereof for the generated enhanced video stream with text characteristics and the generated suppressed video stream through a regularized two-branch proposal network, and calculating a multitask loss function through intra-sample confrontation, inter-sample confrontation and proposal regularization so as to update model parameters of the cross-modal language perception filter and the regularized two-branch proposal network.
And step three, for the video and the text of the answer to be predicted, obtaining a positive proposal time period with the highest predicted score as a retrieval result according to the finally generated cross-modal language perception filter and the regularized two-branch proposal network.
In one embodiment of the present invention, the first step is performed as follows:
1.1) acquiring a video and a description text as a training data set, and extracting frame characteristics and text characteristics;
the frame feature extraction method specifically comprises the following steps: extracting visual features of the video by using a pre-trained video feature extractor, and reducing the length of a visual feature sequence by using time sequence average pooling to obtain a frame feature sequence of the videoWherein n isvIs a characteristic number, viIs the frame characteristic of the ith frame in the video; the pre-trained video feature extractor is different from data set to data set, C3D features are extracted from a Charads-STA data set and an ActivityCaption data set, and optical flow features are extracted from a DiDeMo data set.
The text feature extraction method specifically comprises the following steps: extracting word features by using a pre-trained Glove word2vec embedding method, then taking the word features as the input of a Bi-GRU network, and learning word semantic representation with context information as a text feature sequenceWherein n isqIs the number of words, qiIs the semantic feature of the ith word.
1.2) taking the frame characteristics and the text characteristics as the input of a cross-modal language perception filter, and generating an enhanced video stream with the text characteristics and a suppressed video stream.
The structure of the cross-modal language perception filter is shown in fig. 2, and a scene-based cross-modal estimation method is used, and specifically comprises the following steps:
first, the present invention characterizes text using NetVLAD technologyProjecting to a clustering center; in particular, a set of trainable center vectors is givenWherein n iscAs a number of centers, cjFor the jth center, NetVLAD accumulates the residual between the text feature and the center vector by soft-allocation, which is:
wherein, WcAnd bcFor the projection matrix and the offset vector,to correspond to ncSoft distribution coefficient of individual cluster centers, αijI.e. the soft distribution coefficient, u, between the ith text feature and the jth cluster centerjFor features accumulated in the jth center, each center can be considered a language scene, ujThe language features based on the scenes are obtained finally
Next, the invention calculates a sequence of frame featuresAnd scene-based language feature sequencesCross-modal matching fraction β therebetweenijThe calculation formula is as follows:
wherein the content of the first and second substances,andas a projection matrix, baIn order to be a vector of the offset,is a row vector, sigma is sigmoid activation function, betaijE (0,1) represents the matching score of the ith frame feature and the jth scene-based language feature; by the formula, an intermediate semantic space is introduced for the text and the video.
Considering the definition of important frames, that is, there is a certain language scene that is closely related to the frames, the present invention uses the overall score to evaluate a certain frame, specifically, the overall score of the ith frame is:wherein the index j represents the value taken at the j dimension; meanwhile, in order to prevent the generation of a score distribution with an excessively small degree of separation, for example, if all the frame scores are close to 0 or 1, the score distribution is adjusted by using a maximum-minimum normalization method, and the adjustment formula is as follows:
thereby obtaining a normalized score distribution over the frameWherein the content of the first and second substances,representing the correlation between the ith frame and the description text;
finally, the present invention uses a two-branch gate to generate the enhanced video streamAnd suppression of video streamsThe generation formula specifically comprises: is an enhanced video stream VenThe ith enhancement frame feature of (1),is suppressing the video stream VspThe ith suppression frame feature of (1); the enhancement video stream highlights key frames and attenuates the effects of non-key frames based on the normalized scores, and the suppression video stream is reversed.
In one embodiment of the present invention, the implementation of step two is as follows:
taking the generated enhanced video stream, the generated suppressed video stream and the text characteristics as the input of a regularized two-branch proposal network, and outputting the final regularized time period proposal result and score; comparing the result with the true value and updating parameters of the cross-modal language perception filter and the regularized two-branch proposed network to obtain a final network model;
2.1) the structure of the enhanced branch and the suppressed branch of the regularized two-branch proposed network is consistent and the parameters are shared, and the flow of the enhanced branch selection proposal is shown in fig. 3, which specifically comprises:
given an enhanced video streamWith text featuresConstructing a cross-mode interaction unit and integrating the text clues into the visual characteristics; specifically, a frame-to-word attention structure is used to summarize the text features of each frame, and the integrated formula is:
wherein the content of the first and second substances,an integral text representation form of the ith frame;andas a projection matrix, bmIn order to be a vector of the offset,is a row vector, δijMatching scores of the ith enhanced frame feature in the enhanced video stream and the semantic feature describing the jth word in the text;is the matching score after being processed by the softmax function.
2.2) passing it through a cross gate such that the enhanced frame features and the integrated text features interact with each other, the formula is as follows:
wherein the content of the first and second substances,in order to be a visual door,is a text gate, <' > is a symbol of element-by-element multiplication operation, Wv、WtAs a projection matrix, bv、btIs an offset vector; obtained by joiningAndto obtain speech-aware frame features
2.3) frame features according to language perceptionConstructing a two-dimensional time period characteristic diagram through a two-dimensional temporal network; constructed two-dimensional time interval characteristic diagramIncludes three dimensions: the first two dimensions are used for indexing a start frame and an end frame of a period, and the third dimension is a characteristic dimension; time interval [ a, b]The characteristic calculation formula is as follows:b>part of a will be treated as invalid and filled with 0; in addition, when nvWhen it is too bigIn order to save the calculation cost, the model is subjected to sparse sampling.
According to the constructed two-dimensional time interval characteristic diagram, two-dimensional convolution is carried out twice to explore the relation between adjacent time intervals, and therefore the cross-modal characteristic is obtainedWherein the content of the first and second substances,cross-modal feature, M, corresponding to the ith time period in a two-dimensional time period feature mapenCalculating respective proposal scores for all the periods in the two-dimensional map, each period corresponding to one enhancement proposalIth enhanced proposal scoreThe calculation formula is specifically as follows:wherein WpAnd bpAre the projection matrix and the offset vector.
2.4) screening the proposal according to the calculated proposal score by adopting a proposal screening method based on a center; specifically, the proposal with the highest score is taken as a center proposal, the rest proposals are sorted according to the time interval overlapping degree, T-1 proposals with the highest overlapping degree are taken, the T proposals are taken as positive proposals, and the positive proposal set is recorded as a positive proposalAnd includes corresponding scoresAnd the boundary
By means of which the selection can be effectedA series of related positive offers are selected, and similarly, suppressing branches can effectively generate reasonable negative offers, with the negative offer set being recorded asAnd includes corresponding scoresAnd the boundary
2.5) calculating an enhancement score based on the positive and negative proposals generated and their scoresAnd inhibition scoreAnd in-sample loss function termsWherein, DeltaintraIs a boundary value (taken as 0.4 in the present invention);the target time period is distinguished from similar negative periods of interference in the same data pair by increasing the positive proposal score and decreasing the negative proposal score to encourage in-sample confrontation.
Calculating an enhanced branch score K based on the generated positive offerenAnd randomly sampling from the video or text that does not match the current data, and calculating the sum of the corresponding enhanced propositions scores as aboveAndcomputing an inter-sample loss function termWherein ΔinterThe value is a boundary value (0.6 in the present invention).Andmatching scores of the current data and the randomly sampled unmatched video or text;the inter-sample confrontation is encouraged by increasing the matching sample score and decreasing the non-matching sample score so that the matching positive sample has a higher score than the non-matching negative sample.
2.6) for the generated proposed result, a regularization means is adopted to introduce some prior knowledge, and the model training process is stabilized, specifically:
considering that most of the time period is not selected, i.e. not matched with the text description, the invention uses a global loss function termTo reduce the average score of the proposal, where MenIs the number of all the time segments in the two-dimensional time segment graph;the term makes the unselected period score close to 0 whileAndthe term makes the proposal have a higher score.
Considering that the most accurate period proposal is selected from a series of positive proposals as the final result, the invention applies the softmax function to all positive proposals to obtainIntroducing a gap loss function termTo encourage expansion of the scoring gap between the positive offers.
Considering that the two branches are consistent in structure and shared in parameters, the regularization strategy is only applied to the enhanced branch.
The multitask loss function of the final application comprises the four loss functions and corresponding hyper-parameters, and specifically comprises the following steps:
wherein the content of the first and second substances,in order to be a loss of the multi-tasking,in order to combat the loss between the samples,in order to combat the loss within the sample,andto propose two losses of regularization, λ1、λ2、λ3、λ4Is a hyper-parameter.
In an embodiment of the present invention, a system for retrieving a weakly supervised video period based on a two-branch proposed network is further provided, including:
the data acquisition module is used for acquiring videos and description texts as training data sets when the system is in a training stage; when the system is in the detection stage, the system is used for acquiring the video to be detected and the question sentences.
And the characteristic extraction module is used for extracting frame characteristics from the video and extracting text characteristics from the description text and the question sentences. Specifically, text features can be extracted by adopting pre-trained glov word2vec, and video features can be extracted by adopting a pre-trained video feature extractor.
A cross-modal language-aware filtering module for receiving frame featuresWith text featuresAs an input, an enhanced video stream and a suppressed video stream containing text features are output, specifically:
Ven,Vsp=Filter(V,Q)
wherein the content of the first and second substances,representing the enhancement of the video stream,representing a suppressed video stream; the enhancement video stream highlights language-dependent key frame features, weakens irrelevant frame features, and the suppression video stream is the opposite.
Specifically, the cross-modal language-aware filtering module comprises a module for generating a text feature sequenceCluster center and scene-based linguistic feature sequenceA NetVLAD submodule for calculating a frame feature sequenceAnd scene-based language feature sequencesA cross-mode estimation submodule for cross-mode matching fraction between the two, and a module for using the fraction distribution and frame characteristic sequence after normalization processingAnd generating two-branch gate modules for enhancing the video stream and suppressing the video stream.
Regularizing two proposed network modules: the system consists of an enhanced branch submodule and a suppressed branch submodule, and is used for taking the generated enhanced video stream and text characteristics as the input of an enhanced branch proposal network, outputting proposal results and scores, and screening to obtain a positive proposal setCorresponding scoreAnd the boundaryThe generated inhibition video stream and text characteristics are used as the input of the inhibition branch proposal network, the proposal result and the score are output, and a negative proposal set is obtained by screeningAnd its corresponding scoreAnd the boundary
Can be expressed as:
Pen,Len,Cen=EnhancedBranchΘ(Ven,Q)
Psp,Lsp,Csp=SuppressedBranchΘ(Vsp,Q)
wherein, the enhanced branch sub-module comprises:
the cross-mode interaction unit is used for summarizing the text features of each frame of the video to obtain integrated text features;
a Bi-GRU sub-module for causing the integrated text features and the enhanced video stream to interact with each other to obtain language-aware frame features.
The two-dimensional time interval characteristic diagram submodule is internally provided with a two-dimensional time interval characteristic diagram, and the two-dimensional time interval characteristic diagram comprises three dimensions: the first two dimensions are used for indexing a start frame and an end frame of a period, and the third dimension is a characteristic dimension; the method is used for researching the relation of adjacent time intervals, obtaining the cross-modal characteristics of all the time intervals in the two-dimensional graph according to the frame characteristics of language perception, and calculating the score and the boundary of each proposal according to the cross-modal characteristics of all the time intervals in the two-dimensional graph.
And an offer screening submodule for screening offers using a center-based offer screening method and outputting a screened positive offer set.
The suppressing branch sub-module and the enhancing branch sub-module have the same structure, share parameters and finally generate a negative proposal set.
Wherein, a proposal regularization strategy is introduced into the enhanced branch submodule and comprisesTerm loss anditem losses, in particular:
terms exclude the interference of non-matching proposals by reducing the proposal mean score so that the score of unselected proposals is close to 0, whileAndthe term makes the proposal have a higher score;
terms are considered by expanding the scoring gap between positive proposals in order to select the most accurate proposal from a series of positive proposals as the end result.
Meanwhile, the regularization two-branch proposal network adopts inter-sample countermeasures and intra-sample countermeasures, and encourages the intra-sample countermeasures by improving the positive proposal score and reducing the negative proposal score; inter-sample confrontation is encouraged by increasing the matching sample score and decreasing the non-matching sample score.
In the specific embodiments provided in the present application, it should be understood that the above-described system embodiments are merely illustrative, and for example, the regularized two-branch proposed network module may be a logical functional partition, and may have another partition in actual implementation, for example, a plurality of modules may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the connections between the modules shown or discussed may be communication connections via interfaces, electrical or otherwise.
The method is applied to the following embodiments to achieve the technical effects of the present invention, and detailed steps in the embodiments are not described again.
Examples
The invention carries out experimental verification on three data sets, namely Charads-STA, Activinycaption and DiDeMo, wherein the specific conditions of the three data sets are as follows:
the Charads-STA data set contains 9848 indoor active videos, and the average video duration is 29.8 seconds; the data set was used for training, with 12408, 3720 tested sentence-period pairs, respectively.
The activityCaption data set contains 19209 videos with different contents, and the average duration of the videos is about 2 minutes; the data set was used for training, validation, and testing 37417, 17505, and 17031 sentence-period pairs, respectively.
The DiDeMo dataset contains 10464 videos, each video having a duration of 25-30 seconds; the data set is used for training, verifying and testing 33005, 4180 and 4021 sentence-period pairs respectively; in particular, each video in the DiDemo is divided into six five-second clips, the target time often includes one or more consecutive clips, so the time period of the DiDemo data set is only 21 candidates compared to the chardes-STA and ActivityCaption.
In terms of test evaluation criteria, the present invention follows widely used criteria, using R @ n, IoU ═ m as criteria for chardes-STA and ActivityCaption, and using Rank @1, Rank @5 and mlou as criteria for didmo; more specifically, the invention first calculates IoU values between the predicted time period and the true value; then, R @ n, IoU ═ m calculates the percentage of cases where the IoU value was greater than m for at least one of the previous n time periods, mlou calculates the mean of the first time period IoU values for all test samples, Rank @1, Rank @5 calculates the percentage of the true value in the first or first five cases.
Tables 1 to 3 show the results of the experiments of the present invention on three data sets, Charads-STA, ActivityCaption and DiDeMo, abbreviated to RTBPN.
TABLE 1 Experimental results on Charades-STA data set
TABLE 2 Experimental results on ActivityCaption data set
Table 3 experimental results on the DiDeMo data set
Because the data of the weakly supervised algorithm is only coarse-grained sentence-level labels, the training process of the weakly supervised algorithm often includes many invalid or even negatively-acting learning, and therefore the training effect under the same framework is much worse than that of the fully supervised algorithm with labels of corresponding periods of sentences.
However, as can be seen from tables 1-3, since the present invention considers both inter-sample confrontation and intra-sample confrontation, intra-sample confrontation can be encouraged by intra-sample loss, distinguishing the target time period from similar interference negative time period in the same data pair; inter-sample confrontation may be encouraged by inter-sample loss so that matching positive samples have higher scores than non-matching negative samples. Specifically, an enhanced video stream generated by a language-aware filter of a scene-based cross-modal estimation method highlights key frame features related to languages, weakens irrelevant frame features, and inhibits video streams, but the opposite is true, a novel regularized two-branch proposal network is designed, and a proposal screening technology based on a center is designed, so that an excellent positive proposal set and a reasonable negative proposal set can be screened out, the video period retrieval performance of the invention exceeds that of the early full-supervision algorithms VSA-RNN, VSA-STV and CTRL, and the invention achieves equivalent effects with TGN, QSPN and MCN, which undoubtedly proves the superiority of the performance of the invention.
Compared with other weak supervision algorithms currently existing, the method only slightly differs from the method in the case that R @1, IoU of the activiycation data set is 0.3, and the method makes progress in all other cases, which is enough to show that the method already exceeds the existing other weak supervision algorithms in video period retrieval performance.
The foregoing lists merely illustrate specific embodiments of the invention. It is obvious that the invention is not limited to the above embodiments, but that many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.
Claims (7)
1. A weakly supervised video time interval retrieval method based on two-branch proposed network is characterized by comprising the following steps:
1) establishing a network model consisting of a cross-modal language perception filter and a regularized two-branch proposal network, wherein the regularized two-branch proposal network comprises an enhanced branch proposal network and a suppressed branch proposal network;
2) acquiring a video and a description text as a training data set, and extracting frame characteristics of the video and text characteristics of the description text;
3) taking the frame characteristics and the text characteristics as the input of a cross-modal language perception filter, and generating an enhanced video stream and a suppressed video stream with the text characteristics; the cross-modal language perception filter generates an enhanced video stream and a suppressed video stream with text features by using a scene-based cross-modal estimation method, and specifically comprises the following steps:
3.1) characterizing text using NetVLAD technologyProjecting to a clustering center to obtain a language feature sequence based on a sceneWherein n isqIs the number of words, qiIs a semantic feature of the ith word, ncNumber of centers, ujThe language features of the jth scene;
3.2) computing the sequence of frame featuresAnd scene-based language feature sequencesCross-modal matching scores therebetween; wherein n isvIs a characteristic number, viIs the frame characteristic of the ith frame in the video;
3.3) calculating the score of each frame in the video according to the cross-modal matching score and carrying out normalization processing to obtain normalized score distribution;
3.4) adopting two branch gates to normalize the processed score distribution and the frame characteristic sequenceProcessing to generate an enhanced video stream and a suppressed video stream;
4) taking the generated enhanced video stream and text characteristics as the input of an enhanced branch proposal network, outputting a proposal result and a score, and screening to obtain a positive proposal set; the generated inhibition video stream and text characteristics are used as the input of an inhibition branch proposal network, the proposal result and the score are output, and a negative proposal set is obtained by screening;
the enhanced branch proposal network is specifically as follows:
4.1) taking the generated enhanced video stream and text features as input of an enhanced branch proposal network, and adopting a frame-to-word attention structure to summarize the text features of each frame to obtain integrated text features;
4.2) interacting the integrated text features and the enhanced video stream with each other through a cross gate to obtain language-aware frame features;
4.3) constructing a two-dimensional time period feature map through a two-dimensional temporal network according to the frame features of language perception; the constructed two-dimensional time interval feature map comprises three dimensions: the first two dimensions are used for indexing a start frame and an end frame of a period, and the third dimension is a characteristic dimension;
performing two-dimensional convolution twice according to the constructed two-dimensional time interval characteristic diagram to obtain cross-modal characteristics of all time intervals in the two-dimensional diagram; according to the cross-modal characteristics of all time intervals in the two-dimensional graph, each time interval corresponds to one proposed result, and the score of each proposed result is calculated;
4.4) screening the proposal using a center-based proposed screening: taking the proposal result with the highest score as a center proposal, sorting the rest proposals according to the time interval overlapping degree, taking T-1 proposals with the highest overlapping degree, taking one center proposal and the T-1 proposals with the highest overlapping degree as screened positive proposals, and forming a positive proposal set; the positive proposal set is a set consisting of positive proposal results, scores and boundaries;
the suppressing branch proposal network and the enhancing branch proposal network have consistent structures and shared parameters;
5) introducing proposal regularization into the enhanced branch proposal network, calculating a multitask loss function through intra-sample confrontation, inter-sample confrontation and proposal regularization, and updating parameters of a cross-modal language perception filter and a regularized two-branch proposal network to obtain a trained network model;
6) for the video and the query sentence to be detected, the frame characteristics of the video and the text characteristics of the query sentence are respectively extracted, the frame characteristics and the text characteristics are used as the input of a trained network model, and a positive proposal with the highest predicted score is obtained as a retrieval result.
2. The method for retrieving the video time interval under the weak supervision based on the two-branch proposed network as claimed in claim 1, wherein the step 2) is specifically as follows:
2.1) acquiring a video and a description text as a training data set;
2.2) extracting word features in the text by using a pre-trained gloveword 2vec embedding method, then taking the word features as the input of a Bi-GRU network, and learning the word semantic expression with context information as a text feature sequenceWherein n isqIs the number of words in the description text, qiIs a semantic feature of the ith word;
2.3) extracting the visual characteristics of the video by using a pre-trained video characteristic extractor, and reducing the length of a visual characteristic sequence by using time sequence average pooling to obtain a frame characteristic sequence of the videoWherein n isvIs a characteristic number, viIs the frame characteristic of the ith frame in the video; the pre-trained video feature extractor is different from a training data set, C3D features are extracted from a Charads-STA data set and an ActivityCaption data set, and optical flow features are extracted from a DiDeMo data set.
3. The method according to claim 1, wherein proposal regularization is introduced into the enhanced branch proposal network and fed back to the suppressed branch proposal network in the form of shared parameters;
the proposed regularization loss function formula is specifically:
wherein the content of the first and second substances,andto propose two losses of regularization, MenFor all the time segment numbers in the two-dimensional time segment profile,for the score of the ith positive offer,is the ith positive proposal score processed by the softmax function; t denotes the number of positive proposals in the set of positive proposals.
4. The method according to claim 1, wherein the inter-sample countermeasure and intra-sample countermeasure loss function formulas are as follows:
wherein the content of the first and second substances,in order to combat the loss between the samples,for intra-specimen antagonism against loss, KenFor the sum of all positive proposal scores, KspFor the sum of all negative suggestive scores, ΔintraFor positive propositions score boundary value, ΔinterA negative suggestive score boundary value;anda matching score is given for the current data and the randomly sampled unmatched video or text.
5. The method according to claim 3 or 4, wherein the multitask loss function is specifically:
6. A weakly supervised video time interval retrieval system based on two-branch proposed network, for implementing the weakly supervised video time interval retrieval method of claim 1, the weakly supervised video time interval retrieval system comprising:
the data acquisition module is used for acquiring videos and description texts as training data sets when the system is in a training stage; when the system is in the detection stage, the system is used for acquiring a video to be detected and a question sentence;
the feature extraction module is used for extracting frame features from the video and extracting text features from the description text and the question sentences;
a cross-modal language-aware filtering module for receiving frame featuresWith text featuresAs inputs, an enhanced video stream containing text features and a suppressed video stream are output, where nvIs a characteristic number, viIs the frame characteristic of the ith frame in the video, nqIs the number of words, qiIs a semantic feature of the ith word; comprising a device for generating a text feature sequenceCluster center and scene-based linguistic feature sequenceNetVLAD submodelBlock, one for calculating a sequence of frame featuresAnd scene-based language feature sequencesA cross-mode estimation submodule for cross-mode matching fraction between the two, and a module for using the fraction distribution and frame characteristic sequence after normalization processingGenerating two branch gate sub-modules for enhancing the video stream and inhibiting the video stream; wherein n iscNumber of centers, ujThe language features of the jth scene;
regularizing two proposed network modules: the system comprises an enhancement branch submodule and a suppression branch submodule, wherein the enhancement branch submodule and the suppression branch submodule are used for taking the generated enhancement video stream and text characteristics as the input of an enhancement branch proposal network, outputting a proposal result and a score, and screening to obtain a positive proposal set; and taking the generated inhibition video stream and text characteristics as input of an inhibition branch proposal network, outputting a proposal result and a score, and screening to obtain a negative proposal set.
7. The system of claim 6, wherein the enhanced branch sub-module comprises:
the cross-mode interaction unit is used for summarizing the text features of each frame of the video to obtain integrated text features;
a Bi-GRU sub-module for causing the integrated text features and the enhanced video stream to interact with each other to obtain language-aware frame features;
the two-dimensional time interval characteristic diagram submodule is internally provided with a two-dimensional time interval characteristic diagram, and the two-dimensional time interval characteristic diagram comprises three dimensions: the first two dimensions are used for indexing a start frame and an end frame of a period, and the third dimension is a characteristic dimension; the method is used for researching the relation of adjacent time intervals, acquiring cross-modal characteristics of all time intervals in the two-dimensional graph according to the frame characteristics of language perception, and outputting a proposed result and a score according to the cross-modal characteristics of all time intervals in the two-dimensional graph;
a proposal screening submodule that screens proposals using a center-based proposal screening method to screen a positive proposal set from the proposal results.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011332463.XA CN112417206B (en) | 2020-11-24 | 2020-11-24 | Weak supervision video time interval retrieval method and system based on two-branch proposed network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011332463.XA CN112417206B (en) | 2020-11-24 | 2020-11-24 | Weak supervision video time interval retrieval method and system based on two-branch proposed network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112417206A CN112417206A (en) | 2021-02-26 |
CN112417206B true CN112417206B (en) | 2021-09-24 |
Family
ID=74778792
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011332463.XA Active CN112417206B (en) | 2020-11-24 | 2020-11-24 | Weak supervision video time interval retrieval method and system based on two-branch proposed network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112417206B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112685597B (en) * | 2021-03-12 | 2021-07-13 | 杭州一知智能科技有限公司 | Weak supervision video clip retrieval method and system based on erasure mechanism |
CN113792594B (en) * | 2021-08-10 | 2024-04-12 | 南京大学 | Method and device for locating language fragments in video based on contrast learning |
CN113836901B (en) * | 2021-09-14 | 2023-11-14 | 灵犀量子(北京)医疗科技有限公司 | Method and system for cleaning Chinese and English medical synonym data |
CN113806482B (en) * | 2021-09-17 | 2023-12-12 | ***数智科技有限公司 | Cross-modal retrieval method, device, storage medium and equipment for video text |
CN113869272A (en) * | 2021-10-13 | 2021-12-31 | 北京达佳互联信息技术有限公司 | Processing method and device based on feature extraction model, electronic equipment and medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107679580A (en) * | 2017-10-21 | 2018-02-09 | 桂林电子科技大学 | A kind of isomery shift image feeling polarities analysis method based on the potential association of multi-modal depth |
EP3620976A1 (en) * | 2018-09-07 | 2020-03-11 | Volvo Car Corporation | Methods and systems for providing fast semantic proposals for image and video annotation |
CN111582170A (en) * | 2020-05-08 | 2020-08-25 | 浙江大学 | Method and positioning system for completing specified object positioning task in video by using object-aware multi-branch relation network |
CN111931571A (en) * | 2020-07-07 | 2020-11-13 | 华中科技大学 | Video character target tracking method based on online enhanced detection and electronic equipment |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11195057B2 (en) * | 2014-03-18 | 2021-12-07 | Z Advanced Computing, Inc. | System and method for extremely efficient image and pattern recognition and artificial intelligence platform |
-
2020
- 2020-11-24 CN CN202011332463.XA patent/CN112417206B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107679580A (en) * | 2017-10-21 | 2018-02-09 | 桂林电子科技大学 | A kind of isomery shift image feeling polarities analysis method based on the potential association of multi-modal depth |
EP3620976A1 (en) * | 2018-09-07 | 2020-03-11 | Volvo Car Corporation | Methods and systems for providing fast semantic proposals for image and video annotation |
CN111582170A (en) * | 2020-05-08 | 2020-08-25 | 浙江大学 | Method and positioning system for completing specified object positioning task in video by using object-aware multi-branch relation network |
CN111931571A (en) * | 2020-07-07 | 2020-11-13 | 华中科技大学 | Video character target tracking method based on online enhanced detection and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN112417206A (en) | 2021-02-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112417206B (en) | Weak supervision video time interval retrieval method and system based on two-branch proposed network | |
CN108197109B (en) | Multi-language analysis method and device based on natural language processing | |
Gong et al. | Natural language inference over interaction space | |
Meng et al. | Leveraging concept association network for multimedia rare concept mining and retrieval | |
CN111368088A (en) | Text emotion classification method based on deep learning | |
CN109597995A (en) | A kind of document representation method based on BM25 weighted combination term vector | |
CN111563373B (en) | Attribute-level emotion classification method for focused attribute-related text | |
Zhang et al. | Video-aided unsupervised grammar induction | |
CN113705218A (en) | Event element gridding extraction method based on character embedding, storage medium and electronic device | |
CN110297986A (en) | A kind of Sentiment orientation analysis method of hot microblog topic | |
Hassani et al. | LVTIA: A new method for keyphrase extraction from scientific video lectures | |
Saleem et al. | Stateful human-centered visual captioning system to aid video surveillance | |
CN112966507A (en) | Method, device, equipment and storage medium for constructing recognition model and identifying attack | |
Guo | [Retracted] Intelligent Sports Video Classification Based on Deep Neural Network (DNN) Algorithm and Transfer Learning | |
Matheven et al. | Fake news detection using deep learning and natural language processing | |
CN115774782A (en) | Multilingual text classification method, device, equipment and medium | |
CN115017356A (en) | Image text pair judgment method and device | |
CN115048504A (en) | Information pushing method and device, computer equipment and computer readable storage medium | |
CN114595370A (en) | Model training and sorting method and device, electronic equipment and storage medium | |
Ksibi et al. | Flickr-based semantic context to refine automatic photo annotation | |
Ruan et al. | Chinese news text classification method based on attention mechanism | |
CN113204670A (en) | Attention model-based video abstract description generation method and device | |
Zeng et al. | Correcting the Bias: Mitigating Multimodal Inconsistency Contrastive Learning for Multimodal Fake News Detection | |
Ho et al. | Uit at vbs 2022: An unified and interactive video retrieval system with temporal search | |
Wang et al. | Video description with integrated visual and textual information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |