CN112417206B - Weak supervision video time interval retrieval method and system based on two-branch proposed network - Google Patents

Weak supervision video time interval retrieval method and system based on two-branch proposed network Download PDF

Info

Publication number
CN112417206B
CN112417206B CN202011332463.XA CN202011332463A CN112417206B CN 112417206 B CN112417206 B CN 112417206B CN 202011332463 A CN202011332463 A CN 202011332463A CN 112417206 B CN112417206 B CN 112417206B
Authority
CN
China
Prior art keywords
proposal
video
branch
text
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011332463.XA
Other languages
Chinese (zh)
Other versions
CN112417206A (en
Inventor
童鑫远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Yizhi Intelligent Technology Co ltd
Original Assignee
Hangzhou Yizhi Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Yizhi Intelligent Technology Co ltd filed Critical Hangzhou Yizhi Intelligent Technology Co ltd
Priority to CN202011332463.XA priority Critical patent/CN112417206B/en
Publication of CN112417206A publication Critical patent/CN112417206A/en
Application granted granted Critical
Publication of CN112417206B publication Critical patent/CN112417206B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/735Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a weakly supervised video time interval retrieval method and system based on two branch proposed networks, and belongs to the field of video retrieval. The method mainly comprises the following steps: 1) for a group of videos and description text training sets, according to a cross-modal language perception filter, the joint representation of the video information and the text information is learned, and an enhanced video stream and a suppressed video stream with the text information are generated. 2) For the output generated by the cross-modal language perception filter, a time interval answer for the joint expression of the video and the text is output according to a parameter sharing regularized two-branch proposal network. The invention adopts a language perception filter which uses a scene-based cross-modal estimation method to generate an enhanced video stream and a suppressed video stream, adopts a novel two-branch proposal network, simultaneously considers the confrontation between samples and in the samples, adopts a proposal regularization strategy to stabilize the training process, and effectively improves the model performance.

Description

Weak supervision video time interval retrieval method and system based on two-branch proposed network
Technical Field
The invention relates to the field of video time interval retrieval, in particular to a weak supervision video time interval retrieval method and system based on two branch proposed networks.
Background
Video period retrieval is an important issue in the field of video retrieval, which aims at automatically locating a target period in an untrimmed video according to a given descriptive text.
Video session retrieval is a interdisciplinary field between computer vision and natural language processing. A video session retrieval model understands not only visual and textual content but also the correlation between them. Most of the existing methods are trained in a fully supervised situation by aligning annotated pairs of video texts, which is time consuming and expensive, especially for ambiguous descriptions. Recently, researchers began exploring weakly supervised period retrieval by only video-level sentence annotation.
Most of existing weak supervision time interval retrieval methods are based on a multi-instance learning (MIL) method, a matched video text pair is regarded as a positive sample, a unmatched video text pair is regarded as a negative sample, and the method mainly focuses on countermeasures among the samples to judge whether a video is matched with a given word description or not, and ignores the countermeasures in the samples, namely, determines which time interval is most matched with the given word description. Given a matching video text pair, a video typically contains continuous content with no missing negative periods of great relevance to the textual description portion, but not a perfect match, that are difficult to distinguish from the target period. Therefore, there is a need to develop sufficient in-sample confrontation between periods of similar content in a video.
In summary, in the prior art, the intra-sample confrontation cannot be effectively developed by using the adjacent time periods of the video, so that the performance is limited in the application of similar scenes, and the time period boundary cannot be accurately positioned.
Disclosure of Invention
The invention aims to solve the problems in the prior art and provides a method and a system for searching a weakly supervised video time interval based on a two-branch proposed network.
In order to achieve the purpose, the invention specifically adopts the following technical scheme:
a weakly supervised video time interval retrieval method based on a two-branch proposed network comprises the following steps:
1) establishing a network model consisting of a cross-modal language perception filter and a regularized two-branch proposal network, wherein the regularized two-branch proposal network comprises an enhanced branch proposal network and a suppressed branch proposal network;
2) acquiring a video and a description text as a training data set, and extracting frame characteristics of the video and text characteristics of the description text;
3) taking the frame characteristics and the text characteristics as the input of a cross-modal language perception filter, and generating an enhanced video stream and a suppressed video stream with the text characteristics;
4) taking the generated enhanced video stream and text characteristics as the input of an enhanced branch proposal network, outputting a proposal result and a score, and screening to obtain a positive proposal set; the generated inhibition video stream and text characteristics are used as the input of an inhibition branch proposal network, the proposal result and the score are output, and a negative proposal set is obtained by screening;
5) introducing proposal regularization into the enhanced branch proposal network, calculating a multitask loss function through intra-sample confrontation, inter-sample confrontation and proposal regularization, and updating parameters of a cross-modal language perception filter and a regularized two-branch proposal network to obtain a trained network model;
6) for the video and the query sentence to be detected, the frame characteristics of the video and the text characteristics of the query sentence are respectively extracted, the frame characteristics and the text characteristics are used as the input of a trained network model, and a positive proposal with the highest predicted score is obtained as a retrieval result.
Another objective of the present invention is to provide a system for searching video time interval under weak supervision based on two proposed networks, which is used for implementing the above searching method.
The weakly supervised video period retrieval system comprises:
the data acquisition module is used for acquiring videos and description texts as training data sets when the system is in a training stage; when the system is in the detection stage, the system is used for acquiring the video to be detected and the question sentences.
And the characteristic extraction module is used for extracting frame characteristics from the video and extracting text characteristics from the description text and the question sentences.
And the cross-modal language perception filtering module is used for receiving the frame characteristics and the text characteristics as input and outputting the enhanced video stream and the suppressed video stream containing the text characteristics.
Regularizing two proposed network modules: the system comprises an enhancement branch submodule and a suppression branch submodule, wherein the enhancement branch submodule and the suppression branch submodule are used for taking the generated enhancement video stream and text characteristics as the input of an enhancement branch proposal network, outputting a proposal result and a score, and screening to obtain a positive proposal set; and taking the generated inhibition video stream and text characteristics as input of an inhibition branch proposal network, outputting a proposal result and a score, and screening to obtain a negative proposal set.
Compared with the traditional method, the method effectively improves the performance of video time interval retrieval, and is specifically embodied in that:
(1) aiming at the problem that the conventional method ignores the intra-sample countermeasure, the invention designs a novel regularized two-branch proposal network, each branch comprises a cross-mode interaction unit for integrating text clues into visual features to generate language-perceived frame features, a two-dimensional time interval feature map which is generated based on the language-perceived frame features and is subjected to convolution processing to explore the adjacent time interval relationship, and a proposal screening module for selecting the proposal being proposed; by receiving the enhanced video stream, the weakened video stream, and the text features, a series of matching positive and negative offers can be generated, along with a score and boundary for each offer, for weakly supervised video session retrieval, and using a center-based offer screening technique, a superior set of positive offers and a reasonable set of negative offers are screened out.
In addition, the method also considers the inter-sample confrontation and the intra-sample confrontation at the same time, the intra-sample confrontation can be encouraged through the intra-sample loss, and the target time period and the similar interference negative time period in the same data pair are distinguished; inter-sample confrontation may be encouraged by inter-sample loss so that matching positive samples have higher scores than non-matching negative samples; therefore, whether the video is matched with the given text description or not is judged, and which time interval is most matched with the given text description is further judged, so that the negative time interval which has great correlation with the text description part but is not completely matched can be distinguished from the target time interval, and the accuracy of the retrieval result is improved.
(2) Aiming at the problem that an invalid negative sample which is too simple can be generated by in-sample countermeasure, the invention designs a language perception filter which uses a scene-based cross-modal estimation method, projects text features to a clustering center by utilizing a NetVLAD technology and generates a scene-based language feature sequence, further calculates a cross-modal matching score between a frame feature sequence and the scene-based language feature sequence, obtains a score of each frame and performs normalization processing, and finally adopts two branch gates to generate an enhanced video stream and a suppressed video stream according to the score distribution and the frame feature sequence after the normalization processing, wherein the enhanced video stream highlights key frame features related to languages, weakens irrelevant frame features, and the suppressed video stream is opposite.
(3) Aiming at some prior knowledge which is beneficial to model training, the proposed regularization strategy is designed in the regularized two-branch proposed network, and the regularization strategy only needs to be applied to an enhanced branch through the characteristics of consistent structure and parameter sharing of the two branches. In particular, considering that most of the time periods are unselected, i.e., not matched with the text description, the present invention uses a global penalty function term to reduce the average score of the proposal so that the score of the unselected time periods is close to 0; considering that a most accurate time period proposal is selected from a series of proposals as a final result, the invention applies the softmax function to all proposals, and introduces a gap loss function term to encourage the expansion of the score gap between the proposals; in summary, the proposed regularization strategy designed by the invention reduces the average score of all the proposals to reduce the influence of irrelevant proposals, and enlarges the difference of the proposed scores to help the selection of the optimal proposal, stabilize the model training and improve the model performance.
Drawings
Fig. 1 is a schematic diagram of a network model used in the present invention.
FIG. 2 is a schematic diagram of a cross-modal language-aware filter used in the present invention.
FIG. 3 is a schematic diagram of the structure of an enhanced branch in a regularized two-branch proposed network used by the present invention.
Detailed Description
The invention will be further elucidated and described with reference to the drawings and the detailed description.
As shown in fig. 1, the method for retrieving a weakly supervised video period based on two proposed networks provided by the present invention includes the following steps:
firstly, extracting frame characteristics of a video and text characteristics of a description text for an input video and the input text; and then generating an enhanced video stream with text features and a suppressed video stream through a cross-modal language perception filter.
And step two, generating a series of regularized positive and negative proposals and scores and boundaries thereof for the generated enhanced video stream with text characteristics and the generated suppressed video stream through a regularized two-branch proposal network, and calculating a multitask loss function through intra-sample confrontation, inter-sample confrontation and proposal regularization so as to update model parameters of the cross-modal language perception filter and the regularized two-branch proposal network.
And step three, for the video and the text of the answer to be predicted, obtaining a positive proposal time period with the highest predicted score as a retrieval result according to the finally generated cross-modal language perception filter and the regularized two-branch proposal network.
In one embodiment of the present invention, the first step is performed as follows:
1.1) acquiring a video and a description text as a training data set, and extracting frame characteristics and text characteristics;
the frame feature extraction method specifically comprises the following steps: extracting visual features of the video by using a pre-trained video feature extractor, and reducing the length of a visual feature sequence by using time sequence average pooling to obtain a frame feature sequence of the video
Figure BDA0002796210530000041
Wherein n isvIs a characteristic number, viIs the frame characteristic of the ith frame in the video; the pre-trained video feature extractor is different from data set to data set, C3D features are extracted from a Charads-STA data set and an ActivityCaption data set, and optical flow features are extracted from a DiDeMo data set.
The text feature extraction method specifically comprises the following steps: extracting word features by using a pre-trained Glove word2vec embedding method, then taking the word features as the input of a Bi-GRU network, and learning word semantic representation with context information as a text feature sequence
Figure BDA0002796210530000051
Wherein n isqIs the number of words, qiIs the semantic feature of the ith word.
1.2) taking the frame characteristics and the text characteristics as the input of a cross-modal language perception filter, and generating an enhanced video stream with the text characteristics and a suppressed video stream.
The structure of the cross-modal language perception filter is shown in fig. 2, and a scene-based cross-modal estimation method is used, and specifically comprises the following steps:
first, the present invention characterizes text using NetVLAD technology
Figure BDA0002796210530000052
Projecting to a clustering center; in particular, a set of trainable center vectors is given
Figure BDA0002796210530000053
Wherein n iscAs a number of centers, cjFor the jth center, NetVLAD accumulates the residual between the text feature and the center vector by soft-allocation, which is:
Figure BDA0002796210530000054
wherein, WcAnd bcFor the projection matrix and the offset vector,
Figure BDA0002796210530000055
to correspond to ncSoft distribution coefficient of individual cluster centers, αijI.e. the soft distribution coefficient, u, between the ith text feature and the jth cluster centerjFor features accumulated in the jth center, each center can be considered a language scene, ujThe language features based on the scenes are obtained finally
Figure BDA0002796210530000056
Next, the invention calculates a sequence of frame features
Figure BDA0002796210530000057
And scene-based language feature sequences
Figure BDA0002796210530000058
Cross-modal matching fraction β therebetweenijThe calculation formula is as follows:
Figure BDA0002796210530000059
wherein the content of the first and second substances,
Figure BDA00027962105300000510
and
Figure BDA00027962105300000511
as a projection matrix, baIn order to be a vector of the offset,
Figure BDA00027962105300000512
is a row vector, sigma is sigmoid activation function, betaijE (0,1) represents the matching score of the ith frame feature and the jth scene-based language feature; by the formula, an intermediate semantic space is introduced for the text and the video.
Considering the definition of important frames, that is, there is a certain language scene that is closely related to the frames, the present invention uses the overall score to evaluate a certain frame, specifically, the overall score of the ith frame is:
Figure BDA00027962105300000513
wherein the index j represents the value taken at the j dimension; meanwhile, in order to prevent the generation of a score distribution with an excessively small degree of separation, for example, if all the frame scores are close to 0 or 1, the score distribution is adjusted by using a maximum-minimum normalization method, and the adjustment formula is as follows:
Figure BDA00027962105300000514
thereby obtaining a normalized score distribution over the frame
Figure BDA0002796210530000061
Wherein the content of the first and second substances,
Figure BDA0002796210530000062
representing the correlation between the ith frame and the description text;
finally, the present invention uses a two-branch gate to generate the enhanced video stream
Figure BDA0002796210530000063
And suppression of video streams
Figure BDA0002796210530000064
The generation formula specifically comprises:
Figure BDA0002796210530000065
Figure BDA0002796210530000066
is an enhanced video stream VenThe ith enhancement frame feature of (1),
Figure BDA0002796210530000067
is suppressing the video stream VspThe ith suppression frame feature of (1); the enhancement video stream highlights key frames and attenuates the effects of non-key frames based on the normalized scores, and the suppression video stream is reversed.
In one embodiment of the present invention, the implementation of step two is as follows:
taking the generated enhanced video stream, the generated suppressed video stream and the text characteristics as the input of a regularized two-branch proposal network, and outputting the final regularized time period proposal result and score; comparing the result with the true value and updating parameters of the cross-modal language perception filter and the regularized two-branch proposed network to obtain a final network model;
2.1) the structure of the enhanced branch and the suppressed branch of the regularized two-branch proposed network is consistent and the parameters are shared, and the flow of the enhanced branch selection proposal is shown in fig. 3, which specifically comprises:
given an enhanced video stream
Figure BDA0002796210530000068
With text features
Figure BDA0002796210530000069
Constructing a cross-mode interaction unit and integrating the text clues into the visual characteristics; specifically, a frame-to-word attention structure is used to summarize the text features of each frame, and the integrated formula is:
Figure BDA00027962105300000610
Figure BDA00027962105300000611
Figure BDA00027962105300000612
wherein the content of the first and second substances,
Figure BDA00027962105300000613
an integral text representation form of the ith frame;
Figure BDA00027962105300000614
and
Figure BDA00027962105300000615
as a projection matrix, bmIn order to be a vector of the offset,
Figure BDA00027962105300000616
is a row vector, δijMatching scores of the ith enhanced frame feature in the enhanced video stream and the semantic feature describing the jth word in the text;
Figure BDA00027962105300000617
is the matching score after being processed by the softmax function.
2.2) passing it through a cross gate such that the enhanced frame features and the integrated text features interact with each other, the formula is as follows:
Figure BDA00027962105300000618
Figure BDA00027962105300000619
wherein the content of the first and second substances,
Figure BDA0002796210530000071
in order to be a visual door,
Figure BDA0002796210530000072
is a text gate, <' > is a symbol of element-by-element multiplication operation, Wv、WtAs a projection matrix, bv、btIs an offset vector; obtained by joining
Figure BDA0002796210530000073
And
Figure BDA00027962105300000728
to obtain speech-aware frame features
Figure BDA0002796210530000074
2.3) frame features according to language perception
Figure BDA0002796210530000075
Constructing a two-dimensional time period characteristic diagram through a two-dimensional temporal network; constructed two-dimensional time interval characteristic diagram
Figure BDA0002796210530000076
Includes three dimensions: the first two dimensions are used for indexing a start frame and an end frame of a period, and the third dimension is a characteristic dimension; time interval [ a, b]The characteristic calculation formula is as follows:
Figure BDA0002796210530000077
b>part of a will be treated as invalid and filled with 0; in addition, when nvWhen it is too bigIn order to save the calculation cost, the model is subjected to sparse sampling.
According to the constructed two-dimensional time interval characteristic diagram, two-dimensional convolution is carried out twice to explore the relation between adjacent time intervals, and therefore the cross-modal characteristic is obtained
Figure BDA0002796210530000078
Wherein the content of the first and second substances,
Figure BDA0002796210530000079
cross-modal feature, M, corresponding to the ith time period in a two-dimensional time period feature mapenCalculating respective proposal scores for all the periods in the two-dimensional map, each period corresponding to one enhancement proposal
Figure BDA00027962105300000710
Ith enhanced proposal score
Figure BDA00027962105300000711
The calculation formula is specifically as follows:
Figure BDA00027962105300000712
wherein WpAnd bpAre the projection matrix and the offset vector.
2.4) screening the proposal according to the calculated proposal score by adopting a proposal screening method based on a center; specifically, the proposal with the highest score is taken as a center proposal, the rest proposals are sorted according to the time interval overlapping degree, T-1 proposals with the highest overlapping degree are taken, the T proposals are taken as positive proposals, and the positive proposal set is recorded as a positive proposal
Figure BDA00027962105300000713
And includes corresponding scores
Figure BDA00027962105300000714
And the boundary
Figure BDA00027962105300000715
By means of which the selection can be effectedA series of related positive offers are selected, and similarly, suppressing branches can effectively generate reasonable negative offers, with the negative offer set being recorded as
Figure BDA00027962105300000716
And includes corresponding scores
Figure BDA00027962105300000717
And the boundary
Figure BDA00027962105300000718
2.5) calculating an enhancement score based on the positive and negative proposals generated and their scores
Figure BDA00027962105300000719
And inhibition score
Figure BDA00027962105300000720
And in-sample loss function terms
Figure BDA00027962105300000721
Wherein, DeltaintraIs a boundary value (taken as 0.4 in the present invention);
Figure BDA00027962105300000722
the target time period is distinguished from similar negative periods of interference in the same data pair by increasing the positive proposal score and decreasing the negative proposal score to encourage in-sample confrontation.
Calculating an enhanced branch score K based on the generated positive offerenAnd randomly sampling from the video or text that does not match the current data, and calculating the sum of the corresponding enhanced propositions scores as above
Figure BDA00027962105300000723
And
Figure BDA00027962105300000724
computing an inter-sample loss function term
Figure BDA00027962105300000725
Wherein ΔinterThe value is a boundary value (0.6 in the present invention).
Figure BDA00027962105300000726
And
Figure BDA00027962105300000727
matching scores of the current data and the randomly sampled unmatched video or text;
Figure BDA0002796210530000081
the inter-sample confrontation is encouraged by increasing the matching sample score and decreasing the non-matching sample score so that the matching positive sample has a higher score than the non-matching negative sample.
2.6) for the generated proposed result, a regularization means is adopted to introduce some prior knowledge, and the model training process is stabilized, specifically:
considering that most of the time period is not selected, i.e. not matched with the text description, the invention uses a global loss function term
Figure BDA0002796210530000082
To reduce the average score of the proposal, where MenIs the number of all the time segments in the two-dimensional time segment graph;
Figure BDA0002796210530000083
the term makes the unselected period score close to 0 while
Figure BDA0002796210530000084
And
Figure BDA0002796210530000085
the term makes the proposal have a higher score.
Considering that the most accurate period proposal is selected from a series of positive proposals as the final result, the invention applies the softmax function to all positive proposals to obtain
Figure BDA0002796210530000086
Introducing a gap loss function term
Figure BDA0002796210530000087
To encourage expansion of the scoring gap between the positive offers.
Considering that the two branches are consistent in structure and shared in parameters, the regularization strategy is only applied to the enhanced branch.
The multitask loss function of the final application comprises the four loss functions and corresponding hyper-parameters, and specifically comprises the following steps:
Figure BDA0002796210530000088
wherein the content of the first and second substances,
Figure BDA0002796210530000089
in order to be a loss of the multi-tasking,
Figure BDA00027962105300000810
in order to combat the loss between the samples,
Figure BDA00027962105300000811
in order to combat the loss within the sample,
Figure BDA00027962105300000812
and
Figure BDA00027962105300000813
to propose two losses of regularization, λ1、λ2、λ3、λ4Is a hyper-parameter.
In an embodiment of the present invention, a system for retrieving a weakly supervised video period based on a two-branch proposed network is further provided, including:
the data acquisition module is used for acquiring videos and description texts as training data sets when the system is in a training stage; when the system is in the detection stage, the system is used for acquiring the video to be detected and the question sentences.
And the characteristic extraction module is used for extracting frame characteristics from the video and extracting text characteristics from the description text and the question sentences. Specifically, text features can be extracted by adopting pre-trained glov word2vec, and video features can be extracted by adopting a pre-trained video feature extractor.
A cross-modal language-aware filtering module for receiving frame features
Figure BDA00027962105300000814
With text features
Figure BDA00027962105300000815
As an input, an enhanced video stream and a suppressed video stream containing text features are output, specifically:
Ven,Vsp=Filter(V,Q)
wherein the content of the first and second substances,
Figure BDA00027962105300000816
representing the enhancement of the video stream,
Figure BDA00027962105300000817
representing a suppressed video stream; the enhancement video stream highlights language-dependent key frame features, weakens irrelevant frame features, and the suppression video stream is the opposite.
Specifically, the cross-modal language-aware filtering module comprises a module for generating a text feature sequence
Figure BDA0002796210530000091
Cluster center and scene-based linguistic feature sequence
Figure BDA0002796210530000092
A NetVLAD submodule for calculating a frame feature sequence
Figure BDA0002796210530000093
And scene-based language feature sequences
Figure BDA0002796210530000094
A cross-mode estimation submodule for cross-mode matching fraction between the two, and a module for using the fraction distribution and frame characteristic sequence after normalization processing
Figure BDA0002796210530000095
And generating two-branch gate modules for enhancing the video stream and suppressing the video stream.
Regularizing two proposed network modules: the system consists of an enhanced branch submodule and a suppressed branch submodule, and is used for taking the generated enhanced video stream and text characteristics as the input of an enhanced branch proposal network, outputting proposal results and scores, and screening to obtain a positive proposal set
Figure BDA0002796210530000096
Corresponding score
Figure BDA0002796210530000097
And the boundary
Figure BDA0002796210530000098
The generated inhibition video stream and text characteristics are used as the input of the inhibition branch proposal network, the proposal result and the score are output, and a negative proposal set is obtained by screening
Figure BDA0002796210530000099
And its corresponding score
Figure BDA00027962105300000910
And the boundary
Figure BDA00027962105300000911
Can be expressed as:
Pen,Len,Cen=EnhancedBranchΘ(Ven,Q)
Psp,Lsp,Csp=SuppressedBranchΘ(Vsp,Q)
wherein, the enhanced branch sub-module comprises:
the cross-mode interaction unit is used for summarizing the text features of each frame of the video to obtain integrated text features;
a Bi-GRU sub-module for causing the integrated text features and the enhanced video stream to interact with each other to obtain language-aware frame features.
The two-dimensional time interval characteristic diagram submodule is internally provided with a two-dimensional time interval characteristic diagram, and the two-dimensional time interval characteristic diagram comprises three dimensions: the first two dimensions are used for indexing a start frame and an end frame of a period, and the third dimension is a characteristic dimension; the method is used for researching the relation of adjacent time intervals, obtaining the cross-modal characteristics of all the time intervals in the two-dimensional graph according to the frame characteristics of language perception, and calculating the score and the boundary of each proposal according to the cross-modal characteristics of all the time intervals in the two-dimensional graph.
And an offer screening submodule for screening offers using a center-based offer screening method and outputting a screened positive offer set.
The suppressing branch sub-module and the enhancing branch sub-module have the same structure, share parameters and finally generate a negative proposal set.
Wherein, a proposal regularization strategy is introduced into the enhanced branch submodule and comprises
Figure BDA00027962105300000912
Term loss and
Figure BDA00027962105300000913
item losses, in particular:
Figure BDA0002796210530000101
terms exclude the interference of non-matching proposals by reducing the proposal mean score so that the score of unselected proposals is close to 0, while
Figure BDA0002796210530000102
And
Figure BDA0002796210530000103
the term makes the proposal have a higher score;
Figure BDA0002796210530000104
terms are considered by expanding the scoring gap between positive proposals in order to select the most accurate proposal from a series of positive proposals as the end result.
Meanwhile, the regularization two-branch proposal network adopts inter-sample countermeasures and intra-sample countermeasures, and encourages the intra-sample countermeasures by improving the positive proposal score and reducing the negative proposal score; inter-sample confrontation is encouraged by increasing the matching sample score and decreasing the non-matching sample score.
In the specific embodiments provided in the present application, it should be understood that the above-described system embodiments are merely illustrative, and for example, the regularized two-branch proposed network module may be a logical functional partition, and may have another partition in actual implementation, for example, a plurality of modules may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the connections between the modules shown or discussed may be communication connections via interfaces, electrical or otherwise.
The method is applied to the following embodiments to achieve the technical effects of the present invention, and detailed steps in the embodiments are not described again.
Examples
The invention carries out experimental verification on three data sets, namely Charads-STA, Activinycaption and DiDeMo, wherein the specific conditions of the three data sets are as follows:
the Charads-STA data set contains 9848 indoor active videos, and the average video duration is 29.8 seconds; the data set was used for training, with 12408, 3720 tested sentence-period pairs, respectively.
The activityCaption data set contains 19209 videos with different contents, and the average duration of the videos is about 2 minutes; the data set was used for training, validation, and testing 37417, 17505, and 17031 sentence-period pairs, respectively.
The DiDeMo dataset contains 10464 videos, each video having a duration of 25-30 seconds; the data set is used for training, verifying and testing 33005, 4180 and 4021 sentence-period pairs respectively; in particular, each video in the DiDemo is divided into six five-second clips, the target time often includes one or more consecutive clips, so the time period of the DiDemo data set is only 21 candidates compared to the chardes-STA and ActivityCaption.
In terms of test evaluation criteria, the present invention follows widely used criteria, using R @ n, IoU ═ m as criteria for chardes-STA and ActivityCaption, and using Rank @1, Rank @5 and mlou as criteria for didmo; more specifically, the invention first calculates IoU values between the predicted time period and the true value; then, R @ n, IoU ═ m calculates the percentage of cases where the IoU value was greater than m for at least one of the previous n time periods, mlou calculates the mean of the first time period IoU values for all test samples, Rank @1, Rank @5 calculates the percentage of the true value in the first or first five cases.
Tables 1 to 3 show the results of the experiments of the present invention on three data sets, Charads-STA, ActivityCaption and DiDeMo, abbreviated to RTBPN.
TABLE 1 Experimental results on Charades-STA data set
Figure BDA0002796210530000111
TABLE 2 Experimental results on ActivityCaption data set
Figure BDA0002796210530000112
Table 3 experimental results on the DiDeMo data set
Figure BDA0002796210530000121
Because the data of the weakly supervised algorithm is only coarse-grained sentence-level labels, the training process of the weakly supervised algorithm often includes many invalid or even negatively-acting learning, and therefore the training effect under the same framework is much worse than that of the fully supervised algorithm with labels of corresponding periods of sentences.
However, as can be seen from tables 1-3, since the present invention considers both inter-sample confrontation and intra-sample confrontation, intra-sample confrontation can be encouraged by intra-sample loss, distinguishing the target time period from similar interference negative time period in the same data pair; inter-sample confrontation may be encouraged by inter-sample loss so that matching positive samples have higher scores than non-matching negative samples. Specifically, an enhanced video stream generated by a language-aware filter of a scene-based cross-modal estimation method highlights key frame features related to languages, weakens irrelevant frame features, and inhibits video streams, but the opposite is true, a novel regularized two-branch proposal network is designed, and a proposal screening technology based on a center is designed, so that an excellent positive proposal set and a reasonable negative proposal set can be screened out, the video period retrieval performance of the invention exceeds that of the early full-supervision algorithms VSA-RNN, VSA-STV and CTRL, and the invention achieves equivalent effects with TGN, QSPN and MCN, which undoubtedly proves the superiority of the performance of the invention.
Compared with other weak supervision algorithms currently existing, the method only slightly differs from the method in the case that R @1, IoU of the activiycation data set is 0.3, and the method makes progress in all other cases, which is enough to show that the method already exceeds the existing other weak supervision algorithms in video period retrieval performance.
The foregoing lists merely illustrate specific embodiments of the invention. It is obvious that the invention is not limited to the above embodiments, but that many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.

Claims (7)

1. A weakly supervised video time interval retrieval method based on two-branch proposed network is characterized by comprising the following steps:
1) establishing a network model consisting of a cross-modal language perception filter and a regularized two-branch proposal network, wherein the regularized two-branch proposal network comprises an enhanced branch proposal network and a suppressed branch proposal network;
2) acquiring a video and a description text as a training data set, and extracting frame characteristics of the video and text characteristics of the description text;
3) taking the frame characteristics and the text characteristics as the input of a cross-modal language perception filter, and generating an enhanced video stream and a suppressed video stream with the text characteristics; the cross-modal language perception filter generates an enhanced video stream and a suppressed video stream with text features by using a scene-based cross-modal estimation method, and specifically comprises the following steps:
3.1) characterizing text using NetVLAD technology
Figure FDA0003201376000000011
Projecting to a clustering center to obtain a language feature sequence based on a scene
Figure FDA0003201376000000012
Wherein n isqIs the number of words, qiIs a semantic feature of the ith word, ncNumber of centers, ujThe language features of the jth scene;
3.2) computing the sequence of frame features
Figure FDA0003201376000000013
And scene-based language feature sequences
Figure FDA0003201376000000014
Cross-modal matching scores therebetween; wherein n isvIs a characteristic number, viIs the frame characteristic of the ith frame in the video;
3.3) calculating the score of each frame in the video according to the cross-modal matching score and carrying out normalization processing to obtain normalized score distribution;
3.4) adopting two branch gates to normalize the processed score distribution and the frame characteristic sequence
Figure FDA0003201376000000015
Processing to generate an enhanced video stream and a suppressed video stream;
4) taking the generated enhanced video stream and text characteristics as the input of an enhanced branch proposal network, outputting a proposal result and a score, and screening to obtain a positive proposal set; the generated inhibition video stream and text characteristics are used as the input of an inhibition branch proposal network, the proposal result and the score are output, and a negative proposal set is obtained by screening;
the enhanced branch proposal network is specifically as follows:
4.1) taking the generated enhanced video stream and text features as input of an enhanced branch proposal network, and adopting a frame-to-word attention structure to summarize the text features of each frame to obtain integrated text features;
4.2) interacting the integrated text features and the enhanced video stream with each other through a cross gate to obtain language-aware frame features;
4.3) constructing a two-dimensional time period feature map through a two-dimensional temporal network according to the frame features of language perception; the constructed two-dimensional time interval feature map comprises three dimensions: the first two dimensions are used for indexing a start frame and an end frame of a period, and the third dimension is a characteristic dimension;
performing two-dimensional convolution twice according to the constructed two-dimensional time interval characteristic diagram to obtain cross-modal characteristics of all time intervals in the two-dimensional diagram; according to the cross-modal characteristics of all time intervals in the two-dimensional graph, each time interval corresponds to one proposed result, and the score of each proposed result is calculated;
4.4) screening the proposal using a center-based proposed screening: taking the proposal result with the highest score as a center proposal, sorting the rest proposals according to the time interval overlapping degree, taking T-1 proposals with the highest overlapping degree, taking one center proposal and the T-1 proposals with the highest overlapping degree as screened positive proposals, and forming a positive proposal set; the positive proposal set is a set consisting of positive proposal results, scores and boundaries;
the suppressing branch proposal network and the enhancing branch proposal network have consistent structures and shared parameters;
5) introducing proposal regularization into the enhanced branch proposal network, calculating a multitask loss function through intra-sample confrontation, inter-sample confrontation and proposal regularization, and updating parameters of a cross-modal language perception filter and a regularized two-branch proposal network to obtain a trained network model;
6) for the video and the query sentence to be detected, the frame characteristics of the video and the text characteristics of the query sentence are respectively extracted, the frame characteristics and the text characteristics are used as the input of a trained network model, and a positive proposal with the highest predicted score is obtained as a retrieval result.
2. The method for retrieving the video time interval under the weak supervision based on the two-branch proposed network as claimed in claim 1, wherein the step 2) is specifically as follows:
2.1) acquiring a video and a description text as a training data set;
2.2) extracting word features in the text by using a pre-trained gloveword 2vec embedding method, then taking the word features as the input of a Bi-GRU network, and learning the word semantic expression with context information as a text feature sequence
Figure FDA0003201376000000021
Wherein n isqIs the number of words in the description text, qiIs a semantic feature of the ith word;
2.3) extracting the visual characteristics of the video by using a pre-trained video characteristic extractor, and reducing the length of a visual characteristic sequence by using time sequence average pooling to obtain a frame characteristic sequence of the video
Figure FDA0003201376000000022
Wherein n isvIs a characteristic number, viIs the frame characteristic of the ith frame in the video; the pre-trained video feature extractor is different from a training data set, C3D features are extracted from a Charads-STA data set and an ActivityCaption data set, and optical flow features are extracted from a DiDeMo data set.
3. The method according to claim 1, wherein proposal regularization is introduced into the enhanced branch proposal network and fed back to the suppressed branch proposal network in the form of shared parameters;
the proposed regularization loss function formula is specifically:
Figure FDA0003201376000000031
wherein the content of the first and second substances,
Figure FDA0003201376000000032
and
Figure FDA0003201376000000033
to propose two losses of regularization, MenFor all the time segment numbers in the two-dimensional time segment profile,
Figure FDA0003201376000000034
for the score of the ith positive offer,
Figure FDA0003201376000000035
is the ith positive proposal score processed by the softmax function; t denotes the number of positive proposals in the set of positive proposals.
4. The method according to claim 1, wherein the inter-sample countermeasure and intra-sample countermeasure loss function formulas are as follows:
Figure FDA0003201376000000036
Figure FDA0003201376000000037
wherein the content of the first and second substances,
Figure FDA0003201376000000038
in order to combat the loss between the samples,
Figure FDA0003201376000000039
for intra-specimen antagonism against loss, KenFor the sum of all positive proposal scores, KspFor the sum of all negative suggestive scores, ΔintraFor positive propositions score boundary value, ΔinterA negative suggestive score boundary value;
Figure FDA00032013760000000310
and
Figure FDA00032013760000000311
a matching score is given for the current data and the randomly sampled unmatched video or text.
5. The method according to claim 3 or 4, wherein the multitask loss function is specifically:
Figure FDA00032013760000000312
wherein the content of the first and second substances,
Figure FDA00032013760000000313
in order to be a loss of the multi-tasking,
Figure FDA00032013760000000314
in order to combat the loss between the samples,
Figure FDA00032013760000000315
in order to combat the loss within the sample,
Figure FDA00032013760000000316
and
Figure FDA00032013760000000317
to propose two losses of regularization, λ1、λ2、λ3、λ4Is a hyper-parameter.
6. A weakly supervised video time interval retrieval system based on two-branch proposed network, for implementing the weakly supervised video time interval retrieval method of claim 1, the weakly supervised video time interval retrieval system comprising:
the data acquisition module is used for acquiring videos and description texts as training data sets when the system is in a training stage; when the system is in the detection stage, the system is used for acquiring a video to be detected and a question sentence;
the feature extraction module is used for extracting frame features from the video and extracting text features from the description text and the question sentences;
a cross-modal language-aware filtering module for receiving frame features
Figure FDA00032013760000000318
With text features
Figure FDA00032013760000000319
As inputs, an enhanced video stream containing text features and a suppressed video stream are output, where nvIs a characteristic number, viIs the frame characteristic of the ith frame in the video, nqIs the number of words, qiIs a semantic feature of the ith word; comprising a device for generating a text feature sequence
Figure FDA00032013760000000320
Cluster center and scene-based linguistic feature sequence
Figure FDA0003201376000000041
NetVLAD submodelBlock, one for calculating a sequence of frame features
Figure FDA0003201376000000042
And scene-based language feature sequences
Figure FDA0003201376000000043
A cross-mode estimation submodule for cross-mode matching fraction between the two, and a module for using the fraction distribution and frame characteristic sequence after normalization processing
Figure FDA0003201376000000044
Generating two branch gate sub-modules for enhancing the video stream and inhibiting the video stream; wherein n iscNumber of centers, ujThe language features of the jth scene;
regularizing two proposed network modules: the system comprises an enhancement branch submodule and a suppression branch submodule, wherein the enhancement branch submodule and the suppression branch submodule are used for taking the generated enhancement video stream and text characteristics as the input of an enhancement branch proposal network, outputting a proposal result and a score, and screening to obtain a positive proposal set; and taking the generated inhibition video stream and text characteristics as input of an inhibition branch proposal network, outputting a proposal result and a score, and screening to obtain a negative proposal set.
7. The system of claim 6, wherein the enhanced branch sub-module comprises:
the cross-mode interaction unit is used for summarizing the text features of each frame of the video to obtain integrated text features;
a Bi-GRU sub-module for causing the integrated text features and the enhanced video stream to interact with each other to obtain language-aware frame features;
the two-dimensional time interval characteristic diagram submodule is internally provided with a two-dimensional time interval characteristic diagram, and the two-dimensional time interval characteristic diagram comprises three dimensions: the first two dimensions are used for indexing a start frame and an end frame of a period, and the third dimension is a characteristic dimension; the method is used for researching the relation of adjacent time intervals, acquiring cross-modal characteristics of all time intervals in the two-dimensional graph according to the frame characteristics of language perception, and outputting a proposed result and a score according to the cross-modal characteristics of all time intervals in the two-dimensional graph;
a proposal screening submodule that screens proposals using a center-based proposal screening method to screen a positive proposal set from the proposal results.
CN202011332463.XA 2020-11-24 2020-11-24 Weak supervision video time interval retrieval method and system based on two-branch proposed network Active CN112417206B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011332463.XA CN112417206B (en) 2020-11-24 2020-11-24 Weak supervision video time interval retrieval method and system based on two-branch proposed network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011332463.XA CN112417206B (en) 2020-11-24 2020-11-24 Weak supervision video time interval retrieval method and system based on two-branch proposed network

Publications (2)

Publication Number Publication Date
CN112417206A CN112417206A (en) 2021-02-26
CN112417206B true CN112417206B (en) 2021-09-24

Family

ID=74778792

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011332463.XA Active CN112417206B (en) 2020-11-24 2020-11-24 Weak supervision video time interval retrieval method and system based on two-branch proposed network

Country Status (1)

Country Link
CN (1) CN112417206B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112685597B (en) * 2021-03-12 2021-07-13 杭州一知智能科技有限公司 Weak supervision video clip retrieval method and system based on erasure mechanism
CN113792594B (en) * 2021-08-10 2024-04-12 南京大学 Method and device for locating language fragments in video based on contrast learning
CN113836901B (en) * 2021-09-14 2023-11-14 灵犀量子(北京)医疗科技有限公司 Method and system for cleaning Chinese and English medical synonym data
CN113806482B (en) * 2021-09-17 2023-12-12 ***数智科技有限公司 Cross-modal retrieval method, device, storage medium and equipment for video text
CN113869272A (en) * 2021-10-13 2021-12-31 北京达佳互联信息技术有限公司 Processing method and device based on feature extraction model, electronic equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107679580A (en) * 2017-10-21 2018-02-09 桂林电子科技大学 A kind of isomery shift image feeling polarities analysis method based on the potential association of multi-modal depth
EP3620976A1 (en) * 2018-09-07 2020-03-11 Volvo Car Corporation Methods and systems for providing fast semantic proposals for image and video annotation
CN111582170A (en) * 2020-05-08 2020-08-25 浙江大学 Method and positioning system for completing specified object positioning task in video by using object-aware multi-branch relation network
CN111931571A (en) * 2020-07-07 2020-11-13 华中科技大学 Video character target tracking method based on online enhanced detection and electronic equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11195057B2 (en) * 2014-03-18 2021-12-07 Z Advanced Computing, Inc. System and method for extremely efficient image and pattern recognition and artificial intelligence platform

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107679580A (en) * 2017-10-21 2018-02-09 桂林电子科技大学 A kind of isomery shift image feeling polarities analysis method based on the potential association of multi-modal depth
EP3620976A1 (en) * 2018-09-07 2020-03-11 Volvo Car Corporation Methods and systems for providing fast semantic proposals for image and video annotation
CN111582170A (en) * 2020-05-08 2020-08-25 浙江大学 Method and positioning system for completing specified object positioning task in video by using object-aware multi-branch relation network
CN111931571A (en) * 2020-07-07 2020-11-13 华中科技大学 Video character target tracking method based on online enhanced detection and electronic equipment

Also Published As

Publication number Publication date
CN112417206A (en) 2021-02-26

Similar Documents

Publication Publication Date Title
CN112417206B (en) Weak supervision video time interval retrieval method and system based on two-branch proposed network
CN108197109B (en) Multi-language analysis method and device based on natural language processing
Gong et al. Natural language inference over interaction space
Meng et al. Leveraging concept association network for multimedia rare concept mining and retrieval
CN111368088A (en) Text emotion classification method based on deep learning
CN109597995A (en) A kind of document representation method based on BM25 weighted combination term vector
CN111563373B (en) Attribute-level emotion classification method for focused attribute-related text
Zhang et al. Video-aided unsupervised grammar induction
CN113705218A (en) Event element gridding extraction method based on character embedding, storage medium and electronic device
CN110297986A (en) A kind of Sentiment orientation analysis method of hot microblog topic
Hassani et al. LVTIA: A new method for keyphrase extraction from scientific video lectures
Saleem et al. Stateful human-centered visual captioning system to aid video surveillance
CN112966507A (en) Method, device, equipment and storage medium for constructing recognition model and identifying attack
Guo [Retracted] Intelligent Sports Video Classification Based on Deep Neural Network (DNN) Algorithm and Transfer Learning
Matheven et al. Fake news detection using deep learning and natural language processing
CN115774782A (en) Multilingual text classification method, device, equipment and medium
CN115017356A (en) Image text pair judgment method and device
CN115048504A (en) Information pushing method and device, computer equipment and computer readable storage medium
CN114595370A (en) Model training and sorting method and device, electronic equipment and storage medium
Ksibi et al. Flickr-based semantic context to refine automatic photo annotation
Ruan et al. Chinese news text classification method based on attention mechanism
CN113204670A (en) Attention model-based video abstract description generation method and device
Zeng et al. Correcting the Bias: Mitigating Multimodal Inconsistency Contrastive Learning for Multimodal Fake News Detection
Ho et al. Uit at vbs 2022: An unified and interactive video retrieval system with temporal search
Wang et al. Video description with integrated visual and textual information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant