CN115035455A

CN115035455A - Cross-category video time positioning method, system and storage medium based on multi-modal domain resisting self-adaptation

Info

Publication number: CN115035455A
Application number: CN202210707517.9A
Authority: CN
Inventors: 佘清顺; 黄海烽; 赵洲; 陈哲乾
Original assignee: Hangzhou Yizhi Intelligent Technology Co ltd; Zhejiang University ZJU
Current assignee: Hangzhou Yizhi Intelligent Technology Co ltd; Zhejiang University ZJU
Priority date: 2022-06-21
Filing date: 2022-06-21
Publication date: 2022-09-09
Also published as: LU502690B1

Abstract

The invention discloses a cross-category video time positioning method, a cross-category video time positioning system and a storage medium based on multi-modal domain self-adaption, and belongs to the field of computer vision. Acquiring videos of different categories and corresponding query texts, and extracting visual features and text features; semantic information calibration is carried out through visual features and text features of a cross-modal feature calibrator target category video; randomly masking the visual features of the target type videos through a video feature reconstructor and reconstructing the visual features; fusing the video features and the text features through a cross-modal feature fusion device; performing single mode field invariance feature expression learning on the video features and the text features through a field discriminator, and performing cross-mode field invariance feature expression learning on the initial fusion features; and predicting the final fusion characteristics of the source category video through a double affine predictor. The method realizes the time positioning aiming at the cross-category video, and improves the generalization capability of the model to the unknown target video.

Description

Cross-category video time positioning method, system and storage medium based on multi-modal domain resisting self-adaptation

Technical Field

The invention relates to the field of computer vision, in particular to a cross-category video time positioning method, a system and a storage medium based on multi-modal domain self-adaption.

Background

The video temporal localization task aims to give a video segment temporal boundary corresponding to the query text from the unprocessed video. The traditional method comprises a fully supervised learning method and a weakly supervised learning method. The fully supervised learning consumes longer time in the training stage, needs a large amount of manual data labeling, and is time-consuming and labor-consuming; although the weak supervised learning does not need a large amount of manual labeling data, the performance of the model is greatly different from that of a full supervised learning model due to the lack of enough labeling data, and the two methods are developed on the premise that training data and test data are distributed identically, and domain deviation between different types of scenes in the real world is not considered. The model trained based on the two methods has poor generalization on unknown class data, and cannot well meet the requirements of real world scenes.

Disclosure of Invention

Aiming at the problems, the invention provides a cross-category video time positioning method based on the multi-modal domain countermeasure self-adaption, so as to improve the generalization capability of a model for dealing with unknown target data.

Therefore, the technical scheme adopted by the invention is as follows:

in a first aspect, the present invention provides a cross-category video time positioning method based on adaptation against multi-modal domain, comprising the following steps:

s1: acquiring a source category video, a target category video and a query text of each video corresponding to the source category video and the target category video, extracting initial visual features of the videos and initial text features of the query texts, and coding the initial visual features and the initial text features to be used as final visual features and text features;

s2: performing semantic information calibration on the visual features and text features of the target category video obtained in the step S1 through a cross-modal feature calibrator;

s3: randomly masking the visual features of the target type video obtained in the step S1 by a video feature reconstructor, and performing visual feature reconstruction to obtain reconstructed visual features;

s4: fusing the video features and the text features obtained in the step S1 through a cross-modal feature fusion device to obtain initial fusion features and final fusion features of the source type videos and initial fusion features and final fusion features of the target type videos;

s5: respectively performing single mode field invariance feature expression learning on the video features and the text features obtained in the step S1 through a field discriminator, and performing cross-mode field invariance feature expression learning on the initial fusion features obtained in the step S4;

s6: and predicting the final fusion characteristics of the source type videos obtained in the step S4 through a double affine predictor to obtain the prediction probability of all possible results corresponding to each query text, wherein the maximum prediction probability is the final prediction result.

In a second aspect, the invention provides a cross-category video time positioning system based on the adaptation of the confrontation multi-modal field, which is used for realizing the above cross-category video time positioning method based on the adaptation of the confrontation multi-modal field.

In a third aspect, the present invention provides a computer readable storage medium, on which a program is stored, which, when being executed by a processor, is configured to implement the above cross-category video time-localization method based on countering multi-modal domain adaptation.

Compared with the prior art, the invention has the advantages that:

the invention provides a basic training module which is mainly used for supervised learning of source type data with labels; a confrontation field identification module is provided for learning the field invariance characteristics; a cross-module feature calibration module is provided for reducing semantic gaps between different modal features in the target category data; a video reconstruction module is provided for learning temporal semantic relationships and discriminable feature expressions. The invention firstly provides and realizes the video time positioning task aiming at cross-category video data, and the experimental result shows that the model provided by the invention has good generalization capability.

Drawings

FIG. 1 is a block diagram illustrating an overall architecture of a cross-category video temporal localization method based on countering multi-modal domain adaptation in accordance with an exemplary embodiment;

FIG. 2 is a schematic diagram illustrating a confrontation domain authentication module according to an exemplary embodiment;

FIG. 3 is a schematic diagram illustrating a cross-modal feature calibration module in accordance with an exemplary embodiment;

FIG. 4 is a schematic diagram of a video feature reconstruction module shown in accordance with an exemplary embodiment; fig. 5 is a schematic diagram illustrating a data processing capable device terminal according to an example embodiment.

Detailed Description

The invention is further illustrated with reference to the following figures and examples. The figures are only schematic illustrations of the invention, and some of the block diagrams shown in the figures are functional entities, which do not necessarily have to correspond to physically or logically separate entities, and which may be implemented in software, or in one or more hardware modules or integrated circuits, or in different network and/or processor systems and/or microcontroller systems.

As shown in fig. 1, the cross-category video time positioning method based on the adaptation against the multi-modal domain proposed by the present invention includes the following steps:

s1: firstly, respectively extracting source class video and target class through a 3-dimensional convolutional neural networkIdentifying semantic information of the video to obtain visual characteristics

And feature coding is carried out through a visual coder to finally obtain the coded visual features

Then, semantic information of the source type query text and the target type query text is extracted through a Glove model to obtain text characteristics

And finally obtaining the coded text characteristics by performing characteristic coding through a text coder

The invention realizes the time positioning task aiming at the cross-category video for the first time, wherein the input data comprises the source category data with the label and the target category data without the label, so as to improve the generalization capability of the model to the unknown category video, and the input data is defined as follows:

wherein, V ^s Is a source class video, Q ^s Is a query text set of source category videos, T ^s Is a set of tags for the source category data,

representing the ith source video in the source category videos and its corresponding query text,

representing source category videoThe ith query text matches the real time boundary of the ith source video; v ^t Is object class video, Q ^t Is a query text set, T, for a target category video ^t Is a tag set of target category data, which is null in this embodiment;

representing the ith target video in the target category videos and the corresponding query text; b represents the batch size, i.e., the number of source or target videos and their corresponding query text that are entered into the model each time.

In the step, the same visual feature extractor and the same visual encoder are adopted to obtain the visual features of the videos of different types and encode the visual features, and the same text feature extractor and the same text encoder are adopted to obtain the text features of the query texts corresponding to the videos of different types and encode the text features.

Specifically, the implementation process of step S1 is:

s1-1: extracting input source class video V through 3-dimensional convolutional neural network ^s And object class video V ^t To obtain initial visual features

Query text Q for extracting source category videos through Glove model ^s And query text Q for target category videos ^t Obtaining initial text features based on the semantic information

S1-2: for the initial visual characteristics obtained in step S1-1

And initial text features

Respectively projecting to the same hidden dimension through a convolution layer and a linear projection layer, and coding through a multi-head attention layer to obtain coded visual features and text features as final featuresCharacterizing for subsequent calculation; the coded visual features and text features are expressed as:

wherein n is the number of frames in each video, m is the query text length, d is the hidden dimension,

as a visual characteristic of the source category video,

for visual features of the ith source category video,

for the visual characteristics of the jth frame of the ith source category video,

for the visual characteristics of the target category of video,

for the visual features of the ith object class video,

for the visual characteristics of the jth frame of the ith target class video,

the text features of the query text for the source category video,

for the text features of the query text of the ith source category video,

for the text feature of the jth word in the query text of the ith source category video,

for the text features of the query text of the target category video,

text characteristics of the query text for the ith target category video,

text characteristics of the jth word in the query text of the ith target category video;

in this embodiment, it will be used to encode the initial visual features

And a linear projection layer as a visual encoder for encoding the initial text features

One convolutional layer and one linear projection layer of (a) are denoted as text encoders.

S2: the visual characteristics obtained in the step S1 are calibrated by the cross-modal characteristic calibrator

And text features

Semantic information alignment is performed as shown in fig. 3.

In this step, the cross-modal feature calibrator applies a loss function

The visual features and the text features in the positive sample are made to be closer in the semantic space, and the similarity of the visual features and the text features in the negative sample in the semantic space is further reduced.

Loss function

The definition is as follows:

wherein, l (,) is used for calculating cosine similarity between vectors;

is the visual characteristics obtained according to step S1

And text features

Positive samples taken along the time axis average;

and

is directed to

And

selected corresponding negative examples, Z _V Is directed to

Set of negative examples of (2), Z _Q Is directed to

A set of negative examples of; Δ is a boundary.

The visual features and the text features of the target category after being encoded in the defining step S1 are respectively:

computing

And

average along the time axis, resulting in positive samples:

definition of

Negative sample of

Negative sample of

Loss function across modal feature calibrators during training

The calculation process is as follows:

s3: the initial visual characteristics of the target category video obtained in the step S1 are reconstructed by the video characteristic reconstructor

Random mask and visual feature reconstruction are carried out to obtain reconstructed visual feature V _recon Further learning the time sequence semantic relation and discriminable characteristics between the visual characteristics and the text characteristics, as shown in FIG. 4;

s3-1: randomly comparing the initial visual characteristics of the target category video obtained in the step S1

Masking operation is carried out according to beta probability, and coding is carried out through a visual coder to obtain coded mask visual characteristics

S3-2: performing cross-modal feature fusion on the mask visual features obtained in the step S3-1

And the text characteristics of the query text of the target category video obtained in step S1

Fusing to obtain the initial fusion characteristic F of the mask video ^m ；

S3-3: the mask visual characteristics obtained according to step S3-1

And the initial fusion feature F obtained in step S3-2 ^m Reconstructing video characteristics to obtain reconstructed visual characteristics V _recon The calculation process is as follows:

wherein Conv1D is a 1-dimensional convolutional layer, ReLU is an activation function,

representing operations added by element.

In this embodiment, in the video feature reconstructor in step S3, a loss calculation process in the training process is as follows:

where mselos is the loss of mean square error.

S4: respectively carrying out cross-modal feature fusion on the video features and the text features obtained in the step S1

And

performing feature fusion to obtain a fusion feature F ^s 、

F ^t 、

S4-1: calculating the visual characteristics of the source type video obtained in step S1

And text features

Cross-mode similarity matrix S between them belongs to R ^n×m (ii) a In this embodiment, a cross-modal similarity matrix is calculated by using cosine similarity;

s4-2: respectively normalizing the cross-modal similarity matrix S obtained in the step S4-1 along rows and columns by utilizing a SoftMax function to obtain a similarity density matrix S _r And S _c ；

S4-3: the similarity density matrix S obtained according to the step S4-2 _r And S _c And calculating to obtain a video-query text attention matrix A _v ∈R ^n×d And query the text-to-video attention matrix A _q ∈R ^m×d The calculation process is as follows:

s4-4: the video-query text attention matrix A calculated according to the step S4-3 _v And query the text-to-video attention matrix A _q Calculating to obtain initial fusion characteristics F of source type videos ^s The calculation process is as follows:

wherein, W _f ∈R ^4d×d ，b _f ∈R ^d Both are learnable parameters;

representing the initial fusion characteristics of the ith source class video,

and representing the initial fusion characteristics of the jth frame of the ith source category video.

S4-5: the initial fusion feature F obtained according to step S4-4 ^s Obtaining final fusion characteristics of source type videos through a multi-head attention mechanism

Wherein,

representing the final fusion features of the ith source class video,

representing the final fusion characteristics of the jth frame of the ith source category video;

similarly, according to the method of steps S4-1 to S4-5, the

By replacement with

Will be provided with

By replacement with

Obtaining the initial fusion characteristics F of the target category video ^t And final blend feature

S5: respectively comparing the video features obtained in step S1 with the domain identifier

And text features

Performing unimodal domain invariance feature expression learning and performing initial fusion feature F obtained in step S4 ^s 、F ^t Performing cross-modal domain invariance feature expression learning so as to reduce the domain gaps of different modal features;

in this embodiment, as shown in FIG. 2, the domain identifier includes a visual feature identifier D _v Text feature identifier D _q And a fusion feature discriminator D _f 。

For video features

And text features

Video feature discriminator D _v And a text feature discriminator D _q The calculation process is as follows:

wherein, MLP _v And MLP _p The method is a multi-layer perceptron for visual features and text features, GRL is a gradient inversion layer, and sigma is a sigmod function.

For the initial fusion feature F ^s /F ^t Merging features identifier D _f The calculation process is as follows:

D _f (F ^k )＝σ(MLP _f (GRL(F ^k ))),k∈{s,t}

wherein, MLP _f Is a multi-layer perceptron.

Video feature discriminator D _v Text feature identifier D _q And a fusion feature discriminator D _f Output [0,1 ]]The scalar in between to represent the probability that the input feature is from the target class.

In the training process, firstly, the domain label is defined

That is, the feature label from the source category is 0, and the feature label from the target category is 1.

The calculation process of the confrontation loss of the domain discriminator in the training process is as follows:

wherein BCELoss is a binary cross entropy loss.

S6: final fusion characteristics of the source category video obtained in step S4 through double affine predictor

And predicting to obtain the prediction probability of all possible results corresponding to each query text, wherein the maximum prediction probability of all possible results corresponding to the query text is the final prediction result.

S6-1: defining the candidate segment P (fs, fe) of query text match of each video in the source category video, wherein fs is the start frame number, fe is the end frame number, fs, fe are e [1, n ∈]. The final fusion feature of each video obtained from step S4

Extracting the corresponding fusion characteristics of the starting frame and the ending frame of the candidate segment

Generation of start frames and knots through linear layersHidden feature r of bundle frame _fs 、r _fe ∈R ^d The calculation process is as follows:

wherein, W ₁ /W ₂ ∈R ^4d×d 、b ₁ /b ₂ ∈R ^d Parameters may be learned for the linear layer.

Since each video in the source category video consists of n video frames, all candidate segments can be represented as P ═ fs, fs ∈ [1, n ], fe ∈ [1, n ], fs ≦ fe.

S6-2: introducing the hidden feature r obtained in the step S6-1 _fs 、r _fe Obtaining a matrix containing the prediction probabilities of all candidate video segments P through a double affine mechanism

And combining the frames corresponding to the maximum value of the matrix M into a final prediction result. M is a group of _P The calculation process is as follows:

wherein, U _m ∈R ^d×d ,W _m ∈R ^d ,b _m E.g. R is a learnable parameter, M _P Representing the probability that the video clip P corresponds to the query text.

In this embodiment, the dual affine predictor in step S6 uses a scaling intersection ratio (IoU) function as a supervision signal to complete the training process in the training process. The specific contents are as follows:

IoU Score of video clip P _P The calculation process is as follows:

wherein τ is (τ) _fs ,τ _fe ) To search forThe true time boundary for text matching is queried. (Max score) representing IoU score matrix

Maximum value of (2).

The loss calculation process of the double affine predictor in the training process is as follows:

preferably, the final loss function of the training phase

Expressed as:

wherein λ ₁ 、λ ₂ 、λ ₃ Is a hyper-parameter.

The embodiment also provides a cross-category video time positioning system based on the multi-modal domain confrontation self-adaptation, and the system is used for realizing the embodiment. The terms "module," "unit," and the like as used below may implement a combination of software and/or hardware of predetermined functions. Although the system described in the following embodiments is preferably implemented in software, an implementation in hardware, or a combination of software and hardware, is also possible.

The system comprises:

the data acquisition and feature extraction module is used for acquiring the source type videos, the target type videos and the query texts of the videos corresponding to the source type videos and the target type videos, extracting the initial visual features of the videos and the initial text features of the query texts, and coding the initial visual features and the initial text features to be used as final visual features and text features;

the cross-modal characteristic calibration module is used for performing semantic information calibration on the obtained visual characteristics and text characteristics of the target category video through the cross-modal characteristic calibrator;

the video feature reconstruction module is used for randomly masking the visual features of the target type videos through the video feature reconstructor and reconstructing the visual features to obtain reconstructed visual features;

the cross-modal feature fusion module is used for fusing the obtained video features and the text features through the cross-modal feature fusion device to obtain initial fusion features and final fusion features of the source type videos and initial fusion features and final fusion features of the target type videos;

the domain identification module is used for respectively performing monomodal domain invariance feature expression learning on the obtained video features and text features through the domain identifier and performing cross-modal domain invariance feature expression learning on the obtained initial fusion features;

and the double affine prediction module is used for predicting the obtained final fusion characteristics of the source type video through a double affine predictor to obtain the prediction probability of all possible results corresponding to each query text, and the maximum prediction probability is the final prediction result.

The data acquisition and feature extraction module, the cross-modal feature fusion module and the double affine prediction module form a basic training module.

The implementation process of the function and the action of each module in the system is specifically described in the implementation process of the corresponding step in the method, for example, one specific process of the system may be:

(1) acquiring a source category video, a target category video and a query text of each video corresponding to the source category video and the target category video, extracting initial visual features of the videos and initial text features of the query texts, and coding the initial visual features and the initial text features to be used as final visual features and text features;

(2) performing semantic information calibration on the visual features and text features of the target category video obtained in the step (1) through a cross-modal feature calibrator;

(3) randomly masking the visual features of the target category videos obtained in the step (1) through a video feature reconstructor, and reconstructing the visual features to obtain reconstructed visual features;

(4) fusing the video features and the text features obtained in the step (1) through a cross-modal feature fusion device to obtain initial fusion features and final fusion features of the source category video and initial fusion features and final fusion features of the target category video;

(5) respectively carrying out monomodal domain invariance feature expression learning on the video features and the text features obtained in the step (1) and carrying out cross-modal domain invariance feature expression learning on the initial fusion features obtained in the step (4) by a domain identifier;

(6) and (4) predicting the final fusion characteristics of the source type video obtained in the step (4) through a double affine predictor to obtain the prediction probability of all possible results corresponding to each query text, wherein the maximum prediction probability is the final prediction result.

For the system embodiment, since it basically corresponds to the method embodiment, the specific implementation manner of each step may refer to the description of the method portion, and is not described herein again. The above described system embodiments are merely illustrative, wherein the modules illustrated as separate components may or may not be physically separate, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.

Embodiments of the system of the present invention may be applied to any data processing capable device, which may be a device or apparatus such as a computer. The system embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for running through the processor of any device with data processing capability. In terms of hardware, as shown in fig. 5, in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 5, any device with data processing capability where the system is located in the embodiment may also include other hardware according to the actual function of the any device with data processing capability, which is not described in detail herein.

The embodiment of the present invention further provides a computer-readable storage medium, on which a program is stored, where the program, when executed by a processor, implements the above cross-category video time positioning method based on the adaptation against the multi-modal domain.

The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be any external storage device of a device with data processing capabilities, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing capable device, and may also be used for temporarily storing data that has been output or is to be output.

The technical effects of the present invention are verified by experiments below.

The invention adopts an activityNet Captions data set and a Charides-STA data set to carry out model training and verification, wherein the activityNet Captions data set comprises 5 categories of sports, social contact, diet, housework and personal maintenance; the Charades-STA dataset contains 5 categories kitchen, bedroom, living room, bathroom and hallway.

The experimental parameters of the invention are set as follows: optimizing the training process by adopting an AdamW optimizer with the attenuation rate of 1 e-6; the learning rate is 1 e-3; batch size 64; for ActivityNet Captions dataset, droupout is 0.4; for the Charrades-STA dataset, droupout is 0.3; lambda [ alpha ] ₁ 、λ ₂ 、λ ₃ 0.5, 0.2 and 0.5 respectively; random masking probability beta of0.2。

First, the present invention verifies the performance of the basic training module (AMDA-base), and the statistical results are shown in tables 1 and 2. Although the AMDA-base adopts a simpler architecture, the performance of the AMDA-base is almost different from that of other most advanced models in the video time positioning task, and the AMDA-base also provides excellent basic performance for the model (AMDA) provided by the invention in the cross-category video time positioning task.

TABLE 1 Performance of AMDA-base on ActiviTYNet Captions dataset

TABLE 2 AMDA-base Performance on Chardes-STA data set

Next, the present invention verifies the domain deviation between different categories of data based on AMDA-base, and the statistical results are shown in tables 3 and 4. The ActivityNet contexts dataset takes movement as source category data, and the other 4 categories as target category data. The Charrades-STA data set takes the kitchen as source category data, and the other 4 categories as target category data. And only the source class represents that the input data only has a source class with a label, only the target class represents that the input data only has a target class with a label, and the joint training input data comprises the source class with the label and the target class. The experimental results show that obvious field gaps exist among different types of data, and the combined training results are the best because of the labels with the source type data and the target type data.

TABLE 3 Performance of different data classes on ActiviTYNet Captions data set

TABLE 4 Performance of different data classes on Chardes-STA data set

Finally, the invention verifies the performance of the AMDA model in a sub-module mode, the input data of the AMDA model are labeled source class data and unlabeled target data, the activityNet Captions data set takes movement as the source class data, and the other 4 classes are the target class data. The Charrades-STA data set takes the kitchen as source category data, and the other 4 categories as target category data. The statistical results are shown in tables 5 and 6. The results show that the AMDA model has the best phenotypic performance, and the AMDA model provided by the invention has better generalization capability in the cross-category video time positioning task.

TABLE 5 Performance of the AMDA model on the ActivityNet Captions dataset

TABLE 6 Performance of AMDA models on Chardes-STA datasets

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the patent protection. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. A cross-category video time positioning method based on adaptation against a multi-modal field is characterized by comprising the following steps:

2. The method for cross-category video temporal localization based on countering multi-modal domain adaptation according to claim 1, wherein the step S1 includes:

s1-1: defining input data:

T ^t ＝φ

wherein, V ^s Is a source class video, Q ^s Is a query text set of source category videos, T ^s Is a labelset of source category data, V _i ^s 、

representing that the ith query text in the source category video matches the real time boundary of the ith source video; v ^t Is object class video, Q ^t Is a query text set, T, of a target category video ^t Is a tag set of target category data, which is null in this embodiment; v _i ^t 、

Representing the ith target video in the target category videos and the corresponding query text; b represents the batch size;

s1-2: extracting source category video V ^s And object class video V ^t To obtain initial visual features

Query text Q for extracting source category video ^s Query text Q for videos of target category ^t To obtain initial text features

S1-3: for the initial visual characteristics obtained in step S1-2

And initial text features

Respectively coding to obtain coded visual features and text features as final features; the coded visual features and text features are expressed as:

as a visual characteristic of the source category video,

for visual features of the ith source category video,

for the visual characteristics of the target category of video,

is the ith targetThe visual characteristics of the category video are,

for the visual characteristics of the jth frame of the ith target class video,

the textual features of the query text for the source category video,

the text features of the query text for the ith source category video,

the text features of the query text for the target category of video,

text characteristics of the query text for the ith target category video,

text characteristics of the jth word in the query text of the ith target category video.

3. The method for cross-category video temporal localization according to claim 2, wherein in step S3, the initial visual feature of the target category video obtained in step S1 is determined

And carrying out random mask and visual feature reconstruction.

4. The method for cross-category video temporal localization based on countering multi-modal domain adaptation according to claim 3, wherein the step S3 includes:

s3-1: randomly comparing the initial visual features of the target category videos obtained in the step S1

Masking operation is carried out according to the beta probability, and coding is carried out through a visual coder to obtain the coded mask visual characteristics

And the text characteristics of the query text of the target category videos obtained in the step S1

Fusing to obtain the initial fusion characteristic F of the mask video ^m ；

S3-3: the mask visual characteristics obtained according to step S3-1

represents an add-on-element operation;

the training loss of the cross-modal feature fusion device adopts the mean square error loss.

5. The method for cross-category video temporal localization based on countering multi-modal domain adaptation according to claim 1, wherein the step S2 includes:

s2-1: calculating the average value of the visual characteristics and the text characteristics of the target category videos along a time axis:

wherein,

visual features representing all frames of the ith object class video

The average value of (a) of (b),

visual features representing all frames of the ith object class video

The set of mean values of (a) and (b),

represents the average of the text features of all words in the query text of the ith source category video,

a set of mean values of text features representing all words in a query text of an ith source category video;

s2-2: by using

Constructing positive and negative samples:

will be provided with

As a positive sample, the positive sample,

as negative examples, we denote:

s2-3: training a cross-modal characteristic calibrator by using positive and negative samples to calibrate semantic information of visual characteristics and text characteristics of a target category video; the loss function of the cross-modal characteristic calibrator is as follows:

wherein,

representing the cross-modal characteristic aligner loss,

representing ternary loss, B representing batch size, Δ being a boundary, l (,) being used to compute cosine similarity between vectors, Z _V -a set of negative examples representing visual features, Z _Q -negative examples representing text featuresAnd (4) collecting.

6. The method for cross-category video temporal localization based on countering multi-modal domain adaptation according to claim 1, wherein the step S4 includes:

s4-1: calculating step S1 to obtain visual characteristics

And text features

A cross-mode similarity matrix S between the two, where k belongs to S, t, and k is S,

a visual characteristic representing the video of the source category,

the text characteristics of the query text of the source category video, when k equals t,

for the visual characteristics of the target category of video,

text features of query text for target category videos;

s4-2: respectively normalizing the cross-modal similarity matrix S obtained in the step S4-1 along rows and columns to obtain a similarity density matrix S _r And S _c ；

S4-3: the similarity density matrix S obtained according to the step S4-2 _r And S _c And calculating to obtain a video-query text attention matrix A _v And query the text-to-video attention matrix A _q The calculation process is as follows:

s4-4: the video-query text attention matrix A calculated according to the step S4-3 _v And query the text-to-video attention matrix A _q Calculating to obtain initial fusion characteristics F of source type videos ^k The calculation process is as follows:

k∈{s,t}

wherein, W _f 、b _f Is a learnable parameter;

when k is s, F ^s An initial fusion feature representing the source category of the video,

representing the initial fusion characteristics of the ith source class video,

representing the initial fusion characteristics of the jth frame of the ith source category video; when k is t, F ^t An initial fusion feature representing the target category of video,

representing the initial fusion characteristics of the ith object class video,

representing the initial fusion characteristic of the jth frame of the ith target category video;

s4-5: the initial fusion feature F obtained according to step S4-4 ^k Obtaining final fusion features through a multi-head attention mechanism

Wherein, when k is s,

representing the final fusion features of the ith source class video,

representing the final fusion characteristics of the jth frame of the ith source category video; when k is equal to t, the total weight of the alloy is t,

representing the final fusion features of the ith object class video,

and representing the final fusion characteristics of the jth frame of the ith target class video.

7. The method according to claim 1, wherein the domain discriminator comprises a visual feature discriminator D _v Text feature discriminator D _q And a fusion feature discriminator D _f (ii) a The step S5 includes:

the visual characteristic discriminator D _v Visual characteristics for identifying the source category video obtained in step S1

And visual characteristics of target category videos

The calculation process is as follows:

the text characteristic discriminator D _q Text characteristics of query text for identifying the source category video obtained in step S1

And text features of query text for target category videos

The calculation process is as follows:

the fusion characteristic discriminator D _f Initial fusion feature F for identifying the source category video obtained in step S4 ^s Initial fusion feature F with target class video ^t The calculation process is as follows:

D _f (F ^k )＝σ(MLP _f (GRL(F ^k ))),k∈{s,t}

wherein, MLP _v 、MLP _p 、MLP _f The method comprises the steps that a multi-layer perceptron is respectively aimed at visual features, text features and initial fusion features, GRL is a gradient inversion layer, and sigma is a sigmod function;

the video feature discriminator D _v Text feature discriminator D _q And a fusion feature discriminator D _f Output [0,1 ]]A scalar quantity between to represent the probability that the input feature is from the target class.

8. The method for cross-category video temporal localization based on countering multi-modal domain adaptation according to claim 6, wherein the step S6 includes:

s6-1: defining a candidate fragment P (fs, fe) matched with the query text of each video in the source type videos, wherein fs is a starting frame number, fe is an ending frame number, fs and fe are e [1, n ], and n is the frame number in the videos;

the final fusion feature of each video obtained from step S4

Generation of hidden features r of start and end frames by linear layers _fs 、r _fe The calculation process is as follows:

wherein, W ₁ 、W ₂ 、b ₁ 、b ₂ Parameters can be learned for the linear layer;

since each video in the source category video consists of n video frames, all candidate segments can be represented as P ═ fs, fs ∈ [1, n ], fe ∈ [1, n ], fs ≦ fe;

Combining the frames corresponding to the maximum value of the matrix M into a final prediction result;

M _P the calculation process is as follows:

wherein, U _m 、W _m 、b _m To learnLearning parameter, M _P Representing the probability that the video clip P is a video clip corresponding to the query text, the superscript T representing the transpose,

a splice is indicated.

9. A cross-category video time positioning system based on multi-modal domain adaptation for implementing the cross-category video time positioning method based on multi-modal domain adaptation as claimed in any one of claims 1 to 8, the system comprising:

the domain identification module is used for respectively carrying out single-mode domain invariance feature expression learning on the obtained video features and text features through the domain identifier and carrying out cross-mode domain invariance feature expression learning on the obtained initial fusion features;

and the double affine prediction module is used for predicting the final fusion characteristics of the obtained source type videos through a double affine predictor to obtain the prediction probability of all possible results corresponding to each query text, wherein the maximum prediction probability is the final prediction result.

10. A computer-readable storage medium, on which a program is stored, which, when being executed by a processor, is configured to implement the cross-category video temporal localization method based on countering multi-modal domain adaptation of any one of claims 1 to 8.