CN114627402A

CN114627402A - Cross-modal video time positioning method and system based on space-time diagram

Info

Publication number: CN114627402A
Application number: CN202111644165.9A
Authority: CN
Inventors: 李肯立; 平申; 田泽安; 张忠阳; 潘佳铭; 姜骁
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2022-06-14

Abstract

The invention discloses a cross-modal video time positioning method and a system based on a space-time diagram, wherein the method comprises the following steps: s1, inputting an un-clipped video and a query text, and intercepting a video segment candidate set for the un-clipped video by adopting a multi-scale sliding window; s2, extracting text features and video segment features, and generating a space-time graph representation of the video segments by utilizing a pre-trained scene graph generation model; s3, splicing the space-time diagram features and the video fragment features obtained by passing the space-time diagram of the video through a multilayer diagram convolution neural network to obtain video features rich in space-time semantic information; s4, projecting the video features containing the spatio-temporal information and the text features to the same feature space through a full connection layer, and splicing to obtain video text modal fusion features; and S5, inputting the video text mode fusion characteristics into the multilayer perceptron network to obtain a text video matching score and a position offset vector. The invention can understand the video semantic information with fine granularity and return more accurate video positioning boundary.

Description

Cross-modal video time positioning method and system based on space-time diagram

Technical Field

The present invention relates to the field of video positioning technologies, and in particular, to a cross-modal video time positioning method and system based on a space-time diagram.

Background

Given an un-clipped video (modality one) and a query text (modality two), the prior art has proposed some methods of handling video time-of-day localization, mainly from the given video to the start and end times semantically related to the query statement. In the existing method, candidate sets with different scales are separated from the whole video through a multi-scale sliding window, a segment of video segment characteristics are represented through gathering frame-level characteristics, and video characteristics and text characteristics are mapped to the same characteristic space for matching. Query text typically contains nouns and verbs (such as "a wman is talking into a microphone the shooting hands with man", containing nouns wman, man and microphone, verbs talking into, shooting hands) corresponding to objects in the video (wman, man and microphone) and interactions between the objects (< wman, talking into, microphone >, < wman, shooting hands, man >). Similarly, video time positioning requires not only fine-grained understanding of video semantic concepts (such as objects), capturing interaction between objects, but also understanding the dependency relationship of object interaction on time (in time, scraping hands occur after the ripping inter a microphone). However, the above video characterization method cannot model interaction information of objects in the video in space and time, and particularly when the objects are the same and the interactions are different, the method cannot be used for accurately positioning.

Therefore, there is a need to provide a more accurate spatio-temporal graph-based cross-modal video time-of-day positioning method, which can not only understand the semantic concepts of the video (e.g., objects), but also capture the interactions between the objects in space and time.

Disclosure of Invention

The invention aims to provide a cross-modal video time positioning method based on a space-time diagram, so as to overcome the defects in the prior art.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a cross-modal video time positioning method based on a space-time diagram comprises the following steps:

s1, inputting an un-clipped video and a query text, and intercepting a video segment candidate set for the un-clipped video by adopting a multi-scale sliding window;

s2, extracting text features eq and video segment features ec, and generating a space-time graph representation for the video segments by using a pre-trained scene graph generation model;

s3, splicing the space-time diagram features of the video and the video segment features obtained by the space-time diagram of the video through a multilayer diagram convolution neural network to obtain video features vst rich in space-time semantic information;

s4, projecting the video features vst and the text features eq containing the spatio-temporal information to the same feature space through a full connection layer, and splicing to obtain a video text mode fusion feature f_cq；

S5, fusing the video text mode with the feature f_cqAnd inputting the multi-layer perceptron network to obtain a text video matching score and a position offset vector.

Further, the step S2 includes the steps of:

s20, extracting text features eq from the query text by using a text encoder;

s21, extracting video segment features ec in the video segment candidate set by utilizing a pre-trained convolutional neural network;

s22, extracting a space map for describing interaction between objects in each frame by utilizing a pre-trained scene map generation model for the frame of each candidate segment in the candidate set;

and S23, constructing a time graph according to the similarity between the object features of the adjacent frames, and modeling the object dependence on a time domain.

Further, the step S22 specifically includes:

judging whether the object i and the object j of each frame in the candidate segment have a relationship by the scene graph generation model, and enabling the object i to have a directed edge of the object j if the object i and the object j have the relationship

To set values, a spatial map adjacency matrix A is obtained^spa；

Constructing a directed graph representing the spatial relationship of the objects for each frame in the candidate segments, and generating object characteristics detected by a model as node characteristics X e R by using a scene graph^N×dWherein N represents the total number of objects in the video clip, and d represents the characteristic dimension of the object;

obtain the adjacency matrix A^spaThe adjacency matrix is then normalized per row to ensure that the sum of the edges connected by each object is equal to the set value.

Further, the normalized formula is:

where N represents the total number of objects in the video segment, A^spaAn adjacency matrix representing a spatial graph.

Further, the step S23 specifically includes: calculating cosine similarity between the object i in the frame t and the object j in the frame t +1, if the similarity between the object i and the object j in the frame t +1 is greater than a given threshold value, determining that the object i and the object j are the same object, and enabling the directed edge from the object i to the object j to be a set value to obtain a time chart adjacency matrix A^tem。

Further, the step S3 specifically includes the following steps:

s30, inputting the space map and the time map into a multi-layer map convolution neural network, for each layer map convolution neural network, directly adding the map convolution output results of the space map and the time map: Z-RELU (A)^spaXW^spa+A^temXW^tem) Wherein W is^spaAnd W^temIs a weight matrix, X is a node feature matrix, Z is a single-layer graph convolution network output result, A^spa、A^temRespectively a space map and a time map adjacency matrix, and obtaining a space-time map feature gst ═ avg _ pool (Z _ pool) after convolution of k-layer maps¹,Z²,,,Z^k) Max _ pool, avg _ pool denote maximum pooling and average pooling operations, respectively;

and S31, splicing the space-time diagram features and the video fragment features to obtain the video features with space-time semantic information.

Further, the projection formula in step S4 is:

vst_p＝RELU(W_vvst+b_v),eq_p＝RELU(W_seq+b_s)

wherein, W_v、W_sRepresenting a weight matrix, b_v、b_sRepresenting a bias vector.

Further, in the step S5, the fusion feature f of the video text modality_cqThe formula input into the multi-layer perceptron network is as follows:

wherein, W_l、b_l、o_lRespectively, a l-th layer fully-connected network weight matrix, an offset vector and an output vector, o_L＝[s_cq,δ_s, δ_e]Wherein s is_cqDenotes the matching fraction, δ_s、δ_eIndicating a positioning offset;

further, in the step S5, a loss function L is calculated by the following formula to train the network model, a candidate segment with the highest matching score is selected in the test stage, and the regression offset is added to the time boundary of the candidate segment to obtain a video time positioning boundary;

L＝L_align+λL_reg；

L_align＝∑_(c,q)∈Pλ₁log(1+exp(-s_cq))+∑_(c,q)∈Nλ₂log(1+exp(s_cq))；

in the formula, λ₁、λ₂For the weight coefficient, P is the positive sample set matched with the text video, N is the negative sample set, L_alignFor text video alignment loss function, L_regIn order to offset the regression loss function for the position,

is the true offset.

The invention also provides a system for positioning the time of the cross-modal video based on the space-time diagram, which comprises the following steps:

the multi-scale sliding window intercepting module is used for intercepting a video segment candidate set for an un-clipped video by adopting a multi-scale sliding window after the un-clipped video and the query text are input;

the extraction training module is used for extracting text characteristics eq and video segment characteristics ec, and generating a space-time graph representation for the video segments by using a pre-trained scene graph generation model;

the multilayer graph convolution neural network module is used for splicing the acquired space-time graph characteristics and the video segment characteristics of the space-time graph of the video through the multilayer graph convolution neural network to acquire video characteristics vst rich in space-time semantic information;

the projection module is used for projecting the video features vst and the text features eq containing the spatio-temporal information to the same feature space through a full connection layer, and obtaining a video text mode fusion feature f after splicing_cq；

A multi-layer perceptron network module for fusing the video text mode with the feature f_cqInputting the multi-layer perceptron network to obtain the matching score and the position offset vector of the text video.

Compared with the prior art, the invention has the advantages that: according to the invention, the interactive relation among the objects in the video frame is modeled through the space diagram, and the dependency relation of the interaction between the objects and the time diagram on the time domain is modeled through the time diagram, so that the video semantic information can be understood in a fine-grained manner, and a more accurate video positioning boundary can be returned.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a model diagram of a cross-model video time positioning method based on a space-time diagram according to the present invention.

Fig. 2 is a flowchart of the cross-modal video time positioning method based on the space-time diagram according to the present invention.

FIG. 3 is a flow chart of constructing a spatial map in the present invention.

Fig. 4 is a flow chart for constructing a time chart in the present invention.

Fig. 5 is a schematic diagram of a cross-modal video time positioning system based on space-time diagrams according to the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings so that the advantages and features of the present invention can be more easily understood by those skilled in the art, and the scope of the present invention will be more clearly and clearly defined.

Referring to fig. 1 and fig. 2, the present embodiment discloses a cross-modal video time positioning method based on a space-time diagram, which includes the following steps:

and step S1, inputting an un-clipped video and a query text, and intercepting a video segment candidate set for the un-clipped video by adopting a multi-scale sliding window.

Specifically, an uncut video V ═ V is input₁，v₂,...,v_nIn which v is_iWhere (i-1, 2., n) is the ith image frame, and a query text q, the goal is to identify the video boundary that matches the query sentence, i.e., l-l_s，l_e]. When creating a candidate set, intercepting the video with a certain overlapping rate by adopting a multi-scale time window,e.g. using a time scale of [64,128,256,512 ]]Frames are intercepted at an overlapping rate of 80%, and a candidate set C ═ C is obtained₁,c₂…c_MH, each candidate segment c_iAre marked with corresponding start and end times t_s,t_e]. When c is going to_iStart and stop time of [ t ]_s,t_e]True video boundary [ l ] corresponding to query statement q_s，l_e]Is greater than a given threshold value alpha, c_iConsidered as positive samples, otherwise considered as negative samples. IOU formula is

And S2, extracting text features eq and video segment features ec, and generating a spatio-temporal representation for the video segments by using a pre-trained scene graph generation model.

Specifically, step S2 may include the following specific steps:

step S20, extracting text features eq from the query text by using a text encoder (such as LSTM).

And step S21, extracting video segment features ec in the video segment candidate set by using a pre-trained convolutional neural network (such as C3D).

Step S22, extracting a spatial map describing the interaction between objects in each frame by using a pre-trained scene graph generation model (e.g., ReIDN) for the frame of each candidate segment in the candidate set.

Specifically, existing trained scene graph generation models, such as the ReIDN model, the Neural Motif model, and the like, can be used. For a given picture, the scene graph generation model detects objects in the picture and relationships between the objects: (<Subject, predicate, object>). The object regions in the picture correspond to nodes (subjects or objects) in the scene graph, and the relationships between objects correspond to edges (predicates) in the scene graph. The process of constructing the space map is shown in FIG. 3, for an object i and an object j in a video frame t if there is a relationship between the two<i，p，j>Let the directed edge from object i to object j be 1, denoted as

Through scene graph analysis, a directed graph representing the spatial relationship of objects in a video frame can be constructed, and the object characteristics detected by a scene graph generation model are used as node characteristics X e R^N×d(N represents the total number of objects in the video segment and d represents the object feature dimension). Obtain the adjacency matrix A^spaWe then normalize each row of the adjacency matrix to ensure that the sum of the edges connected by each object is equal to 1. The normalized formula is:

wherein N represents the total number of objects in the video clip, A^spaAn adjacency matrix representing a spatial graph.

And step S23, constructing a time graph according to the similarity between the object features of the adjacent frames, and modeling the object dependence on a time domain.

Specifically, as shown in fig. 4, for an object i in a frame t, the cosine similarity between the object i and an object j in a frame t +1 is calculated, if the similarity between the object i and the object j is greater than a given threshold β, the object i and the object j are considered to be the same object, and a directed edge from the object i to the object j is 1 and is represented as

The adjacency matrix A of the time chart is obtained by adopting the same normalization on the adjacency matrix of the time chart^tem。

And step S3, splicing the obtained space-time diagram features and the video fragment features of the space-time diagram of the video through a multi-layer diagram convolution neural network (GCN) to obtain the video features vst rich in space-time semantic information.

Specifically, the method comprises the following specific steps:

step S30, inputting the space map and the time map into the multi-layer map convolutional neural network, wherein the space-time map has an adjacent matrix A^spaAnd A^temTherefore, for each layer of graph convolution neural network, the graph convolution output results of the spatial graph and the time graph are directly added: Z-RELU (A)^spaXW^spa+A^temXW^tem) Wherein W is^spaAnd W^temIs a weight matrix, X is a node feature matrix, Z is a single-layer graph convolution network output result, A^spa、A^temAfter convolution of k-layer maps, space-time map features gst-avg _ pool (max _ pool (Z)) are obtained¹,Z²,,,Z^k) Max _ pool, avg _ pool denote maximum pooling and average pooling operations, respectively;

and step S31, splicing the spatio-temporal image characteristics gst and the video fragment characteristics ec to obtain the video characteristics vst with spatio-temporal semantic information.

Step S4, projecting the video feature vst and the text feature eq containing the spatio-temporal information to the same feature space through a full connection layer, and splicing to obtain a video text mode fusion feature f_cq。

Specifically, video features vst and text features eq are respectively projected to the same feature space through a full connection layer, and the projection formula is vst _ p-RELU (W)_vvst+b_v),eq_p＝RELU(W_seq+b_s) Wherein W is_v、W_sRepresenting a weight matrix, b_v、b_sRepresenting a bias vector. Splicing the projected video features and the text features in the same feature space to obtain a fusion modal feature f_cq。

Step S5, fusing the video text mode with the feature f_cqAnd inputting the multi-layer perceptron network to obtain a text video matching score and a position offset vector.

Specifically, f is_cqInputting multilayer full-connection neural network, outputting text q and video clip c_iMatching score and positioning offset of (1), i.e. out ═ s_cq,δ_s,δ_e]Wherein s is_cqDenotes the matching fraction, δ_s、δ_eIndicating a positioning offset. f. of_cqThe formula is as follows through a multilayer fully-connected neural network:

wherein, W_l、b_l、o_lRespectively, a l-th layer fully-connected network weight matrix, an offset vector and an output vector.

In particular, o_L＝[s_cq,δ_s,δ_e]Wherein s is_cqRepresents the matching fraction, δ_s、δ_eIndicating a positioning offset.

Specifically, by the formula L ═ L_align+λL_reg，L_align＝∑_(c,q)∈Pλ₁log(1+exp(-s_cq))+ ∑_(c,q)∈Nλ₂log(1+exp(s_cq)),

Calculating a loss function L, where λ, λ₁、λ₂For the weight coefficient, P is the positive sample set matched with the text video, N is the negative sample set, L_alignAs a text-to-video alignment loss function, L_regIn order to be a position-shift regression loss function,

is the true offset.

The invention trains the network model by calculating the loss function L. And selecting the candidate segment with the highest matching score in the testing stage, and adding the regression offset to the time boundary of the candidate segment to obtain the final accurate video time positioning boundary.

Referring to fig. 5, the present invention further provides a system for positioning a temporal location of a cross-modal video based on a space-time diagram, including: the multi-scale sliding window intercepting module 1 is used for intercepting a video segment candidate set for an un-clipped video by adopting a multi-scale sliding window after the un-clipped video and a query text are input; the extraction training module 2 is used for extracting text characteristics eq and video segment characteristics ec, and generating a spatio-temporal representation for the video segments by using a pre-trained scene graph generation model; the multilayer graph convolution neural network module 3 is used for splicing the obtained space-time graph characteristics of the video with the video fragment characteristics through a multilayer graph convolution neural network (GCN) to obtain video characteristics vst rich in space-time semantic information; a projection module 4 for projecting the video feature vst and the text feature eq containing the spatio-temporal information to the same feature through a full connection layerSpace, and obtaining the fusion feature f of the video text mode after splicing_cq(ii) a A multi-layer perceptron network module 5 for fusing the video text mode with the feature f_cqAnd inputting the multi-layer perceptron network to obtain a text video matching score and a position offset vector.

According to the invention, the interactive relation among the objects in the video frame is modeled through the space diagram, and the dependency relation of the interaction between the objects and the time diagram on the time domain is modeled through the time diagram, so that the video semantic information can be understood in a fine-grained manner, and a more accurate video positioning boundary can be returned.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, various changes or modifications may be made by the patentees within the scope of the appended claims, and within the scope of the invention, as long as they do not exceed the scope of the invention described in the claims.

Claims

1. A cross-modal video time positioning method based on a space-time diagram is characterized by comprising the following steps:

2. The spatio-temporal graph-based cross-modal video temporal positioning method according to claim 1, wherein the step S2 comprises the steps of:

s20, extracting text features eq from the query text by using a text encoder;

3. The method according to claim 2, wherein the step S22 specifically includes:

the scene graph generation model judges whether the object i and the object j of each frame in the candidate segment have a relationship, and if yes, the object i is enabled to have a directed edge to the object j

To set values, a spatial map adjacency matrix A is obtained^spa；

Constructing a directed graph representing the spatial relationship of the objects for each frame in the candidate segments, and taking the object characteristics detected by the scene graph generation model as node characteristics X epsilon R^N×dWherein N represents the total number of objects in the video clip, and d represents the characteristic dimension of the objects;

4. The spatiotemporal-graph-based cross-modal video temporal positioning method according to claim 3, wherein the normalized formula is:

whereinN denotes the total number of objects in the video segment, A^spaAn adjacency matrix representing a spatial graph.

5. The method according to claim 2, wherein the step S23 specifically includes: calculating cosine similarity between the object i in the frame t and the object j in the frame t +1, if the similarity between the object i and the object j in the frame t +1 is greater than a given threshold value, determining that the object i and the object j are the same object, and enabling the directed edge from the object i to the object j to be a set value to obtain a time chart adjacency matrix A^tem。

6. The method according to claim 2, wherein the step S3 specifically comprises the following steps:

s30, inputting the space diagram and the time diagram into a multi-layer diagram convolution neural network, for each layer diagram convolution neural network, directly adding the diagram convolution output results of the space diagram and the time diagram: Z-RELU (A)^spaXW^spa+A^temXW^tem) Wherein W is^spaAnd W^temIs a weight matrix, X is a node feature matrix, Z is a single-layer graph convolution network output result, A^spa、A^temAfter convolution of k-layer maps, space-time map features gst-avg _ pool (max _ pool (Z)) are obtained¹，Z²，，，Z^k) Max _ pool, avg _ pool denote maximum pooling and average pooling operations, respectively;

7. The spatio-temporal graph-based cross-modal video temporal positioning method according to claim 1, wherein the projection formula in the step S4 is:

vst_p＝RELU(W_vvst+b_v)，eq_p＝RELU(W_seq+b_s)

8. The method according to claim 1, wherein the video text mode fusion feature f in step S5 is a video text mode cross-modal video time positioning method based on space-time diagram_cqThe formula input into the multi-layer perceptron network is as follows:

wherein, W_l、b_l、o_lRespectively, a l-th layer fully-connected network weight matrix, an offset vector and an output vector, o_L＝[s_cq，δ_s，δ_e]Wherein s is_cqDenotes the matching fraction, δ_s、δ_eIndicating a positioning offset.

9. The cross-modal video time positioning method based on the space-time diagram of claim 1, wherein in step S5, the loss function L is calculated by the following formula to train the network model, a candidate segment with the highest matching score is selected in a test stage, and a regression offset is added to a time boundary of the candidate segment to obtain a video time positioning boundary;

L＝L_align+λL_reg；

L_align＝∑_(c，q)∈Pλ₁log(1+exp(-s_cq))+∑_(c，q)∈Nλ₂log(1+exp(s_cq))；

is the true offset.

10. The system for cross-modal video time-of-day localization method based on spatio-temporal patterns according to any of claims 1-9, comprising: