CN114627402A - Cross-modal video time positioning method and system based on space-time diagram - Google Patents

Cross-modal video time positioning method and system based on space-time diagram Download PDF

Info

Publication number
CN114627402A
CN114627402A CN202111644165.9A CN202111644165A CN114627402A CN 114627402 A CN114627402 A CN 114627402A CN 202111644165 A CN202111644165 A CN 202111644165A CN 114627402 A CN114627402 A CN 114627402A
Authority
CN
China
Prior art keywords
video
space
time
features
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111644165.9A
Other languages
Chinese (zh)
Inventor
李肯立
平申
田泽安
张忠阳
潘佳铭
姜骁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202111644165.9A priority Critical patent/CN114627402A/en
Publication of CN114627402A publication Critical patent/CN114627402A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a cross-modal video time positioning method and a system based on a space-time diagram, wherein the method comprises the following steps: s1, inputting an un-clipped video and a query text, and intercepting a video segment candidate set for the un-clipped video by adopting a multi-scale sliding window; s2, extracting text features and video segment features, and generating a space-time graph representation of the video segments by utilizing a pre-trained scene graph generation model; s3, splicing the space-time diagram features and the video fragment features obtained by passing the space-time diagram of the video through a multilayer diagram convolution neural network to obtain video features rich in space-time semantic information; s4, projecting the video features containing the spatio-temporal information and the text features to the same feature space through a full connection layer, and splicing to obtain video text modal fusion features; and S5, inputting the video text mode fusion characteristics into the multilayer perceptron network to obtain a text video matching score and a position offset vector. The invention can understand the video semantic information with fine granularity and return more accurate video positioning boundary.

Description

Cross-modal video time positioning method and system based on space-time diagram
Technical Field
The present invention relates to the field of video positioning technologies, and in particular, to a cross-modal video time positioning method and system based on a space-time diagram.
Background
Given an un-clipped video (modality one) and a query text (modality two), the prior art has proposed some methods of handling video time-of-day localization, mainly from the given video to the start and end times semantically related to the query statement. In the existing method, candidate sets with different scales are separated from the whole video through a multi-scale sliding window, a segment of video segment characteristics are represented through gathering frame-level characteristics, and video characteristics and text characteristics are mapped to the same characteristic space for matching. Query text typically contains nouns and verbs (such as "a wman is talking into a microphone the shooting hands with man", containing nouns wman, man and microphone, verbs talking into, shooting hands) corresponding to objects in the video (wman, man and microphone) and interactions between the objects (< wman, talking into, microphone >, < wman, shooting hands, man >). Similarly, video time positioning requires not only fine-grained understanding of video semantic concepts (such as objects), capturing interaction between objects, but also understanding the dependency relationship of object interaction on time (in time, scraping hands occur after the ripping inter a microphone). However, the above video characterization method cannot model interaction information of objects in the video in space and time, and particularly when the objects are the same and the interactions are different, the method cannot be used for accurately positioning.
Therefore, there is a need to provide a more accurate spatio-temporal graph-based cross-modal video time-of-day positioning method, which can not only understand the semantic concepts of the video (e.g., objects), but also capture the interactions between the objects in space and time.
Disclosure of Invention
The invention aims to provide a cross-modal video time positioning method based on a space-time diagram, so as to overcome the defects in the prior art.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
a cross-modal video time positioning method based on a space-time diagram comprises the following steps:
s1, inputting an un-clipped video and a query text, and intercepting a video segment candidate set for the un-clipped video by adopting a multi-scale sliding window;
s2, extracting text features eq and video segment features ec, and generating a space-time graph representation for the video segments by using a pre-trained scene graph generation model;
s3, splicing the space-time diagram features of the video and the video segment features obtained by the space-time diagram of the video through a multilayer diagram convolution neural network to obtain video features vst rich in space-time semantic information;
s4, projecting the video features vst and the text features eq containing the spatio-temporal information to the same feature space through a full connection layer, and splicing to obtain a video text mode fusion feature fcq
S5, fusing the video text mode with the feature fcqAnd inputting the multi-layer perceptron network to obtain a text video matching score and a position offset vector.
Further, the step S2 includes the steps of:
s20, extracting text features eq from the query text by using a text encoder;
s21, extracting video segment features ec in the video segment candidate set by utilizing a pre-trained convolutional neural network;
s22, extracting a space map for describing interaction between objects in each frame by utilizing a pre-trained scene map generation model for the frame of each candidate segment in the candidate set;
and S23, constructing a time graph according to the similarity between the object features of the adjacent frames, and modeling the object dependence on a time domain.
Further, the step S22 specifically includes:
judging whether the object i and the object j of each frame in the candidate segment have a relationship by the scene graph generation model, and enabling the object i to have a directed edge of the object j if the object i and the object j have the relationship
Figure RE-GDA0003601763470000021
To set values, a spatial map adjacency matrix A is obtainedspa
Constructing a directed graph representing the spatial relationship of the objects for each frame in the candidate segments, and generating object characteristics detected by a model as node characteristics X e R by using a scene graphN×dWherein N represents the total number of objects in the video clip, and d represents the characteristic dimension of the object;
obtain the adjacency matrix AspaThe adjacency matrix is then normalized per row to ensure that the sum of the edges connected by each object is equal to the set value.
Further, the normalized formula is:
Figure RE-GDA0003601763470000022
where N represents the total number of objects in the video segment, AspaAn adjacency matrix representing a spatial graph.
Further, the step S23 specifically includes: calculating cosine similarity between the object i in the frame t and the object j in the frame t +1, if the similarity between the object i and the object j in the frame t +1 is greater than a given threshold value, determining that the object i and the object j are the same object, and enabling the directed edge from the object i to the object j to be a set value to obtain a time chart adjacency matrix Atem
Further, the step S3 specifically includes the following steps:
s30, inputting the space map and the time map into a multi-layer map convolution neural network, for each layer map convolution neural network, directly adding the map convolution output results of the space map and the time map: Z-RELU (A)spaXWspa+AtemXWtem) Wherein W isspaAnd WtemIs a weight matrix, X is a node feature matrix, Z is a single-layer graph convolution network output result, Aspa、AtemRespectively a space map and a time map adjacency matrix, and obtaining a space-time map feature gst ═ avg _ pool (Z _ pool) after convolution of k-layer maps1,Z2,,,Zk) Max _ pool, avg _ pool denote maximum pooling and average pooling operations, respectively;
and S31, splicing the space-time diagram features and the video fragment features to obtain the video features with space-time semantic information.
Further, the projection formula in step S4 is:
vst_p=RELU(Wvvst+bv),eq_p=RELU(Wseq+bs)
wherein, Wv、WsRepresenting a weight matrix, bv、bsRepresenting a bias vector.
Further, in the step S5, the fusion feature f of the video text modalitycqThe formula input into the multi-layer perceptron network is as follows:
Figure RE-GDA0003601763470000031
wherein, Wl、bl、olRespectively, a l-th layer fully-connected network weight matrix, an offset vector and an output vector, oL=[scqs, δe]Wherein s iscqDenotes the matching fraction, δs、δeIndicating a positioning offset;
further, in the step S5, a loss function L is calculated by the following formula to train the network model, a candidate segment with the highest matching score is selected in the test stage, and the regression offset is added to the time boundary of the candidate segment to obtain a video time positioning boundary;
L=Lalign+λLreg
Lalign=∑(c,q)∈Pλ1log(1+exp(-scq))+∑(c,q)∈Nλ2log(1+exp(scq));
Figure RE-GDA0003601763470000032
in the formula, λ1、λ2For the weight coefficient, P is the positive sample set matched with the text video, N is the negative sample set, LalignFor text video alignment loss function, LregIn order to offset the regression loss function for the position,
Figure RE-GDA0003601763470000033
is the true offset.
The invention also provides a system for positioning the time of the cross-modal video based on the space-time diagram, which comprises the following steps:
the multi-scale sliding window intercepting module is used for intercepting a video segment candidate set for an un-clipped video by adopting a multi-scale sliding window after the un-clipped video and the query text are input;
the extraction training module is used for extracting text characteristics eq and video segment characteristics ec, and generating a space-time graph representation for the video segments by using a pre-trained scene graph generation model;
the multilayer graph convolution neural network module is used for splicing the acquired space-time graph characteristics and the video segment characteristics of the space-time graph of the video through the multilayer graph convolution neural network to acquire video characteristics vst rich in space-time semantic information;
the projection module is used for projecting the video features vst and the text features eq containing the spatio-temporal information to the same feature space through a full connection layer, and obtaining a video text mode fusion feature f after splicingcq
A multi-layer perceptron network module for fusing the video text mode with the feature fcqInputting the multi-layer perceptron network to obtain the matching score and the position offset vector of the text video.
Compared with the prior art, the invention has the advantages that: according to the invention, the interactive relation among the objects in the video frame is modeled through the space diagram, and the dependency relation of the interaction between the objects and the time diagram on the time domain is modeled through the time diagram, so that the video semantic information can be understood in a fine-grained manner, and a more accurate video positioning boundary can be returned.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a model diagram of a cross-model video time positioning method based on a space-time diagram according to the present invention.
Fig. 2 is a flowchart of the cross-modal video time positioning method based on the space-time diagram according to the present invention.
FIG. 3 is a flow chart of constructing a spatial map in the present invention.
Fig. 4 is a flow chart for constructing a time chart in the present invention.
Fig. 5 is a schematic diagram of a cross-modal video time positioning system based on space-time diagrams according to the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings so that the advantages and features of the present invention can be more easily understood by those skilled in the art, and the scope of the present invention will be more clearly and clearly defined.
Referring to fig. 1 and fig. 2, the present embodiment discloses a cross-modal video time positioning method based on a space-time diagram, which includes the following steps:
and step S1, inputting an un-clipped video and a query text, and intercepting a video segment candidate set for the un-clipped video by adopting a multi-scale sliding window.
Specifically, an uncut video V ═ V is input1,v2,...,vnIn which v isiWhere (i-1, 2., n) is the ith image frame, and a query text q, the goal is to identify the video boundary that matches the query sentence, i.e., l-ls,le]. When creating a candidate set, intercepting the video with a certain overlapping rate by adopting a multi-scale time window,e.g. using a time scale of [64,128,256,512 ]]Frames are intercepted at an overlapping rate of 80%, and a candidate set C ═ C is obtained1,c2…cMH, each candidate segment ciAre marked with corresponding start and end times ts,te]. When c is going toiStart and stop time of [ t ]s,te]True video boundary [ l ] corresponding to query statement qs,le]Is greater than a given threshold value alpha, ciConsidered as positive samples, otherwise considered as negative samples. IOU formula is
Figure RE-GDA0003601763470000041
And S2, extracting text features eq and video segment features ec, and generating a spatio-temporal representation for the video segments by using a pre-trained scene graph generation model.
Specifically, step S2 may include the following specific steps:
step S20, extracting text features eq from the query text by using a text encoder (such as LSTM).
And step S21, extracting video segment features ec in the video segment candidate set by using a pre-trained convolutional neural network (such as C3D).
Step S22, extracting a spatial map describing the interaction between objects in each frame by using a pre-trained scene graph generation model (e.g., ReIDN) for the frame of each candidate segment in the candidate set.
Specifically, existing trained scene graph generation models, such as the ReIDN model, the Neural Motif model, and the like, can be used. For a given picture, the scene graph generation model detects objects in the picture and relationships between the objects: (<Subject, predicate, object>). The object regions in the picture correspond to nodes (subjects or objects) in the scene graph, and the relationships between objects correspond to edges (predicates) in the scene graph. The process of constructing the space map is shown in FIG. 3, for an object i and an object j in a video frame t if there is a relationship between the two<i,p,j>Let the directed edge from object i to object j be 1, denoted as
Figure RE-GDA0003601763470000051
Through scene graph analysis, a directed graph representing the spatial relationship of objects in a video frame can be constructed, and the object characteristics detected by a scene graph generation model are used as node characteristics X e RN×d(N represents the total number of objects in the video segment and d represents the object feature dimension). Obtain the adjacency matrix AspaWe then normalize each row of the adjacency matrix to ensure that the sum of the edges connected by each object is equal to 1. The normalized formula is:
Figure RE-GDA0003601763470000052
wherein N represents the total number of objects in the video clip, AspaAn adjacency matrix representing a spatial graph.
And step S23, constructing a time graph according to the similarity between the object features of the adjacent frames, and modeling the object dependence on a time domain.
Specifically, as shown in fig. 4, for an object i in a frame t, the cosine similarity between the object i and an object j in a frame t +1 is calculated, if the similarity between the object i and the object j is greater than a given threshold β, the object i and the object j are considered to be the same object, and a directed edge from the object i to the object j is 1 and is represented as
Figure RE-GDA0003601763470000053
The adjacency matrix A of the time chart is obtained by adopting the same normalization on the adjacency matrix of the time charttem
And step S3, splicing the obtained space-time diagram features and the video fragment features of the space-time diagram of the video through a multi-layer diagram convolution neural network (GCN) to obtain the video features vst rich in space-time semantic information.
Specifically, the method comprises the following specific steps:
step S30, inputting the space map and the time map into the multi-layer map convolutional neural network, wherein the space-time map has an adjacent matrix AspaAnd AtemTherefore, for each layer of graph convolution neural network, the graph convolution output results of the spatial graph and the time graph are directly added: Z-RELU (A)spaXWspa+AtemXWtem) Wherein W isspaAnd WtemIs a weight matrix, X is a node feature matrix, Z is a single-layer graph convolution network output result, Aspa、AtemAfter convolution of k-layer maps, space-time map features gst-avg _ pool (max _ pool (Z)) are obtained1,Z2,,,Zk) Max _ pool, avg _ pool denote maximum pooling and average pooling operations, respectively;
and step S31, splicing the spatio-temporal image characteristics gst and the video fragment characteristics ec to obtain the video characteristics vst with spatio-temporal semantic information.
Step S4, projecting the video feature vst and the text feature eq containing the spatio-temporal information to the same feature space through a full connection layer, and splicing to obtain a video text mode fusion feature fcq
Specifically, video features vst and text features eq are respectively projected to the same feature space through a full connection layer, and the projection formula is vst _ p-RELU (W)vvst+bv),eq_p=RELU(Wseq+bs) Wherein W isv、WsRepresenting a weight matrix, bv、bsRepresenting a bias vector. Splicing the projected video features and the text features in the same feature space to obtain a fusion modal feature fcq
Step S5, fusing the video text mode with the feature fcqAnd inputting the multi-layer perceptron network to obtain a text video matching score and a position offset vector.
Specifically, f iscqInputting multilayer full-connection neural network, outputting text q and video clip ciMatching score and positioning offset of (1), i.e. out ═ scqse]Wherein s iscqDenotes the matching fraction, δs、δeIndicating a positioning offset. f. ofcqThe formula is as follows through a multilayer fully-connected neural network:
Figure RE-GDA0003601763470000061
wherein, Wl、bl、olRespectively, a l-th layer fully-connected network weight matrix, an offset vector and an output vector.
In particular, oL=[scqse]Wherein s iscqRepresents the matching fraction, δs、δeIndicating a positioning offset.
Specifically, by the formula L ═ Lalign+λLreg,Lalign=∑(c,q)∈Pλ1log(1+exp(-scq))+ ∑(c,q)∈Nλ2log(1+exp(scq)),
Figure RE-GDA0003601763470000063
Calculating a loss function L, where λ, λ1、λ2For the weight coefficient, P is the positive sample set matched with the text video, N is the negative sample set, LalignAs a text-to-video alignment loss function, LregIn order to be a position-shift regression loss function,
Figure RE-GDA0003601763470000062
is the true offset.
The invention trains the network model by calculating the loss function L. And selecting the candidate segment with the highest matching score in the testing stage, and adding the regression offset to the time boundary of the candidate segment to obtain the final accurate video time positioning boundary.
Referring to fig. 5, the present invention further provides a system for positioning a temporal location of a cross-modal video based on a space-time diagram, including: the multi-scale sliding window intercepting module 1 is used for intercepting a video segment candidate set for an un-clipped video by adopting a multi-scale sliding window after the un-clipped video and a query text are input; the extraction training module 2 is used for extracting text characteristics eq and video segment characteristics ec, and generating a spatio-temporal representation for the video segments by using a pre-trained scene graph generation model; the multilayer graph convolution neural network module 3 is used for splicing the obtained space-time graph characteristics of the video with the video fragment characteristics through a multilayer graph convolution neural network (GCN) to obtain video characteristics vst rich in space-time semantic information; a projection module 4 for projecting the video feature vst and the text feature eq containing the spatio-temporal information to the same feature through a full connection layerSpace, and obtaining the fusion feature f of the video text mode after splicingcq(ii) a A multi-layer perceptron network module 5 for fusing the video text mode with the feature fcqAnd inputting the multi-layer perceptron network to obtain a text video matching score and a position offset vector.
According to the invention, the interactive relation among the objects in the video frame is modeled through the space diagram, and the dependency relation of the interaction between the objects and the time diagram on the time domain is modeled through the time diagram, so that the video semantic information can be understood in a fine-grained manner, and a more accurate video positioning boundary can be returned.
Although the embodiments of the present invention have been described with reference to the accompanying drawings, various changes or modifications may be made by the patentees within the scope of the appended claims, and within the scope of the invention, as long as they do not exceed the scope of the invention described in the claims.

Claims (10)

1. A cross-modal video time positioning method based on a space-time diagram is characterized by comprising the following steps:
s1, inputting an un-clipped video and a query text, and intercepting a video segment candidate set for the un-clipped video by adopting a multi-scale sliding window;
s2, extracting text features eq and video segment features ec, and generating a space-time graph representation for the video segments by using a pre-trained scene graph generation model;
s3, splicing the space-time diagram features of the video and the video segment features obtained by the space-time diagram of the video through a multilayer diagram convolution neural network to obtain video features vst rich in space-time semantic information;
s4, projecting the video features vst and the text features eq containing the spatio-temporal information to the same feature space through a full connection layer, and splicing to obtain a video text mode fusion feature fcq
S5, fusing the video text mode with the feature fcqAnd inputting the multi-layer perceptron network to obtain a text video matching score and a position offset vector.
2. The spatio-temporal graph-based cross-modal video temporal positioning method according to claim 1, wherein the step S2 comprises the steps of:
s20, extracting text features eq from the query text by using a text encoder;
s21, extracting video segment features ec in the video segment candidate set by utilizing a pre-trained convolutional neural network;
s22, extracting a space map for describing interaction between objects in each frame by utilizing a pre-trained scene map generation model for the frame of each candidate segment in the candidate set;
and S23, constructing a time graph according to the similarity between the object features of the adjacent frames, and modeling the object dependence on a time domain.
3. The method according to claim 2, wherein the step S22 specifically includes:
the scene graph generation model judges whether the object i and the object j of each frame in the candidate segment have a relationship, and if yes, the object i is enabled to have a directed edge to the object j
Figure FDA0003444572450000011
To set values, a spatial map adjacency matrix A is obtainedspa
Constructing a directed graph representing the spatial relationship of the objects for each frame in the candidate segments, and taking the object characteristics detected by the scene graph generation model as node characteristics X epsilon RN×dWherein N represents the total number of objects in the video clip, and d represents the characteristic dimension of the objects;
obtain the adjacency matrix AspaThe adjacency matrix is then normalized per row to ensure that the sum of the edges connected by each object is equal to the set value.
4. The spatiotemporal-graph-based cross-modal video temporal positioning method according to claim 3, wherein the normalized formula is:
Figure FDA0003444572450000012
whereinN denotes the total number of objects in the video segment, AspaAn adjacency matrix representing a spatial graph.
5. The method according to claim 2, wherein the step S23 specifically includes: calculating cosine similarity between the object i in the frame t and the object j in the frame t +1, if the similarity between the object i and the object j in the frame t +1 is greater than a given threshold value, determining that the object i and the object j are the same object, and enabling the directed edge from the object i to the object j to be a set value to obtain a time chart adjacency matrix Atem
6. The method according to claim 2, wherein the step S3 specifically comprises the following steps:
s30, inputting the space diagram and the time diagram into a multi-layer diagram convolution neural network, for each layer diagram convolution neural network, directly adding the diagram convolution output results of the space diagram and the time diagram: Z-RELU (A)spaXWspa+AtemXWtem) Wherein W isspaAnd WtemIs a weight matrix, X is a node feature matrix, Z is a single-layer graph convolution network output result, Aspa、AtemAfter convolution of k-layer maps, space-time map features gst-avg _ pool (max _ pool (Z)) are obtained1,Z2,,,Zk) Max _ pool, avg _ pool denote maximum pooling and average pooling operations, respectively;
and S31, splicing the space-time diagram features and the video fragment features to obtain the video features with space-time semantic information.
7. The spatio-temporal graph-based cross-modal video temporal positioning method according to claim 1, wherein the projection formula in the step S4 is:
vst_p=RELU(Wvvst+bv),eq_p=RELU(Wseq+bs)
wherein, Wv、WsRepresenting a weight matrix, bv、bsRepresenting a bias vector.
8. The method according to claim 1, wherein the video text mode fusion feature f in step S5 is a video text mode cross-modal video time positioning method based on space-time diagramcqThe formula input into the multi-layer perceptron network is as follows:
Figure FDA0003444572450000021
wherein, Wl、bl、olRespectively, a l-th layer fully-connected network weight matrix, an offset vector and an output vector, oL=[scq,δs,δe]Wherein s iscqDenotes the matching fraction, δs、δeIndicating a positioning offset.
9. The cross-modal video time positioning method based on the space-time diagram of claim 1, wherein in step S5, the loss function L is calculated by the following formula to train the network model, a candidate segment with the highest matching score is selected in a test stage, and a regression offset is added to a time boundary of the candidate segment to obtain a video time positioning boundary;
L=Lalign+λLreg
Lalign=∑(c,q)∈Pλ1log(1+exp(-scq))+∑(c,q)∈Nλ2log(1+exp(scq));
Figure FDA0003444572450000031
in the formula, λ1、λ2For the weight coefficient, P is the positive sample set matched with the text video, N is the negative sample set, LalignFor text video alignment loss function, LregIn order to offset the regression loss function for the position,
Figure FDA0003444572450000032
is the true offset.
10. The system for cross-modal video time-of-day localization method based on spatio-temporal patterns according to any of claims 1-9, comprising:
the multi-scale sliding window intercepting module is used for intercepting a video segment candidate set for an un-clipped video by adopting a multi-scale sliding window after the un-clipped video and the query text are input;
the extraction training module is used for extracting text characteristics eq and video segment characteristics ec, and generating a space-time graph representation for the video segments by using a pre-trained scene graph generation model;
the multilayer graph convolution neural network module is used for splicing the acquired space-time graph characteristics and the video segment characteristics of the space-time graph of the video through the multilayer graph convolution neural network to acquire video characteristics vst rich in space-time semantic information;
the projection module is used for projecting the video features vst and the text features eq containing the spatio-temporal information to the same feature space through a full connection layer, and obtaining a video text mode fusion feature f after splicingcq
A multi-layer perceptron network module for fusing the video text mode with the feature fcqInputting the multi-layer perceptron network to obtain the matching score and the position offset vector of the text video.
CN202111644165.9A 2021-12-30 2021-12-30 Cross-modal video time positioning method and system based on space-time diagram Pending CN114627402A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111644165.9A CN114627402A (en) 2021-12-30 2021-12-30 Cross-modal video time positioning method and system based on space-time diagram

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111644165.9A CN114627402A (en) 2021-12-30 2021-12-30 Cross-modal video time positioning method and system based on space-time diagram

Publications (1)

Publication Number Publication Date
CN114627402A true CN114627402A (en) 2022-06-14

Family

ID=81897998

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111644165.9A Pending CN114627402A (en) 2021-12-30 2021-12-30 Cross-modal video time positioning method and system based on space-time diagram

Country Status (1)

Country Link
CN (1) CN114627402A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117058601A (en) * 2023-10-13 2023-11-14 华中科技大学 Video space-time positioning network and method of cross-modal network based on Gaussian kernel
CN117612072A (en) * 2024-01-23 2024-02-27 中国科学技术大学 Video understanding method based on dynamic space-time diagram
CN118230226A (en) * 2024-05-23 2024-06-21 上海蜜度科技股份有限公司 Video target positioning method, system, medium and electronic equipment
WO2024139091A1 (en) * 2022-12-27 2024-07-04 苏州元脑智能科技有限公司 Video behavior positioning method and apparatus, electronic device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10307921A (en) * 1997-05-02 1998-11-17 Nippon Telegr & Teleph Corp <Ntt> Movement measuring method and device for time serial image
CN111429571A (en) * 2020-04-15 2020-07-17 四川大学 Rapid stereo matching method based on spatio-temporal image information joint correlation
US20210349940A1 (en) * 2019-06-17 2021-11-11 Tencent Technology (Shenzhen) Company Limited Video clip positioning method and apparatus, computer device, and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10307921A (en) * 1997-05-02 1998-11-17 Nippon Telegr & Teleph Corp <Ntt> Movement measuring method and device for time serial image
US20210349940A1 (en) * 2019-06-17 2021-11-11 Tencent Technology (Shenzhen) Company Limited Video clip positioning method and apparatus, computer device, and storage medium
CN111429571A (en) * 2020-04-15 2020-07-17 四川大学 Rapid stereo matching method based on spatio-temporal image information joint correlation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杜圣东: "基于深度学习的城市时空序列预测模型及应用研究", 《中国博士学位论文全文数据库 工程科技I辑》, 15 June 2021 (2021-06-15), pages 027 - 90 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024139091A1 (en) * 2022-12-27 2024-07-04 苏州元脑智能科技有限公司 Video behavior positioning method and apparatus, electronic device and storage medium
CN117058601A (en) * 2023-10-13 2023-11-14 华中科技大学 Video space-time positioning network and method of cross-modal network based on Gaussian kernel
CN117612072A (en) * 2024-01-23 2024-02-27 中国科学技术大学 Video understanding method based on dynamic space-time diagram
CN117612072B (en) * 2024-01-23 2024-04-19 中国科学技术大学 Video understanding method based on dynamic space-time diagram
CN118230226A (en) * 2024-05-23 2024-06-21 上海蜜度科技股份有限公司 Video target positioning method, system, medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN114627402A (en) Cross-modal video time positioning method and system based on space-time diagram
KR102114564B1 (en) Learning system, learning device, learning method, learning program, teacher data creation device, teacher data creation method, teacher data creation program, terminal device and threshold change device
CN113255755B (en) Multi-modal emotion classification method based on heterogeneous fusion network
Fisher et al. Speaker association with signal-level audiovisual fusion
US20210319897A1 (en) Multimodal analysis combining monitoring modalities to elicit cognitive states and perform screening for mental disorders
CN109495766A (en) A kind of method, apparatus, equipment and the storage medium of video audit
JP5644772B2 (en) Audio data analysis apparatus, audio data analysis method, and audio data analysis program
CN113095346A (en) Data labeling method and data labeling device
CN112861945B (en) Multi-mode fusion lie detection method
CN113656660B (en) Cross-modal data matching method, device, equipment and medium
CN110569703A (en) computer-implemented method and device for identifying damage from picture
CN114254208A (en) Identification method of weak knowledge points and planning method and device of learning path
CN113298015A (en) Video character social relationship graph generation method based on graph convolution network
CN114529552A (en) Remote sensing image building segmentation method based on geometric contour vertex prediction
CN113326868B (en) Decision layer fusion method for multi-modal emotion classification
CN116522212B (en) Lie detection method, device, equipment and medium based on image text fusion
CN112861474A (en) Information labeling method, device, equipment and computer readable storage medium
CN116310975B (en) Audiovisual event positioning method based on consistent fragment selection
CN116541507A (en) Visual question-answering method and system based on dynamic semantic graph neural network
CN116704609A (en) Online hand hygiene assessment method and system based on time sequence attention
CN117197708A (en) Multi-mode video behavior recognition method based on language-vision contrast learning
Nakamura et al. LSTM‐based japanese speaker identification using an omnidirectional camera and voice information
WO2023238722A1 (en) Information creation method, information creation device, and moving picture file
CN117648980B (en) Novel entity relationship joint extraction method based on contradiction dispute data
WO2023238721A1 (en) Information creation method and information creation device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination