CN111582170A

CN111582170A - Method and positioning system for completing specified object positioning task in video by using object-aware multi-branch relation network

Info

Publication number: CN111582170A
Application number: CN202010382647.0A
Authority: CN
Inventors: 赵洲; 路伊琳; 张竹
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-05-08
Filing date: 2020-05-08
Publication date: 2020-08-25
Anticipated expiration: 2040-05-08
Also published as: CN111582170B

Abstract

The invention discloses a method and a positioning system for completing a specified object positioning task in a video by using an object-aware multi-branch relation network. The method comprises the following steps: providing a section of video, extracting regional characteristics of different frames from the video, and extracting dynamic information from the regional characteristics; giving a query statement, and learning the characteristics of an object in the context of the sentence by means of a Bi-GRU and an NLTK library; constructing an object perception multi-branch relation network, emphasizing related regional characteristics of an object by using object perception modulation, weakening unrelated regional characteristics, and further realizing object-region cross-model matching; capturing the relation of key objects in the main branch and the auxiliary branch by using multi-branch relation reasoning; a method for calculating diversity loss is provided, and different branches are guaranteed to focus on areas related to corresponding objects. And obtaining a plurality of video segments by using a sampling method, selecting the segment with the highest time confidence score, and selecting the region with the highest space score to generate the target pipeline.

Description

Method and positioning system for completing specified object positioning task in video by using object-aware multi-branch relation network

Technical Field

The invention relates to the field of positioning of specified objects in videos, in particular to a method and a positioning system for completing a task of positioning the specified objects in the videos by using an object-aware multi-branch relation network.

Background

Specifying object-localization in video (Spatio-temporal video grouping) is a task for connecting Computer Vision (CV) and Natural Language Processing (NLP), given a sentence that describes an object, a spatiotemporal pipeline that describes the object is retrieved in the video, i.e., a bounding box is generated. In recent years, much work has been done in this area. However, most existing localization methods are limited to well-aligned sentence-video segment pairs.

Specifying object localization in video is an emerging task in the area of cross-modal understanding. Most of the methods available today are limited to well-aligned sentence-video segment pairs, i.e. video segments have been cut out of the complete video, temporally aligned with sentences. Recently, researchers have begun exploring the problem of locating specific objects in video based on misaligned data and multiform sentences. Specifically, the sentence form may be a statement sentence or an interrogative sentence, and the sentence may describe the relationship between the query object and other auxiliary objects in a period of time, for example, "child kicks a ball" describes the motion relationship between the main object (child) and the auxiliary object (ball) in a period of time. The key to this task is therefore to capture the key relationships between objects in the video, upon which the bounding box is generated.

Of the existing methods, some methods, although achieving excellent performance on aligned sentence-video segments, cannot solve the problem of localization on misaligned sentence-video segments and cannot identify relationships between objects. In the method of exploring unaligned data, Zhang et al (Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, and Lianli gao. where doe it exists: spatial-temporal video interpretation for multi-form sensors. arxiv preprint) integrates text cues into regional features and uses Spatio-temporal graph reasoning to retrieve Spatio-temporal pipelines. Although this approach can capture object relationships by interacting across modal regions, it fails to filter out unnecessary objects and retain them in the coarse relationship model for all regions, preventing the establishment of a valid relationship model.

Therefore, the existing method for specifying object location in video has at least the following technical problems:

(1) a series of candidate space-time pipelines (namely candidate frames) must be extracted firstly, then the most relevant pipelines are extracted according to sentences, and suitable candidate pipelines are difficult to extract on unaligned sentence-video fragments;

(2) each object is modeled separately; or, although a relational model between objects is established, the model is coarser and unnecessary objects are introduced.

These problems can make some of the positioning methods in the prior art perform poorly on unaligned sentence-video segments, neglect the relationship between each object, incorporate unnecessary objects into the object relationship model, and result in inaccurate final positioning range.

Disclosure of Invention

In order to solve the problems that the positioning performance on unaligned sentence-video segments is poor and the relationship between objects cannot be accurately captured in the prior art, the invention provides a method and a positioning system for completing a task of positioning a specified object in a video by using an object-aware multi-branch relationship network (OMRN). The method first extracts dynamic region features from the video and learns object representations corresponding to nouns in sentences. Then, an object-aware multi-branch relation network is established, a video region containing an object is determined, wherein the object-aware multi-branch relation network comprises a plurality of branches, each branch corresponds to a noun object in a query statement, a main branch corresponds to the queried object, namely a main object, an auxiliary branch corresponds to other objects mentioned in the statement, specifically, an object-aware modulation layer is used in each branch to enhance the characteristics of the related region of the object, weaken the unnecessary region characteristics in each branch, then object-region cross-mode matching is carried out in each branch, and then the key object relation between the main branch and the auxiliary branch is captured according to a multi-branch relation reasoning module. In addition, considering that each branch only focuses on the corresponding object, the invention provides a diversity loss calculation method (diversity loss function) so that different branches focus on different areas. Finally, a temporal boundary is determined and the target pipeline is retrieved according to the temporal-spatial locator. The invention will pay more attention to the key objects in the sentence, and establish strong enough cross-mode relationship reasoning between them, realize the accurate positioning.

In order to achieve the purpose, the invention adopts the following specific technical scheme:

the method for completing the positioning task of the specified object in the video by utilizing the object-aware multi-branch relation network comprises the following steps:

s1: extracting regional characteristics of different frames from a video aiming at a section of video, and calculating association scores between any regional characteristic in the video frame and all regional characteristics in the video frame in an adjacent interval; extracting the regional characteristics with the highest matching scores in each video frame in the adjacent interval as matching regional characteristics, and performing average pooling on any regional characteristic in the video frames and the matching regional characteristics to obtain the dynamic regional characteristics of the video frames;

s2: aiming at a query sentence, firstly, obtaining semantic feature sets of all words in the query sentence by adopting a Bi-GRU network, and extracting semantic features of nouns from the semantic feature sets; then adopting an attention method to further obtain object characteristics in the query statement;

s3: constructing an object-aware multi-branch relation network, wherein the object-aware multi-branch relation network is composed of a main branch, T-1 auxiliary branches and a multi-branch relation reasoning module, and each branch comprises an object-aware modulation layer, a cross-mode matching layer and a softmax function layer;

firstly, the dynamic region feature of the video frame obtained in step S1 and the tth object feature in the query statement obtained in step S2 are used as the input of the object perception modulation layer of the tth branch to obtain the object perception region feature of the region in the video, wherein when T is 1, the region represents a main branch, and when T belongs to {2,3, … T }, the region represents an auxiliary branch;

then calculating a matching score between the object perception region characteristics of the region in the video and the object characteristics in the query statement through a cross-modal matching layer, and processing the matching score through a softmax function layer;

finally, the object perception regional characteristics of the region in the video output by the main branch and the T-1 auxiliary branches and the matching scores processed by the softmax function are used as the input of a multi-branch relation reasoning module to obtain the object perception multi-branch characteristics of the region;

s4: establishing a space-time locator which comprises a space locator and a time locator;

s5: designing a multitask loss function as:

wherein ,λ₁，λ₂，λ₃，λ₄The over-parameters of the balance between the four losses are regulated and controlled,

a loss function representing the spatial locator is shown,

represents an alignment loss function of the time locator,

represents the regression loss function of the time locator,

a diversity loss function representing an object-aware multi-branch relational network; training an object perception multi-branch relation network and a space-time locator in an end-to-end mode according to a multi-task loss function;

s6: for a section of video and query sentences to be processed, preprocessing is performed through steps S1 and S2, then the dynamic region features of the obtained video frames and the object features in the query sentences are used as the input of a trained object perception multi-branch relation network, the output of the trained object perception multi-branch relation network is used as the input of a trained space-time locator, and the region corresponding to the minimum value of the multitask loss function is used as the final result to be output.

Further, the step S3 is specifically:

3.1) constructing an object-aware multi-branch relation network, wherein the object-aware multi-branch relation network is composed of a main branch, T-1 auxiliary branches and a multi-branch relation reasoning module, and each branch comprises an object-aware modulation layer, a cross-mode matching layer and a softmax function layer;

3.2) characterizing the regions obtained in step S1

And object features in query statement o_tThe first branch and the second branch are used as input of the t-th branch, and the object perception area characteristics are obtained through calculation of an object perception modulation layer of each branch, and the formula is as follows:

γ_t＝tanh(W^γo_t+b^γ)

_t＝tanh(Wo_t+b)

wherein, W^γ，W，b^γ，bIs a parameter matrix and an offset vector, gamma_tA modulation gate representing the t-th object in the corresponding query statement,_trepresenting the bias vector of the t-th object in the query statement, ⊙ representing element-by-element multiplication,

object perception region characteristics of a kth region of an nth video frame representing the t-th branch;

3.3) calculating the matching score between the object perception region characteristics of the region in the video and the object characteristics in the query sentence by crossing the modal matching layer, wherein the formula is as follows:

wherein,

representing a row vector, W^cRepresenting a parameter matrix, b^cA vector of parameters is represented that is,

a matching score representing the kth area of the nth video frame and the t < th > object in the query statement; then passes through the layer pair of the softmax function

Is processed to obtain

Composition set

3.4) the softmax function processed matching scores output by the main branch and the T-1 auxiliary branches

And object perception region features

As the input of the multi-branch relation reasoning module, the object perception multi-branch characteristics of the region are obtained

The method specifically comprises the following steps:

3.4.1) calculating the attention weight between any region in the video frame of the main branch and any region in the video frames of the T-1 auxiliary branches, wherein the calculation formula is as follows:

wherein,

represents a main pointThe object perception region characteristics of the kth region of the nth video frame,

object perception region characteristics of the ith region of the nth video frame representing the tth branch;

and

respectively representing the relative position vector and the attention weight of the kth area of the nth video frame of the main branch and the ith area of the nth video frame of the tth branch, and then pairing the kth area and the ith area of the nth video frame of the main branch through a softmax function layer

Is processed to obtain

3.4.2) obtaining the integrated feature of all the regions in the t-th branch to any region in the main branch from the relevant region of the t-th object in the query statement in the auxiliary branch, and the calculation formula is as follows:

wherein,

the integration characteristics of all the areas in the t-th branch t to the k-th area of the nth video frame in the main branch are shown, and t is more than or equal to 2;

integration features based on any region in a video frame

Further obtaining all the area pairsImage-aware multi-branch feature set

The calculation formula is as follows:

wherein ReLU (. cndot.) represents a linear rectification function, used as an activation function,

an object-aware multi-branch feature representing a kth region of an nth video frame.

The diversity loss function is:

wherein S is_gtRepresents the set of frames in the truth segment,

in order to normalize the factors, the parameters of the model,

and

and representing a matching score set of any two objects in the query statement processed by the softmax function layer and all regions of the nth video frame.

Another objective of the present invention is to provide a positioning system for completing a task of positioning a specified object in a video by using an object-aware multi-branch relationship network, for implementing the method for completing the task of positioning the specified object in the video, including:

the video preprocessing module: the video processing device is used for extracting the regional characteristics of different frames from the video and calculating the association scores between any regional characteristic in the video frame and all regional characteristics in the video frame in the adjacent interval; extracting the regional characteristics with the highest matching scores in each video frame in the adjacent interval as matching regional characteristics, and performing average pooling on any regional characteristic in the video frames and the matching regional characteristics to obtain the dynamic regional characteristics of the video frames;

the query statement preprocessing module: the method comprises the steps of acquiring a semantic feature set of all words in a query sentence, extracting semantic features of nouns from the semantic feature set, and further obtaining object features in the query sentence by adopting an attention method;

a video clip positioning module: the video segment positioning module comprises a modeling submodule and a training submodule, wherein the modeling submodule is configured with an object perception multi-branch relation model, a space positioning model and a time positioning model, the object perception multi-branch relation model is used for extracting object perception multi-branch characteristics of an area, the space positioning model is used for realizing the positioning of a space pipeline, the time positioning model is used for realizing the positioning of a time pipeline, and the training submodule is configured with a multi-task loss function;

an output module: for outputting the positioning result.

The invention has the following beneficial effects:

(1) the traditional positioning method must extract a series of candidate space-time pipelines (i.e. candidate borders) firstly, the extraction of the series of candidate space-time pipelines refers to that a series of candidate region borders are selected on the premise that the time is positioned (i.e. aligned), then the most relevant pipelines are extracted according to sentences, and proper candidate pipelines are difficult to extract on unaligned sentence-video segments, so the positioning method in the prior art is not suitable for positioning unaligned sentence-video segment pairs, the invention firstly screens out the region characteristics belonging to the same region in different frames through association scores, obtains the dynamic characteristics of each region by adopting an average pooling method, samples a group of candidate segments through establishing a time positioner, calculates the time confidence score of each frame in each candidate segment, and calculates to obtain a time loss function to complete time positioning, the accuracy of the localization on the misaligned sentence-video segment is improved.

(2) For the problem that the relationship between objects cannot be accurately captured in the prior art, the object-aware multi-branch relationship network is constructed and comprises a main branch, an auxiliary branch and a multi-branch relationship reasoning module; each branch corresponds to a noun object in the query statement, and the first noun in the query statement is taken as a main object, namely the query object, which corresponds to the main branch; other objects mentioned in the query statement are auxiliary objects, corresponding to auxiliary branches. In each branch, an object perception modulation layer, a cross-mode matching layer and a softmax function layer are included, the object perception modulation layer is used in each branch to enhance the characteristics of the object related region, the unnecessary region characteristics in each branch are weakened, then object-region cross-mode matching is carried out in each branch, the key object relationship between the main branch and the auxiliary branch is captured according to the multi-branch relationship reasoning module, particularly in the multi-branch relationship reasoning module, the integrated characteristics of all the regions in each branch for each region in the main branch are obtained through calculation, the object perception multi-branch characteristics are obtained through ReLU activation function processing, the characteristics can accurately reflect the relationship between the auxiliary branch and the main branch, and therefore the captured object relationship is more accurate.

(3) For the problem that the object relationship model in the prior art often introduces unnecessary objects to cause the model to be rough, the invention designs a diversity loss function, calculates the loss function according to the region-object matching score (namely the matching score of each branch for each region in a video frame), ensures that each branch only pays attention to the corresponding object, can effectively filter out the unnecessary object relationship in the obtained object relationship model, and further improves the positioning performance.

Drawings

FIG. 1 is a schematic overall flow diagram of the present invention;

fig. 2 is a schematic diagram of the overall architecture of an object-aware multi-branch relational network.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1, the present invention utilizes an object-aware multi-branch relationship network to complete a task of specifying an object in a video, and first extracts a region feature from the video and an object feature from a sentence. Then, an object-aware multi-branch relationship network is established to discover regions associated with the objects and inferences are made to capture relationships between the key objects. Finally, the final positioning of the designated object in the video clip is completed through the designed space-time positioner.

The specific implementation steps are as follows:

the method comprises the following steps: video preprocessing: and extracting the dynamic region characteristics of the video frame aiming at a section of video.

Step two: preprocessing of the query statement: and extracting object features in the query statement aiming at the query statement.

Step three: and constructing an object-aware multi-branch relation network, wherein the object-aware multi-branch relation network is composed of a main branch, T-1 auxiliary branches and a multi-branch relation reasoning module, and each branch comprises an object-aware modulation layer, a cross-mode matching layer and a softmax function layer.

Step four: a space-time locator is established, including a space locator and a time locator.

Step five: designing a multi-task loss function, and training an object perception multi-branch relation network and a space-time locator in an end-to-end mode; for a section of video and query sentences to be processed, preprocessing is performed through steps S1 and S2, then the dynamic region features of the obtained video frames and the object features in the query sentences are used as the input of a trained object perception multi-branch relation network, the output of the trained object perception multi-branch relation network is used as the input of a trained space-time locator, and the region corresponding to the minimum value of the multitask loss function is used as the final result to be output.

In one embodiment of the present invention, a method for pre-processing video is presented.

Extracting regional characteristics of different frames from a video aiming at a section of video, and calculating association scores between any regional characteristic in the video frame and all regional characteristics in the video frame in an adjacent interval; extracting the regional characteristics with the highest matching scores in each video frame in the adjacent interval as the matching regional characteristics, and performing average pooling on any regional characteristic in the video frames and the matching regional characteristics to obtain the dynamic regional characteristics of the video frames.

Specifically, for a section of video, extracting regional characteristics through a pre-training Faster R-CNN model

Wherein N represents the total number of video frames, K represents the number of regions in each frame of video,

representing the feature value of the kth region in the nth frame of the video, using a bounding box corresponding to the spatial position of the region

Is shown in which

Representing the coordinates of the center of the kth region of the nth frame of video,

the width and the height of a k area of an nth frame of the video are represented;

dynamic information is acquired by adopting a time region aggregation method, and if two regions of adjacent frames have similar semantic features and spatial positions, the same object comprising different times is determined. For any region feature in video frame

Taking the front L frame and the back L frame as a video frame set, and calculating any region feature in the video frame set

And

the correlation score between the two is calculated by the formula:

wherein,

and

respectively representing the jth regional feature and spatial location of the ith video frame, L ∈ [ n-L, n + L]IoU (·) represents the IoU score for the two region borders, α represents the balance coefficient;

in each video frame of video frame set formed by front and back L frames

The region feature with the highest association score is used as

Is matched with the region characteristics of

Performing average pooling with the extracted 2L matching region characteristics to obtain dynamic region characteristics

In one implementation of the invention, a method of preprocessing a query statement is provided.

Firstly, obtaining semantic feature sets of all words in a query sentence by adopting a Bi-GRU network, and extracting semantic features of nouns from the semantic feature sets; and then adopting an attention method to further obtain the object characteristics in the query statement.

In particular toFirstly, adopting Bi-GRU network to obtain semantic feature set of words in query sentence

Wherein s is_mRepresenting semantic features of an mth word in the query sentence, wherein M represents the number of words in the query sentence;

marking all nouns in the query sentence as objects in the query sentence by adopting an NLTK tool, and collecting semantic feature sets of words in the query sentence

Extracting semantic features of objects in a query statement

And aggregating the context of each object in the query statement through an attention method to obtain the object characteristics in the query statement, wherein the calculation formula is as follows:

wherein,

representing a projection matrix, b^sA vector of the offset is represented, and,

representing a row vector, β_t,mRepresenting the attention weight, o, of the mth word in the query sentence to the tth object in the query sentence_tRepresenting the t-th object feature in the query sentence to form an object feature set

T represents the number of objects in the query statement. In a specific use process, the first noun in the default sentence represents an object to be queried, and if the sentence is an interrogative sentence, the noun is 'who', 'what' and the like. Thus, o₁Features representing a primary object (i.e., a query object) { o }₂,…,o_tDenotes the auxiliary object feature.

In one implementation of the invention, a specific operation process of the object-aware multi-branch relation network is given.

As shown in fig. 2, the object-aware multi-branch relational network is composed of a main branch, T-1 auxiliary branches and a multi-branch relational inference module, and each branch includes an object-aware modulation layer, a cross-mode matching layer and a softmax function layer.

Firstly, the dynamic region characteristics of the video frame obtained in the step one and the tth object characteristics in the query sentence obtained in the step two are used as the input of an object perception modulation layer of a tth branch to obtain the object perception region characteristics of the region in the video, wherein when T is 1, the main branch is represented, and when T belongs to {2,3, … T }, the auxiliary branch is represented;

and finally, taking the matching scores output by the main branch and the T-1 auxiliary branches and processed by the softmax function as the input of a multi-branch relation reasoning module to obtain the object perception multi-branch characteristics of the region.

Specifically, the object sensing modulation layer is used for calculating object sensing regional characteristics, namely enhancing target related regional characteristics and weakening unrelated regional characteristics, and firstly, a modulation gate and a modulation vector are calculated, wherein the formula is as follows:

γ_t＝tanh(W^γo_t+b^γ)

_t＝tanh(Wo_t+b)

wherein, W^γ，W，b^γ，bIs a parameter matrix and an offset vector, gamma_tA modulation gate representing the t-th object in the corresponding query statement,_ta bias vector representing the t-th object in the query statement, ⊙ representing element-by-element multiplication,

representing dynamic region characteristics, o_tRepresenting the tth object feature in the query statement;

all region features are then modulated using the following formula:

wherein,

and representing the object perception region characteristics of the kth region of the nth video frame of the t-th branch.

The cross-modal matching layer is used for calculating matching scores between object perception region features of regions in the video and object features in the query sentences, and the calculation formula is as follows:

wherein,

and the matching score of the kth area of the nth video frame and the t < th > object in the query statement is represented.

The softmax function layer is used for the pair

To carry outProcessing, wherein the calculation formula is as follows:

composition set

This result will participate in the subsequent multi-branch relationship reasoning and diversity loss calculations.

The multi-branch relation reasoning module is used for reasoning and acquiring the relation between a main object (query object) and an auxiliary object to obtain the object perception multi-branch characteristics of the region, and comprises a multi-step calculation formula, and specifically comprises the following steps:

1) in order to extract the clues of the relevant areas from the auxiliary branches, the clues are integrated into the main branch, and attention weight between any area in the video frame of the main branch and any area in the video frames of the T-1 auxiliary branches is firstly calculated, and the calculation formula is as follows:

wherein,

object-aware region features representing the kth region of the nth video frame of the main branch,

and

Is processed to obtain

In the formula

Computing

The process is as follows:

2) obtaining the integrated features of all the regions in the t-th branch to any region in the main branch from the relevant regions of the t-th object in the set query statement in the auxiliary branch, wherein the calculation formula is as follows:

wherein,

by introducing the matching score in the calculation for the region-object matching score calculated in the cross-modal matching layer, irrelevant regions can be filtered out in the main and auxiliary branch relation estimation, and the main and auxiliary branches can be enhanced at the same timeHigher in to branch

A relational model of the fractional regions.

Integration features based on any region in a video frame

Further obtaining all regional object perception multi-branch feature set

The calculation formula is as follows:

In order to make different branches focus on only the objects associated with the different branches, a diversity loss function of the object-aware multi-branch relation network is designed to calculate the diversity loss, and the formula is as follows:

wherein S is_gtRepresents the set of frames in the truth segment,

in order to normalize the factors, the parameters of the model,

and

representing any two objects in the query statement processed by the softmax function layer and the nth videoSet of matching scores for all regions of the frame. Computing the loss of diversity will cause the branches to allocate more attention to the regions that match their corresponding objects.

A specific application of the spatio-temporal locator is given in one implementation of the present invention.

Building a spatial locator using the object-aware multi-branch and main object features of the region o₁And calculating the confidence score of any region in the video frame, wherein the formula is as follows:

wherein, the sigma is a sigmoid function,

representing the spatial confidence score, W, of the kth region of the nth video frame^rAnd W^oRepresenting a parameter matrix;

the loss function of the spatial locator is:

wherein,

IoU score, S, representing the kth region of the nth video frame and the true region corresponding thereto_gtRepresenting a set of frames in a truth segment;

temporal localizer, object-aware multi-branch and main-object features through spatial attention-focused regions o₁And obtaining the object perception characteristics of the frame level in the video, wherein the calculation formula is as follows:

wherein f isⁿRepresenting an object perceptual feature of an nth video frame;

a line vector is represented by a vector of lines,

b^frepresenting a parameter matrix and an offset; then another Bi-RGU is used to learn the object-aware context features of all frames

Defining a set of candidate segments at each video frame

w^hRepresenting the width of the H-th candidate segment in each video frame, and H representing the number of candidate segments, i.e. taking each frame as a sampling center, and adopting the front and the back thereof

Frames, with which a candidate segment is composed, according to H different w^hAnd obtaining H candidate fragments. All candidate clips are estimated through a sigmoid function linear layer, and the offset of the boundary is generated at the same time, and the calculation formula is as follows:

wherein,

representing the temporal confidence score of the H candidate segments at the nth video frame,

denotes the offset, W, of H candidate segments^cAnd W^lIs a parameter matrix, b^cAnd b^lIs the offset, σ (·) is the sigmoid function;

the time locator described has two losses: alignment loss of candidate segment selection and regression loss of boundary adjustment; the alignment loss formula is as follows:

wherein,

the time IoU scores representing the h-th candidate segment and the true segment in the nth video frame,

denotes cⁿThe h element of (a), representing the confidence score of the h candidate segment at the nth video frame;

selecting the score with the greatest temporal confidence

The time boundary of the segment of (a) is (s, e), and the offset is (l)_s,l_e) (ii) a First according to the truth boundary

Calculating the offset of the candidate segment

The regression loss formula is as follows:

wherein R represents a smooth L1 function.

In one embodiment of the present invention, a specific form of the multitasking loss function is presented.

Wherein,

representing a multitask penalty function, λ₁，λ₂，λ₃，λ₄The over-parameters of the balance between the four losses are regulated and controlled,

a loss function representing the spatial locator is shown,

represents an alignment loss function of the time locator,

represents the regression loss function of the time locator,

a diversity loss function representing an object-aware multi-branch relational network.

In another embodiment of the present invention, a positioning system for completing a task of positioning a specified object in a video by using an object-aware multi-branch relationship network is provided, where the positioning system is configured to implement the above method for completing the task of positioning the specified object in the video, and the method includes:

an output module: for outputting the positioning result.

Wherein, the video preprocessing module comprises:

the fast R-CNN sub-module: configuring a trained Faster R-CNN model, and taking a section of video as input to obtain the regional characteristics of the video, wherein the representation method of the regional characteristics is not repeated;

time zone aggregation submodule: for aggregating the front L frame and the back L frame of the frame of any region into a video frame set, and calculating the characteristics of any region in the video frame set

And

the correlation score between the two is not repeated in the calculation formula;

a pooling submodule: the method is used for performing average pooling on the regional characteristics with the highest association scores in each video frame of the video frame set and the regional characteristics to be matched and outputting dynamic regional characteristics.

The query statement preprocessing module comprises:

Bi-GRU submodule: configuring a trained Bi-GRU network for acquiring semantic features of words in query sentences;

labeling the submodule: all nouns in the query statement are marked as objects in the query statement;

a feature extraction submodule: the semantic features of the object in the query sentence are extracted from the semantic features of the words in the query sentence, and the object features in the query sentence are obtained through calculation, so that the calculation formula is not repeated.

The object perception multi-branch relation model is composed of a main branch, T-1 auxiliary branches and a multi-branch relation reasoning layer, and each branch comprises an object perception modulation layer, a cross-mode matching layer and a softmax function layer; the calculation formulas of the object perception modulation layer, the cross-mode matching layer, the softmax function layer and the multi-branch relation reasoning layer are not repeated;

the multi-task loss function in the training submodule is as follows:

wherein λ is₁，λ₂，λ₃，λ₄The over-parameters of the balance between the four losses are regulated and controlled,

a loss function representing the spatial locator is shown,

represents an alignment loss function of the time locator,

represents the regression loss function of the time locator,

a diversity loss function representing an object-aware multi-branch relational network;

denotes cⁿThe h element of (a), representing the confidence score of the h candidate segment at the nth video frame; r represents smooth L1 function, S_gtRepresents the set of frames in the truth segment,

in order to normalize the factors, the parameters of the model,

and

Examples

In order to show the experimental effect of the present invention, this embodiment provides a comparative experiment, and the implementation method is the same as the process described above, and only specific implementation details are given here, and the repeated process is not described again.

The present implementation used the VidSTG dataset proposed by Zhang et al (Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, HuashengLiu, and Lianli Gao. where multiple do it exists: spread-temporal video grouping for multi-form sensors. arxiv preprint). The data set is constructed by annotating natural language descriptions based on a video object relationship data set, VidOR (Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xinn Yang, and Tat-SengChua. innocent objects and relationships in user-generated video. in ICMR, pages 279-287. ACM). VidSTG is the only underlying dataset based on unaligned sentence-video segment pairs. The VidSTG contains 5563 videos in the training set, 618 videos in the verification set and 743 videos in the test set. Containing a total of 99943 sentences (44808 statement sentences and 55135 question sentences), 80 query objects. The average duration of the video is 28.01 seconds, and the average of the statement sentences and question sentences is 11.12 words and 8.98 words.

The implementation details are as follows:

in the video pre-processing, 1024-d region features are extracted by pre-trained Faster R-CNN, 5 frames per second are sampled, and K is 20 regions per frame, in query sentence pre-processing, word embedding is performed using pre-trained Glove, obtaining 300-d vectors capable of expressing word features, and nouns in sentences are identified using NLTK, in modeling, α is set to 0.6, L is set to 5, and λ is set₁，λ₂，λ₃，λ₄Set to 1.0, 0.001 and 1.0, respectively. At each step, 9 candidate segments are defined, whose temporal width is [3,9,17,33,65,97,129,165,197]. The dimensions of all projection matrices and biases are set to 256, and the hidden state (hidden state) for each direction in the Bi-GRU is set to 128. The model was trained in this example using an Adam optimizer with an initial learning rate of 0.0005.

The performance evaluation method comprises the following steps:

the time positioning performance is evaluated by using a standard m _ tIoU, and the space-time accuracy is evaluated by using m _ vIoU and vIoU @ R. m _ tIoU represents the mean of the selected segments and the time IoU of the true value, and vIoU represents the space-time IoU between the predicted space-time tube and the true value, and is calculated as follows:

wherein S_pRepresenting a set of frames in a predicted segment, S_gtRepresenting the set of frames in the truth segment. r isⁿ，

Respectively representing prediction and truth regions in the nth frame. m _ vIoU is the average vIoU of all test samples, and vIoU @ R is the satisfaction of vIoU in a test sample>The ratio of R.

And (3) comparing the performances:

this implementation is compared to the present method using four existing methods.

Since localization (STVG) is an emerging task in Spatio-temporal video, the only methods that can solve this problem based on non-aligned data are STGRN methods (document 1: Zhu Zhuang, Zhou Zhuao, Yang Zhuao, Qi Wang, Huashengliu, and Lianli Gao. where does exist: space-temporal video grouping for multi-form transmitters. arxiv. print.). In addition, the temporal localization of a given description object in unaligned sentence-video segment pairs can be accomplished due to TALL (document 2: Jiyang Gao, ChenSun, Zhenheng Yang, and Ram New tia. TALL: temporal activity localization visual angle query. in ICCV, pages 5277 and 5285.) and L-Net (document 3: Jingyuan Chen, Lin Ma, Xinpen Chen, Zequn Jie, and Jiebo Luo.Localizing natural language in video. InAAI.). These two methods are therefore combined with the STVG method based on alignment data as a baseline. Where TALL and L-Net are used to determine temporal boundaries, the STVG method based on alignment data is used to further retrieve spatio-temporal pipelines. TALL uses the proposal-selection framework for time-positioning, and L-Net uses verbatim interaction for overall segment selection.

Grounder (document 4: Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevorgarell, and Bernt Schile. group-ing of textual phenols in images byrconstruction. in ECCV, pages 817-834.) can extract the target region in each frame of a given predicted segment. STPR (document 5: Masataka Yamaguchi, Kuniaki Saito, Yoshitaka Ushiku, and TatsuyaHarada. spread-temporal person statistical visual field natural visual requirements. in ICCV, pages 1453 and 1462.) and WSSTG (document 6: Zhenfang Chen, Lin Ma, Wenhan Luo, and KWan-Yee KWong. Weak-superimposed spread-temporal grouping physical sensory input. the channels were pre-generated using channels in the fragments and evaluated by cross-modal ranking. The initial STPR is applicable to the localization of people in multiple videos, which is extended in this embodiment to the localization of multiple objects in a single video. The initial WSSTG uses weakly supervised ranking penalties, which are replaced in this example by supervised triple penalties (document 7: Sibei Yang, Guinbin Li, and Yizhou Yu. Cross-modal relationship for group expression. in CVPR, pages 4145 and 4154). In this example, Grounder, STPR, WSSTG were evaluated for binding to TALL or L-Net, respectively. The evaluation results are given in tables 1 to 3, wherein OMRN represents the method of the present invention.

Table 1: performance evaluation results based on VidSTG dataset

Table 2: evaluation results based on time alignment truth (tem. gt represents time alignment truth)

Table 3: ablation results based on VidSTG dataset

And (4) evaluation results:

according to table 1, the temporal positioning and the spatiotemporal positioning performance of the interrogative sentences in all models are lower than those of the declarative sentences, namely, the m _ tIoU score obtained according to the positioning of the interrogative sentences is lower than that obtained according to the positioning of the declarative sentences, which indicates that the positioning of the unknown objects lacking the explicit characteristics is more difficult.

For time alignment accuracy, according to table 1, the m _ tlou scores for STGRN and OMRN (method of the invention) were 48.47%, 50.73%, respectively, whereas the m _ tlou score for grouder, STPR, WSSTG combined with TLL was 34.63%, and the m _ tlou score for L-Net combined was 40.86%. This is because the time-localization method based on region modeling adopted by the invention and STGRN is superior to the time-localization method based on video frame modeling by TALL and L-Net, which shows that fine-grained region modeling is helpful for determining the accurate time boundary of the target pipeline. Compared with the STGRN method of the same type, the method has higher time positioning accuracy.

For the accuracy of space-time positioning, according to table 1, in all models, the m _ vIoU, the vIoU @0.3, and the vIoU @0.5 scores of the invention all obtain the highest score. The invention can accurately capture the dynamic information of different objects among frames through time region aggregation, and can perform fine-grained object perception through object perception modulation and cross-modal matching. While the group r + {. cndot. } method ignores the temporal dynamics of the object, the performance is worst, which indicates that for high quality spatio-temporal localization, capturing the object dynamics between frames is very important.

As can be seen from table 1, in all standards, the method of the present invention achieves significant performance improvement compared to other methods, which indicates that the method of the present invention can effectively focus on the key region through object-aware multi-branch region modeling with diversity loss, and capture the key object relationship through multi-branch reasoning.

In addition, to compare the performance of the present invention and other methods in aligned sentence-video segment pairs. This embodiment evaluates the comparison of the spatial localization performance of the present invention with other methods given a time truth (i.e. where temporal localization has been given). As shown in Table 2, the m _ vIoU, vIoU @0.3 and vIoU @0.5 scores of the invention all obtained the highest scores. This indicates that the present invention still has higher performance on aligned sentence-video segment pairs.

Finally, the present embodiment performs ablationThe present embodiment removes ① the object-aware modulation module (w/o.OM) from all branches and ② the multitask loss function

In removing loss of diversity

(w/o.DL); ③ removing cross-modal matching blocks from all branches, and weighting terms in multi-branch relational inference

(w/o.CM), here due to lack of match score provided across the modal matching block

Loss of diversity is also ineffective, ④ removes temporal region aggregation modules (w/o.ta) from region modeling, ⑤ removes context awareness modules (w/o.ca) from object extraction.

The results of the model performance evaluation after removal of each module are shown in table 3. It can be seen that the performance of all ablation models is reduced compared to the complete model, indicating that each component contributes to improved positioning accuracy. The performance of the first three models is reduced more after elimination, which shows that object-aware multi-branch relation reasoning plays a crucial role in the accuracy of space-time positioning. Cm, which indicates that cross-modal matching with diversity regularization is important for merging regional features related to linguistic description from the auxiliary branch to the main branch.

Claims

1. The method for completing the positioning task of the specified object in the video by utilizing the object-aware multi-branch relation network is characterized by comprising the following steps of:

firstly, the dynamic region feature of the video frame obtained in step S1 and the tth object feature in the query statement obtained in step S2 are used as the input of the object perception modulation layer of the tth branch to obtain the object perception region feature of the region in the video, wherein when T is 1, the region represents a main branch, and when T belongs to {2,3, … T }, the region represents an auxiliary branch; then calculating a matching score between the object perception region characteristics of the region in the video and the object characteristics in the query statement through a cross-modal matching layer, and processing the matching score through a softmax function layer; finally, the object perception regional characteristics of the region in the video output by the main branch and the T-1 auxiliary branches and the matching scores processed by the softmax function are used as the input of a multi-branch relation reasoning module to obtain the object perception multi-branch characteristics of the region;

s5: designing a multitask loss function as:

a loss function representing the spatial locator is shown,

represents an alignment loss function of the time locator,

represents the regression loss function of the time locator,

2. The method for completing the task of locating a specific object in a video according to claim 1, wherein the step S1 specifically comprises:

for a video, extracting regional characteristics through a pre-training Faster R-CNN model

Is shown in which

adopting a time region aggregation method to aim at any region feature in a video frame

And

the correlation score between the two is calculated by the formula:

wherein,

and

respectively representing the jth regional feature and spatial location of the ith video frame, L ∈ [ n-L, n + L]IoU (-) denotes the IoU score for the two region borders, α denotes the balance systemCounting;

neutralizing each video frame of a set of video frames

The region feature with the highest association score is used as

Is matched with the region characteristics of

3. The method for completing the task of locating a specific object in a video according to claim 1, wherein the step S2 specifically comprises:

aiming at a query statement, firstly, a Bi-GRU network is adopted to obtain a semantic feature set of words in the query statement

Extracting semantic features of objects in a query statement

Aggregating the context of each object in the query statement by an attention method to obtain the object characteristics in the query statementThe calculation formula is as follows:

wherein,

T represents the number of objects in the query statement o₁Representing main object features, { o₂,…,o_tDenotes the auxiliary object feature.

4. The method for completing the task of locating a specific object in a video according to claim 1, wherein the step S3 specifically comprises:

3.2) characterization of the region obtained in step S1Sign for

γ_t＝tanh(W^γo_t+b^γ)

_t＝tanh(Wo_t+b)

wherein,

representing the kth region and query language of the nth video frameMatching score of the t-th object in the sentence; then passes through the layer pair of the softmax function

Is processed to obtain

Composition set

And object perception region features

The method specifically comprises the following steps:

wherein,

object perception of the l-th region of the n-th video frame representing the t-th branchA regional characteristic;

and

Is processed to obtain

wherein,

integration features based on any region in a video frame

Further obtaining all regional object perception multi-branch feature set

The calculation formula is as follows:

5. The method according to claim 4, wherein said object-aware multi-branch relationship network employs a diversity loss function, and the formula is:

wherein S is_gtRepresents the set of frames in the truth segment,

in order to normalize the factors, the parameters of the model,

and

6. The method for completing the task of locating a specific object in a video according to claim 1, wherein the step S4 specifically comprises:

4.1) building a spatial locator using the object-aware multi-branch and main object features of the regiono₁And calculating the confidence score of any region in the video frame, wherein the formula is as follows:

wherein, the sigma is a sigmoid function,

the loss function of the spatial locator is:

wherein,

4.2) establishing a time locator for sensing multi-branch characteristics and main object characteristics o through the objects of the space attention aggregation area₁And obtaining the object perception characteristics of the frame level in the video, wherein the calculation formula is as follows:

wherein f isⁿRepresenting an object perceptual feature of an nth video frame; w is a^TRepresenting a row vector, W₁ ^f、W₂ ^f、b^fRepresenting a parameter matrix and an offset; then another Bi-RGU is used to learn the object-aware context features of all frames

Defining a set of candidate segments at each video frame

w^hRepresenting the width of the H-th candidate segment in each video frame, and H representing the number of candidate segments; all candidate clips are estimated through a sigmoid function linear layer, and the offset of the boundary is generated at the same time, and the calculation formula is as follows:

wherein,

wherein,

is shown asThe time IoU fraction of the h-th candidate segment and the true segment of the n video frames,

selecting the score with the greatest temporal confidence

Calculating a true offset

The regression loss formula is as follows:

wherein R represents a smooth L1 function.

7.A positioning system for performing a task of positioning a specified object in a video by using an object-aware multi-branch relationship network, wherein the positioning system is used for implementing the method for performing the task of positioning the specified object in the video according to claim 1, and the method comprises:

a video clip positioning module: the video fragment positioning module comprises a modeling submodule and a training submodule, wherein the modeling submodule is provided with an object perception multi-branch relation model, a space positioning model and a time positioning model, and the training submodule is provided with a multi-task loss function;

an output module: for outputting the positioning result.

8. The system of claim 7, wherein the video pre-processing module comprises:

the fast R-CNN sub-module: configuring a trained Faster R-CNN model, and taking a section of video as input to obtain regional characteristics of the video;

time zone aggregation submodule: the system comprises a video frame set, a matching unit and a matching unit, wherein the video frame set is used for aggregating a front L frame and a rear L frame of a frame where any region is located into a video frame set and calculating the association score between any region feature in the video frame set and any region feature in a video frame to be matched;

9. The system of claim 7, wherein the query statement preprocessing module comprises:

a feature extraction submodule: the method is used for extracting the semantic features of the object in the query sentence from the semantic features of the words in the query sentence and calculating the object features in the query sentence.

10. The system of claim 7, wherein the modeling sub-module is configured to:

object-aware multi-branch relationship model: the object perception multi-branch relation model is composed of a main branch, T-1 auxiliary branches and a multi-branch relation reasoning layer, and each branch comprises an object perception modulation layer, a cross-mode matching layer and a softmax function layer;

a space locator: the positioning device is used for realizing the positioning of the space pipeline;

a time locator: for achieving the positioning of the time pipe.