CN111582170A - Method and positioning system for completing specified object positioning task in video by using object-aware multi-branch relation network - Google Patents

Method and positioning system for completing specified object positioning task in video by using object-aware multi-branch relation network Download PDF

Info

Publication number
CN111582170A
CN111582170A CN202010382647.0A CN202010382647A CN111582170A CN 111582170 A CN111582170 A CN 111582170A CN 202010382647 A CN202010382647 A CN 202010382647A CN 111582170 A CN111582170 A CN 111582170A
Authority
CN
China
Prior art keywords
branch
video
region
representing
video frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010382647.0A
Other languages
Chinese (zh)
Other versions
CN111582170B (en
Inventor
赵洲
路伊琳
张竹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202010382647.0A priority Critical patent/CN111582170B/en
Publication of CN111582170A publication Critical patent/CN111582170A/en
Application granted granted Critical
Publication of CN111582170B publication Critical patent/CN111582170B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method and a positioning system for completing a specified object positioning task in a video by using an object-aware multi-branch relation network. The method comprises the following steps: providing a section of video, extracting regional characteristics of different frames from the video, and extracting dynamic information from the regional characteristics; giving a query statement, and learning the characteristics of an object in the context of the sentence by means of a Bi-GRU and an NLTK library; constructing an object perception multi-branch relation network, emphasizing related regional characteristics of an object by using object perception modulation, weakening unrelated regional characteristics, and further realizing object-region cross-model matching; capturing the relation of key objects in the main branch and the auxiliary branch by using multi-branch relation reasoning; a method for calculating diversity loss is provided, and different branches are guaranteed to focus on areas related to corresponding objects. And obtaining a plurality of video segments by using a sampling method, selecting the segment with the highest time confidence score, and selecting the region with the highest space score to generate the target pipeline.

Description

Method and positioning system for completing specified object positioning task in video by using object-aware multi-branch relation network
Technical Field
The invention relates to the field of positioning of specified objects in videos, in particular to a method and a positioning system for completing a task of positioning the specified objects in the videos by using an object-aware multi-branch relation network.
Background
Specifying object-localization in video (Spatio-temporal video grouping) is a task for connecting Computer Vision (CV) and Natural Language Processing (NLP), given a sentence that describes an object, a spatiotemporal pipeline that describes the object is retrieved in the video, i.e., a bounding box is generated. In recent years, much work has been done in this area. However, most existing localization methods are limited to well-aligned sentence-video segment pairs.
Specifying object localization in video is an emerging task in the area of cross-modal understanding. Most of the methods available today are limited to well-aligned sentence-video segment pairs, i.e. video segments have been cut out of the complete video, temporally aligned with sentences. Recently, researchers have begun exploring the problem of locating specific objects in video based on misaligned data and multiform sentences. Specifically, the sentence form may be a statement sentence or an interrogative sentence, and the sentence may describe the relationship between the query object and other auxiliary objects in a period of time, for example, "child kicks a ball" describes the motion relationship between the main object (child) and the auxiliary object (ball) in a period of time. The key to this task is therefore to capture the key relationships between objects in the video, upon which the bounding box is generated.
Of the existing methods, some methods, although achieving excellent performance on aligned sentence-video segments, cannot solve the problem of localization on misaligned sentence-video segments and cannot identify relationships between objects. In the method of exploring unaligned data, Zhang et al (Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, and Lianli gao. where doe it exists: spatial-temporal video interpretation for multi-form sensors. arxiv preprint) integrates text cues into regional features and uses Spatio-temporal graph reasoning to retrieve Spatio-temporal pipelines. Although this approach can capture object relationships by interacting across modal regions, it fails to filter out unnecessary objects and retain them in the coarse relationship model for all regions, preventing the establishment of a valid relationship model.
Therefore, the existing method for specifying object location in video has at least the following technical problems:
(1) a series of candidate space-time pipelines (namely candidate frames) must be extracted firstly, then the most relevant pipelines are extracted according to sentences, and suitable candidate pipelines are difficult to extract on unaligned sentence-video fragments;
(2) each object is modeled separately; or, although a relational model between objects is established, the model is coarser and unnecessary objects are introduced.
These problems can make some of the positioning methods in the prior art perform poorly on unaligned sentence-video segments, neglect the relationship between each object, incorporate unnecessary objects into the object relationship model, and result in inaccurate final positioning range.
Disclosure of Invention
In order to solve the problems that the positioning performance on unaligned sentence-video segments is poor and the relationship between objects cannot be accurately captured in the prior art, the invention provides a method and a positioning system for completing a task of positioning a specified object in a video by using an object-aware multi-branch relationship network (OMRN). The method first extracts dynamic region features from the video and learns object representations corresponding to nouns in sentences. Then, an object-aware multi-branch relation network is established, a video region containing an object is determined, wherein the object-aware multi-branch relation network comprises a plurality of branches, each branch corresponds to a noun object in a query statement, a main branch corresponds to the queried object, namely a main object, an auxiliary branch corresponds to other objects mentioned in the statement, specifically, an object-aware modulation layer is used in each branch to enhance the characteristics of the related region of the object, weaken the unnecessary region characteristics in each branch, then object-region cross-mode matching is carried out in each branch, and then the key object relation between the main branch and the auxiliary branch is captured according to a multi-branch relation reasoning module. In addition, considering that each branch only focuses on the corresponding object, the invention provides a diversity loss calculation method (diversity loss function) so that different branches focus on different areas. Finally, a temporal boundary is determined and the target pipeline is retrieved according to the temporal-spatial locator. The invention will pay more attention to the key objects in the sentence, and establish strong enough cross-mode relationship reasoning between them, realize the accurate positioning.
In order to achieve the purpose, the invention adopts the following specific technical scheme:
the method for completing the positioning task of the specified object in the video by utilizing the object-aware multi-branch relation network comprises the following steps:
s1: extracting regional characteristics of different frames from a video aiming at a section of video, and calculating association scores between any regional characteristic in the video frame and all regional characteristics in the video frame in an adjacent interval; extracting the regional characteristics with the highest matching scores in each video frame in the adjacent interval as matching regional characteristics, and performing average pooling on any regional characteristic in the video frames and the matching regional characteristics to obtain the dynamic regional characteristics of the video frames;
s2: aiming at a query sentence, firstly, obtaining semantic feature sets of all words in the query sentence by adopting a Bi-GRU network, and extracting semantic features of nouns from the semantic feature sets; then adopting an attention method to further obtain object characteristics in the query statement;
s3: constructing an object-aware multi-branch relation network, wherein the object-aware multi-branch relation network is composed of a main branch, T-1 auxiliary branches and a multi-branch relation reasoning module, and each branch comprises an object-aware modulation layer, a cross-mode matching layer and a softmax function layer;
firstly, the dynamic region feature of the video frame obtained in step S1 and the tth object feature in the query statement obtained in step S2 are used as the input of the object perception modulation layer of the tth branch to obtain the object perception region feature of the region in the video, wherein when T is 1, the region represents a main branch, and when T belongs to {2,3, … T }, the region represents an auxiliary branch;
then calculating a matching score between the object perception region characteristics of the region in the video and the object characteristics in the query statement through a cross-modal matching layer, and processing the matching score through a softmax function layer;
finally, the object perception regional characteristics of the region in the video output by the main branch and the T-1 auxiliary branches and the matching scores processed by the softmax function are used as the input of a multi-branch relation reasoning module to obtain the object perception multi-branch characteristics of the region;
s4: establishing a space-time locator which comprises a space locator and a time locator;
s5: designing a multitask loss function as:
Figure BDA0002482601270000031
wherein ,λ1,λ2,λ3,λ4The over-parameters of the balance between the four losses are regulated and controlled,
Figure BDA0002482601270000032
a loss function representing the spatial locator is shown,
Figure BDA0002482601270000033
represents an alignment loss function of the time locator,
Figure BDA0002482601270000034
represents the regression loss function of the time locator,
Figure BDA0002482601270000035
a diversity loss function representing an object-aware multi-branch relational network; training an object perception multi-branch relation network and a space-time locator in an end-to-end mode according to a multi-task loss function;
s6: for a section of video and query sentences to be processed, preprocessing is performed through steps S1 and S2, then the dynamic region features of the obtained video frames and the object features in the query sentences are used as the input of a trained object perception multi-branch relation network, the output of the trained object perception multi-branch relation network is used as the input of a trained space-time locator, and the region corresponding to the minimum value of the multitask loss function is used as the final result to be output.
Further, the step S3 is specifically:
3.1) constructing an object-aware multi-branch relation network, wherein the object-aware multi-branch relation network is composed of a main branch, T-1 auxiliary branches and a multi-branch relation reasoning module, and each branch comprises an object-aware modulation layer, a cross-mode matching layer and a softmax function layer;
3.2) characterizing the regions obtained in step S1
Figure BDA0002482601270000041
And object features in query statement otThe first branch and the second branch are used as input of the t-th branch, and the object perception area characteristics are obtained through calculation of an object perception modulation layer of each branch, and the formula is as follows:
γt=tanh(Wγot+bγ)
t=tanh(Wot+b)
Figure BDA0002482601270000042
wherein, Wγ,W,bγ,bIs a parameter matrix and an offset vector, gammatA modulation gate representing the t-th object in the corresponding query statement,trepresenting the bias vector of the t-th object in the query statement, ⊙ representing element-by-element multiplication,
Figure BDA0002482601270000043
object perception region characteristics of a kth region of an nth video frame representing the t-th branch;
3.3) calculating the matching score between the object perception region characteristics of the region in the video and the object characteristics in the query sentence by crossing the modal matching layer, wherein the formula is as follows:
Figure BDA0002482601270000044
wherein,
Figure BDA00024826012700000417
representing a row vector, WcRepresenting a parameter matrix, bcA vector of parameters is represented that is,
Figure BDA0002482601270000045
a matching score representing the kth area of the nth video frame and the t < th > object in the query statement; then passes through the layer pair of the softmax function
Figure BDA0002482601270000046
Is processed to obtain
Figure BDA0002482601270000047
Composition set
Figure BDA0002482601270000048
3.4) the softmax function processed matching scores output by the main branch and the T-1 auxiliary branches
Figure BDA0002482601270000049
And object perception region features
Figure BDA00024826012700000410
As the input of the multi-branch relation reasoning module, the object perception multi-branch characteristics of the region are obtained
Figure BDA00024826012700000411
The method specifically comprises the following steps:
3.4.1) calculating the attention weight between any region in the video frame of the main branch and any region in the video frames of the T-1 auxiliary branches, wherein the calculation formula is as follows:
Figure BDA00024826012700000412
wherein,
Figure BDA00024826012700000413
represents a main pointThe object perception region characteristics of the kth region of the nth video frame,
Figure BDA00024826012700000414
object perception region characteristics of the ith region of the nth video frame representing the tth branch;
Figure BDA00024826012700000415
and
Figure BDA00024826012700000416
respectively representing the relative position vector and the attention weight of the kth area of the nth video frame of the main branch and the ith area of the nth video frame of the tth branch, and then pairing the kth area and the ith area of the nth video frame of the main branch through a softmax function layer
Figure BDA0002482601270000051
Is processed to obtain
Figure BDA0002482601270000052
Figure BDA0002482601270000053
3.4.2) obtaining the integrated feature of all the regions in the t-th branch to any region in the main branch from the relevant region of the t-th object in the query statement in the auxiliary branch, and the calculation formula is as follows:
Figure BDA0002482601270000054
wherein,
Figure BDA0002482601270000055
the integration characteristics of all the areas in the t-th branch t to the k-th area of the nth video frame in the main branch are shown, and t is more than or equal to 2;
integration features based on any region in a video frame
Figure BDA0002482601270000056
Further obtaining all the area pairsImage-aware multi-branch feature set
Figure BDA0002482601270000057
The calculation formula is as follows:
Figure BDA0002482601270000058
wherein ReLU (. cndot.) represents a linear rectification function, used as an activation function,
Figure BDA0002482601270000059
an object-aware multi-branch feature representing a kth region of an nth video frame.
The diversity loss function is:
Figure BDA00024826012700000510
wherein S isgtRepresents the set of frames in the truth segment,
Figure BDA00024826012700000511
in order to normalize the factors, the parameters of the model,
Figure BDA00024826012700000512
and
Figure BDA00024826012700000513
and representing a matching score set of any two objects in the query statement processed by the softmax function layer and all regions of the nth video frame.
Another objective of the present invention is to provide a positioning system for completing a task of positioning a specified object in a video by using an object-aware multi-branch relationship network, for implementing the method for completing the task of positioning the specified object in the video, including:
the video preprocessing module: the video processing device is used for extracting the regional characteristics of different frames from the video and calculating the association scores between any regional characteristic in the video frame and all regional characteristics in the video frame in the adjacent interval; extracting the regional characteristics with the highest matching scores in each video frame in the adjacent interval as matching regional characteristics, and performing average pooling on any regional characteristic in the video frames and the matching regional characteristics to obtain the dynamic regional characteristics of the video frames;
the query statement preprocessing module: the method comprises the steps of acquiring a semantic feature set of all words in a query sentence, extracting semantic features of nouns from the semantic feature set, and further obtaining object features in the query sentence by adopting an attention method;
a video clip positioning module: the video segment positioning module comprises a modeling submodule and a training submodule, wherein the modeling submodule is configured with an object perception multi-branch relation model, a space positioning model and a time positioning model, the object perception multi-branch relation model is used for extracting object perception multi-branch characteristics of an area, the space positioning model is used for realizing the positioning of a space pipeline, the time positioning model is used for realizing the positioning of a time pipeline, and the training submodule is configured with a multi-task loss function;
an output module: for outputting the positioning result.
The invention has the following beneficial effects:
(1) the traditional positioning method must extract a series of candidate space-time pipelines (i.e. candidate borders) firstly, the extraction of the series of candidate space-time pipelines refers to that a series of candidate region borders are selected on the premise that the time is positioned (i.e. aligned), then the most relevant pipelines are extracted according to sentences, and proper candidate pipelines are difficult to extract on unaligned sentence-video segments, so the positioning method in the prior art is not suitable for positioning unaligned sentence-video segment pairs, the invention firstly screens out the region characteristics belonging to the same region in different frames through association scores, obtains the dynamic characteristics of each region by adopting an average pooling method, samples a group of candidate segments through establishing a time positioner, calculates the time confidence score of each frame in each candidate segment, and calculates to obtain a time loss function to complete time positioning, the accuracy of the localization on the misaligned sentence-video segment is improved.
(2) For the problem that the relationship between objects cannot be accurately captured in the prior art, the object-aware multi-branch relationship network is constructed and comprises a main branch, an auxiliary branch and a multi-branch relationship reasoning module; each branch corresponds to a noun object in the query statement, and the first noun in the query statement is taken as a main object, namely the query object, which corresponds to the main branch; other objects mentioned in the query statement are auxiliary objects, corresponding to auxiliary branches. In each branch, an object perception modulation layer, a cross-mode matching layer and a softmax function layer are included, the object perception modulation layer is used in each branch to enhance the characteristics of the object related region, the unnecessary region characteristics in each branch are weakened, then object-region cross-mode matching is carried out in each branch, the key object relationship between the main branch and the auxiliary branch is captured according to the multi-branch relationship reasoning module, particularly in the multi-branch relationship reasoning module, the integrated characteristics of all the regions in each branch for each region in the main branch are obtained through calculation, the object perception multi-branch characteristics are obtained through ReLU activation function processing, the characteristics can accurately reflect the relationship between the auxiliary branch and the main branch, and therefore the captured object relationship is more accurate.
(3) For the problem that the object relationship model in the prior art often introduces unnecessary objects to cause the model to be rough, the invention designs a diversity loss function, calculates the loss function according to the region-object matching score (namely the matching score of each branch for each region in a video frame), ensures that each branch only pays attention to the corresponding object, can effectively filter out the unnecessary object relationship in the obtained object relationship model, and further improves the positioning performance.
Drawings
FIG. 1 is a schematic overall flow diagram of the present invention;
fig. 2 is a schematic diagram of the overall architecture of an object-aware multi-branch relational network.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
As shown in fig. 1, the present invention utilizes an object-aware multi-branch relationship network to complete a task of specifying an object in a video, and first extracts a region feature from the video and an object feature from a sentence. Then, an object-aware multi-branch relationship network is established to discover regions associated with the objects and inferences are made to capture relationships between the key objects. Finally, the final positioning of the designated object in the video clip is completed through the designed space-time positioner.
The specific implementation steps are as follows:
the method comprises the following steps: video preprocessing: and extracting the dynamic region characteristics of the video frame aiming at a section of video.
Step two: preprocessing of the query statement: and extracting object features in the query statement aiming at the query statement.
Step three: and constructing an object-aware multi-branch relation network, wherein the object-aware multi-branch relation network is composed of a main branch, T-1 auxiliary branches and a multi-branch relation reasoning module, and each branch comprises an object-aware modulation layer, a cross-mode matching layer and a softmax function layer.
Step four: a space-time locator is established, including a space locator and a time locator.
Step five: designing a multi-task loss function, and training an object perception multi-branch relation network and a space-time locator in an end-to-end mode; for a section of video and query sentences to be processed, preprocessing is performed through steps S1 and S2, then the dynamic region features of the obtained video frames and the object features in the query sentences are used as the input of a trained object perception multi-branch relation network, the output of the trained object perception multi-branch relation network is used as the input of a trained space-time locator, and the region corresponding to the minimum value of the multitask loss function is used as the final result to be output.
In one embodiment of the present invention, a method for pre-processing video is presented.
Extracting regional characteristics of different frames from a video aiming at a section of video, and calculating association scores between any regional characteristic in the video frame and all regional characteristics in the video frame in an adjacent interval; extracting the regional characteristics with the highest matching scores in each video frame in the adjacent interval as the matching regional characteristics, and performing average pooling on any regional characteristic in the video frames and the matching regional characteristics to obtain the dynamic regional characteristics of the video frames.
Specifically, for a section of video, extracting regional characteristics through a pre-training Faster R-CNN model
Figure BDA0002482601270000081
Wherein N represents the total number of video frames, K represents the number of regions in each frame of video,
Figure BDA0002482601270000082
representing the feature value of the kth region in the nth frame of the video, using a bounding box corresponding to the spatial position of the region
Figure BDA0002482601270000083
Figure BDA0002482601270000084
Is shown in which
Figure BDA0002482601270000085
Representing the coordinates of the center of the kth region of the nth frame of video,
Figure BDA0002482601270000086
the width and the height of a k area of an nth frame of the video are represented;
dynamic information is acquired by adopting a time region aggregation method, and if two regions of adjacent frames have similar semantic features and spatial positions, the same object comprising different times is determined. For any region feature in video frame
Figure BDA0002482601270000087
Taking the front L frame and the back L frame as a video frame set, and calculating any region feature in the video frame set
Figure BDA0002482601270000088
And
Figure BDA0002482601270000089
the correlation score between the two is calculated by the formula:
Figure BDA00024826012700000810
wherein,
Figure BDA00024826012700000811
and
Figure BDA00024826012700000812
respectively representing the jth regional feature and spatial location of the ith video frame, L ∈ [ n-L, n + L]IoU (·) represents the IoU score for the two region borders, α represents the balance coefficient;
in each video frame of video frame set formed by front and back L frames
Figure BDA00024826012700000813
The region feature with the highest association score is used as
Figure BDA00024826012700000814
Is matched with the region characteristics of
Figure BDA00024826012700000815
Performing average pooling with the extracted 2L matching region characteristics to obtain dynamic region characteristics
Figure BDA00024826012700000816
In one implementation of the invention, a method of preprocessing a query statement is provided.
Firstly, obtaining semantic feature sets of all words in a query sentence by adopting a Bi-GRU network, and extracting semantic features of nouns from the semantic feature sets; and then adopting an attention method to further obtain the object characteristics in the query statement.
In particular toFirstly, adopting Bi-GRU network to obtain semantic feature set of words in query sentence
Figure BDA0002482601270000091
Wherein s ismRepresenting semantic features of an mth word in the query sentence, wherein M represents the number of words in the query sentence;
marking all nouns in the query sentence as objects in the query sentence by adopting an NLTK tool, and collecting semantic feature sets of words in the query sentence
Figure BDA0002482601270000092
Extracting semantic features of objects in a query statement
Figure BDA0002482601270000093
And aggregating the context of each object in the query statement through an attention method to obtain the object characteristics in the query statement, wherein the calculation formula is as follows:
Figure BDA0002482601270000094
Figure BDA0002482601270000095
Figure BDA0002482601270000096
wherein,
Figure BDA0002482601270000097
representing a projection matrix, bsA vector of the offset is represented, and,
Figure BDA0002482601270000099
representing a row vector, βt,mRepresenting the attention weight, o, of the mth word in the query sentence to the tth object in the query sentencetRepresenting the t-th object feature in the query sentence to form an object feature set
Figure BDA0002482601270000098
T represents the number of objects in the query statement. In a specific use process, the first noun in the default sentence represents an object to be queried, and if the sentence is an interrogative sentence, the noun is 'who', 'what' and the like. Thus, o1Features representing a primary object (i.e., a query object) { o }2,…,otDenotes the auxiliary object feature.
In one implementation of the invention, a specific operation process of the object-aware multi-branch relation network is given.
As shown in fig. 2, the object-aware multi-branch relational network is composed of a main branch, T-1 auxiliary branches and a multi-branch relational inference module, and each branch includes an object-aware modulation layer, a cross-mode matching layer and a softmax function layer.
Firstly, the dynamic region characteristics of the video frame obtained in the step one and the tth object characteristics in the query sentence obtained in the step two are used as the input of an object perception modulation layer of a tth branch to obtain the object perception region characteristics of the region in the video, wherein when T is 1, the main branch is represented, and when T belongs to {2,3, … T }, the auxiliary branch is represented;
then calculating a matching score between the object perception region characteristics of the region in the video and the object characteristics in the query statement through a cross-modal matching layer, and processing the matching score through a softmax function layer;
and finally, taking the matching scores output by the main branch and the T-1 auxiliary branches and processed by the softmax function as the input of a multi-branch relation reasoning module to obtain the object perception multi-branch characteristics of the region.
Specifically, the object sensing modulation layer is used for calculating object sensing regional characteristics, namely enhancing target related regional characteristics and weakening unrelated regional characteristics, and firstly, a modulation gate and a modulation vector are calculated, wherein the formula is as follows:
γt=tanh(Wγot+bγ)
t=tanh(Wot+b)
wherein, Wγ,W,bγ,bIs a parameter matrix and an offset vector, gammatA modulation gate representing the t-th object in the corresponding query statement,ta bias vector representing the t-th object in the query statement, ⊙ representing element-by-element multiplication,
Figure BDA0002482601270000101
representing dynamic region characteristics, otRepresenting the tth object feature in the query statement;
all region features are then modulated using the following formula:
Figure BDA0002482601270000102
wherein,
Figure BDA0002482601270000103
and representing the object perception region characteristics of the kth region of the nth video frame of the t-th branch.
The cross-modal matching layer is used for calculating matching scores between object perception region features of regions in the video and object features in the query sentences, and the calculation formula is as follows:
Figure BDA0002482601270000104
wherein,
Figure BDA0002482601270000109
representing a row vector, WcRepresenting a parameter matrix, bcA vector of parameters is represented that is,
Figure BDA0002482601270000105
and the matching score of the kth area of the nth video frame and the t < th > object in the query statement is represented.
The softmax function layer is used for the pair
Figure BDA0002482601270000106
To carry outProcessing, wherein the calculation formula is as follows:
Figure BDA0002482601270000107
composition set
Figure BDA0002482601270000108
This result will participate in the subsequent multi-branch relationship reasoning and diversity loss calculations.
The multi-branch relation reasoning module is used for reasoning and acquiring the relation between a main object (query object) and an auxiliary object to obtain the object perception multi-branch characteristics of the region, and comprises a multi-step calculation formula, and specifically comprises the following steps:
1) in order to extract the clues of the relevant areas from the auxiliary branches, the clues are integrated into the main branch, and attention weight between any area in the video frame of the main branch and any area in the video frames of the T-1 auxiliary branches is firstly calculated, and the calculation formula is as follows:
Figure BDA0002482601270000111
wherein,
Figure BDA0002482601270000112
object-aware region features representing the kth region of the nth video frame of the main branch,
Figure BDA0002482601270000113
object perception region characteristics of the ith region of the nth video frame representing the tth branch;
Figure BDA0002482601270000114
and
Figure BDA0002482601270000115
respectively representing the relative position vector and the attention weight of the kth area of the nth video frame of the main branch and the ith area of the nth video frame of the tth branch, and then pairing the kth area and the ith area of the nth video frame of the main branch through a softmax function layer
Figure BDA0002482601270000116
Is processed to obtain
Figure BDA0002482601270000117
Figure BDA0002482601270000118
In the formula
Figure BDA0002482601270000119
Computing
Figure BDA00024826012700001110
The process is as follows:
Figure BDA00024826012700001111
Figure BDA00024826012700001112
2) obtaining the integrated features of all the regions in the t-th branch to any region in the main branch from the relevant regions of the t-th object in the set query statement in the auxiliary branch, wherein the calculation formula is as follows:
Figure BDA00024826012700001113
wherein,
Figure BDA00024826012700001114
the integration characteristics of all the areas in the t-th branch t to the k-th area of the nth video frame in the main branch are shown, and t is more than or equal to 2;
Figure BDA00024826012700001115
by introducing the matching score in the calculation for the region-object matching score calculated in the cross-modal matching layer, irrelevant regions can be filtered out in the main and auxiliary branch relation estimation, and the main and auxiliary branches can be enhanced at the same timeHigher in to branch
Figure BDA00024826012700001116
A relational model of the fractional regions.
Integration features based on any region in a video frame
Figure BDA00024826012700001117
Further obtaining all regional object perception multi-branch feature set
Figure BDA00024826012700001118
The calculation formula is as follows:
Figure BDA00024826012700001119
wherein ReLU (. cndot.) represents a linear rectification function, used as an activation function,
Figure BDA0002482601270000121
an object-aware multi-branch feature representing a kth region of an nth video frame.
In order to make different branches focus on only the objects associated with the different branches, a diversity loss function of the object-aware multi-branch relation network is designed to calculate the diversity loss, and the formula is as follows:
Figure BDA0002482601270000122
wherein S isgtRepresents the set of frames in the truth segment,
Figure BDA0002482601270000123
in order to normalize the factors, the parameters of the model,
Figure BDA0002482601270000124
and
Figure BDA0002482601270000125
representing any two objects in the query statement processed by the softmax function layer and the nth videoSet of matching scores for all regions of the frame. Computing the loss of diversity will cause the branches to allocate more attention to the regions that match their corresponding objects.
A specific application of the spatio-temporal locator is given in one implementation of the present invention.
Building a spatial locator using the object-aware multi-branch and main object features of the region o1And calculating the confidence score of any region in the video frame, wherein the formula is as follows:
Figure BDA0002482601270000126
wherein, the sigma is a sigmoid function,
Figure BDA0002482601270000127
representing the spatial confidence score, W, of the kth region of the nth video framerAnd WoRepresenting a parameter matrix;
the loss function of the spatial locator is:
Figure BDA0002482601270000128
wherein,
Figure BDA0002482601270000129
IoU score, S, representing the kth region of the nth video frame and the true region corresponding theretogtRepresenting a set of frames in a truth segment;
temporal localizer, object-aware multi-branch and main-object features through spatial attention-focused regions o1And obtaining the object perception characteristics of the frame level in the video, wherein the calculation formula is as follows:
Figure BDA00024826012700001210
Figure BDA0002482601270000131
wherein f isnRepresenting an object perceptual feature of an nth video frame;
Figure BDA00024826012700001317
a line vector is represented by a vector of lines,
Figure BDA00024826012700001318
bfrepresenting a parameter matrix and an offset; then another Bi-RGU is used to learn the object-aware context features of all frames
Figure BDA0002482601270000133
Defining a set of candidate segments at each video frame
Figure BDA0002482601270000134
whRepresenting the width of the H-th candidate segment in each video frame, and H representing the number of candidate segments, i.e. taking each frame as a sampling center, and adopting the front and the back thereof
Figure BDA0002482601270000135
Frames, with which a candidate segment is composed, according to H different whAnd obtaining H candidate fragments. All candidate clips are estimated through a sigmoid function linear layer, and the offset of the boundary is generated at the same time, and the calculation formula is as follows:
Figure BDA0002482601270000136
Figure BDA0002482601270000137
wherein,
Figure BDA0002482601270000138
representing the temporal confidence score of the H candidate segments at the nth video frame,
Figure BDA0002482601270000139
denotes the offset, W, of H candidate segmentscAnd WlIs a parameter matrix, bcAnd blIs the offset, σ (·) is the sigmoid function;
the time locator described has two losses: alignment loss of candidate segment selection and regression loss of boundary adjustment; the alignment loss formula is as follows:
Figure BDA00024826012700001310
wherein,
Figure BDA00024826012700001311
the time IoU scores representing the h-th candidate segment and the true segment in the nth video frame,
Figure BDA00024826012700001312
denotes cnThe h element of (a), representing the confidence score of the h candidate segment at the nth video frame;
selecting the score with the greatest temporal confidence
Figure BDA00024826012700001313
The time boundary of the segment of (a) is (s, e), and the offset is (l)s,le) (ii) a First according to the truth boundary
Figure BDA00024826012700001314
Calculating the offset of the candidate segment
Figure BDA00024826012700001315
The regression loss formula is as follows:
Figure BDA00024826012700001316
wherein R represents a smooth L1 function.
In one embodiment of the present invention, a specific form of the multitasking loss function is presented.
Figure BDA0002482601270000141
Wherein,
Figure BDA0002482601270000142
representing a multitask penalty function, λ1,λ2,λ3,λ4The over-parameters of the balance between the four losses are regulated and controlled,
Figure BDA0002482601270000143
a loss function representing the spatial locator is shown,
Figure BDA0002482601270000144
represents an alignment loss function of the time locator,
Figure BDA0002482601270000145
represents the regression loss function of the time locator,
Figure BDA0002482601270000146
a diversity loss function representing an object-aware multi-branch relational network.
In another embodiment of the present invention, a positioning system for completing a task of positioning a specified object in a video by using an object-aware multi-branch relationship network is provided, where the positioning system is configured to implement the above method for completing the task of positioning the specified object in the video, and the method includes:
the video preprocessing module: the video processing device is used for extracting the regional characteristics of different frames from the video and calculating the association scores between any regional characteristic in the video frame and all regional characteristics in the video frame in the adjacent interval; extracting the regional characteristics with the highest matching scores in each video frame in the adjacent interval as matching regional characteristics, and performing average pooling on any regional characteristic in the video frames and the matching regional characteristics to obtain the dynamic regional characteristics of the video frames;
the query statement preprocessing module: the method comprises the steps of acquiring a semantic feature set of all words in a query sentence, extracting semantic features of nouns from the semantic feature set, and further obtaining object features in the query sentence by adopting an attention method;
a video clip positioning module: the video segment positioning module comprises a modeling submodule and a training submodule, wherein the modeling submodule is configured with an object perception multi-branch relation model, a space positioning model and a time positioning model, the object perception multi-branch relation model is used for extracting object perception multi-branch characteristics of an area, the space positioning model is used for realizing the positioning of a space pipeline, the time positioning model is used for realizing the positioning of a time pipeline, and the training submodule is configured with a multi-task loss function;
an output module: for outputting the positioning result.
Wherein, the video preprocessing module comprises:
the fast R-CNN sub-module: configuring a trained Faster R-CNN model, and taking a section of video as input to obtain the regional characteristics of the video, wherein the representation method of the regional characteristics is not repeated;
time zone aggregation submodule: for aggregating the front L frame and the back L frame of the frame of any region into a video frame set, and calculating the characteristics of any region in the video frame set
Figure BDA0002482601270000147
And
Figure BDA0002482601270000148
the correlation score between the two is not repeated in the calculation formula;
a pooling submodule: the method is used for performing average pooling on the regional characteristics with the highest association scores in each video frame of the video frame set and the regional characteristics to be matched and outputting dynamic regional characteristics.
The query statement preprocessing module comprises:
Bi-GRU submodule: configuring a trained Bi-GRU network for acquiring semantic features of words in query sentences;
labeling the submodule: all nouns in the query statement are marked as objects in the query statement;
a feature extraction submodule: the semantic features of the object in the query sentence are extracted from the semantic features of the words in the query sentence, and the object features in the query sentence are obtained through calculation, so that the calculation formula is not repeated.
The object perception multi-branch relation model is composed of a main branch, T-1 auxiliary branches and a multi-branch relation reasoning layer, and each branch comprises an object perception modulation layer, a cross-mode matching layer and a softmax function layer; the calculation formulas of the object perception modulation layer, the cross-mode matching layer, the softmax function layer and the multi-branch relation reasoning layer are not repeated;
the multi-task loss function in the training submodule is as follows:
Figure BDA0002482601270000151
Figure BDA0002482601270000152
Figure BDA0002482601270000153
Figure BDA0002482601270000154
Figure BDA0002482601270000155
wherein λ is1,λ2,λ3,λ4The over-parameters of the balance between the four losses are regulated and controlled,
Figure BDA0002482601270000156
a loss function representing the spatial locator is shown,
Figure BDA0002482601270000157
represents an alignment loss function of the time locator,
Figure BDA0002482601270000158
represents the regression loss function of the time locator,
Figure BDA0002482601270000159
a diversity loss function representing an object-aware multi-branch relational network;
Figure BDA00024826012700001510
IoU score, S, representing the kth region of the nth video frame and the true region corresponding theretogtRepresenting a set of frames in a truth segment;
Figure BDA00024826012700001511
the time IoU scores representing the h-th candidate segment and the true segment in the nth video frame,
Figure BDA00024826012700001512
denotes cnThe h element of (a), representing the confidence score of the h candidate segment at the nth video frame; r represents smooth L1 function, SgtRepresents the set of frames in the truth segment,
Figure BDA0002482601270000161
in order to normalize the factors, the parameters of the model,
Figure BDA0002482601270000162
and
Figure BDA0002482601270000163
and representing a matching score set of any two objects in the query statement processed by the softmax function layer and all regions of the nth video frame.
Examples
In order to show the experimental effect of the present invention, this embodiment provides a comparative experiment, and the implementation method is the same as the process described above, and only specific implementation details are given here, and the repeated process is not described again.
The present implementation used the VidSTG dataset proposed by Zhang et al (Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, HuashengLiu, and Lianli Gao. where multiple do it exists: spread-temporal video grouping for multi-form sensors. arxiv preprint). The data set is constructed by annotating natural language descriptions based on a video object relationship data set, VidOR (Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xinn Yang, and Tat-SengChua. innocent objects and relationships in user-generated video. in ICMR, pages 279-287. ACM). VidSTG is the only underlying dataset based on unaligned sentence-video segment pairs. The VidSTG contains 5563 videos in the training set, 618 videos in the verification set and 743 videos in the test set. Containing a total of 99943 sentences (44808 statement sentences and 55135 question sentences), 80 query objects. The average duration of the video is 28.01 seconds, and the average of the statement sentences and question sentences is 11.12 words and 8.98 words.
The implementation details are as follows:
in the video pre-processing, 1024-d region features are extracted by pre-trained Faster R-CNN, 5 frames per second are sampled, and K is 20 regions per frame, in query sentence pre-processing, word embedding is performed using pre-trained Glove, obtaining 300-d vectors capable of expressing word features, and nouns in sentences are identified using NLTK, in modeling, α is set to 0.6, L is set to 5, and λ is set1,λ2,λ3,λ4Set to 1.0, 0.001 and 1.0, respectively. At each step, 9 candidate segments are defined, whose temporal width is [3,9,17,33,65,97,129,165,197]. The dimensions of all projection matrices and biases are set to 256, and the hidden state (hidden state) for each direction in the Bi-GRU is set to 128. The model was trained in this example using an Adam optimizer with an initial learning rate of 0.0005.
The performance evaluation method comprises the following steps:
the time positioning performance is evaluated by using a standard m _ tIoU, and the space-time accuracy is evaluated by using m _ vIoU and vIoU @ R. m _ tIoU represents the mean of the selected segments and the time IoU of the true value, and vIoU represents the space-time IoU between the predicted space-time tube and the true value, and is calculated as follows:
Figure BDA0002482601270000171
wherein SpRepresenting a set of frames in a predicted segment, SgtRepresenting the set of frames in the truth segment. r isn
Figure BDA0002482601270000172
Respectively representing prediction and truth regions in the nth frame. m _ vIoU is the average vIoU of all test samples, and vIoU @ R is the satisfaction of vIoU in a test sample>The ratio of R.
And (3) comparing the performances:
this implementation is compared to the present method using four existing methods.
Since localization (STVG) is an emerging task in Spatio-temporal video, the only methods that can solve this problem based on non-aligned data are STGRN methods (document 1: Zhu Zhuang, Zhou Zhuao, Yang Zhuao, Qi Wang, Huashengliu, and Lianli Gao. where does exist: space-temporal video grouping for multi-form transmitters. arxiv. print.). In addition, the temporal localization of a given description object in unaligned sentence-video segment pairs can be accomplished due to TALL (document 2: Jiyang Gao, ChenSun, Zhenheng Yang, and Ram New tia. TALL: temporal activity localization visual angle query. in ICCV, pages 5277 and 5285.) and L-Net (document 3: Jingyuan Chen, Lin Ma, Xinpen Chen, Zequn Jie, and Jiebo Luo.Localizing natural language in video. InAAI.). These two methods are therefore combined with the STVG method based on alignment data as a baseline. Where TALL and L-Net are used to determine temporal boundaries, the STVG method based on alignment data is used to further retrieve spatio-temporal pipelines. TALL uses the proposal-selection framework for time-positioning, and L-Net uses verbatim interaction for overall segment selection.
Grounder (document 4: Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevorgarell, and Bernt Schile. group-ing of textual phenols in images byrconstruction. in ECCV, pages 817-834.) can extract the target region in each frame of a given predicted segment. STPR (document 5: Masataka Yamaguchi, Kuniaki Saito, Yoshitaka Ushiku, and TatsuyaHarada. spread-temporal person statistical visual field natural visual requirements. in ICCV, pages 1453 and 1462.) and WSSTG (document 6: Zhenfang Chen, Lin Ma, Wenhan Luo, and KWan-Yee KWong. Weak-superimposed spread-temporal grouping physical sensory input. the channels were pre-generated using channels in the fragments and evaluated by cross-modal ranking. The initial STPR is applicable to the localization of people in multiple videos, which is extended in this embodiment to the localization of multiple objects in a single video. The initial WSSTG uses weakly supervised ranking penalties, which are replaced in this example by supervised triple penalties (document 7: Sibei Yang, Guinbin Li, and Yizhou Yu. Cross-modal relationship for group expression. in CVPR, pages 4145 and 4154). In this example, Grounder, STPR, WSSTG were evaluated for binding to TALL or L-Net, respectively. The evaluation results are given in tables 1 to 3, wherein OMRN represents the method of the present invention.
Table 1: performance evaluation results based on VidSTG dataset
Figure BDA0002482601270000181
Table 2: evaluation results based on time alignment truth (tem. gt represents time alignment truth)
Figure BDA0002482601270000182
Table 3: ablation results based on VidSTG dataset
Figure BDA0002482601270000183
And (4) evaluation results:
according to table 1, the temporal positioning and the spatiotemporal positioning performance of the interrogative sentences in all models are lower than those of the declarative sentences, namely, the m _ tIoU score obtained according to the positioning of the interrogative sentences is lower than that obtained according to the positioning of the declarative sentences, which indicates that the positioning of the unknown objects lacking the explicit characteristics is more difficult.
For time alignment accuracy, according to table 1, the m _ tlou scores for STGRN and OMRN (method of the invention) were 48.47%, 50.73%, respectively, whereas the m _ tlou score for grouder, STPR, WSSTG combined with TLL was 34.63%, and the m _ tlou score for L-Net combined was 40.86%. This is because the time-localization method based on region modeling adopted by the invention and STGRN is superior to the time-localization method based on video frame modeling by TALL and L-Net, which shows that fine-grained region modeling is helpful for determining the accurate time boundary of the target pipeline. Compared with the STGRN method of the same type, the method has higher time positioning accuracy.
For the accuracy of space-time positioning, according to table 1, in all models, the m _ vIoU, the vIoU @0.3, and the vIoU @0.5 scores of the invention all obtain the highest score. The invention can accurately capture the dynamic information of different objects among frames through time region aggregation, and can perform fine-grained object perception through object perception modulation and cross-modal matching. While the group r + {. cndot. } method ignores the temporal dynamics of the object, the performance is worst, which indicates that for high quality spatio-temporal localization, capturing the object dynamics between frames is very important.
As can be seen from table 1, in all standards, the method of the present invention achieves significant performance improvement compared to other methods, which indicates that the method of the present invention can effectively focus on the key region through object-aware multi-branch region modeling with diversity loss, and capture the key object relationship through multi-branch reasoning.
In addition, to compare the performance of the present invention and other methods in aligned sentence-video segment pairs. This embodiment evaluates the comparison of the spatial localization performance of the present invention with other methods given a time truth (i.e. where temporal localization has been given). As shown in Table 2, the m _ vIoU, vIoU @0.3 and vIoU @0.5 scores of the invention all obtained the highest scores. This indicates that the present invention still has higher performance on aligned sentence-video segment pairs.
Finally, the present embodiment performs ablationThe present embodiment removes ① the object-aware modulation module (w/o.OM) from all branches and ② the multitask loss function
Figure BDA0002482601270000191
In removing loss of diversity
Figure BDA0002482601270000192
(w/o.DL); ③ removing cross-modal matching blocks from all branches, and weighting terms in multi-branch relational inference
Figure BDA0002482601270000193
(w/o.CM), here due to lack of match score provided across the modal matching block
Figure BDA0002482601270000194
Loss of diversity is also ineffective, ④ removes temporal region aggregation modules (w/o.ta) from region modeling, ⑤ removes context awareness modules (w/o.ca) from object extraction.
The results of the model performance evaluation after removal of each module are shown in table 3. It can be seen that the performance of all ablation models is reduced compared to the complete model, indicating that each component contributes to improved positioning accuracy. The performance of the first three models is reduced more after elimination, which shows that object-aware multi-branch relation reasoning plays a crucial role in the accuracy of space-time positioning. Cm, which indicates that cross-modal matching with diversity regularization is important for merging regional features related to linguistic description from the auxiliary branch to the main branch.

Claims (10)

1. The method for completing the positioning task of the specified object in the video by utilizing the object-aware multi-branch relation network is characterized by comprising the following steps of:
s1: extracting regional characteristics of different frames from a video aiming at a section of video, and calculating association scores between any regional characteristic in the video frame and all regional characteristics in the video frame in an adjacent interval; extracting the regional characteristics with the highest matching scores in each video frame in the adjacent interval as matching regional characteristics, and performing average pooling on any regional characteristic in the video frames and the matching regional characteristics to obtain the dynamic regional characteristics of the video frames;
s2: aiming at a query sentence, firstly, obtaining semantic feature sets of all words in the query sentence by adopting a Bi-GRU network, and extracting semantic features of nouns from the semantic feature sets; then adopting an attention method to further obtain object characteristics in the query statement;
s3: constructing an object-aware multi-branch relation network, wherein the object-aware multi-branch relation network is composed of a main branch, T-1 auxiliary branches and a multi-branch relation reasoning module, and each branch comprises an object-aware modulation layer, a cross-mode matching layer and a softmax function layer;
firstly, the dynamic region feature of the video frame obtained in step S1 and the tth object feature in the query statement obtained in step S2 are used as the input of the object perception modulation layer of the tth branch to obtain the object perception region feature of the region in the video, wherein when T is 1, the region represents a main branch, and when T belongs to {2,3, … T }, the region represents an auxiliary branch; then calculating a matching score between the object perception region characteristics of the region in the video and the object characteristics in the query statement through a cross-modal matching layer, and processing the matching score through a softmax function layer; finally, the object perception regional characteristics of the region in the video output by the main branch and the T-1 auxiliary branches and the matching scores processed by the softmax function are used as the input of a multi-branch relation reasoning module to obtain the object perception multi-branch characteristics of the region;
s4: establishing a space-time locator which comprises a space locator and a time locator;
s5: designing a multitask loss function as:
Figure FDA0002482601260000011
wherein λ is1,λ2,λ3,λ4The over-parameters of the balance between the four losses are regulated and controlled,
Figure FDA0002482601260000012
a loss function representing the spatial locator is shown,
Figure FDA0002482601260000013
represents an alignment loss function of the time locator,
Figure FDA0002482601260000014
represents the regression loss function of the time locator,
Figure FDA0002482601260000015
a diversity loss function representing an object-aware multi-branch relational network; training an object perception multi-branch relation network and a space-time locator in an end-to-end mode according to a multi-task loss function;
s6: for a section of video and query sentences to be processed, preprocessing is performed through steps S1 and S2, then the dynamic region features of the obtained video frames and the object features in the query sentences are used as the input of a trained object perception multi-branch relation network, the output of the trained object perception multi-branch relation network is used as the input of a trained space-time locator, and the region corresponding to the minimum value of the multitask loss function is used as the final result to be output.
2. The method for completing the task of locating a specific object in a video according to claim 1, wherein the step S1 specifically comprises:
for a video, extracting regional characteristics through a pre-training Faster R-CNN model
Figure FDA0002482601260000021
Wherein N represents the total number of video frames, K represents the number of regions in each frame of video,
Figure FDA0002482601260000022
representing the feature value of the kth region in the nth frame of the video, using a bounding box corresponding to the spatial position of the region
Figure FDA0002482601260000023
Is shown in which
Figure FDA0002482601260000024
Representing the coordinates of the center of the kth region of the nth frame of video,
Figure FDA0002482601260000025
the width and the height of a k area of an nth frame of the video are represented;
adopting a time region aggregation method to aim at any region feature in a video frame
Figure FDA0002482601260000026
Taking the front L frame and the back L frame as a video frame set, and calculating any region feature in the video frame set
Figure FDA0002482601260000027
And
Figure FDA0002482601260000028
the correlation score between the two is calculated by the formula:
Figure FDA0002482601260000029
wherein,
Figure FDA00024826012600000210
and
Figure FDA00024826012600000211
respectively representing the jth regional feature and spatial location of the ith video frame, L ∈ [ n-L, n + L]IoU (-) denotes the IoU score for the two region borders, α denotes the balance systemCounting;
neutralizing each video frame of a set of video frames
Figure FDA00024826012600000212
The region feature with the highest association score is used as
Figure FDA00024826012600000213
Is matched with the region characteristics of
Figure FDA00024826012600000214
Performing average pooling with the extracted 2L matching region characteristics to obtain dynamic region characteristics
Figure FDA00024826012600000215
3. The method for completing the task of locating a specific object in a video according to claim 1, wherein the step S2 specifically comprises:
aiming at a query statement, firstly, a Bi-GRU network is adopted to obtain a semantic feature set of words in the query statement
Figure FDA00024826012600000216
Wherein s ismRepresenting semantic features of an mth word in the query sentence, wherein M represents the number of words in the query sentence;
marking all nouns in the query sentence as objects in the query sentence by adopting an NLTK tool, and collecting semantic feature sets of words in the query sentence
Figure FDA00024826012600000217
Extracting semantic features of objects in a query statement
Figure FDA00024826012600000218
Aggregating the context of each object in the query statement by an attention method to obtain the object characteristics in the query statementThe calculation formula is as follows:
Figure FDA0002482601260000031
Figure FDA0002482601260000032
Figure FDA0002482601260000033
wherein,
Figure FDA0002482601260000034
representing a projection matrix, bsA vector of the offset is represented, and,
Figure FDA00024826012600000310
representing a row vector, βt,mRepresenting the attention weight, o, of the mth word in the query sentence to the tth object in the query sentencetRepresenting the t-th object feature in the query sentence to form an object feature set
Figure FDA0002482601260000035
T represents the number of objects in the query statement o1Representing main object features, { o2,…,otDenotes the auxiliary object feature.
4. The method for completing the task of locating a specific object in a video according to claim 1, wherein the step S3 specifically comprises:
3.1) constructing an object-aware multi-branch relation network, wherein the object-aware multi-branch relation network is composed of a main branch, T-1 auxiliary branches and a multi-branch relation reasoning module, and each branch comprises an object-aware modulation layer, a cross-mode matching layer and a softmax function layer;
3.2) characterization of the region obtained in step S1Sign for
Figure FDA0002482601260000036
And object features in query statement otThe first branch and the second branch are used as input of the t-th branch, and the object perception area characteristics are obtained through calculation of an object perception modulation layer of each branch, and the formula is as follows:
γt=tanh(Wγot+bγ)
t=tanh(Wot+b)
Figure FDA0002482601260000037
wherein, Wγ,W,bγ,bIs a parameter matrix and an offset vector, gammatA modulation gate representing the t-th object in the corresponding query statement,trepresenting the bias vector of the t-th object in the query statement, ⊙ representing element-by-element multiplication,
Figure FDA0002482601260000038
object perception region characteristics of a kth region of an nth video frame representing the t-th branch;
3.3) calculating the matching score between the object perception region characteristics of the region in the video and the object characteristics in the query sentence by crossing the modal matching layer, wherein the formula is as follows:
Figure FDA0002482601260000039
wherein,
Figure FDA00024826012600000422
representing a row vector, WcRepresenting a parameter matrix, bcA vector of parameters is represented that is,
Figure FDA0002482601260000041
representing the kth region and query language of the nth video frameMatching score of the t-th object in the sentence; then passes through the layer pair of the softmax function
Figure FDA0002482601260000042
Is processed to obtain
Figure FDA0002482601260000043
Composition set
Figure FDA0002482601260000044
3.4) the softmax function processed matching scores output by the main branch and the T-1 auxiliary branches
Figure FDA0002482601260000045
And object perception region features
Figure FDA0002482601260000046
As the input of the multi-branch relation reasoning module, the object perception multi-branch characteristics of the region are obtained
Figure FDA0002482601260000047
The method specifically comprises the following steps:
3.4.1) calculating the attention weight between any region in the video frame of the main branch and any region in the video frames of the T-1 auxiliary branches, wherein the calculation formula is as follows:
Figure FDA0002482601260000048
wherein,
Figure FDA0002482601260000049
object-aware region features representing the kth region of the nth video frame of the main branch,
Figure FDA00024826012600000410
object perception of the l-th region of the n-th video frame representing the t-th branchA regional characteristic;
Figure FDA00024826012600000411
and
Figure FDA00024826012600000412
respectively representing the relative position vector and the attention weight of the kth area of the nth video frame of the main branch and the ith area of the nth video frame of the tth branch, and then pairing the kth area and the ith area of the nth video frame of the main branch through a softmax function layer
Figure FDA00024826012600000413
Is processed to obtain
Figure FDA00024826012600000414
Figure FDA00024826012600000415
3.4.2) obtaining the integrated feature of all the regions in the t-th branch to any region in the main branch from the relevant region of the t-th object in the query statement in the auxiliary branch, and the calculation formula is as follows:
Figure FDA00024826012600000416
wherein,
Figure FDA00024826012600000417
the integration characteristics of all the areas in the t-th branch t to the k-th area of the nth video frame in the main branch are shown, and t is more than or equal to 2;
integration features based on any region in a video frame
Figure FDA00024826012600000418
Further obtaining all regional object perception multi-branch feature set
Figure FDA00024826012600000419
The calculation formula is as follows:
Figure FDA00024826012600000420
wherein ReLU (. cndot.) represents a linear rectification function, used as an activation function,
Figure FDA00024826012600000421
an object-aware multi-branch feature representing a kth region of an nth video frame.
5. The method according to claim 4, wherein said object-aware multi-branch relationship network employs a diversity loss function, and the formula is:
Figure FDA0002482601260000051
wherein S isgtRepresents the set of frames in the truth segment,
Figure FDA0002482601260000052
in order to normalize the factors, the parameters of the model,
Figure FDA0002482601260000053
and
Figure FDA0002482601260000054
and representing a matching score set of any two objects in the query statement processed by the softmax function layer and all regions of the nth video frame.
6. The method for completing the task of locating a specific object in a video according to claim 1, wherein the step S4 specifically comprises:
4.1) building a spatial locator using the object-aware multi-branch and main object features of the regiono1And calculating the confidence score of any region in the video frame, wherein the formula is as follows:
Figure FDA0002482601260000055
wherein, the sigma is a sigmoid function,
Figure FDA0002482601260000056
representing the spatial confidence score, W, of the kth region of the nth video framerAnd WoRepresenting a parameter matrix;
the loss function of the spatial locator is:
Figure FDA0002482601260000057
wherein,
Figure FDA0002482601260000058
IoU score, S, representing the kth region of the nth video frame and the true region corresponding theretogtRepresenting a set of frames in a truth segment;
4.2) establishing a time locator for sensing multi-branch characteristics and main object characteristics o through the objects of the space attention aggregation area1And obtaining the object perception characteristics of the frame level in the video, wherein the calculation formula is as follows:
Figure FDA0002482601260000059
Figure FDA00024826012600000510
wherein f isnRepresenting an object perceptual feature of an nth video frame; w is aTRepresenting a row vector, W1 f、W2 f、bfRepresenting a parameter matrix and an offset; then another Bi-RGU is used to learn the object-aware context features of all frames
Figure FDA0002482601260000061
Defining a set of candidate segments at each video frame
Figure FDA0002482601260000062
whRepresenting the width of the H-th candidate segment in each video frame, and H representing the number of candidate segments; all candidate clips are estimated through a sigmoid function linear layer, and the offset of the boundary is generated at the same time, and the calculation formula is as follows:
Figure FDA0002482601260000063
Figure FDA0002482601260000064
wherein,
Figure FDA0002482601260000065
representing the temporal confidence score of the H candidate segments at the nth video frame,
Figure FDA0002482601260000066
denotes the offset, W, of H candidate segmentscAnd WlIs a parameter matrix, bcAnd blIs the offset, σ (·) is the sigmoid function;
the time locator described has two losses: alignment loss of candidate segment selection and regression loss of boundary adjustment; the alignment loss formula is as follows:
Figure FDA0002482601260000067
wherein,
Figure FDA0002482601260000068
is shown asThe time IoU fraction of the h-th candidate segment and the true segment of the n video frames,
Figure FDA0002482601260000069
denotes cnThe h element of (a), representing the confidence score of the h candidate segment at the nth video frame;
selecting the score with the greatest temporal confidence
Figure FDA00024826012600000610
The time boundary of the segment of (a) is (s, e), and the offset is (l)s,le) (ii) a First according to the truth boundary
Figure FDA00024826012600000611
Calculating a true offset
Figure FDA00024826012600000612
The regression loss formula is as follows:
Figure FDA00024826012600000613
wherein R represents a smooth L1 function.
7.A positioning system for performing a task of positioning a specified object in a video by using an object-aware multi-branch relationship network, wherein the positioning system is used for implementing the method for performing the task of positioning the specified object in the video according to claim 1, and the method comprises:
the video preprocessing module: the video processing device is used for extracting the regional characteristics of different frames from the video and calculating the association scores between any regional characteristic in the video frame and all regional characteristics in the video frame in the adjacent interval; extracting the regional characteristics with the highest matching scores in each video frame in the adjacent interval as matching regional characteristics, and performing average pooling on any regional characteristic in the video frames and the matching regional characteristics to obtain the dynamic regional characteristics of the video frames;
the query statement preprocessing module: the method comprises the steps of acquiring a semantic feature set of all words in a query sentence, extracting semantic features of nouns from the semantic feature set, and further obtaining object features in the query sentence by adopting an attention method;
a video clip positioning module: the video fragment positioning module comprises a modeling submodule and a training submodule, wherein the modeling submodule is provided with an object perception multi-branch relation model, a space positioning model and a time positioning model, and the training submodule is provided with a multi-task loss function;
an output module: for outputting the positioning result.
8. The system of claim 7, wherein the video pre-processing module comprises:
the fast R-CNN sub-module: configuring a trained Faster R-CNN model, and taking a section of video as input to obtain regional characteristics of the video;
time zone aggregation submodule: the system comprises a video frame set, a matching unit and a matching unit, wherein the video frame set is used for aggregating a front L frame and a rear L frame of a frame where any region is located into a video frame set and calculating the association score between any region feature in the video frame set and any region feature in a video frame to be matched;
a pooling submodule: the method is used for performing average pooling on the regional characteristics with the highest association scores in each video frame of the video frame set and the regional characteristics to be matched and outputting dynamic regional characteristics.
9. The system of claim 7, wherein the query statement preprocessing module comprises:
Bi-GRU submodule: configuring a trained Bi-GRU network for acquiring semantic features of words in query sentences;
labeling the submodule: all nouns in the query statement are marked as objects in the query statement;
a feature extraction submodule: the method is used for extracting the semantic features of the object in the query sentence from the semantic features of the words in the query sentence and calculating the object features in the query sentence.
10. The system of claim 7, wherein the modeling sub-module is configured to:
object-aware multi-branch relationship model: the object perception multi-branch relation model is composed of a main branch, T-1 auxiliary branches and a multi-branch relation reasoning layer, and each branch comprises an object perception modulation layer, a cross-mode matching layer and a softmax function layer;
a space locator: the positioning device is used for realizing the positioning of the space pipeline;
a time locator: for achieving the positioning of the time pipe.
CN202010382647.0A 2020-05-08 2020-05-08 Method and system for positioning specified object in video based on multi-branch relation network Active CN111582170B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010382647.0A CN111582170B (en) 2020-05-08 2020-05-08 Method and system for positioning specified object in video based on multi-branch relation network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010382647.0A CN111582170B (en) 2020-05-08 2020-05-08 Method and system for positioning specified object in video based on multi-branch relation network

Publications (2)

Publication Number Publication Date
CN111582170A true CN111582170A (en) 2020-08-25
CN111582170B CN111582170B (en) 2023-05-23

Family

ID=72112195

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010382647.0A Active CN111582170B (en) 2020-05-08 2020-05-08 Method and system for positioning specified object in video based on multi-branch relation network

Country Status (1)

Country Link
CN (1) CN111582170B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112380394A (en) * 2020-10-27 2021-02-19 浙江工商大学 Progressive positioning method for positioning from text to video clip
CN112417206A (en) * 2020-11-24 2021-02-26 杭州一知智能科技有限公司 Weak supervision video time interval retrieval method and system based on two-branch proposed network
CN116580054A (en) * 2022-01-29 2023-08-11 腾讯科技(深圳)有限公司 Video data processing method, device, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5748789A (en) * 1996-10-31 1998-05-05 Microsoft Corporation Transparent block skipping in object-based video coding systems
CN107229894A (en) * 2016-03-24 2017-10-03 上海宝信软件股份有限公司 Intelligent video monitoring method and system based on computer vision analysis technology
CN110225368A (en) * 2019-06-27 2019-09-10 腾讯科技(深圳)有限公司 A kind of video locating method, device and electronic equipment
CN110807122A (en) * 2019-10-18 2020-02-18 浙江大学 Image-text cross-modal feature disentanglement method based on depth mutual information constraint

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5748789A (en) * 1996-10-31 1998-05-05 Microsoft Corporation Transparent block skipping in object-based video coding systems
CN107229894A (en) * 2016-03-24 2017-10-03 上海宝信软件股份有限公司 Intelligent video monitoring method and system based on computer vision analysis technology
CN110225368A (en) * 2019-06-27 2019-09-10 腾讯科技(深圳)有限公司 A kind of video locating method, device and electronic equipment
CN110807122A (en) * 2019-10-18 2020-02-18 浙江大学 Image-text cross-modal feature disentanglement method based on depth mutual information constraint

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李玺;查宇飞;张天柱;崔振;左旺孟;侯志强;卢湖川;王菡子;: "深度学习的目标跟踪算法综述" *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112380394A (en) * 2020-10-27 2021-02-19 浙江工商大学 Progressive positioning method for positioning from text to video clip
CN112380394B (en) * 2020-10-27 2022-05-10 浙江工商大学 Progressive positioning method for positioning from text to video clip
US11941872B2 (en) 2020-10-27 2024-03-26 Zhejiang Gongshang University Progressive localization method for text-to-video clip localization
CN112417206A (en) * 2020-11-24 2021-02-26 杭州一知智能科技有限公司 Weak supervision video time interval retrieval method and system based on two-branch proposed network
CN112417206B (en) * 2020-11-24 2021-09-24 杭州一知智能科技有限公司 Weak supervision video time interval retrieval method and system based on two-branch proposed network
CN116580054A (en) * 2022-01-29 2023-08-11 腾讯科技(深圳)有限公司 Video data processing method, device, equipment and medium

Also Published As

Publication number Publication date
CN111582170B (en) 2023-05-23

Similar Documents

Publication Publication Date Title
Liao et al. Deep facial spatiotemporal network for engagement prediction in online learning
Kamal et al. Automatic traffic sign detection and recognition using SegU-Net and a modified Tversky loss function with L1-constraint
CN108229338B (en) Video behavior identification method based on deep convolution characteristics
CN111582170A (en) Method and positioning system for completing specified object positioning task in video by using object-aware multi-branch relation network
CN107506712A (en) Method for distinguishing is known in a kind of human behavior based on 3D depth convolutional networks
CN111931795B (en) Multi-modal emotion recognition method and system based on subspace sparse feature fusion
CN112036276B (en) Artificial intelligent video question-answering method
KR20200010672A (en) Smart merchandise searching method and system using deep learning
CN114549470B (en) Hand bone critical area acquisition method based on convolutional neural network and multi-granularity attention
CN113111968A (en) Image recognition model training method and device, electronic equipment and readable storage medium
Chang et al. Event-centric multi-modal fusion method for dense video captioning
CN117149944B (en) Multi-mode situation emotion recognition method and system based on wide time range
CN115375732A (en) Unsupervised target tracking method and system based on module migration
Han et al. Modeling long-term video semantic distribution for temporal action proposal generation
CN117892175A (en) SNN multi-mode target identification method, system, equipment and medium
Hou et al. Confidence-guided self refinement for action prediction in untrimmed videos
CN116310975B (en) Audiovisual event positioning method based on consistent fragment selection
CN116958740A (en) Zero sample target detection method based on semantic perception and self-adaptive contrast learning
Wang et al. Hidden Markov Model‐Based Video Recognition for Sports
Wang et al. Curiosity-driven salient object detection with fragment attention
Jin et al. C2F: An effective coarse-to-fine network for video summarization
Rehman et al. A Real-Time Approach for Finger Spelling Interpretation Based on American Sign Language Using Neural Networks
Huo et al. Modality-convolutions: Multi-modal gesture recognition based on convolutional neural network
Ge et al. A visual tracking algorithm combining parallel network and dual attention-aware mechanism
Rawat et al. Indian sign language recognition system for interrogative words using deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant