CN112084319B - Relational network video question-answering system and method based on actions - Google Patents

Relational network video question-answering system and method based on actions Download PDF

Info

Publication number
CN112084319B
CN112084319B CN202011049187.6A CN202011049187A CN112084319B CN 112084319 B CN112084319 B CN 112084319B CN 202011049187 A CN202011049187 A CN 202011049187A CN 112084319 B CN112084319 B CN 112084319B
Authority
CN
China
Prior art keywords
video
network
representing
feature
question
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011049187.6A
Other languages
Chinese (zh)
Other versions
CN112084319A (en
Inventor
邵杰
张骥鹏
高联丽
徐行
申恒涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Huakun Zhenyu Intelligent Technology Co ltd
Original Assignee
Sichuan Artificial Intelligence Research Institute Yibin
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Artificial Intelligence Research Institute Yibin filed Critical Sichuan Artificial Intelligence Research Institute Yibin
Priority to CN202011049187.6A priority Critical patent/CN112084319B/en
Publication of CN112084319A publication Critical patent/CN112084319A/en
Application granted granted Critical
Publication of CN112084319B publication Critical patent/CN112084319B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a relational network video question-answering system and method based on actions, which belong to the field of computational linguistics and computer vision. The invention uses the result of the time sequence action detection network to assist the coding of the video characteristic, emphasizes the action factor of the video, avoids the error accumulation caused by the wrong action detection through the action probability distribution obtained by the detection result, inputs the action probability distribution and the initial video characteristic into the coder of the neural network together to learn the video characteristic so that the final video characteristic can contain the action information, and finally inputs the output video characteristic and the problem characteristic into a multi-head relation converter network to output the final result through the network.

Description

Relational network video question-answering system and method based on actions
Technical Field
The invention belongs to the field of computational linguistics and computer vision, and particularly relates to a relational network video question-answering system and method based on actions.
Background
The video question-answering system automatically answers relevant questions according to given video clips, attracts the attention of researchers in recent years, and is an important multi-modal understanding task. A typical video question-and-answer system is to give a description of a question and a corresponding question fragment, and earlier studies attempted to solve the question by cross-modal retrieval and action recognition.
In recent years, question-answering systems based on deep learning have appeared, and these deep learning methods can automatically acquire feature learning information, and at the same time, they achieve high performance on large-scale and complex data sets. Many of these methods have been the use of explored multi-modal information fusion and attention mechanisms, and since then much research has been devoted to improving problem systems based on deep learning. A relatively representative improvement is to utilize a hierarchical and multi-level attention mechanism and a graph neural network to model the association between various information, which aims to improve the characterization capability and the feature extraction capability of the model. On the other hand, an improved video representation acquisition mode is also a potential method for realizing a better solution expression mode, and particularly, the existing video question-answering system cannot effectively acquire motion information in a video and cannot well utilize related information, so that the acquired features cannot accurately express key information in the video, and finally, the generated answer is inaccurate.
Disclosure of Invention
Aiming at the defects in the prior art, the relational network video question-answering system and method based on the action provided by the invention solve the problem that the accuracy of the solution answer of the existing deep learning model is low.
In order to achieve the above purpose, the invention adopts the technical scheme that:
the scheme provides a relational network video question-answering system based on actions, which comprises a coding module, a question feature module, an action detection module, a relational conversion network module and a decoding module;
the encoding module is used for representing all frames of the video into a group of real-value vectors VE with fixed dimensionality through a three-dimensional convolution network and an optical flow network;
the question feature module is used for representing words in the question text as question features Q by utilizing a co-occurrence-based word embedding methodo
The motion detection module is used for acquiring various motion probability distributions in the video by utilizing a time sequence motion detection network, and fusing the various motion probability distributions with the real value vector VE to obtain an intermediate video feature V;
the relation conversion network module is used for converting the relation of the intermediate video characteristic V and the problem characteristic Q into a relation of the intermediate video characteristic V and the problem characteristic QoObtaining relation characteristic R between video actions by using relation conversion networkzAnd the video feature V and the relation feature R are combined through an attention mechanismzAggregated into relational video features ratt
The decoding module is used for fusing the intermediate video feature V and the problem feature QoAnd relational video features rattAnd inputting the fusion result into a decoder of the video question to generate a question answer of a corresponding type, and completing the action-based relation network video question-answer.
The invention has the beneficial effects that: the invention firstly uses the result of the time sequence action detection network to assist the coding of the video characteristic, emphasizes the action factor of the video, meanwhile, because of lacking of precise action interval label, the invention does not directly use the detected action interval, but avoids error accumulation caused by wrong action detection through the action probability distribution obtained by the detection result, the action probability distribution obtained by the time sequence action detection network and the initial video characteristic are input into a coder based on a recurrent neural network together to learn the video characteristic, so that the final video characteristic can contain the action information, and finally, the output video characteristic and the problem characteristic are input into a multi-head relation converter network, and the final result is output through the network. The invention improves the task performance by enhancing the action characteristics in the problem and is assisted by the relationship converter network, thereby obtaining better problem solving effect.
Based on the system, the invention also provides a relational network video question-answering method based on the action, which comprises the following steps:
s1, representing all video frames into a group of real-value vectors VE with fixed dimensionality through a three-dimensional convolution network and an optical flow network;
s2, representing words in the question text as question features Q by using a co-occurrence-based word embedding methodo
S3, acquiring multiple action probability distributions in the video by using a time sequence action detection network, and fusing the multiple action probability distributions with a real value vector VE to obtain an intermediate video feature V;
s4, according to the intermediate video feature V and the problem feature QoObtaining relation characteristic R between video actions by using relation conversion networkzAnd the video feature V and the relation feature R are combined through an attention mechanismzAggregated into relational video features ratt
S5, fusing intermediate video feature V and problem feature QoAnd relational video features rattAnd inputting the fusion result into a decoder of the video question to generate a question answer of a corresponding type, and completing the action-based relation network video question-answer.
The invention has the beneficial effects that: the invention firstly uses the result of the time sequence action detection network to assist the coding of the video characteristic, emphasizes the action factor of the video, meanwhile, because of lacking of precise action interval label, the invention does not directly use the detected action interval, but avoids error accumulation caused by wrong action detection through the action probability distribution obtained by the detection result, the action probability distribution obtained by the time sequence action detection network and the initial video characteristic are input into a coder based on a recurrent neural network together to learn the video characteristic, so that the final video characteristic can contain the action information, and finally, the output video characteristic and the problem characteristic are input into a multi-head relation converter network, and the final result is output through the network. The invention improves the task performance by enhancing the action characteristics in the problem and is assisted by the relationship converter network, thereby obtaining better problem solving effect.
Further, the step S1 includes the following steps:
s101, extracting a T-frame image from a video according to the transmission frame number of the video file per second;
s102, obtaining hidden state representation VF (f) of a static feature set of the frame by using a residual error network according to the extracted T frame image1,f2,...,frAnd taking the hidden state representation VF of the static feature set as a static feature real value vector corresponding to the videoWherein f isrRepresenting residual error characteristics corresponding to each frame of video;
s103, acquiring hidden state representation VS ═ S of dynamic feature set of frame by using optical flow convolution network according to extracted T frame image1,s2,...,srAnd taking the hidden state representation VS of the dynamic feature set as a dynamic feature real-value vector corresponding to the video, wherein srRepresenting optical flow characteristics corresponding to each frame of video;
and S104, fusing the static characteristic real value vector and the dynamic characteristic real value vector to obtain a real value vector VE with fixed dimensionality.
The beneficial effects of the further scheme are as follows: the residual network, the optical flow volume and the network respectively emphasize the dynamic and static characteristics of the video, so that the more comprehensive understanding of the model to the video is facilitated.
Still further, the step S2 includes the steps of:
s201, processing an input question in a word sequence mode according to a question text;
s202, converting the word sequence into a real-value vector set Q of fixed dimensionality by using a word embedding method1,q2,..,qNWherein q isNRepresenting the feature vector corresponding to the last word, and N representing the length of the question sequence;
s203, inputting the real value vector set Q to a recurrent neural network to obtain a problem characteristic Qo
Still further, the step S3 includes the steps of:
s301, processing the video sequence by utilizing a time sequence action detection network to obtain a plurality of action probability distributions { (tf) in the videos1,tfs2,...,tfe1),...,(tfsM,...,tfeM) Where, tfsMA start time frame, tf, representing the detected motioneMAn end time frame representing the detected motion, M representing the first M motion probability distributions;
s302, converting the various action probability distributions into corresponding mask matrixes, and fusing the mask matrixes and the real value vectors VE to obtain the intermediate video features V.
The beneficial effects of the further scheme are as follows: the invention firstly uses the information provided by the action detection network to assist the video characteristics input by coding, effectively embeds the attribute of the video in the time dimension into the video characteristics, and the newly generated video characteristics contain the detected information which takes the action as the center, and the action information has important significance for correctly answering the question.
Still further, the step S302 includes the steps of:
s3021, converting the multiple action probability distributions into corresponding initial mask matrixes to obtain a subset VE of the real-valued vector VE1
S3022, defining a zero matrix Mask with the same size as the real-valued vector VE1And Mask the zero matrix1And subset VE1The corresponding column assignment is 1, and a final mask matrix is obtained;
s3033, fusing the final mask matrix and the real value vector VE through bit-wise multiplication, and simultaneously calculating mask matrixes corresponding to a plurality of action intervals;
s3034, adding the mask matrixes corresponding to the plurality of action intervals to obtain the video characteristic BSNf
S3035, according to the video characteristic BSNfCalculating with a real value vector VE to obtain an intermediate video characteristic V; the expression of the intermediate video feature V is as follows:
V=VE+BSNf
Figure BDA0002709012410000051
BSNfj=VE⊙Maskj
where VE represents the real-valued vector, BSNfRepresenting video features, BSNfjMask matrix indicating correspondence of a plurality of action periods, <' > indicating bit-wise multiplication, M indicating the first M actionsProbability distribution, j, represents the jth action detected, and 1 ≦ j ≦ M.
The beneficial effects of the further scheme are as follows: the invention can embed the action information into the video representation on the premise of not changing the characteristic shape by fusing the probability distribution of various actions and the real value vector VE.
Still further, the step S4 includes the steps of:
s401, converting the intermediate video feature V into a video feature VP by using a full-connection network;
s402, utilizing a relation network to carry out on the video feature VP and the problem feature QoProcessing to obtain the relation characteristic ri(ii) a The relation characteristic riThe expression of (a) is as follows:
ri=Wr([vpi,vpi+1,...,vpi+F,Qo])+br
wherein, WrRepresenting the parameter matrix, vp, to be trainedi+FRepresenting features corresponding to frames derived from the ith frame up to F frames, brRepresenting a bias parameter to be trained;
s403, utilizing a relation conversion network to convert the video feature VP and the problem feature QoPerforming fusion processing, and obtaining the relation characteristic r according to the fusion resultiCalculating to obtain a relation characteristic Rz(ii) a The relation characteristic RzThe expression of (a) is as follows:
Figure BDA0002709012410000061
Rk=Relation-Modulek(VP,Qo)
wherein, relationship-Modulek(. represents the calculation of the kth relational sub-network, RkExpressing the output of the kth relational sub-network, K expressing the total quantity of the relational sub-networks, and | l expressing splicing the outputs of the K relational sub-networks;
s404, according to the relation characteristic RzUsing feed-forward network and layer regularizerCalculating relational characteristics
Figure BDA0002709012410000062
S405, characterizing the relation by using an attention mechanism
Figure BDA0002709012410000063
Aggregated into relational video features ratt(ii) a The relational video feature rattThe expression of (a) is as follows:
Figure BDA0002709012410000064
wherein the Attentionr(. cndot.) denotes the attention mechanism.
The beneficial effects of the further scheme are as follows: the invention provides a video model based on a relation conversion network for the first time, and aims to better utilize the attributes of time dimensions distributed in video frames and the interaction thereof, and add a plurality of characteristics of a multi-head structure and a converter network based on the prior of the search of the relation characteristics between frames added by the relation network, so that the system has stronger video characteristic extraction capability.
Still further, the expression of the video feature VP in step S401 is as follows:
VP={vp1,vp2,...,vpT}
vpi=Wp×vi+bp
1≤i≤T
wherein vp isTRepresenting the video feature corresponding to the last frame image, WpRepresenting the parameter matrix to be trained, viRepresenting intermediate video features of the ith frame, bpRepresenting the bias parameter to be trained, T representing the total frame number of the video, i representing the ith frame of the video, vpiRepresenting the corresponding feature of the ith frame of the video.
Still further, the relationship characteristic in the step S404
Figure BDA0002709012410000071
The expression of (a) is as follows:
Figure BDA0002709012410000072
Figure BDA0002709012410000073
Figure BDA0002709012410000074
wherein,
Figure BDA0002709012410000075
representing the relationship characteristics after layer regularization, layerNorm representing layer regularization, FFN (-) representing the computation of the feed-forward network,
Figure BDA0002709012410000076
presentation pair
Figure BDA0002709012410000077
Performing a calculation of a feedforward network, bf1Representing a bias parameter, W, of a first layer of the feedforward networkf1Parameter matrix, W, representing the first layer of the feedforward networkf2Parameter matrix representing the second layer of the feedforward network, bf2Representing a bias parameter of a second layer of the feed-forward network.
The beneficial effects of the further scheme are as follows: the invention can fully and efficiently model the complex association between frames by utilizing the synthesis of a plurality of relational networks.
Still further, the step S5 includes the steps of:
s501, aggregating the intermediate video features V into a comprehensive video representation V by using an attention mechanismatt
vatt=Attentionv(V,Qo)
Wherein the Attentionv(. represents a note)Free mechanism, QoRepresenting a problem feature;
s502, representing the integrated videoattRelational video features rattAnd problem feature QoAnd fusing by using a bitwise addition mode to obtain a final representation J, inputting the final representation J into a decoder of the video question to generate a question answer of a corresponding type, and completing the action-based relational network video question-answer.
The beneficial effects of the further scheme are as follows: characterizing said integrated video vattRelational video features rattAnd problem feature QoThe fusion is carried out, and the fusion characteristic of the relationship information is added, so that better answer of the question can be assisted.
Drawings
FIG. 1 is a schematic diagram of the system of the present invention.
FIG. 2 is a schematic flow chart of the method of the present invention.
Fig. 3 is a schematic flow chart in this embodiment.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.
Example 1
As shown in fig. 1, a relational network video question-answering system based on actions includes an encoding module, a question feature module, an action detection module, a relational conversion network module, and a decoding module; the encoding module is used for representing all frames of the video into a group of real-value vectors VE with fixed dimensionality through a three-dimensional convolution network and an optical flow network; a question feature module for representing words in the question text as question features Q using a co-occurrence based word embedding methodo(ii) a An action detection module for utilizing time sequence action detectionThe network acquires the probability distribution of various actions in the video, and fuses the probability distribution of various actions with the real value vector VE to obtain the intermediate video characteristic V; a relation conversion network module for converting the relation of the intermediate video feature V and the problem feature QoObtaining relation characteristic R between video actions by using relation conversion networkzAnd the video feature V and the relation feature R are combined through an attention mechanismzAggregated into relational video features ratt(ii) a A decoding module for fusing the real value vector VE and the problem feature QoAnd relational video features rattAnd inputting the fusion result into a decoder of the video question to generate a question answer of a corresponding type, and completing the action-based relation network video question-answer.
In this embodiment, the present invention first represents all frames of a video as a set of real-valued vectors VE with fixed dimensions through a three-dimensional convolutional network and an optical flow network, and then represents words in a problem text as problem features Q by using a co-occurrence-based word embedding methodoThen, the invention uses the result of the time sequence action detection network to assist the coding of the video characteristic, emphasizes the action factor of the video, meanwhile, because of lacking precise action interval label, the invention does not directly use the detected action interval, but avoids the error accumulation caused by wrong action detection through the action probability distribution obtained by the detection result, the action probability distribution obtained by the time sequence action detection network and the initial video characteristic are input into the coder based on the recurrent neural network together to learn the video characteristic, so that the final video characteristic can contain the action information, finally, the output video characteristic and the problem characteristic are input into a multi-head relation converter network, and the final result is output through the network. The invention improves the task performance by enhancing the action characteristics in the problem and is assisted by the relationship converter network, thereby obtaining better problem solving effect.
Example 2
As shown in fig. 2 and fig. 3, the present invention further provides a relational network video question answering method based on actions, which is implemented as follows:
s1, representing all video frames as a set of real-valued vectors VE with fixed dimensions through a three-dimensional convolution network and an optical flow network, which is implemented as follows:
s101, extracting a T-frame image from a video according to the transmission frame number of the video file per second;
s102, obtaining hidden state representation VF (f) of a static feature set of the frame by using a residual error network according to the extracted T frame image1,f2,...,frAnd taking the hidden state representation VF of the static feature set as a static feature real value vector corresponding to the video, wherein frRepresenting residual error characteristics corresponding to each frame of video;
s103, acquiring hidden state representation VS ═ S of dynamic feature set of frame by using optical flow convolution network according to extracted T frame image1,s2,...,srAnd taking the hidden state representation VS of the dynamic feature set as a dynamic feature real-value vector corresponding to the video, wherein srRepresenting optical flow characteristics corresponding to each frame of video;
s104, fusing the static characteristic real value vector and the dynamic characteristic real value vector to obtain a real value vector VE with fixed dimensionality;
s2, representing words in the question text as question features Q by using a co-occurrence-based word embedding methodo(ii) a The realization method comprises the following steps:
s201, processing an input question in a word sequence mode according to a question text;
s202, converting the word sequence into a real-value vector set Q of fixed dimensionality by using a word embedding method1,q2,..,qNWherein q isNRepresenting the feature vector corresponding to the last word, and N representing the length of the question sequence;
s203, inputting the real value vector set Q into a recurrent neural network to obtain a problem characteristic Qo
S3, acquiring multiple action probability distributions in the video by using a time sequence action detection network, and fusing the multiple action probability distributions with a real value vector VE to obtain an intermediate video feature V; the realization method comprises the following steps:
s301, using time sequenceThe action detection network processes the video sequence to obtain a plurality of action probability distributions (tf) in the videos1,tfs2,...,tfe1),...,(tfsM,...,tfeM) Where, tfsMA start time frame, tf, representing the detected motioneMRepresenting the ending time frame of the detected motion, M representing the first M motion probability distributions.
In the embodiment, a video sequence is processed through a motion detection network to obtain the distribution of motion probability on a video; more specifically, since the motion detection network generates probability distributions corresponding to a plurality of motions, the present invention selects the top M distributions, i.e. the motion interval with the top M confidence. At the same time, converting the time in the detection result into corresponding frame, and obtaining { (tf)s1,tfs2,...,tfe1),...,(tfsM,...,tfeM) Indicating which frames are more likely to have an action.
S302, converting the probability distribution of various actions into a corresponding mask matrix, and fusing the mask matrix and the real value vector VE to obtain an intermediate video feature V; the realization method comprises the following steps:
s3021, converting the probability distribution of the plurality of actions into corresponding initial mask matrixes to obtain a subset VE of the real-valued vector VE1
S3022, defining a zero matrix Mask with the same size as the real-valued vector VE1And Mask the zero matrix1And subset VE1The corresponding column assignment is 1, and a final mask matrix is obtained;
s3033, fusing the final mask matrix and the real value vector VE through bit-wise multiplication, and simultaneously calculating mask matrixes corresponding to a plurality of action intervals;
s3034, adding the mask matrixes corresponding to the plurality of action intervals to obtain the video characteristic BSNf
S3035, BSN according to video characteristicsfCalculating with a real value vector VE to obtain an intermediate video characteristic V; the expression for the intermediate video feature V is as follows:
V=VE+BSNf
Figure BDA0002709012410000111
BSNfj=VE⊙Maskj
where VE represents the real-valued vector, BSNfRepresenting video features, BSNfjMask matrixes corresponding to a plurality of action intervals are represented, bit-wise multiplication is performed, namely, calculation of bit-wise multiplication is performed between the matrixes, then a new matrix is obtained, M represents probability distribution of the first M actions, j represents the j-th action to be detected, and j is more than or equal to 1 and less than or equal to M.
In the present embodiment, the operation interval { (tf) is obtained based ons1,tfs2,...,tfe1),...,(tfsM,...,tfeM) And converting the video image into a corresponding mask matrix by the system, and then fusing the mask matrix with the original video characteristic VE to finish the operation of motion coding. First, the above operation sections are converted into a mask matrix to (ft)s1,tfe1) For example, take (tf)s1,tfe1) Is a boundary, where fts1Denotes the start time of the first action, tfe1Representing the stop time of the first action, the system may obtain a subset VE of the set of video features VE1It contains only the features of the frames that are in the corresponding detection action interval. Then, a zero matrix Mask with the same size as VE is defined1Then Mask is put1Middle pair VE1The corresponding columns are all assigned with 1, so that the corresponding Mask is formally obtained1. Then fusing the mask matrix and the video characteristic VE together through bit-wise multiplication, simultaneously calculating the mask matrixes corresponding to a plurality of action intervals, and finally adding the mask matrixes and the mask matrixes to obtain the video characteristic BSN after action codingf
S4, according to the intermediate video feature V and the problem feature QoObtaining relation characteristic R between video actions by using relation conversion networkzAnd the video feature V and the relation feature R are combined through an attention mechanismzAggregated into relational video features ratt(ii) a The realization method comprises the following steps:
s401, converting the intermediate video feature V into a video feature VP by using a full-connection network; the expression of the video feature VP is as follows:
VP={vp1,...,vpT}
vpi=Wp×vi+bp
1≤i≤T
wherein vp isTRepresenting the video feature corresponding to the last frame image, WpRepresenting the parameter matrix to be trained, viRepresenting intermediate video features of the ith frame, bpRepresenting the bias parameter to be trained, T representing the total frame number of the video, i representing the ith frame of the video, vpiRepresenting the corresponding characteristic of the ith frame of the video;
s402, utilizing a relation network to carry out video feature VP and problem feature QoProcessing to obtain the relation characteristic ri(ii) a Relation characteristic riThe expression of (a) is as follows:
ri=Wr([vpi,vpi+1,...,vpi+F,Qo])+br
wherein, WrRepresenting the parameter matrix, vp, to be trainedi+FRepresenting features corresponding to frames derived from the ith frame up to F frames, brWhich represents the bias parameter to be trained, where F is set to 1 in the present invention.
In this embodiment, the video feature VP is processed by a relational network module, and the video feature VP at a given frame level is { VP ═ VP1,vp2,...,vpTAnd problem feature Qo,riThe system obtains T- (F-1) corresponding relation characteristics R ═ R for a set containing T frame video characteristics, wherein the ith relation characteristic is shown1,r2,...,rT-(F-1) F, the number of frames that the relational network module needs to consider, is set to 1 in this document. The processing procedure of this step is hereinafter referred to as relationship-Modulek
S403, using the relation conversion network to convert the video feature VP and the problem feature QoPerforming fusion processing, and determining the fusion result and relationshipCharacteristic riCalculating to obtain a relation characteristic Rz(ii) a Relation characteristic RzThe expression of (a) is as follows:
Figure BDA0002709012410000131
Rk=Relation-Modulek(VP,Qo)
wherein, relationship-Modulek(. cndot.) denotes the calculation of the kth relational sub-network in the same manner as in S402, RkExpressing the output of the kth relational sub-network, K expressing the total quantity of the relational sub-networks, and | l expressing splicing the outputs of the K relational sub-networks;
s404, according to the relation characteristic RkObtaining relationship features using feed-forward network and layer regularization calculations
Figure BDA0002709012410000132
Characteristics of relationships
Figure BDA0002709012410000133
The expression of (a) is as follows:
Figure BDA0002709012410000134
Figure BDA0002709012410000135
Figure BDA0002709012410000136
wherein,
Figure BDA0002709012410000137
representing the relationship characteristics after layer regularization, layerNorm representing layer regularization, FFN (-) representing the computation of the feed-forward network,
Figure BDA0002709012410000138
presentation pair
Figure BDA0002709012410000139
Performing a calculation of a feedforward network, bf1Representing a bias parameter, W, of a first layer of the feedforward networkf1Parameter matrix, W, representing the first layer of the feedforward networkf2Parameter matrix representing the second layer of the feedforward network, bf2Representing a bias parameter of a second layer of the feed-forward network.
In this embodiment, the present system provides a new relationship conversion network, which also uses the frame-level video feature VP ═ VP1,vp2,...,vpTAnd problem feature QoFor inputting, firstly, the video characteristics are obtained through a multi-head relation network, a K-head structure is introduced, each head adopts the relation network, and the calculation method is the same as the above S402. For each relational network module
Figure BDA0002709012410000141
The system executes the operation in parallel, then splices the operation results together, and then obtains the final relation characteristic through a feedforward network FFN and layer regularization LayerNorm calculation
Figure BDA0002709012410000142
S405, characterizing the relationship by using an attention mechanism
Figure BDA0002709012410000143
Aggregated into relational video features ratt(ii) a Relational video features rattThe expression of (a) is as follows:
Figure BDA0002709012410000144
wherein the Attentionr(. cndot.) represents the attention mechanism;
s5, fusing intermediate video feature V and problem feature QoAnd view of relationshipFrequency characteristic rattAnd inputting the fusion result into a decoder of the video question to generate a question answer of a corresponding type, and completing the question answering based on the action of the relational network video, wherein the implementation method comprises the following steps:
s501, aggregating the intermediate video features V into a comprehensive video representation V by using an attention mechanismatt
vatt=Attentionv(V,Qo)
Wherein the Attentionv(. represents the attention mechanism, Q)oRepresenting a problem feature;
s502, representing the integrated video vattRelational video features rattAnd problem feature QoAnd fusing by using a bitwise addition mode to obtain a final representation J, inputting the final representation J into a decoder of the video question to generate a question answer of a corresponding type, and completing the action-based relational network video question-answer.
The present invention will be further described below.
In this embodiment, two common data sets are utilized: wherein TGIF-QA (video question and answer dataset for spatio-temporal joint reasoning) has 165,165 questions and 71,741 video clips, ActivityNet-QA (video question and answer dataset for action) has 58,000 questions and 58,00 corresponding video clips. For the TGIF-QA data set, the whole data is divided into 4 types of subproblems of Action, Transition, Frame and Count (Action, state Transition, single Frame static problem, Count), wherein the Action, Transition and Frame are directly evaluated by using the accuracy, and the Count is evaluated by using the Mean Square Error (MSE) because the result is a numerical value. For the ActivityNet-QA dataset, the evaluation was performed by accuracy and similarity of standard correct answers (WUPS). As shown in Table 1, the results of the method are shown in Table 1, which is a comparison of the effects of the method and the existing method, and it can be seen from Table 1 (the accuracy of the model on the test set is represented by the data in Action, Transition, and Frame, the larger the numerical value is, the better the numerical value is, the mean square error in Count is the difference between the result generated by the model on the test set and the standard answer, the smaller the numerical value is, the better the effect of the method is compared with the existing ST-TP method (time sequence reasoning model), Co-memory method (joint memory network), PSAC method (position-dependent time-space thrust network) and HGA method (heterogeneous graph network alignment model).
TABLE 1
Action Transition Frame Count
ST-TP 62.9 69.4 49.5 4.32
Co-memory 68.2 74.3 51.5 4.10
PSAC 70.4 76.9 55.7 4.27
HGA 75.4 81.0 55.1 4.09
Method for producing a composite material 75.81 81.61 57.68 4/08
As shown in Table 2, it CAN be seen from Table 2 that the data in Acc represents the accuracy of the model on the test set, the larger the value is better, WUPS is the difference between the model on the test set generation result and the standard answer, the larger the value is better, the method has better effect than the existing E-VQA (static question answering model), E-MN (memory network) method, E-SA (soft attention network) method, VQA-HMAL (conditional countermeasure network) method and CAN (combined attention network) method.
TABLE 2
Figure BDA0002709012410000151
Figure BDA0002709012410000161
In summary, the present invention introduces two novel mechanisms, motion-based coding and relationship converter, to help improve the video question-and-answer system, and besides using the features at the frame level in the static part, the present invention also focuses more on the motion attribute in the time dimension and embeds it into the video features. In addition, the invention does not use a recurrent neural network to extract the video characterization vector, but uses a new relation conversion network to capture the video characteristics, and the experiment is carried out on two large-scale video question-answer data, namely TGIF-QA and ActivityNet-QA respectively, and the result shows that the invention is obviously improved on the basis of the original method.

Claims (8)

1. A relation network video question-answering system based on actions is characterized by comprising a coding module, a question feature module, an action detection module, a relation conversion network module and a decoding module;
the encoding module is used for representing all frames of the video into a group of real-value vectors VE with fixed dimensionality through a three-dimensional convolution network and an optical flow network;
the question feature module is used for representing words in the question text as question features Q by utilizing a co-occurrence-based word embedding methodo
The motion detection module is used for acquiring various motion probability distributions in the video by utilizing a time sequence motion detection network, and fusing the various motion probability distributions with a real value vector VE to obtain an intermediate video feature V, and the implementation method comprises the following steps:
a1, processing the video sequence by utilizing a time sequence action detection network to obtain a plurality of action probability distributions (tf) in the videos1,tfs2,...,tfe1),...,(tfsM,...,tfeM) Where, tfsMA start time frame, tf, representing the detected motioneMAn end time frame representing the detected motion, M representing the first M motion probability distributions;
a2, converting the multiple action probability distributions into corresponding mask matrixes, and fusing the mask matrixes and the real value vectors VE to obtain intermediate video features V;
the step A2 comprises the following steps:
a21, converting the multiple action probability distributions into corresponding initial mask matrixes to obtain a subset VE of a real-value vector VE1
A22, defining a zero matrix Mask with the same size as the real value vector VE1And Mask the zero matrix1And subset VE1The corresponding column assignment is 1, and a final mask matrix is obtained;
a23, fusing the final mask matrix and the real value vector VE by bit-wise multiplication, and simultaneously calculating mask matrixes corresponding to a plurality of action intervals;
a24, adding the mask matrixes corresponding to a plurality of action sections to obtain video characteristics BSNf
A25, BSN according to the video characteristicsfCalculating with a real value vector VE to obtain an intermediate video characteristic V; the expression of the intermediate video feature V is as follows:
V=VE+BSNf
Figure FDA0002906385390000021
BSNfj=VE⊙Maskj
where VE represents the real-valued vector, BSNfRepresenting video features, BSNfjA mask matrix indicating a plurality of action intervals, wherein, < > indicates bit-wise multiplication, M indicates probability distribution of the first M actions, j indicates the j-th action to be detected, and j is more than or equal to 1 and less than or equal to M;
the relation conversion network module is used for converting the relation of the intermediate video characteristic V and the problem characteristic Q into a relation of the intermediate video characteristic V and the problem characteristic QoObtaining relation characteristic R between video actions by using relation conversion networkzAnd the video feature V and the relation feature R are combined through an attention mechanismzAggregated into relational video features ratt
The decoding module is used for fusing the intermediate video feature V and the problem feature QoAnd relational video features rattAnd inputting the fusion result into a decoder of the video question to generate a question answer of a corresponding type, and completing the action-based relation network video question-answer.
2. A relational network video question-answering method based on actions is characterized by comprising the following steps:
s1, representing all video frames into a group of real-value vectors VE with fixed dimensionality through a three-dimensional convolution network and an optical flow network;
s2, utilizing co-occurrenceThe word embedding method expresses words in the question text as question features Qo
S3, acquiring multiple action probability distributions in the video by using a time sequence action detection network, and fusing the multiple action probability distributions with a real value vector VE to obtain an intermediate video feature V;
the step S3 includes the steps of:
s301, processing the video sequence by utilizing a time sequence action detection network to obtain a plurality of action probability distributions { (tf) in the videos1,tfs2,...,tfe1),...,(tfsM,...,tfeM) Where, tfsMA start time frame, tf, representing the detected motioneMAn end time frame representing the detected motion, M representing the first M motion probability distributions;
s302, converting the probability distribution of the various actions into a corresponding mask matrix, and fusing the mask matrix and the real value vector VE to obtain an intermediate video feature V;
the step S302 includes the steps of:
s3021, converting the multiple action probability distributions into corresponding initial mask matrixes to obtain a subset VE of the real-valued vector VE1
S3022, defining a zero matrix Mask with the same size as the real-valued vector VE1And Mask the zero matrix1And subset VE1The corresponding column assignment is 1, and a final mask matrix is obtained;
s3033, fusing the final mask matrix and the real value vector VE through bit-wise multiplication, and simultaneously calculating mask matrixes corresponding to a plurality of action intervals;
s3034, adding the mask matrixes corresponding to the plurality of action intervals to obtain the video characteristic BSNf
S3035, according to the video characteristic BSNfCalculating with a real value vector VE to obtain an intermediate video characteristic V; the expression of the intermediate video feature V is as follows:
V=VE+BSNf
Figure FDA0002906385390000031
BSNfi=VE⊙Maskj
where VE represents the real-valued vector, BSNfRepresenting video features, BSNfjA mask matrix indicating a plurality of action intervals, wherein, < > indicates bit-wise multiplication, M indicates probability distribution of the first M actions, j indicates the j-th action to be detected, and j is more than or equal to 1 and less than or equal to M;
s4, according to the intermediate video feature V and the problem feature QoObtaining relation characteristic R between video actions by using relation conversion networkzAnd the video feature V and the relation feature R are combined through an attention mechanismzAggregated into relational video features ratt
S5, fusing intermediate video feature V and problem feature QoAnd relational video features rattAnd inputting the fusion result into a decoder of the video question to generate a question answer of a corresponding type, and completing the action-based relation network video question-answer.
3. The motion-based video question answering method of the relational network according to the claim 2, wherein the step S1 comprises the steps of:
s101, extracting a T-frame image from a video according to the transmission frame number of the video file per second;
s102, obtaining hidden state representation VF (f) of a static feature set of the frame by using a residual error network according to the extracted T frame image1,f2,...,frAnd taking the hidden state representation VF of the static feature set as a static feature real value vector corresponding to the video, wherein frRepresenting residual error characteristics corresponding to each frame of video;
s103, acquiring hidden state representation VS ═ S of dynamic feature set of frame by using optical flow convolution network according to extracted T frame image1,s2,...,srAnd taking the hidden state representation VS of the dynamic feature set as a dynamic feature real-value vector corresponding to the video, wherein srEach representsAn optical flow characteristic corresponding to a frame of video;
and S104, fusing the static characteristic real value vector and the dynamic characteristic real value vector to obtain a real value vector VE with fixed dimensionality.
4. The motion-based video question answering method of the relational network according to the claim 2, wherein the step S2 comprises the steps of:
s201, processing an input question in a word sequence mode according to a question text;
s202, converting the word sequence into a real-value vector set Q of fixed dimensionality by using a word embedding method1,q2,..,qNWherein q isNRepresenting the feature vector corresponding to the last word, and N representing the length of the question sequence;
s203, inputting the real value vector set Q to a recurrent neural network to obtain a problem characteristic Qo
5. The motion-based video question answering method of the relational network according to the claim 2, wherein the step S4 comprises the steps of:
s401, converting the intermediate video feature V into a video feature VP by using a full-connection network;
s402, utilizing a relation network to carry out on the video feature VP and the problem feature QoProcessing to obtain the relation characteristic ri(ii) a The relation characteristic riThe expression of (a) is as follows:
ri=Wr([vpi,vpi+1,...,vpi+F,Qo])+br
wherein, WrRepresenting the parameter matrix, vp, to be trainedi+FRepresenting features corresponding to frames derived from the ith frame up to F frames, brRepresenting a bias parameter to be trained;
s403, utilizing a relation conversion network to convert the video feature VP and the problem feature QoPerforming fusion processing, and obtaining the relation characteristic r according to the fusion resultiCalculating to obtain the relation characteristicSign Rz(ii) a The relation characteristic RzThe expression of (a) is as follows:
Figure FDA0002906385390000051
Rk=Relation-Modulek(VP,Qo)
wherein, relationship-Modulek(. represents the calculation of the kth relational sub-network, RkExpressing the output of the kth relational sub-network, K expressing the total quantity of the relational sub-networks, and | l expressing splicing the outputs of the K relational sub-networks;
s404, according to the relation characteristic RzObtaining relationship features using feed-forward network and layer regularization calculations
Figure FDA0002906385390000052
S405, characterizing the relation by using an attention mechanism
Figure FDA0002906385390000053
Aggregated into relational video features ratt(ii) a The relational video feature rattThe expression of (a) is as follows:
Figure FDA0002906385390000054
wherein the Attentionr(. cndot.) denotes the attention mechanism.
6. The motion-based video question answering method of relational network according to claim 5, wherein the expression of the video feature VP in the step S401 is as follows:
VP={vp1,vp2,...,vpT}
vpi=Wp×vi+bp
1≤i≤T
wherein vp isTRepresenting the video feature corresponding to the last frame image, WpRepresenting the parameter matrix to be trained, viRepresenting intermediate video features of the ith frame, bpRepresenting the bias parameter to be trained, T representing the total frame number of the video, i representing the ith frame of the video, vpiRepresenting the corresponding feature of the ith frame of the video.
7. The motion-based video question answering method of relational network according to claim 5, wherein the relational features in the step S404
Figure FDA0002906385390000061
The expression of (a) is as follows:
Figure FDA0002906385390000062
Figure FDA0002906385390000063
Figure FDA0002906385390000064
wherein,
Figure FDA0002906385390000065
representing the relationship characteristics after layer regularization, layerNorm representing layer regularization, FFN (-) representing the computation of the feed-forward network,
Figure FDA0002906385390000066
presentation pair
Figure FDA0002906385390000067
Performing a calculation of a feedforward network, bf1Representing a bias parameter, W, of a first layer of the feedforward networkf1Representing a feedforward networkParameter matrix of one layer, Wf2Parameter matrix representing the second layer of the feedforward network, bf2Representing a bias parameter of a second layer of the feed-forward network.
8. The motion-based video question answering method of the relational network according to the claim 2, wherein the step S5 comprises the steps of:
s501, aggregating the intermediate video features V into a comprehensive video representation V by using an attention mechanismatt
vatt=Attentionv(V,Qo)
Wherein the Attentionv(. represents the attention mechanism, Q)oRepresenting a problem feature;
s502, representing the integrated videoattRelational video features rattAnd problem feature QoAnd fusing by using a bitwise addition mode to obtain a final representation J, inputting the final representation J into a decoder of the video question to generate a question answer of a corresponding type, and completing the action-based relational network video question-answer.
CN202011049187.6A 2020-09-29 2020-09-29 Relational network video question-answering system and method based on actions Active CN112084319B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011049187.6A CN112084319B (en) 2020-09-29 2020-09-29 Relational network video question-answering system and method based on actions

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011049187.6A CN112084319B (en) 2020-09-29 2020-09-29 Relational network video question-answering system and method based on actions

Publications (2)

Publication Number Publication Date
CN112084319A CN112084319A (en) 2020-12-15
CN112084319B true CN112084319B (en) 2021-03-16

Family

ID=73729964

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011049187.6A Active CN112084319B (en) 2020-09-29 2020-09-29 Relational network video question-answering system and method based on actions

Country Status (1)

Country Link
CN (1) CN112084319B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113536952B (en) * 2021-06-22 2023-04-21 电子科技大学 Video question-answering method based on attention network of motion capture
CN115312044B (en) * 2022-08-05 2024-06-14 清华大学 Hierarchical audio-visual feature fusion method and product for audio-video question and answer

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101149756A (en) * 2007-11-09 2008-03-26 清华大学 Individual relation finding method based on path grade at large scale community network
CN110097000A (en) * 2019-04-29 2019-08-06 东南大学 Video behavior recognition methods based on local feature Aggregation Descriptor and sequential relationship network

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9123254B2 (en) * 2012-06-07 2015-09-01 Xerox Corporation Method and system for managing surveys
CN109582798A (en) * 2017-09-29 2019-04-05 阿里巴巴集团控股有限公司 Automatic question-answering method, system and equipment
CN109614613B (en) * 2018-11-30 2020-07-31 北京市商汤科技开发有限公司 Image description statement positioning method and device, electronic equipment and storage medium
CN111079532B (en) * 2019-11-13 2021-07-13 杭州电子科技大学 Video content description method based on text self-encoder

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101149756A (en) * 2007-11-09 2008-03-26 清华大学 Individual relation finding method based on path grade at large scale community network
CN110097000A (en) * 2019-04-29 2019-08-06 东南大学 Video behavior recognition methods based on local feature Aggregation Descriptor and sequential relationship network

Also Published As

Publication number Publication date
CN112084319A (en) 2020-12-15

Similar Documents

Publication Publication Date Title
CN114064918B (en) Multi-modal event knowledge graph construction method
CN111753024B (en) Multi-source heterogeneous data entity alignment method oriented to public safety field
CN109815336B (en) Text aggregation method and system
CN111538848A (en) Knowledge representation learning method fusing multi-source information
CN110765277B (en) Knowledge-graph-based mobile terminal online equipment fault diagnosis method
Liu et al. Cross-attentional spatio-temporal semantic graph networks for video question answering
CN112084319B (en) Relational network video question-answering system and method based on actions
CN112905795A (en) Text intention classification method, device and readable medium
CN111125520B (en) Event line extraction method based on deep clustering model for news text
CN113221571B (en) Entity relation joint extraction method based on entity correlation attention mechanism
CN111061951A (en) Recommendation model based on double-layer self-attention comment modeling
CN116108351A (en) Cross-language knowledge graph-oriented weak supervision entity alignment optimization method and system
CN112035689A (en) Zero sample image hash retrieval method based on vision-to-semantic network
CN117058266A (en) Handwriting word generation method based on skeleton and outline
CN115563314A (en) Knowledge graph representation learning method for multi-source information fusion enhancement
CN115687638A (en) Entity relation combined extraction method and system based on triple forest
CN115062109A (en) Entity-to-attention mechanism-based entity relationship joint extraction method
CN111428518B (en) Low-frequency word translation method and device
CN112926323A (en) Chinese named entity identification method based on multi-stage residual convolution and attention mechanism
CN117648984A (en) Intelligent question-answering method and system based on domain knowledge graph
CN117407532A (en) Method for enhancing data by using large model and collaborative training
CN115599954B (en) Video question-answering method based on scene graph reasoning
Perdana et al. Instance-based deep transfer learning on cross-domain image captioning
CN116522165A (en) Public opinion text matching system and method based on twin structure
CN115964468A (en) Rural information intelligent question-answering method and device based on multilevel template matching

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20210525

Address after: 610000 China (Sichuan) pilot Free Trade Zone, Chengdu, Sichuan

Patentee after: Sichuan Huakun Zhenyu Intelligent Technology Co.,Ltd.

Address before: No. 430, Section 2, west section of North Changjiang Road, Lingang Economic and Technological Development Zone, Yibin, Sichuan, 644000

Patentee before: Sichuan Artificial Intelligence Research Institute (Yibin)