CN112084319B

CN112084319B - Relational network video question-answering system and method based on actions

Info

Publication number: CN112084319B
Application number: CN202011049187.6A
Authority: CN
Inventors: 邵杰; 张骥鹏; 高联丽; 徐行; 申恒涛
Original assignee: Sichuan Artificial Intelligence Research Institute Yibin
Current assignee: Sichuan Huakun Zhenyu Intelligent Technology Co ltd
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2021-03-16
Anticipated expiration: 2040-09-29
Also published as: CN112084319A

Abstract

The invention provides a relational network video question-answering system and method based on actions, which belong to the field of computational linguistics and computer vision. The invention uses the result of the time sequence action detection network to assist the coding of the video characteristic, emphasizes the action factor of the video, avoids the error accumulation caused by the wrong action detection through the action probability distribution obtained by the detection result, inputs the action probability distribution and the initial video characteristic into the coder of the neural network together to learn the video characteristic so that the final video characteristic can contain the action information, and finally inputs the output video characteristic and the problem characteristic into a multi-head relation converter network to output the final result through the network.

Description

Relational network video question-answering system and method based on actions

Technical Field

The invention belongs to the field of computational linguistics and computer vision, and particularly relates to a relational network video question-answering system and method based on actions.

Background

The video question-answering system automatically answers relevant questions according to given video clips, attracts the attention of researchers in recent years, and is an important multi-modal understanding task. A typical video question-and-answer system is to give a description of a question and a corresponding question fragment, and earlier studies attempted to solve the question by cross-modal retrieval and action recognition.

In recent years, question-answering systems based on deep learning have appeared, and these deep learning methods can automatically acquire feature learning information, and at the same time, they achieve high performance on large-scale and complex data sets. Many of these methods have been the use of explored multi-modal information fusion and attention mechanisms, and since then much research has been devoted to improving problem systems based on deep learning. A relatively representative improvement is to utilize a hierarchical and multi-level attention mechanism and a graph neural network to model the association between various information, which aims to improve the characterization capability and the feature extraction capability of the model. On the other hand, an improved video representation acquisition mode is also a potential method for realizing a better solution expression mode, and particularly, the existing video question-answering system cannot effectively acquire motion information in a video and cannot well utilize related information, so that the acquired features cannot accurately express key information in the video, and finally, the generated answer is inaccurate.

Disclosure of Invention

Aiming at the defects in the prior art, the relational network video question-answering system and method based on the action provided by the invention solve the problem that the accuracy of the solution answer of the existing deep learning model is low.

In order to achieve the above purpose, the invention adopts the technical scheme that:

the scheme provides a relational network video question-answering system based on actions, which comprises a coding module, a question feature module, an action detection module, a relational conversion network module and a decoding module;

the encoding module is used for representing all frames of the video into a group of real-value vectors VE with fixed dimensionality through a three-dimensional convolution network and an optical flow network;

the question feature module is used for representing words in the question text as question features Q by utilizing a co-occurrence-based word embedding method_o；

The motion detection module is used for acquiring various motion probability distributions in the video by utilizing a time sequence motion detection network, and fusing the various motion probability distributions with the real value vector VE to obtain an intermediate video feature V;

the relation conversion network module is used for converting the relation of the intermediate video characteristic V and the problem characteristic Q into a relation of the intermediate video characteristic V and the problem characteristic Q_oObtaining relation characteristic R between video actions by using relation conversion network_zAnd the video feature V and the relation feature R are combined through an attention mechanism_zAggregated into relational video features r_att；

The decoding module is used for fusing the intermediate video feature V and the problem feature Q_oAnd relational video features r_attAnd inputting the fusion result into a decoder of the video question to generate a question answer of a corresponding type, and completing the action-based relation network video question-answer.

The invention has the beneficial effects that: the invention firstly uses the result of the time sequence action detection network to assist the coding of the video characteristic, emphasizes the action factor of the video, meanwhile, because of lacking of precise action interval label, the invention does not directly use the detected action interval, but avoids error accumulation caused by wrong action detection through the action probability distribution obtained by the detection result, the action probability distribution obtained by the time sequence action detection network and the initial video characteristic are input into a coder based on a recurrent neural network together to learn the video characteristic, so that the final video characteristic can contain the action information, and finally, the output video characteristic and the problem characteristic are input into a multi-head relation converter network, and the final result is output through the network. The invention improves the task performance by enhancing the action characteristics in the problem and is assisted by the relationship converter network, thereby obtaining better problem solving effect.

Based on the system, the invention also provides a relational network video question-answering method based on the action, which comprises the following steps:

s1, representing all video frames into a group of real-value vectors VE with fixed dimensionality through a three-dimensional convolution network and an optical flow network;

s2, representing words in the question text as question features Q by using a co-occurrence-based word embedding method_o；

S3, acquiring multiple action probability distributions in the video by using a time sequence action detection network, and fusing the multiple action probability distributions with a real value vector VE to obtain an intermediate video feature V;

s4, according to the intermediate video feature V and the problem feature Q_oObtaining relation characteristic R between video actions by using relation conversion network_zAnd the video feature V and the relation feature R are combined through an attention mechanism_zAggregated into relational video features r_att；

S5, fusing intermediate video feature V and problem feature Q_oAnd relational video features r_attAnd inputting the fusion result into a decoder of the video question to generate a question answer of a corresponding type, and completing the action-based relation network video question-answer.

Further, the step S1 includes the following steps:

s101, extracting a T-frame image from a video according to the transmission frame number of the video file per second;

s102, obtaining hidden state representation VF (f) of a static feature set of the frame by using a residual error network according to the extracted T frame image₁,f₂,...,f_rAnd taking the hidden state representation VF of the static feature set as a static feature real value vector corresponding to the videoWherein f is_rRepresenting residual error characteristics corresponding to each frame of video;

s103, acquiring hidden state representation VS ═ S of dynamic feature set of frame by using optical flow convolution network according to extracted T frame image₁,s₂,...,s_rAnd taking the hidden state representation VS of the dynamic feature set as a dynamic feature real-value vector corresponding to the video, wherein s_rRepresenting optical flow characteristics corresponding to each frame of video;

and S104, fusing the static characteristic real value vector and the dynamic characteristic real value vector to obtain a real value vector VE with fixed dimensionality.

The beneficial effects of the further scheme are as follows: the residual network, the optical flow volume and the network respectively emphasize the dynamic and static characteristics of the video, so that the more comprehensive understanding of the model to the video is facilitated.

Still further, the step S2 includes the steps of:

s201, processing an input question in a word sequence mode according to a question text;

s202, converting the word sequence into a real-value vector set Q of fixed dimensionality by using a word embedding method₁,q₂,..,q_NWherein q is_NRepresenting the feature vector corresponding to the last word, and N representing the length of the question sequence;

s203, inputting the real value vector set Q to a recurrent neural network to obtain a problem characteristic Q_o。

Still further, the step S3 includes the steps of:

s301, processing the video sequence by utilizing a time sequence action detection network to obtain a plurality of action probability distributions { (tf) in the video_s1,tf_s2,...,tf_e1),...,(tf_sM,...,tf_eM) Where, tf_sMA start time frame, tf, representing the detected motion_eMAn end time frame representing the detected motion, M representing the first M motion probability distributions;

s302, converting the various action probability distributions into corresponding mask matrixes, and fusing the mask matrixes and the real value vectors VE to obtain the intermediate video features V.

The beneficial effects of the further scheme are as follows: the invention firstly uses the information provided by the action detection network to assist the video characteristics input by coding, effectively embeds the attribute of the video in the time dimension into the video characteristics, and the newly generated video characteristics contain the detected information which takes the action as the center, and the action information has important significance for correctly answering the question.

Still further, the step S302 includes the steps of:

s3021, converting the multiple action probability distributions into corresponding initial mask matrixes to obtain a subset VE of the real-valued vector VE₁；

S3022, defining a zero matrix Mask with the same size as the real-valued vector VE₁And Mask the zero matrix₁And subset VE₁The corresponding column assignment is 1, and a final mask matrix is obtained;

s3033, fusing the final mask matrix and the real value vector VE through bit-wise multiplication, and simultaneously calculating mask matrixes corresponding to a plurality of action intervals;

s3034, adding the mask matrixes corresponding to the plurality of action intervals to obtain the video characteristic BSN_f；

S3035, according to the video characteristic BSN_fCalculating with a real value vector VE to obtain an intermediate video characteristic V; the expression of the intermediate video feature V is as follows:

V＝VE+BSN_f

BSN_fj＝VE⊙Mask_j

where VE represents the real-valued vector, BSN_fRepresenting video features, BSN_fjMask matrix indicating correspondence of a plurality of action periods, <' > indicating bit-wise multiplication, M indicating the first M actionsProbability distribution, j, represents the jth action detected, and 1 ≦ j ≦ M.

The beneficial effects of the further scheme are as follows: the invention can embed the action information into the video representation on the premise of not changing the characteristic shape by fusing the probability distribution of various actions and the real value vector VE.

Still further, the step S4 includes the steps of:

s401, converting the intermediate video feature V into a video feature VP by using a full-connection network;

s402, utilizing a relation network to carry out on the video feature VP and the problem feature Q_oProcessing to obtain the relation characteristic r_i(ii) a The relation characteristic r_iThe expression of (a) is as follows:

r_i＝W_r([vp_i,vp_i+1,...,vp_i+F,Q_o])+b_r

wherein, W_rRepresenting the parameter matrix, vp, to be trained_i+FRepresenting features corresponding to frames derived from the ith frame up to F frames, b_rRepresenting a bias parameter to be trained;

s403, utilizing a relation conversion network to convert the video feature VP and the problem feature Q_oPerforming fusion processing, and obtaining the relation characteristic r according to the fusion result_iCalculating to obtain a relation characteristic R_z(ii) a The relation characteristic R_zThe expression of (a) is as follows:

R_k＝Relation-Module_k(VP,Q_o)

wherein, relationship-Module_k(. represents the calculation of the kth relational sub-network, R_kExpressing the output of the kth relational sub-network, K expressing the total quantity of the relational sub-networks, and | l expressing splicing the outputs of the K relational sub-networks;

s404, according to the relation characteristic R_zUsing feed-forward network and layer regularizerCalculating relational characteristics

S405, characterizing the relation by using an attention mechanism

Aggregated into relational video features r_att(ii) a The relational video feature r_attThe expression of (a) is as follows:

wherein the Attention_r(. cndot.) denotes the attention mechanism.

The beneficial effects of the further scheme are as follows: the invention provides a video model based on a relation conversion network for the first time, and aims to better utilize the attributes of time dimensions distributed in video frames and the interaction thereof, and add a plurality of characteristics of a multi-head structure and a converter network based on the prior of the search of the relation characteristics between frames added by the relation network, so that the system has stronger video characteristic extraction capability.

Still further, the expression of the video feature VP in step S401 is as follows:

VP＝{vp₁,vp₂,...,vp_T}

vp_i＝W_p×v_i+b_p

1≤i≤T

wherein vp is_TRepresenting the video feature corresponding to the last frame image, W_pRepresenting the parameter matrix to be trained, v_iRepresenting intermediate video features of the ith frame, b_pRepresenting the bias parameter to be trained, T representing the total frame number of the video, i representing the ith frame of the video, vp_iRepresenting the corresponding feature of the ith frame of the video.

Still further, the relationship characteristic in the step S404

The expression of (a) is as follows:

wherein,

representing the relationship characteristics after layer regularization, layerNorm representing layer regularization, FFN (-) representing the computation of the feed-forward network,

presentation pair

Performing a calculation of a feedforward network, b_f1Representing a bias parameter, W, of a first layer of the feedforward network_f1Parameter matrix, W, representing the first layer of the feedforward network_f2Parameter matrix representing the second layer of the feedforward network, b_f2Representing a bias parameter of a second layer of the feed-forward network.

The beneficial effects of the further scheme are as follows: the invention can fully and efficiently model the complex association between frames by utilizing the synthesis of a plurality of relational networks.

Still further, the step S5 includes the steps of:

s501, aggregating the intermediate video features V into a comprehensive video representation V by using an attention mechanism_att：

v_att＝Attention_v(V,Q_o)

Wherein the Attention_v(. represents a note)Free mechanism, Q_oRepresenting a problem feature;

s502, representing the integrated video_attRelational video features r_attAnd problem feature Q_oAnd fusing by using a bitwise addition mode to obtain a final representation J, inputting the final representation J into a decoder of the video question to generate a question answer of a corresponding type, and completing the action-based relational network video question-answer.

The beneficial effects of the further scheme are as follows: characterizing said integrated video v_attRelational video features r_attAnd problem feature Q_oThe fusion is carried out, and the fusion characteristic of the relationship information is added, so that better answer of the question can be assisted.

Drawings

FIG. 1 is a schematic diagram of the system of the present invention.

FIG. 2 is a schematic flow chart of the method of the present invention.

Fig. 3 is a schematic flow chart in this embodiment.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

Example 1

As shown in fig. 1, a relational network video question-answering system based on actions includes an encoding module, a question feature module, an action detection module, a relational conversion network module, and a decoding module; the encoding module is used for representing all frames of the video into a group of real-value vectors VE with fixed dimensionality through a three-dimensional convolution network and an optical flow network; a question feature module for representing words in the question text as question features Q using a co-occurrence based word embedding method_o(ii) a An action detection module for utilizing time sequence action detectionThe network acquires the probability distribution of various actions in the video, and fuses the probability distribution of various actions with the real value vector VE to obtain the intermediate video characteristic V; a relation conversion network module for converting the relation of the intermediate video feature V and the problem feature Q_oObtaining relation characteristic R between video actions by using relation conversion network_zAnd the video feature V and the relation feature R are combined through an attention mechanism_zAggregated into relational video features r_att(ii) a A decoding module for fusing the real value vector VE and the problem feature Q_oAnd relational video features r_attAnd inputting the fusion result into a decoder of the video question to generate a question answer of a corresponding type, and completing the action-based relation network video question-answer.

In this embodiment, the present invention first represents all frames of a video as a set of real-valued vectors VE with fixed dimensions through a three-dimensional convolutional network and an optical flow network, and then represents words in a problem text as problem features Q by using a co-occurrence-based word embedding method_oThen, the invention uses the result of the time sequence action detection network to assist the coding of the video characteristic, emphasizes the action factor of the video, meanwhile, because of lacking precise action interval label, the invention does not directly use the detected action interval, but avoids the error accumulation caused by wrong action detection through the action probability distribution obtained by the detection result, the action probability distribution obtained by the time sequence action detection network and the initial video characteristic are input into the coder based on the recurrent neural network together to learn the video characteristic, so that the final video characteristic can contain the action information, finally, the output video characteristic and the problem characteristic are input into a multi-head relation converter network, and the final result is output through the network. The invention improves the task performance by enhancing the action characteristics in the problem and is assisted by the relationship converter network, thereby obtaining better problem solving effect.

Example 2

As shown in fig. 2 and fig. 3, the present invention further provides a relational network video question answering method based on actions, which is implemented as follows:

s1, representing all video frames as a set of real-valued vectors VE with fixed dimensions through a three-dimensional convolution network and an optical flow network, which is implemented as follows:

s102, obtaining hidden state representation VF (f) of a static feature set of the frame by using a residual error network according to the extracted T frame image₁,f₂,...,f_rAnd taking the hidden state representation VF of the static feature set as a static feature real value vector corresponding to the video, wherein f_rRepresenting residual error characteristics corresponding to each frame of video;

s104, fusing the static characteristic real value vector and the dynamic characteristic real value vector to obtain a real value vector VE with fixed dimensionality;

s2, representing words in the question text as question features Q by using a co-occurrence-based word embedding method_o(ii) a The realization method comprises the following steps:

s203, inputting the real value vector set Q into a recurrent neural network to obtain a problem characteristic Q_o；

S3, acquiring multiple action probability distributions in the video by using a time sequence action detection network, and fusing the multiple action probability distributions with a real value vector VE to obtain an intermediate video feature V; the realization method comprises the following steps:

s301, using time sequenceThe action detection network processes the video sequence to obtain a plurality of action probability distributions (tf) in the video_s1,tf_s2,...,tf_e1),...,(tf_sM,...,tf_eM) Where, tf_sMA start time frame, tf, representing the detected motion_eMRepresenting the ending time frame of the detected motion, M representing the first M motion probability distributions.

In the embodiment, a video sequence is processed through a motion detection network to obtain the distribution of motion probability on a video; more specifically, since the motion detection network generates probability distributions corresponding to a plurality of motions, the present invention selects the top M distributions, i.e. the motion interval with the top M confidence. At the same time, converting the time in the detection result into corresponding frame, and obtaining { (tf)_s1,tf_s2,...,tf_e1),...,(tf_sM,...,tf_eM) Indicating which frames are more likely to have an action.

S302, converting the probability distribution of various actions into a corresponding mask matrix, and fusing the mask matrix and the real value vector VE to obtain an intermediate video feature V; the realization method comprises the following steps:

s3021, converting the probability distribution of the plurality of actions into corresponding initial mask matrixes to obtain a subset VE of the real-valued vector VE₁；

S3035, BSN according to video characteristics_fCalculating with a real value vector VE to obtain an intermediate video characteristic V; the expression for the intermediate video feature V is as follows:

V＝VE+BSN_f

BSN_fj＝VE⊙Mask_j

where VE represents the real-valued vector, BSN_fRepresenting video features, BSN_fjMask matrixes corresponding to a plurality of action intervals are represented, bit-wise multiplication is performed, namely, calculation of bit-wise multiplication is performed between the matrixes, then a new matrix is obtained, M represents probability distribution of the first M actions, j represents the j-th action to be detected, and j is more than or equal to 1 and less than or equal to M.

In the present embodiment, the operation interval { (tf) is obtained based on_s1,tf_s2,...,tf_e1),...,(tf_sM,...,tf_eM) And converting the video image into a corresponding mask matrix by the system, and then fusing the mask matrix with the original video characteristic VE to finish the operation of motion coding. First, the above operation sections are converted into a mask matrix to (ft)_s1,tf_e1) For example, take (tf)_s1,tf_e1) Is a boundary, where ft_s1Denotes the start time of the first action, tf_e1Representing the stop time of the first action, the system may obtain a subset VE of the set of video features VE₁It contains only the features of the frames that are in the corresponding detection action interval. Then, a zero matrix Mask with the same size as VE is defined₁Then Mask is put₁Middle pair VE₁The corresponding columns are all assigned with 1, so that the corresponding Mask is formally obtained₁. Then fusing the mask matrix and the video characteristic VE together through bit-wise multiplication, simultaneously calculating the mask matrixes corresponding to a plurality of action intervals, and finally adding the mask matrixes and the mask matrixes to obtain the video characteristic BSN after action coding_f。

S4, according to the intermediate video feature V and the problem feature Q_oObtaining relation characteristic R between video actions by using relation conversion network_zAnd the video feature V and the relation feature R are combined through an attention mechanism_zAggregated into relational video features r_att(ii) a The realization method comprises the following steps:

s401, converting the intermediate video feature V into a video feature VP by using a full-connection network; the expression of the video feature VP is as follows:

VP＝{vp₁,...,vp_T}

vp_i＝W_p×v_i+b_p

1≤i≤T

wherein vp is_TRepresenting the video feature corresponding to the last frame image, W_pRepresenting the parameter matrix to be trained, v_iRepresenting intermediate video features of the ith frame, b_pRepresenting the bias parameter to be trained, T representing the total frame number of the video, i representing the ith frame of the video, vp_iRepresenting the corresponding characteristic of the ith frame of the video;

s402, utilizing a relation network to carry out video feature VP and problem feature Q_oProcessing to obtain the relation characteristic r_i(ii) a Relation characteristic r_iThe expression of (a) is as follows:

r_i＝W_r([vp_i,vp_i+1,...,vp_i+F,Q_o])+b_r

wherein, W_rRepresenting the parameter matrix, vp, to be trained_i+FRepresenting features corresponding to frames derived from the ith frame up to F frames, b_rWhich represents the bias parameter to be trained, where F is set to 1 in the present invention.

In this embodiment, the video feature VP is processed by a relational network module, and the video feature VP at a given frame level is { VP ═ VP₁,vp₂,...,vp_TAnd problem feature Q_o，r_iThe system obtains T- (F-1) corresponding relation characteristics R ═ R for a set containing T frame video characteristics, wherein the ith relation characteristic is shown₁,r₂,...,r_T-(F-1) F, the number of frames that the relational network module needs to consider, is set to 1 in this document. The processing procedure of this step is hereinafter referred to as relationship-Module_k；

S403, using the relation conversion network to convert the video feature VP and the problem feature Q_oPerforming fusion processing, and determining the fusion result and relationshipCharacteristic r_iCalculating to obtain a relation characteristic R_z(ii) a Relation characteristic R_zThe expression of (a) is as follows:

R_k＝Relation-Module_k(VP,Q_o)

wherein, relationship-Module_k(. cndot.) denotes the calculation of the kth relational sub-network in the same manner as in S402, R_kExpressing the output of the kth relational sub-network, K expressing the total quantity of the relational sub-networks, and | l expressing splicing the outputs of the K relational sub-networks;

s404, according to the relation characteristic R_kObtaining relationship features using feed-forward network and layer regularization calculations

Characteristics of relationships

The expression of (a) is as follows:

wherein,

presentation pair

In this embodiment, the present system provides a new relationship conversion network, which also uses the frame-level video feature VP ═ VP₁,vp₂,...,vp_TAnd problem feature Q_oFor inputting, firstly, the video characteristics are obtained through a multi-head relation network, a K-head structure is introduced, each head adopts the relation network, and the calculation method is the same as the above S402. For each relational network module

The system executes the operation in parallel, then splices the operation results together, and then obtains the final relation characteristic through a feedforward network FFN and layer regularization LayerNorm calculation

S405, characterizing the relationship by using an attention mechanism

Aggregated into relational video features r_att(ii) a Relational video features r_attThe expression of (a) is as follows:

wherein the Attention_r(. cndot.) represents the attention mechanism;

s5, fusing intermediate video feature V and problem feature Q_oAnd view of relationshipFrequency characteristic r_attAnd inputting the fusion result into a decoder of the video question to generate a question answer of a corresponding type, and completing the question answering based on the action of the relational network video, wherein the implementation method comprises the following steps:

v_att＝Attention_v(V,Q_o)

Wherein the Attention_v(. represents the attention mechanism, Q)_oRepresenting a problem feature;

s502, representing the integrated video v_attRelational video features r_attAnd problem feature Q_oAnd fusing by using a bitwise addition mode to obtain a final representation J, inputting the final representation J into a decoder of the video question to generate a question answer of a corresponding type, and completing the action-based relational network video question-answer.

The present invention will be further described below.

In this embodiment, two common data sets are utilized: wherein TGIF-QA (video question and answer dataset for spatio-temporal joint reasoning) has 165,165 questions and 71,741 video clips, ActivityNet-QA (video question and answer dataset for action) has 58,000 questions and 58,00 corresponding video clips. For the TGIF-QA data set, the whole data is divided into 4 types of subproblems of Action, Transition, Frame and Count (Action, state Transition, single Frame static problem, Count), wherein the Action, Transition and Frame are directly evaluated by using the accuracy, and the Count is evaluated by using the Mean Square Error (MSE) because the result is a numerical value. For the ActivityNet-QA dataset, the evaluation was performed by accuracy and similarity of standard correct answers (WUPS). As shown in Table 1, the results of the method are shown in Table 1, which is a comparison of the effects of the method and the existing method, and it can be seen from Table 1 (the accuracy of the model on the test set is represented by the data in Action, Transition, and Frame, the larger the numerical value is, the better the numerical value is, the mean square error in Count is the difference between the result generated by the model on the test set and the standard answer, the smaller the numerical value is, the better the effect of the method is compared with the existing ST-TP method (time sequence reasoning model), Co-memory method (joint memory network), PSAC method (position-dependent time-space thrust network) and HGA method (heterogeneous graph network alignment model).

TABLE 1

	Action	Transition	Frame	Count
					ST-TP	62.9	69.4	49.5	4.32
Co-memory	68.2	74.3	51.5	4.10
					PSAC	70.4	76.9	55.7	4.27
HGA	75.4	81.0	55.1	4.09
					Method for producing a composite material	75.81	81.61	57.68	4/08

As shown in Table 2, it CAN be seen from Table 2 that the data in Acc represents the accuracy of the model on the test set, the larger the value is better, WUPS is the difference between the model on the test set generation result and the standard answer, the larger the value is better, the method has better effect than the existing E-VQA (static question answering model), E-MN (memory network) method, E-SA (soft attention network) method, VQA-HMAL (conditional countermeasure network) method and CAN (combined attention network) method.

TABLE 2

In summary, the present invention introduces two novel mechanisms, motion-based coding and relationship converter, to help improve the video question-and-answer system, and besides using the features at the frame level in the static part, the present invention also focuses more on the motion attribute in the time dimension and embeds it into the video features. In addition, the invention does not use a recurrent neural network to extract the video characterization vector, but uses a new relation conversion network to capture the video characteristics, and the experiment is carried out on two large-scale video question-answer data, namely TGIF-QA and ActivityNet-QA respectively, and the result shows that the invention is obviously improved on the basis of the original method.

Claims

1. A relation network video question-answering system based on actions is characterized by comprising a coding module, a question feature module, an action detection module, a relation conversion network module and a decoding module;

The motion detection module is used for acquiring various motion probability distributions in the video by utilizing a time sequence motion detection network, and fusing the various motion probability distributions with a real value vector VE to obtain an intermediate video feature V, and the implementation method comprises the following steps:

a1, processing the video sequence by utilizing a time sequence action detection network to obtain a plurality of action probability distributions (tf) in the video_s1,tf_s2,...,tf_e1),...,(tf_sM,...,tf_eM) Where, tf_sMA start time frame, tf, representing the detected motion_eMAn end time frame representing the detected motion, M representing the first M motion probability distributions;

a2, converting the multiple action probability distributions into corresponding mask matrixes, and fusing the mask matrixes and the real value vectors VE to obtain intermediate video features V;

the step A2 comprises the following steps:

a21, converting the multiple action probability distributions into corresponding initial mask matrixes to obtain a subset VE of a real-value vector VE₁；

A22, defining a zero matrix Mask with the same size as the real value vector VE₁And Mask the zero matrix₁And subset VE₁The corresponding column assignment is 1, and a final mask matrix is obtained;

a23, fusing the final mask matrix and the real value vector VE by bit-wise multiplication, and simultaneously calculating mask matrixes corresponding to a plurality of action intervals;

a24, adding the mask matrixes corresponding to a plurality of action sections to obtain video characteristics BSN_f；

A25, BSN according to the video characteristics_fCalculating with a real value vector VE to obtain an intermediate video characteristic V; the expression of the intermediate video feature V is as follows:

V＝VE+BSN_f

BSN_fj＝VE⊙Mask_j

where VE represents the real-valued vector, BSN_fRepresenting video features, BSN_fjA mask matrix indicating a plurality of action intervals, wherein, < > indicates bit-wise multiplication, M indicates probability distribution of the first M actions, j indicates the j-th action to be detected, and j is more than or equal to 1 and less than or equal to M;

2. A relational network video question-answering method based on actions is characterized by comprising the following steps:

s2, utilizing co-occurrenceThe word embedding method expresses words in the question text as question features Q_o；

the step S3 includes the steps of:

s302, converting the probability distribution of the various actions into a corresponding mask matrix, and fusing the mask matrix and the real value vector VE to obtain an intermediate video feature V;

the step S302 includes the steps of:

V＝VE+BSN_f

BSN_fi＝VE⊙Mask_j

3. The motion-based video question answering method of the relational network according to the claim 2, wherein the step S1 comprises the steps of:

s103, acquiring hidden state representation VS ═ S of dynamic feature set of frame by using optical flow convolution network according to extracted T frame image₁,s₂,...,s_rAnd taking the hidden state representation VS of the dynamic feature set as a dynamic feature real-value vector corresponding to the video, wherein s_rEach representsAn optical flow characteristic corresponding to a frame of video;

4. The motion-based video question answering method of the relational network according to the claim 2, wherein the step S2 comprises the steps of:

5. The motion-based video question answering method of the relational network according to the claim 2, wherein the step S4 comprises the steps of:

r_i＝W_r([vp_i,vp_i+1,...,vp_i+F,Q_o])+b_r

s403, utilizing a relation conversion network to convert the video feature VP and the problem feature Q_oPerforming fusion processing, and obtaining the relation characteristic r according to the fusion result_iCalculating to obtain the relation characteristicSign R_z(ii) a The relation characteristic R_zThe expression of (a) is as follows:

R_k＝Relation-Module_k(VP,Q_o)

s404, according to the relation characteristic R_zObtaining relationship features using feed-forward network and layer regularization calculations

S405, characterizing the relation by using an attention mechanism

wherein the Attention_r(. cndot.) denotes the attention mechanism.

6. The motion-based video question answering method of relational network according to claim 5, wherein the expression of the video feature VP in the step S401 is as follows:

VP＝{vp₁,vp₂,...,vp_T}

vp_i＝W_p×v_i+b_p

1≤i≤T

7. The motion-based video question answering method of relational network according to claim 5, wherein the relational features in the step S404

The expression of (a) is as follows:

wherein,

presentation pair

Performing a calculation of a feedforward network, b_f1Representing a bias parameter, W, of a first layer of the feedforward network_f1Representing a feedforward networkParameter matrix of one layer, W_f2Parameter matrix representing the second layer of the feedforward network, b_f2Representing a bias parameter of a second layer of the feed-forward network.

8. The motion-based video question answering method of the relational network according to the claim 2, wherein the step S5 comprises the steps of:

v_att＝Attention_v(V,Q_o)