CN111160038A

CN111160038A - Method for generating video conversation answers and questions based on self-attention mechanism

Info

Publication number: CN111160038A
Application number: CN201911299062.6A
Authority: CN
Inventors: 赵洲; 许津铭; 金韦克
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2019-12-16
Filing date: 2019-12-16
Publication date: 2020-05-15

Abstract

The invention discloses a method for generating video conversation answers and questions based on a self-attention mechanism. The method mainly comprises the following steps: 1) and constructing a dialogue network capable of reasoning among videos, questions and answers aiming at a group of video information and dialogue historical record data sets. 2) Aiming at the formed network, the query information is gradually updated by using a cross conversion module, whether the updating needs to be finished or not is judged by using an inference gate, and the final output is decoded to obtain an answer. The invention is based on a self-attention mechanism, utilizes a cross conversion module and an inference gate, can comprehensively utilize the correlation between the conversation history and the video content, and generates more conforming answers. Compared with the traditional method, the effect of the invention in the video question answering is better.

Description

Method for generating video conversation answers and questions based on self-attention mechanism

Technical Field

The invention relates to video question-answer generation, in particular to a method for generating video conversation answers and questions based on a self-attention mechanism.

Background

The video question-answer question is an important question in the field of video information retrieval, and the question aims to automatically generate an answer aiming at a related video and a corresponding question.

Existing visual dialog methods use recurrent neural networks (e.g., LSTM) to encode the dialog history into a single vector representation, and some more advanced methods use hierarchical, attention and memory mechanisms to refine the dialog history representation, but lack explicit reasoning processes. Recently, there has been a method of using a neural module network architecture, considering only static visual characters. But in video dialogs, there are other dynamic characters, such as actions and state transitions.

Therefore, the method adopts multi-stream video information in the model, and simultaneously provides a cross conversion module and an inference gate based on a self-attention mechanism for better utilizing the conversation history and the video information.

Disclosure of Invention

The invention aims to solve the problems in the prior art, and in order to overcome the problem that the correlation between video conversation histories is not concerned only by the semantic association degree between answers to questions, the invention provides a novel video conversation answer and question generation method based on a self-attention mechanism, wherein the mechanism gradually updates inquiry information according to conversation history records and video contents until an agent considers the information to be sufficient and definite. To solve the multimodal fusion problem, the present invention proposes a cross-conversion module that can learn finer granularity and more comprehensive interactions within and between modalities. The method firstly utilizes the existing video conversation history to construct a conversation network with reasonable function, and then encodes the initial query information. Thereafter, the query information is updated from each dialog step by step. Due to the continuity of the dialog history, the order of inference is from the latest turn to the beginning turn. For each round of update, a cross-over module is used to merge into the updated query. And finally, updating the query information again by using the reasoning gate, outputting a final result and stopping reasoning.

In order to achieve the purpose, the invention adopts the following specific technical scheme:

a method for generating video conversation answers and questions based on a self-attention mechanism comprises the following steps:

1. acquiring video characteristics of a piece of video, wherein the video characteristics comprise appearance characteristics v at a frame level^fAnd action characteristics v at segment level^s；

2. Aiming at a query statement X, a first type of self-attention processing unit APU-1 is adopted for coding, and the output is recorded as q; v obtained in step 1)^fAnd v^sRespectively with q-input cross-conversion module, generating query and video information combined features { O_fq，O_sqIn which O is_fqFor looking up appearance characteristics combined with video information, O_sqQuerying the action characteristics combined with the video information; the cross conversion module comprises two input channels, each channel comprises a second type self-attention processing unit APU-2, a cascade layer and a linear layer, and the output of the two channels is connected with a multi-head attention layer;

3. historical dialog information for a set of videos, and { O) from step 2)_fq，O_sqAnd adopting a second type of self-attention processing unit APU-2 for coding to obtain each round of coded historical dialogue information

Wherein

For historical dialogue information and O_fqThe output results after the combination and encoding are carried out,

for historical dialogue information and O_sqCombining and encoding the output results;

4. to is directed at

And { O_fq，O_sqJudging whether updating needs to be stopped or not through an inference gate, and if so, judging whether updating needs to be stoppedIf yes, the reasoning is finished and the updated { O } is finally obtained_fq，O_sqOutputting as a result of the encoder and using a decoder to obtain a final answer or question; if not, taking the output of the inference gate as q in the step 2), and repeating the step 2) to the step 4), and enabling the output of the inference gate to be the q

And { O_fq，O_sqAnd inputting the data to an inference gate respectively for updating until the inference is stopped.

In the present invention, the second type of self-attention processing unit APU-2 is similar to the Transmodel decoder module, but the middle multi-headed attention layer is different. Let the external input be I_oThe output of the multi-head self-attention layer is O_aThen normal operation in a generic transform decoder is given by:

where d is the dimension of the vector, in the present invention, the order of input would be replaced by:

thereby enabling the external input information to guide the internal attention operation. In the case of a video conversation, query information is used to filter relevant and useful information from the conversation history.

The invention has the following beneficial effects:

(1) unlike the prior art, the present invention proposes a novel video dialogue inference mechanism that gradually updates query information according to dialogue history and video content;

(2) the invention designs a cross conversion module to solve the problem of multi-mode fusion, and the module can learn finer granularity and comprehensive interaction inside and between visual and text information;

(3) in addition to generating answers, the invention can also realize the generation of questions under the same framework to construct a complete video conversation system.

Drawings

FIG. 1 is a schematic overall flow diagram of the inference mechanism utilized by the present invention;

fig. 2 is a schematic diagram of a Cross-transform module (CT module for short) according to the present invention.

Detailed Description

The invention will be further elucidated and described with reference to the drawings and the detailed description.

As shown in fig. 1, a method for generating answers and questions of a video conversation based on a self-attention mechanism of the present invention includes the following steps:

the method comprises the steps of acquiring video characteristics of a section of video aiming at a given video, wherein the video characteristics comprise the appearance characteristics of the video frame level acquired by using a pre-trained VGG network

Wherein

Representing the appearance of the ith frame in a video, T₁Representing the number of frames sampled in the video; capturing motion features at video clip level using a pre-trained C3D network

Wherein

Representing the motion characteristics of the ith segment in the video, T₂Representing the number of segments of the video sample; finally obtaining the appearance characteristic v of the video frame level^fAnd action characteristics v at segment level^s；

Step two, aiming at the query statement X, a first type of self-attention processing unit APU-1 is adopted for coding, and the output is recorded as q; the APU-1 is composed of a multi-head self-attention layer and a transition layer, wherein the transition layer is a completely connected neural network and is composed of two linear transformations and a rectifying linear activation function; the coded X is marked as q and is used as the input of a cross conversion module, which is called a CT module for short; the CT module comprises two input channels, each channel comprises a second type self-attention processing unit APU-2, a cascade layer and a linear layer, and the output of the two channels is connected with a multi-head attention layer;

q and v are^fAs inputs to the two channels of the CT module, when q and v^fAfter entering APU-2, for input v^fUsing q as the external input of the middle multi-headed attention layer, output is

At the same time, for APU-2 of input q, v is used^fAs the external input of the middle multi-head attention layer, the output is recorded as

Then, the output of two APU-2 and its input are fused separately through cascade and linear layers, and the formula is as follows:

before output, the output stream is switched again as the original input for the next layer processing, the fusion effect is enhanced by stacking M layers, and finally another multi-attention layer fusion F is used_vAnd F_qOutput is recorded as O_fq；

Similarly, q and v^sThe output is O as the input of two channels of the CT module_sq(ii) a Finally obtaining the characteristic { O) of combining the query and the video information_fq，O_sqIn which O is_fqFor looking up appearance characteristics combined with video information, O_sqTo query for action features that are combined with video information.

Step three, historical dialogue information c ═ for a group of videos₁，c₂，...c_N) Wherein c is_iIndicating the ith round of historyThe historical dialogue information of a group of videos comprises N rounds of dialogue, and each round of dialogue is provided with a question

And an answer

Is composed of (a) wherein

The jth word in the question representing the historical ith turn of the dialog,

representing the jth word in the historical ith wheel answer, l representing the length of the question sentence, and l' representing the length of the answer sentence;

will question q_iAnd answer a_iEnd-to-end connection with

Represents;

then c is_iEach word vector w in_jWith corresponding position code PF_jAdding to obtain encoded historical ith wheel conversation information

The position code calculation method is as follows:

the coded dialogue information C_iInput into the second type of self-attention processing Unit APU-2, the characteristic { O) of the combination of the video information with the query obtained in step 2)_fq，O_sqRepeating the operation process of the APU-2 for T times and outputting the operation process as the external input of the middle multi-head attention layer, wherein when the APU-2 is operated, the external input is O_fqWhen it is, the output is recorded as

When the external input is O_sqWhen it is, the output is recorded as

Finally obtaining each round of historical dialogue information after coding

Wherein

for historical dialogue information and O_sqAnd combining and encoding the output result.

Step four, mixing

And { O_fq，O_sqInputting the historical dialogue information of the latest round into an inference gate to be updated respectively, and outputting the historical dialogue information of the latest round finally

J, reasoning, determining whether reasoning needs to be stopped, and if the reasoning does not meet the requirement of stopping reasoning, calculating the final output of the historical dialogue information of the previous round

Starting a new round of reasoning, and so on;

when the input is

And O_fqThe reasoning process is as follows:

μ₁＝σ(W₃(O_fqW₄+b₃)+b₄)

where σ is the activation function, [;]is a cascade operation, ⊙ is a vector element-by-element multiplication, W₁，W₂，W₃，W₄Is a coefficient matrix, b₁，b₂，b₃，b₄Is the deviation, S' is the fractional vector of the inference gate; mu.s₁Is a scalar quantity representing the score of the information; according to the same reasoning steps, when the input is

And O_sqWhen the output is μ₂；

Judging whether updating needs to be stopped, and calculating mu according to the following formula:

μ＝(μ₁+μ₂)/2

if μ is less than a given threshold τ, the updated { O is output by the inference gate_fq，O_sqAnd taking it as q in step 2), repeating step 2) -step 4), and

and { O_fq，O_sqInputting the inference gate respectively for updating, stopping inference until mu is larger than a given threshold value tau or the inference reaches the beginning of the dialogue historical information, and updating the { O in the last round_fq，O_sqOutputting as the final result of the encoder, and obtaining the final result using the decoder; when generating a problem, the input query statement X is the latest round of dialogue information, and the final result generated by the decoder is the problem corresponding to the video context; when generating answer, the input query sentence X is the question, and decoding is carried outThe final result generated by the machine is the answer to the question.

FIG. 2 is a schematic diagram of a cross-conversion module, referred to as CT module for short, according to the present invention, the CT module includes two input channels, each of which includes a second-type self-attention processing unit APU-2, a cascade and a linear layer, and outputs of the two channels are connected to a multi-head attention layer;

the second type of self-attention processing unit APU-2 consists of a multi-head self-attention layer, a multi-head attention layer and a transition layer; input vector for APU-2 (APU-2)_inputFirstly, V, K vectors are obtained through a multi-head self-attention layer, and V, K vectors are respectively taken as O_aThen adding O_aAnd an external input I_oAs the input of the middle multi-head attention layer, inputting the output result T of the multi-head attention layer into the transition layer for linear transformation and activation;

the operation is given by:

K＝W^K*(APU-2)_input

V＝W^V*(APU-2)_input

(APU-2)_out＝Transition(Atten)

wherein, W^KAnd W^VParameter matrix of representation model, (APU-2)_inputRepresenting the input vector, d being the dimension of the input vector, I_oIs an external input of a multi-head attention layer, (APU-2)_outRepresenting the output result of APU-2;

the outputs of the two APU-2 are respectively fused with the inputs thereof through cascade and linear layers, the output stream is switched again after fusion to be used as the original input for the next layer processing, the fusion effect is enhanced by stacking M layers, and finally the outputs of the two linear layers are fused by using another multi-head attention layer, and finally the output of the CT module is obtained.

The method is applied to the following embodiments to achieve the technical effects of the present invention, and detailed steps in the embodiments are not described again.

Examples

The experiment is carried out on a YouTube Clips data set and a TACOS-Multi level data set. These two data sets contained 1987 and 1303 videos, respectively. Each video in youtube clips is 60 frames, and each video in TACoS-multiline is 80 frames. Each video has five different conversations. There are 6515 video sessions in the youtube clips dataset and 9935 video sessions in the TACoS-multiline dataset. The number of challenge-response pairs is 66806 and 37228, respectively. Statistically, there are five rounds of conversation in most video dialogs of the TACoS-multiline dataset, while the youtube clips dataset has a majority of conversations between three and twelve rounds.

In order to objectively evaluate the performance of the algorithm of the present invention, the present invention uses three evaluation criteria, i.e., BLEUN (N ═ 1, 2), route-L and METEOR, in the selected test set to evaluate the effect of the present invention. The percentages of training data, validation data and test data for the two data sets were 90%, 5% and 5%, respectively, depending on the number of video sessions constructed. According to the steps described in the detailed description, Table 1 shows the experimental results of generating answers on the TACOS-Multi level and YoutubeClip datasets, and Table 2 shows the results of generating questions on the same datasets. The method is expressed as RICT:

table 1: the invention aims at BLUEN (N is 1, 2) and ROUGE-L and METEOR standard, and generates the experimental result of answers on TACOS-Multi level and YoubeClip data sets

Table 2: the present invention addresses BLUEN (N ═ 1, 2), ROUGE-L and METEOR standards, generating problem results on TACOS-Multilevel and YoutubeClip datasets

Claims

1. A method for generating video conversation answers and questions based on a self-attention mechanism is characterized by comprising the following steps:

1) acquiring video characteristics of a piece of video, wherein the video characteristics comprise appearance characteristics v at a frame level^fAnd action characteristics v at segment level^s；

2) Aiming at a query statement X, a first type of self-attention processing unit APU-1 is adopted for coding, and the output is recorded as q; v obtained in step 1)^fAnd v^sRespectively with q-input cross-conversion module, generating query and video information combined features { O_fq，O_sqIn which O is_fqFor looking up appearance characteristics combined with video information, O_sqQuerying the action characteristics combined with the video information; the cross conversion module comprises two input channels, each channel comprises a second type self-attention processing unit APU-2, a cascade layer and a linear layer, and the output of the two channels is connected with a multi-head attention layer;

3) historical dialog information for a set of videos, and { O) from step 2)_fq，O_sqAnd adopting a second type of self-attention processing unit APU-2 for coding to obtain each round of coded historical dialogue information

Wherein

4) to is directed at

And { O_fq，O_sqJudging whether the updating needs to be stopped or not through an inference gate, if so, ending the inference, and finally updating the { O }_fq，O_sqOutputting as a result of the encoder and using a decoder to obtain a final answer or question; if not, taking the output of the inference gate as q in the step 2), and repeating the step 2) to the step 4), and enabling the output of the inference gate to be the q

2. The method for generating answers and questions for video conversation based on self-attention mechanism as claimed in claim 1, wherein the step 1) is specifically as follows:

obtaining appearance features at video frame level for a given video using a pre-trained VGG network

Wherein

Wherein

Representing the motion characteristics of the ith segment in the video, T₂Representing the number of segments of the video sample.

3. The method for generating answers and questions in a video session based on a self-attention mechanism as claimed in claim 1, wherein said second type of self-attention processing unit APU-2 of step 2) and step 3) is composed of a multi-headed self-attention layer, a multi-headed attention layer and a transition layer; input vector for APU-2 (APU-2)_inputFirst of all, by a multi-head self-attentionThe layer obtains V, K vectors, and V, K vectors are respectively regarded as O_aThen adding O_aAnd an external input I_oAs the input of the middle multi-head attention layer, inputting the output result T of the multi-head attention layer into the transition layer for linear transformation and activation;

the operation is given by:

K＝W^K*(APU-2)_input

V＝W^V*(APU-2)_input

(APU-2)_out＝Transition(Atten)

wherein, W^KAnd W^VParameter matrix of representation model, (APU-2)_inputRepresenting the input vector, d being the dimension of the input vector, I_oIs an external input of a multi-head attention layer, (APU-2)_outRepresenting the output result of APU-2.

4. The method for generating answers and questions for video conversation based on self-attention mechanism as claimed in claim 3, wherein said step 2) is specifically:

aiming at a query statement X, inputting the X after position coding into a first-class self-attention processing unit APU-1 for coding, wherein the APU-1 consists of a multi-head self-attention layer and a transition layer, and the coded X is marked as q and is used as the input of a cross conversion module, namely a CT module for short; the CT module comprises two input channels, each channel comprises a second type self-attention processing unit APU-2, a cascade layer and a linear layer, and the output of the two channels is connected with a multi-head attention layer;

Similarly, q and v^sThe output is O as the input of two channels of the CT module_sq(ii) a Finally obtaining the characteristic { O) of combining the query and the video information_fq，O_sq}。

5. The method for generating answers and questions for video conversation based on self-attention mechanism as claimed in claim 3, wherein said step 3) is specifically:

historical dialog information for a set of videos c ═ c₁，c₂，...c_N) Wherein c is_iRepresenting historical ith round of dialogue information, N representing historical dialogue information of a group of videos, wherein N rounds of dialogue are included in the historical dialogue information, and each round of dialogue is caused by a question

And an answer

Is composed of (a) wherein

will question q_iAnd answer a_iEnd-to-end connection with

Represents;

The position code calculation method is as follows:

the coded dialogue information C_iInput into the second type of self-attention processing Unit APU-2, the characteristic { O) of the combination of the video information with the query obtained in step 2)_fq，O_sqRepeating the operation process of the APU-2 for T times to obtain the final output of each round of encoded historical dialogue information

Wherein, when the external input is O during the operation of the APU-2_fqWhen it is, the output is recorded as

When the external input is O_sqWhen it is, the output is recorded as

6. The method for generating answers and questions for video conversation based on self-attention mechanism as claimed in claim 1, wherein said step 4) is specifically:

will be provided with

Starting a new round of reasoning, and so on;

when the input is

And O_fqThe reasoning process is as follows:

μ₁＝σ(W₃(O_fqW₄+b₃)+b₄)

where σ is the activation function, [;]is a cascaded operation in which the number of operations,

is a vector element-by-element multiplication, W₁，W₂，W₃，W₄Is a coefficient matrix, b₁，b₂，b₃，b₄Is the deviation, S' is the fractional vector of the inference gate; mu.s₁Is a scalar quantity representing the score of the information; according to the same reasoning steps, when the input is

And O_sqWhen the output is μ₂；

μ＝(μ₁+μ₂)/2

and { O_fq，O_sqInputting the inference gate respectively for updating, stopping inference until mu is larger than a given threshold value tau or the inference reaches the beginning of the dialogue historical information, and updating the { O in the last round_fq，O_sqOutputting as the final result of the encoder, and obtaining the final result using the decoder; when generating a problem, the input query statement X is the latest round of dialogue information, and the final result generated by the decoder is the problem corresponding to the video context; when generating the answer, the input query statement X is a question, and the final result generated by the decoder is the answer to the question.