CN111160038A - Method for generating video conversation answers and questions based on self-attention mechanism - Google Patents

Method for generating video conversation answers and questions based on self-attention mechanism Download PDF

Info

Publication number
CN111160038A
CN111160038A CN201911299062.6A CN201911299062A CN111160038A CN 111160038 A CN111160038 A CN 111160038A CN 201911299062 A CN201911299062 A CN 201911299062A CN 111160038 A CN111160038 A CN 111160038A
Authority
CN
China
Prior art keywords
video
input
output
apu
self
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201911299062.6A
Other languages
Chinese (zh)
Inventor
赵洲
许津铭
金韦克
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201911299062.6A priority Critical patent/CN111160038A/en
Publication of CN111160038A publication Critical patent/CN111160038A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for generating video conversation answers and questions based on a self-attention mechanism. The method mainly comprises the following steps: 1) and constructing a dialogue network capable of reasoning among videos, questions and answers aiming at a group of video information and dialogue historical record data sets. 2) Aiming at the formed network, the query information is gradually updated by using a cross conversion module, whether the updating needs to be finished or not is judged by using an inference gate, and the final output is decoded to obtain an answer. The invention is based on a self-attention mechanism, utilizes a cross conversion module and an inference gate, can comprehensively utilize the correlation between the conversation history and the video content, and generates more conforming answers. Compared with the traditional method, the effect of the invention in the video question answering is better.

Description

Method for generating video conversation answers and questions based on self-attention mechanism
Technical Field
The invention relates to video question-answer generation, in particular to a method for generating video conversation answers and questions based on a self-attention mechanism.
Background
The video question-answer question is an important question in the field of video information retrieval, and the question aims to automatically generate an answer aiming at a related video and a corresponding question.
Existing visual dialog methods use recurrent neural networks (e.g., LSTM) to encode the dialog history into a single vector representation, and some more advanced methods use hierarchical, attention and memory mechanisms to refine the dialog history representation, but lack explicit reasoning processes. Recently, there has been a method of using a neural module network architecture, considering only static visual characters. But in video dialogs, there are other dynamic characters, such as actions and state transitions.
Therefore, the method adopts multi-stream video information in the model, and simultaneously provides a cross conversion module and an inference gate based on a self-attention mechanism for better utilizing the conversation history and the video information.
Disclosure of Invention
The invention aims to solve the problems in the prior art, and in order to overcome the problem that the correlation between video conversation histories is not concerned only by the semantic association degree between answers to questions, the invention provides a novel video conversation answer and question generation method based on a self-attention mechanism, wherein the mechanism gradually updates inquiry information according to conversation history records and video contents until an agent considers the information to be sufficient and definite. To solve the multimodal fusion problem, the present invention proposes a cross-conversion module that can learn finer granularity and more comprehensive interactions within and between modalities. The method firstly utilizes the existing video conversation history to construct a conversation network with reasonable function, and then encodes the initial query information. Thereafter, the query information is updated from each dialog step by step. Due to the continuity of the dialog history, the order of inference is from the latest turn to the beginning turn. For each round of update, a cross-over module is used to merge into the updated query. And finally, updating the query information again by using the reasoning gate, outputting a final result and stopping reasoning.
In order to achieve the purpose, the invention adopts the following specific technical scheme:
a method for generating video conversation answers and questions based on a self-attention mechanism comprises the following steps:
1. acquiring video characteristics of a piece of video, wherein the video characteristics comprise appearance characteristics v at a frame levelfAnd action characteristics v at segment levels
2. Aiming at a query statement X, a first type of self-attention processing unit APU-1 is adopted for coding, and the output is recorded as q; v obtained in step 1)fAnd vsRespectively with q-input cross-conversion module, generating query and video information combined features { Ofq,OsqIn which O isfqFor looking up appearance characteristics combined with video information, OsqQuerying the action characteristics combined with the video information; the cross conversion module comprises two input channels, each channel comprises a second type self-attention processing unit APU-2, a cascade layer and a linear layer, and the output of the two channels is connected with a multi-head attention layer;
3. historical dialog information for a set of videos, and { O) from step 2)fq,OsqAnd adopting a second type of self-attention processing unit APU-2 for coding to obtain each round of coded historical dialogue information
Figure BDA0002320397910000021
Wherein
Figure BDA0002320397910000022
For historical dialogue information and OfqThe output results after the combination and encoding are carried out,
Figure BDA0002320397910000023
for historical dialogue information and OsqCombining and encoding the output results;
4. to is directed at
Figure BDA0002320397910000024
And { Ofq,OsqJudging whether updating needs to be stopped or not through an inference gate, and if so, judging whether updating needs to be stoppedIf yes, the reasoning is finished and the updated { O } is finally obtainedfq,OsqOutputting as a result of the encoder and using a decoder to obtain a final answer or question; if not, taking the output of the inference gate as q in the step 2), and repeating the step 2) to the step 4), and enabling the output of the inference gate to be the q
Figure BDA0002320397910000025
And { Ofq,OsqAnd inputting the data to an inference gate respectively for updating until the inference is stopped.
In the present invention, the second type of self-attention processing unit APU-2 is similar to the Transmodel decoder module, but the middle multi-headed attention layer is different. Let the external input be IoThe output of the multi-head self-attention layer is OaThen normal operation in a generic transform decoder is given by:
Figure BDA0002320397910000026
where d is the dimension of the vector, in the present invention, the order of input would be replaced by:
Figure BDA0002320397910000027
thereby enabling the external input information to guide the internal attention operation. In the case of a video conversation, query information is used to filter relevant and useful information from the conversation history.
The invention has the following beneficial effects:
(1) unlike the prior art, the present invention proposes a novel video dialogue inference mechanism that gradually updates query information according to dialogue history and video content;
(2) the invention designs a cross conversion module to solve the problem of multi-mode fusion, and the module can learn finer granularity and comprehensive interaction inside and between visual and text information;
(3) in addition to generating answers, the invention can also realize the generation of questions under the same framework to construct a complete video conversation system.
Drawings
FIG. 1 is a schematic overall flow diagram of the inference mechanism utilized by the present invention;
fig. 2 is a schematic diagram of a Cross-transform module (CT module for short) according to the present invention.
Detailed Description
The invention will be further elucidated and described with reference to the drawings and the detailed description.
As shown in fig. 1, a method for generating answers and questions of a video conversation based on a self-attention mechanism of the present invention includes the following steps:
the method comprises the steps of acquiring video characteristics of a section of video aiming at a given video, wherein the video characteristics comprise the appearance characteristics of the video frame level acquired by using a pre-trained VGG network
Figure BDA0002320397910000031
Wherein
Figure BDA0002320397910000032
Representing the appearance of the ith frame in a video, T1Representing the number of frames sampled in the video; capturing motion features at video clip level using a pre-trained C3D network
Figure BDA0002320397910000033
Wherein
Figure BDA0002320397910000034
Representing the motion characteristics of the ith segment in the video, T2Representing the number of segments of the video sample; finally obtaining the appearance characteristic v of the video frame levelfAnd action characteristics v at segment levels
Step two, aiming at the query statement X, a first type of self-attention processing unit APU-1 is adopted for coding, and the output is recorded as q; the APU-1 is composed of a multi-head self-attention layer and a transition layer, wherein the transition layer is a completely connected neural network and is composed of two linear transformations and a rectifying linear activation function; the coded X is marked as q and is used as the input of a cross conversion module, which is called a CT module for short; the CT module comprises two input channels, each channel comprises a second type self-attention processing unit APU-2, a cascade layer and a linear layer, and the output of the two channels is connected with a multi-head attention layer;
q and v arefAs inputs to the two channels of the CT module, when q and vfAfter entering APU-2, for input vfUsing q as the external input of the middle multi-headed attention layer, output is
Figure BDA0002320397910000041
At the same time, for APU-2 of input q, v is usedfAs the external input of the middle multi-head attention layer, the output is recorded as
Figure BDA0002320397910000042
Then, the output of two APU-2 and its input are fused separately through cascade and linear layers, and the formula is as follows:
Figure BDA0002320397910000043
Figure BDA0002320397910000044
before output, the output stream is switched again as the original input for the next layer processing, the fusion effect is enhanced by stacking M layers, and finally another multi-attention layer fusion F is usedvAnd FqOutput is recorded as Ofq
Similarly, q and vsThe output is O as the input of two channels of the CT modulesq(ii) a Finally obtaining the characteristic { O) of combining the query and the video informationfq,OsqIn which O isfqFor looking up appearance characteristics combined with video information, OsqTo query for action features that are combined with video information.
Step three, historical dialogue information c ═ for a group of videos1,c2,...cN) Wherein c isiIndicating the ith round of historyThe historical dialogue information of a group of videos comprises N rounds of dialogue, and each round of dialogue is provided with a question
Figure BDA0002320397910000045
And an answer
Figure BDA0002320397910000046
Is composed of (a) wherein
Figure BDA0002320397910000047
The jth word in the question representing the historical ith turn of the dialog,
Figure BDA0002320397910000048
representing the jth word in the historical ith wheel answer, l representing the length of the question sentence, and l' representing the length of the answer sentence;
will question qiAnd answer aiEnd-to-end connection with
Figure BDA0002320397910000049
Represents;
then c isiEach word vector w injWith corresponding position code PFjAdding to obtain encoded historical ith wheel conversation information
Figure BDA00023203979100000410
The position code calculation method is as follows:
Figure BDA00023203979100000411
the coded dialogue information CiInput into the second type of self-attention processing Unit APU-2, the characteristic { O) of the combination of the video information with the query obtained in step 2)fq,OsqRepeating the operation process of the APU-2 for T times and outputting the operation process as the external input of the middle multi-head attention layer, wherein when the APU-2 is operated, the external input is OfqWhen it is, the output is recorded as
Figure BDA00023203979100000412
When the external input is OsqWhen it is, the output is recorded as
Figure BDA00023203979100000413
Finally obtaining each round of historical dialogue information after coding
Figure BDA0002320397910000051
Wherein
Figure BDA0002320397910000052
For historical dialogue information and OfqThe output results after the combination and encoding are carried out,
Figure BDA0002320397910000053
for historical dialogue information and OsqAnd combining and encoding the output result.
Step four, mixing
Figure BDA0002320397910000054
And { Ofq,OsqInputting the historical dialogue information of the latest round into an inference gate to be updated respectively, and outputting the historical dialogue information of the latest round finally
Figure BDA0002320397910000055
J, reasoning, determining whether reasoning needs to be stopped, and if the reasoning does not meet the requirement of stopping reasoning, calculating the final output of the historical dialogue information of the previous round
Figure BDA0002320397910000056
Starting a new round of reasoning, and so on;
when the input is
Figure BDA0002320397910000057
And OfqThe reasoning process is as follows:
Figure BDA0002320397910000058
Figure BDA0002320397910000059
Figure BDA00023203979100000510
μ1=σ(W3(OfqW4+b3)+b4)
where σ is the activation function, [;]is a cascade operation, ⊙ is a vector element-by-element multiplication, W1,W2,W3,W4Is a coefficient matrix, b1,b2,b3,b4Is the deviation, S' is the fractional vector of the inference gate; mu.s1Is a scalar quantity representing the score of the information; according to the same reasoning steps, when the input is
Figure BDA00023203979100000511
And OsqWhen the output is μ2
Judging whether updating needs to be stopped, and calculating mu according to the following formula:
μ=(μ12)/2
if μ is less than a given threshold τ, the updated { O is output by the inference gatefq,OsqAnd taking it as q in step 2), repeating step 2) -step 4), and
Figure BDA00023203979100000512
and { Ofq,OsqInputting the inference gate respectively for updating, stopping inference until mu is larger than a given threshold value tau or the inference reaches the beginning of the dialogue historical information, and updating the { O in the last roundfq,OsqOutputting as the final result of the encoder, and obtaining the final result using the decoder; when generating a problem, the input query statement X is the latest round of dialogue information, and the final result generated by the decoder is the problem corresponding to the video context; when generating answer, the input query sentence X is the question, and decoding is carried outThe final result generated by the machine is the answer to the question.
FIG. 2 is a schematic diagram of a cross-conversion module, referred to as CT module for short, according to the present invention, the CT module includes two input channels, each of which includes a second-type self-attention processing unit APU-2, a cascade and a linear layer, and outputs of the two channels are connected to a multi-head attention layer;
the second type of self-attention processing unit APU-2 consists of a multi-head self-attention layer, a multi-head attention layer and a transition layer; input vector for APU-2 (APU-2)inputFirstly, V, K vectors are obtained through a multi-head self-attention layer, and V, K vectors are respectively taken as OaThen adding OaAnd an external input IoAs the input of the middle multi-head attention layer, inputting the output result T of the multi-head attention layer into the transition layer for linear transformation and activation;
the operation is given by:
K=WK*(APU-2)input
V=WV*(APU-2)input
Figure BDA0002320397910000061
(APU-2)out=Transition(Atten)
wherein, WKAnd WVParameter matrix of representation model, (APU-2)inputRepresenting the input vector, d being the dimension of the input vector, IoIs an external input of a multi-head attention layer, (APU-2)outRepresenting the output result of APU-2;
the outputs of the two APU-2 are respectively fused with the inputs thereof through cascade and linear layers, the output stream is switched again after fusion to be used as the original input for the next layer processing, the fusion effect is enhanced by stacking M layers, and finally the outputs of the two linear layers are fused by using another multi-head attention layer, and finally the output of the CT module is obtained.
The method is applied to the following embodiments to achieve the technical effects of the present invention, and detailed steps in the embodiments are not described again.
Examples
The experiment is carried out on a YouTube Clips data set and a TACOS-Multi level data set. These two data sets contained 1987 and 1303 videos, respectively. Each video in youtube clips is 60 frames, and each video in TACoS-multiline is 80 frames. Each video has five different conversations. There are 6515 video sessions in the youtube clips dataset and 9935 video sessions in the TACoS-multiline dataset. The number of challenge-response pairs is 66806 and 37228, respectively. Statistically, there are five rounds of conversation in most video dialogs of the TACoS-multiline dataset, while the youtube clips dataset has a majority of conversations between three and twelve rounds.
In order to objectively evaluate the performance of the algorithm of the present invention, the present invention uses three evaluation criteria, i.e., BLEUN (N ═ 1, 2), route-L and METEOR, in the selected test set to evaluate the effect of the present invention. The percentages of training data, validation data and test data for the two data sets were 90%, 5% and 5%, respectively, depending on the number of video sessions constructed. According to the steps described in the detailed description, Table 1 shows the experimental results of generating answers on the TACOS-Multi level and YoutubeClip datasets, and Table 2 shows the results of generating questions on the same datasets. The method is expressed as RICT:
table 1: the invention aims at BLUEN (N is 1, 2) and ROUGE-L and METEOR standard, and generates the experimental result of answers on TACOS-Multi level and YoubeClip data sets
Figure BDA0002320397910000071
Table 2: the present invention addresses BLUEN (N ═ 1, 2), ROUGE-L and METEOR standards, generating problem results on TACOS-Multilevel and YoutubeClip datasets
Figure BDA0002320397910000072

Claims (6)

1. A method for generating video conversation answers and questions based on a self-attention mechanism is characterized by comprising the following steps:
1) acquiring video characteristics of a piece of video, wherein the video characteristics comprise appearance characteristics v at a frame levelfAnd action characteristics v at segment levels
2) Aiming at a query statement X, a first type of self-attention processing unit APU-1 is adopted for coding, and the output is recorded as q; v obtained in step 1)fAnd vsRespectively with q-input cross-conversion module, generating query and video information combined features { Ofq,OsqIn which O isfqFor looking up appearance characteristics combined with video information, OsqQuerying the action characteristics combined with the video information; the cross conversion module comprises two input channels, each channel comprises a second type self-attention processing unit APU-2, a cascade layer and a linear layer, and the output of the two channels is connected with a multi-head attention layer;
3) historical dialog information for a set of videos, and { O) from step 2)fq,OsqAnd adopting a second type of self-attention processing unit APU-2 for coding to obtain each round of coded historical dialogue information
Figure FDA0002320397900000011
Wherein
Figure FDA0002320397900000012
For historical dialogue information and OfqThe output results after the combination and encoding are carried out,
Figure FDA0002320397900000013
for historical dialogue information and OsqCombining and encoding the output results;
4) to is directed at
Figure FDA0002320397900000014
And { Ofq,OsqJudging whether the updating needs to be stopped or not through an inference gate, if so, ending the inference, and finally updating the { O }fq,OsqOutputting as a result of the encoder and using a decoder to obtain a final answer or question; if not, taking the output of the inference gate as q in the step 2), and repeating the step 2) to the step 4), and enabling the output of the inference gate to be the q
Figure FDA0002320397900000015
And { Ofq,OsqAnd inputting the data to an inference gate respectively for updating until the inference is stopped.
2. The method for generating answers and questions for video conversation based on self-attention mechanism as claimed in claim 1, wherein the step 1) is specifically as follows:
obtaining appearance features at video frame level for a given video using a pre-trained VGG network
Figure FDA0002320397900000016
Wherein
Figure FDA0002320397900000017
Representing the appearance of the ith frame in a video, T1Representing the number of frames sampled in the video; capturing motion features at video clip level using a pre-trained C3D network
Figure FDA0002320397900000018
Wherein
Figure FDA0002320397900000019
Representing the motion characteristics of the ith segment in the video, T2Representing the number of segments of the video sample.
3. The method for generating answers and questions in a video session based on a self-attention mechanism as claimed in claim 1, wherein said second type of self-attention processing unit APU-2 of step 2) and step 3) is composed of a multi-headed self-attention layer, a multi-headed attention layer and a transition layer; input vector for APU-2 (APU-2)inputFirst of all, by a multi-head self-attentionThe layer obtains V, K vectors, and V, K vectors are respectively regarded as OaThen adding OaAnd an external input IoAs the input of the middle multi-head attention layer, inputting the output result T of the multi-head attention layer into the transition layer for linear transformation and activation;
the operation is given by:
K=WK*(APU-2)input
V=WV*(APU-2)input
Figure FDA0002320397900000021
(APU-2)out=Transition(Atten)
wherein, WKAnd WVParameter matrix of representation model, (APU-2)inputRepresenting the input vector, d being the dimension of the input vector, IoIs an external input of a multi-head attention layer, (APU-2)outRepresenting the output result of APU-2.
4. The method for generating answers and questions for video conversation based on self-attention mechanism as claimed in claim 3, wherein said step 2) is specifically:
aiming at a query statement X, inputting the X after position coding into a first-class self-attention processing unit APU-1 for coding, wherein the APU-1 consists of a multi-head self-attention layer and a transition layer, and the coded X is marked as q and is used as the input of a cross conversion module, namely a CT module for short; the CT module comprises two input channels, each channel comprises a second type self-attention processing unit APU-2, a cascade layer and a linear layer, and the output of the two channels is connected with a multi-head attention layer;
q and v arefAs inputs to the two channels of the CT module, when q and vfAfter entering APU-2, for input vfUsing q as the external input of the middle multi-headed attention layer, output is
Figure FDA0002320397900000022
At the same time, for APU-2 of input q, v is usedfAs the external input of the middle multi-head attention layer, the output is recorded as
Figure FDA0002320397900000023
Then, the output of two APU-2 and its input are fused separately through cascade and linear layers, and the formula is as follows:
Figure FDA0002320397900000024
Figure FDA0002320397900000025
before output, the output stream is switched again as the original input for the next layer processing, the fusion effect is enhanced by stacking M layers, and finally another multi-attention layer fusion F is usedvAnd FqOutput is recorded as Ofq
Similarly, q and vsThe output is O as the input of two channels of the CT modulesq(ii) a Finally obtaining the characteristic { O) of combining the query and the video informationfq,Osq}。
5. The method for generating answers and questions for video conversation based on self-attention mechanism as claimed in claim 3, wherein said step 3) is specifically:
historical dialog information for a set of videos c ═ c1,c2,...cN) Wherein c isiRepresenting historical ith round of dialogue information, N representing historical dialogue information of a group of videos, wherein N rounds of dialogue are included in the historical dialogue information, and each round of dialogue is caused by a question
Figure FDA0002320397900000031
And an answer
Figure FDA0002320397900000032
Is composed of (a) wherein
Figure FDA0002320397900000033
The jth word in the question representing the historical ith turn of the dialog,
Figure FDA0002320397900000034
representing the jth word in the historical ith wheel answer, l representing the length of the question sentence, and l' representing the length of the answer sentence;
will question qiAnd answer aiEnd-to-end connection with
Figure FDA0002320397900000035
Represents;
then c isiEach word vector w injWith corresponding position code PFjAdding to obtain encoded historical ith wheel conversation information
Figure FDA0002320397900000036
The position code calculation method is as follows:
Figure FDA0002320397900000037
the coded dialogue information CiInput into the second type of self-attention processing Unit APU-2, the characteristic { O) of the combination of the video information with the query obtained in step 2)fq,OsqRepeating the operation process of the APU-2 for T times to obtain the final output of each round of encoded historical dialogue information
Figure FDA0002320397900000038
Wherein, when the external input is O during the operation of the APU-2fqWhen it is, the output is recorded as
Figure FDA0002320397900000039
When the external input is OsqWhen it is, the output is recorded as
Figure FDA00023203979000000310
6. The method for generating answers and questions for video conversation based on self-attention mechanism as claimed in claim 1, wherein said step 4) is specifically:
will be provided with
Figure FDA00023203979000000311
And { Ofq,OsqInputting the historical dialogue information of the latest round into an inference gate to be updated respectively, and outputting the historical dialogue information of the latest round finally
Figure FDA00023203979000000312
J, reasoning, determining whether reasoning needs to be stopped, and if the reasoning does not meet the requirement of stopping reasoning, calculating the final output of the historical dialogue information of the previous round
Figure FDA00023203979000000313
Starting a new round of reasoning, and so on;
when the input is
Figure FDA00023203979000000314
And OfqThe reasoning process is as follows:
Figure FDA0002320397900000041
Figure FDA0002320397900000042
Figure FDA0002320397900000043
μ1=σ(W3(OfqW4+b3)+b4)
where σ is the activation function, [;]is a cascaded operation in which the number of operations,
Figure FDA0002320397900000046
is a vector element-by-element multiplication, W1,W2,W3,W4Is a coefficient matrix, b1,b2,b3,b4Is the deviation, S' is the fractional vector of the inference gate; mu.s1Is a scalar quantity representing the score of the information; according to the same reasoning steps, when the input is
Figure FDA0002320397900000044
And OsqWhen the output is μ2
Judging whether updating needs to be stopped, and calculating mu according to the following formula:
μ=(μ12)/2
if μ is less than a given threshold τ, the updated { O is output by the inference gatefq,OsqAnd taking it as q in step 2), repeating step 2) -step 4), and
Figure FDA0002320397900000045
and { Ofq,OsqInputting the inference gate respectively for updating, stopping inference until mu is larger than a given threshold value tau or the inference reaches the beginning of the dialogue historical information, and updating the { O in the last roundfq,OsqOutputting as the final result of the encoder, and obtaining the final result using the decoder; when generating a problem, the input query statement X is the latest round of dialogue information, and the final result generated by the decoder is the problem corresponding to the video context; when generating the answer, the input query statement X is a question, and the final result generated by the decoder is the answer to the question.
CN201911299062.6A 2019-12-16 2019-12-16 Method for generating video conversation answers and questions based on self-attention mechanism Withdrawn CN111160038A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911299062.6A CN111160038A (en) 2019-12-16 2019-12-16 Method for generating video conversation answers and questions based on self-attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911299062.6A CN111160038A (en) 2019-12-16 2019-12-16 Method for generating video conversation answers and questions based on self-attention mechanism

Publications (1)

Publication Number Publication Date
CN111160038A true CN111160038A (en) 2020-05-15

Family

ID=70557326

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911299062.6A Withdrawn CN111160038A (en) 2019-12-16 2019-12-16 Method for generating video conversation answers and questions based on self-attention mechanism

Country Status (1)

Country Link
CN (1) CN111160038A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112559698A (en) * 2020-11-02 2021-03-26 山东师范大学 Method and system for improving video question-answering precision based on multi-mode fusion model
CN113704437A (en) * 2021-09-03 2021-11-26 重庆邮电大学 Knowledge base question-answering method integrating multi-head attention mechanism and relative position coding
CN117172732A (en) * 2023-07-31 2023-12-05 北京五八赶集信息技术有限公司 Recruitment service system, method, equipment and storage medium based on AI
CN117808443A (en) * 2023-07-31 2024-04-02 北京五八赶集信息技术有限公司 Recruitment service system, method, equipment and storage medium based on AI

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110377711A (en) * 2019-07-01 2019-10-25 浙江大学 A method of open long video question-answering task is solved from attention network using layering convolution
CN110502627A (en) * 2019-08-28 2019-11-26 上海海事大学 A kind of answer generation method based on multilayer Transformer polymerization encoder
CN111699498A (en) * 2018-02-09 2020-09-22 易享信息技术有限公司 Multitask learning as question and answer

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111699498A (en) * 2018-02-09 2020-09-22 易享信息技术有限公司 Multitask learning as question and answer
CN110377711A (en) * 2019-07-01 2019-10-25 浙江大学 A method of open long video question-answering task is solved from attention network using layering convolution
CN110502627A (en) * 2019-08-28 2019-11-26 上海海事大学 A kind of answer generation method based on multilayer Transformer polymerization encoder

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ASHISH VASWANI ET AL.: "Attention Is All You Need", 《31ST CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS》 *
WEIKE JIN ET AL.: "Video Dialog via Progressive Inference and Cross-Transformer", 《PROCEEDINGS OF THE 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112559698A (en) * 2020-11-02 2021-03-26 山东师范大学 Method and system for improving video question-answering precision based on multi-mode fusion model
CN113704437A (en) * 2021-09-03 2021-11-26 重庆邮电大学 Knowledge base question-answering method integrating multi-head attention mechanism and relative position coding
CN113704437B (en) * 2021-09-03 2023-08-11 重庆邮电大学 Knowledge base question-answering method integrating multi-head attention mechanism and relative position coding
CN117172732A (en) * 2023-07-31 2023-12-05 北京五八赶集信息技术有限公司 Recruitment service system, method, equipment and storage medium based on AI
CN117808443A (en) * 2023-07-31 2024-04-02 北京五八赶集信息技术有限公司 Recruitment service system, method, equipment and storage medium based on AI

Similar Documents

Publication Publication Date Title
CN111160038A (en) Method for generating video conversation answers and questions based on self-attention mechanism
CN109785824B (en) Training method and device of voice translation model
CN109992657B (en) Dialogue type problem generation method based on enhanced dynamic reasoning
CN112417134B (en) Automatic abstract generation system and method based on voice text deep fusion features
CN113591902A (en) Cross-modal understanding and generating method and device based on multi-modal pre-training model
CN111625660A (en) Dialog generation method, video comment method, device, equipment and storage medium
CN112287675B (en) Intelligent customer service intention understanding method based on text and voice information fusion
CN115964467A (en) Visual situation fused rich semantic dialogue generation method
CN109902164B (en) Method for solving question-answering of open long format video by using convolution bidirectional self-attention network
CN112966083B (en) Multi-turn dialogue generation method and device based on dialogue history modeling
US12046249B2 (en) Bandwidth extension of incoming data using neural networks
CN111754992A (en) Noise robust audio/video bimodal speech recognition method and system
CN115293132B (en) Dialog of virtual scenes a treatment method device, electronic apparatus, and storage medium
Li et al. A deep reinforcement learning framework for Identifying funny scenes in movies
CN111382257A (en) Method and system for generating dialog context
CN115712709A (en) Multi-modal dialog question-answer generation method based on multi-relationship graph model
CN115269836A (en) Intention identification method and device
CN111625629A (en) Task-based conversational robot response method, device, robot and storage medium
CN113656542A (en) Dialect recommendation method based on information retrieval and sorting
CN116208772A (en) Data processing method, device, electronic equipment and computer readable storage medium
Lin et al. A hierarchical structured multi-head attention network for multi-turn response generation
CN113515617B (en) Method, device and equipment for generating model through dialogue
CN111783434B (en) Method and system for improving noise immunity of reply generation model
Kumar et al. Towards robust speech recognition model using Deep Learning
Meng et al. Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20200515

WW01 Invention patent application withdrawn after publication