CN111160038A - Method for generating video conversation answers and questions based on self-attention mechanism - Google Patents
Method for generating video conversation answers and questions based on self-attention mechanism Download PDFInfo
- Publication number
- CN111160038A CN111160038A CN201911299062.6A CN201911299062A CN111160038A CN 111160038 A CN111160038 A CN 111160038A CN 201911299062 A CN201911299062 A CN 201911299062A CN 111160038 A CN111160038 A CN 111160038A
- Authority
- CN
- China
- Prior art keywords
- video
- input
- output
- apu
- self
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/73—Querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method for generating video conversation answers and questions based on a self-attention mechanism. The method mainly comprises the following steps: 1) and constructing a dialogue network capable of reasoning among videos, questions and answers aiming at a group of video information and dialogue historical record data sets. 2) Aiming at the formed network, the query information is gradually updated by using a cross conversion module, whether the updating needs to be finished or not is judged by using an inference gate, and the final output is decoded to obtain an answer. The invention is based on a self-attention mechanism, utilizes a cross conversion module and an inference gate, can comprehensively utilize the correlation between the conversation history and the video content, and generates more conforming answers. Compared with the traditional method, the effect of the invention in the video question answering is better.
Description
Technical Field
The invention relates to video question-answer generation, in particular to a method for generating video conversation answers and questions based on a self-attention mechanism.
Background
The video question-answer question is an important question in the field of video information retrieval, and the question aims to automatically generate an answer aiming at a related video and a corresponding question.
Existing visual dialog methods use recurrent neural networks (e.g., LSTM) to encode the dialog history into a single vector representation, and some more advanced methods use hierarchical, attention and memory mechanisms to refine the dialog history representation, but lack explicit reasoning processes. Recently, there has been a method of using a neural module network architecture, considering only static visual characters. But in video dialogs, there are other dynamic characters, such as actions and state transitions.
Therefore, the method adopts multi-stream video information in the model, and simultaneously provides a cross conversion module and an inference gate based on a self-attention mechanism for better utilizing the conversation history and the video information.
Disclosure of Invention
The invention aims to solve the problems in the prior art, and in order to overcome the problem that the correlation between video conversation histories is not concerned only by the semantic association degree between answers to questions, the invention provides a novel video conversation answer and question generation method based on a self-attention mechanism, wherein the mechanism gradually updates inquiry information according to conversation history records and video contents until an agent considers the information to be sufficient and definite. To solve the multimodal fusion problem, the present invention proposes a cross-conversion module that can learn finer granularity and more comprehensive interactions within and between modalities. The method firstly utilizes the existing video conversation history to construct a conversation network with reasonable function, and then encodes the initial query information. Thereafter, the query information is updated from each dialog step by step. Due to the continuity of the dialog history, the order of inference is from the latest turn to the beginning turn. For each round of update, a cross-over module is used to merge into the updated query. And finally, updating the query information again by using the reasoning gate, outputting a final result and stopping reasoning.
In order to achieve the purpose, the invention adopts the following specific technical scheme:
a method for generating video conversation answers and questions based on a self-attention mechanism comprises the following steps:
1. acquiring video characteristics of a piece of video, wherein the video characteristics comprise appearance characteristics v at a frame levelfAnd action characteristics v at segment levels;
2. Aiming at a query statement X, a first type of self-attention processing unit APU-1 is adopted for coding, and the output is recorded as q; v obtained in step 1)fAnd vsRespectively with q-input cross-conversion module, generating query and video information combined features { Ofq,OsqIn which O isfqFor looking up appearance characteristics combined with video information, OsqQuerying the action characteristics combined with the video information; the cross conversion module comprises two input channels, each channel comprises a second type self-attention processing unit APU-2, a cascade layer and a linear layer, and the output of the two channels is connected with a multi-head attention layer;
3. historical dialog information for a set of videos, and { O) from step 2)fq,OsqAnd adopting a second type of self-attention processing unit APU-2 for coding to obtain each round of coded historical dialogue informationWhereinFor historical dialogue information and OfqThe output results after the combination and encoding are carried out,for historical dialogue information and OsqCombining and encoding the output results;
4. to is directed atAnd { Ofq,OsqJudging whether updating needs to be stopped or not through an inference gate, and if so, judging whether updating needs to be stoppedIf yes, the reasoning is finished and the updated { O } is finally obtainedfq,OsqOutputting as a result of the encoder and using a decoder to obtain a final answer or question; if not, taking the output of the inference gate as q in the step 2), and repeating the step 2) to the step 4), and enabling the output of the inference gate to be the qAnd { Ofq,OsqAnd inputting the data to an inference gate respectively for updating until the inference is stopped.
In the present invention, the second type of self-attention processing unit APU-2 is similar to the Transmodel decoder module, but the middle multi-headed attention layer is different. Let the external input be IoThe output of the multi-head self-attention layer is OaThen normal operation in a generic transform decoder is given by:
where d is the dimension of the vector, in the present invention, the order of input would be replaced by:
thereby enabling the external input information to guide the internal attention operation. In the case of a video conversation, query information is used to filter relevant and useful information from the conversation history.
The invention has the following beneficial effects:
(1) unlike the prior art, the present invention proposes a novel video dialogue inference mechanism that gradually updates query information according to dialogue history and video content;
(2) the invention designs a cross conversion module to solve the problem of multi-mode fusion, and the module can learn finer granularity and comprehensive interaction inside and between visual and text information;
(3) in addition to generating answers, the invention can also realize the generation of questions under the same framework to construct a complete video conversation system.
Drawings
FIG. 1 is a schematic overall flow diagram of the inference mechanism utilized by the present invention;
fig. 2 is a schematic diagram of a Cross-transform module (CT module for short) according to the present invention.
Detailed Description
The invention will be further elucidated and described with reference to the drawings and the detailed description.
As shown in fig. 1, a method for generating answers and questions of a video conversation based on a self-attention mechanism of the present invention includes the following steps:
the method comprises the steps of acquiring video characteristics of a section of video aiming at a given video, wherein the video characteristics comprise the appearance characteristics of the video frame level acquired by using a pre-trained VGG networkWhereinRepresenting the appearance of the ith frame in a video, T1Representing the number of frames sampled in the video; capturing motion features at video clip level using a pre-trained C3D networkWhereinRepresenting the motion characteristics of the ith segment in the video, T2Representing the number of segments of the video sample; finally obtaining the appearance characteristic v of the video frame levelfAnd action characteristics v at segment levels;
Step two, aiming at the query statement X, a first type of self-attention processing unit APU-1 is adopted for coding, and the output is recorded as q; the APU-1 is composed of a multi-head self-attention layer and a transition layer, wherein the transition layer is a completely connected neural network and is composed of two linear transformations and a rectifying linear activation function; the coded X is marked as q and is used as the input of a cross conversion module, which is called a CT module for short; the CT module comprises two input channels, each channel comprises a second type self-attention processing unit APU-2, a cascade layer and a linear layer, and the output of the two channels is connected with a multi-head attention layer;
q and v arefAs inputs to the two channels of the CT module, when q and vfAfter entering APU-2, for input vfUsing q as the external input of the middle multi-headed attention layer, output isAt the same time, for APU-2 of input q, v is usedfAs the external input of the middle multi-head attention layer, the output is recorded asThen, the output of two APU-2 and its input are fused separately through cascade and linear layers, and the formula is as follows:
before output, the output stream is switched again as the original input for the next layer processing, the fusion effect is enhanced by stacking M layers, and finally another multi-attention layer fusion F is usedvAnd FqOutput is recorded as Ofq;
Similarly, q and vsThe output is O as the input of two channels of the CT modulesq(ii) a Finally obtaining the characteristic { O) of combining the query and the video informationfq,OsqIn which O isfqFor looking up appearance characteristics combined with video information, OsqTo query for action features that are combined with video information.
Step three, historical dialogue information c ═ for a group of videos1,c2,...cN) Wherein c isiIndicating the ith round of historyThe historical dialogue information of a group of videos comprises N rounds of dialogue, and each round of dialogue is provided with a questionAnd an answerIs composed of (a) whereinThe jth word in the question representing the historical ith turn of the dialog,representing the jth word in the historical ith wheel answer, l representing the length of the question sentence, and l' representing the length of the answer sentence;
then c isiEach word vector w injWith corresponding position code PFjAdding to obtain encoded historical ith wheel conversation informationThe position code calculation method is as follows:
the coded dialogue information CiInput into the second type of self-attention processing Unit APU-2, the characteristic { O) of the combination of the video information with the query obtained in step 2)fq,OsqRepeating the operation process of the APU-2 for T times and outputting the operation process as the external input of the middle multi-head attention layer, wherein when the APU-2 is operated, the external input is OfqWhen it is, the output is recorded asWhen the external input is OsqWhen it is, the output is recorded as
Finally obtaining each round of historical dialogue information after codingWhereinFor historical dialogue information and OfqThe output results after the combination and encoding are carried out,for historical dialogue information and OsqAnd combining and encoding the output result.
Step four, mixingAnd { Ofq,OsqInputting the historical dialogue information of the latest round into an inference gate to be updated respectively, and outputting the historical dialogue information of the latest round finallyJ, reasoning, determining whether reasoning needs to be stopped, and if the reasoning does not meet the requirement of stopping reasoning, calculating the final output of the historical dialogue information of the previous roundStarting a new round of reasoning, and so on;
μ1=σ(W3(OfqW4+b3)+b4)
where σ is the activation function, [;]is a cascade operation, ⊙ is a vector element-by-element multiplication, W1,W2,W3,W4Is a coefficient matrix, b1,b2,b3,b4Is the deviation, S' is the fractional vector of the inference gate; mu.s1Is a scalar quantity representing the score of the information; according to the same reasoning steps, when the input isAnd OsqWhen the output is μ2;
Judging whether updating needs to be stopped, and calculating mu according to the following formula:
μ=(μ1+μ2)/2
if μ is less than a given threshold τ, the updated { O is output by the inference gatefq,OsqAnd taking it as q in step 2), repeating step 2) -step 4), andand { Ofq,OsqInputting the inference gate respectively for updating, stopping inference until mu is larger than a given threshold value tau or the inference reaches the beginning of the dialogue historical information, and updating the { O in the last roundfq,OsqOutputting as the final result of the encoder, and obtaining the final result using the decoder; when generating a problem, the input query statement X is the latest round of dialogue information, and the final result generated by the decoder is the problem corresponding to the video context; when generating answer, the input query sentence X is the question, and decoding is carried outThe final result generated by the machine is the answer to the question.
FIG. 2 is a schematic diagram of a cross-conversion module, referred to as CT module for short, according to the present invention, the CT module includes two input channels, each of which includes a second-type self-attention processing unit APU-2, a cascade and a linear layer, and outputs of the two channels are connected to a multi-head attention layer;
the second type of self-attention processing unit APU-2 consists of a multi-head self-attention layer, a multi-head attention layer and a transition layer; input vector for APU-2 (APU-2)inputFirstly, V, K vectors are obtained through a multi-head self-attention layer, and V, K vectors are respectively taken as OaThen adding OaAnd an external input IoAs the input of the middle multi-head attention layer, inputting the output result T of the multi-head attention layer into the transition layer for linear transformation and activation;
the operation is given by:
K=WK*(APU-2)input
V=WV*(APU-2)input
(APU-2)out=Transition(Atten)
wherein, WKAnd WVParameter matrix of representation model, (APU-2)inputRepresenting the input vector, d being the dimension of the input vector, IoIs an external input of a multi-head attention layer, (APU-2)outRepresenting the output result of APU-2;
the outputs of the two APU-2 are respectively fused with the inputs thereof through cascade and linear layers, the output stream is switched again after fusion to be used as the original input for the next layer processing, the fusion effect is enhanced by stacking M layers, and finally the outputs of the two linear layers are fused by using another multi-head attention layer, and finally the output of the CT module is obtained.
The method is applied to the following embodiments to achieve the technical effects of the present invention, and detailed steps in the embodiments are not described again.
Examples
The experiment is carried out on a YouTube Clips data set and a TACOS-Multi level data set. These two data sets contained 1987 and 1303 videos, respectively. Each video in youtube clips is 60 frames, and each video in TACoS-multiline is 80 frames. Each video has five different conversations. There are 6515 video sessions in the youtube clips dataset and 9935 video sessions in the TACoS-multiline dataset. The number of challenge-response pairs is 66806 and 37228, respectively. Statistically, there are five rounds of conversation in most video dialogs of the TACoS-multiline dataset, while the youtube clips dataset has a majority of conversations between three and twelve rounds.
In order to objectively evaluate the performance of the algorithm of the present invention, the present invention uses three evaluation criteria, i.e., BLEUN (N ═ 1, 2), route-L and METEOR, in the selected test set to evaluate the effect of the present invention. The percentages of training data, validation data and test data for the two data sets were 90%, 5% and 5%, respectively, depending on the number of video sessions constructed. According to the steps described in the detailed description, Table 1 shows the experimental results of generating answers on the TACOS-Multi level and YoutubeClip datasets, and Table 2 shows the results of generating questions on the same datasets. The method is expressed as RICT:
table 1: the invention aims at BLUEN (N is 1, 2) and ROUGE-L and METEOR standard, and generates the experimental result of answers on TACOS-Multi level and YoubeClip data sets
Table 2: the present invention addresses BLUEN (N ═ 1, 2), ROUGE-L and METEOR standards, generating problem results on TACOS-Multilevel and YoutubeClip datasets
Claims (6)
1. A method for generating video conversation answers and questions based on a self-attention mechanism is characterized by comprising the following steps:
1) acquiring video characteristics of a piece of video, wherein the video characteristics comprise appearance characteristics v at a frame levelfAnd action characteristics v at segment levels;
2) Aiming at a query statement X, a first type of self-attention processing unit APU-1 is adopted for coding, and the output is recorded as q; v obtained in step 1)fAnd vsRespectively with q-input cross-conversion module, generating query and video information combined features { Ofq,OsqIn which O isfqFor looking up appearance characteristics combined with video information, OsqQuerying the action characteristics combined with the video information; the cross conversion module comprises two input channels, each channel comprises a second type self-attention processing unit APU-2, a cascade layer and a linear layer, and the output of the two channels is connected with a multi-head attention layer;
3) historical dialog information for a set of videos, and { O) from step 2)fq,OsqAnd adopting a second type of self-attention processing unit APU-2 for coding to obtain each round of coded historical dialogue informationWhereinFor historical dialogue information and OfqThe output results after the combination and encoding are carried out,for historical dialogue information and OsqCombining and encoding the output results;
4) to is directed atAnd { Ofq,OsqJudging whether the updating needs to be stopped or not through an inference gate, if so, ending the inference, and finally updating the { O }fq,OsqOutputting as a result of the encoder and using a decoder to obtain a final answer or question; if not, taking the output of the inference gate as q in the step 2), and repeating the step 2) to the step 4), and enabling the output of the inference gate to be the qAnd { Ofq,OsqAnd inputting the data to an inference gate respectively for updating until the inference is stopped.
2. The method for generating answers and questions for video conversation based on self-attention mechanism as claimed in claim 1, wherein the step 1) is specifically as follows:
obtaining appearance features at video frame level for a given video using a pre-trained VGG networkWhereinRepresenting the appearance of the ith frame in a video, T1Representing the number of frames sampled in the video; capturing motion features at video clip level using a pre-trained C3D networkWhereinRepresenting the motion characteristics of the ith segment in the video, T2Representing the number of segments of the video sample.
3. The method for generating answers and questions in a video session based on a self-attention mechanism as claimed in claim 1, wherein said second type of self-attention processing unit APU-2 of step 2) and step 3) is composed of a multi-headed self-attention layer, a multi-headed attention layer and a transition layer; input vector for APU-2 (APU-2)inputFirst of all, by a multi-head self-attentionThe layer obtains V, K vectors, and V, K vectors are respectively regarded as OaThen adding OaAnd an external input IoAs the input of the middle multi-head attention layer, inputting the output result T of the multi-head attention layer into the transition layer for linear transformation and activation;
the operation is given by:
K=WK*(APU-2)input
V=WV*(APU-2)input
(APU-2)out=Transition(Atten)
wherein, WKAnd WVParameter matrix of representation model, (APU-2)inputRepresenting the input vector, d being the dimension of the input vector, IoIs an external input of a multi-head attention layer, (APU-2)outRepresenting the output result of APU-2.
4. The method for generating answers and questions for video conversation based on self-attention mechanism as claimed in claim 3, wherein said step 2) is specifically:
aiming at a query statement X, inputting the X after position coding into a first-class self-attention processing unit APU-1 for coding, wherein the APU-1 consists of a multi-head self-attention layer and a transition layer, and the coded X is marked as q and is used as the input of a cross conversion module, namely a CT module for short; the CT module comprises two input channels, each channel comprises a second type self-attention processing unit APU-2, a cascade layer and a linear layer, and the output of the two channels is connected with a multi-head attention layer;
q and v arefAs inputs to the two channels of the CT module, when q and vfAfter entering APU-2, for input vfUsing q as the external input of the middle multi-headed attention layer, output isAt the same time, for APU-2 of input q, v is usedfAs the external input of the middle multi-head attention layer, the output is recorded asThen, the output of two APU-2 and its input are fused separately through cascade and linear layers, and the formula is as follows:
before output, the output stream is switched again as the original input for the next layer processing, the fusion effect is enhanced by stacking M layers, and finally another multi-attention layer fusion F is usedvAnd FqOutput is recorded as Ofq;
Similarly, q and vsThe output is O as the input of two channels of the CT modulesq(ii) a Finally obtaining the characteristic { O) of combining the query and the video informationfq,Osq}。
5. The method for generating answers and questions for video conversation based on self-attention mechanism as claimed in claim 3, wherein said step 3) is specifically:
historical dialog information for a set of videos c ═ c1,c2,...cN) Wherein c isiRepresenting historical ith round of dialogue information, N representing historical dialogue information of a group of videos, wherein N rounds of dialogue are included in the historical dialogue information, and each round of dialogue is caused by a questionAnd an answerIs composed of (a) whereinThe jth word in the question representing the historical ith turn of the dialog,representing the jth word in the historical ith wheel answer, l representing the length of the question sentence, and l' representing the length of the answer sentence;
then c isiEach word vector w injWith corresponding position code PFjAdding to obtain encoded historical ith wheel conversation informationThe position code calculation method is as follows:
the coded dialogue information CiInput into the second type of self-attention processing Unit APU-2, the characteristic { O) of the combination of the video information with the query obtained in step 2)fq,OsqRepeating the operation process of the APU-2 for T times to obtain the final output of each round of encoded historical dialogue informationWherein, when the external input is O during the operation of the APU-2fqWhen it is, the output is recorded asWhen the external input is OsqWhen it is, the output is recorded as
6. The method for generating answers and questions for video conversation based on self-attention mechanism as claimed in claim 1, wherein said step 4) is specifically:
will be provided withAnd { Ofq,OsqInputting the historical dialogue information of the latest round into an inference gate to be updated respectively, and outputting the historical dialogue information of the latest round finallyJ, reasoning, determining whether reasoning needs to be stopped, and if the reasoning does not meet the requirement of stopping reasoning, calculating the final output of the historical dialogue information of the previous roundStarting a new round of reasoning, and so on;
μ1=σ(W3(OfqW4+b3)+b4)
where σ is the activation function, [;]is a cascaded operation in which the number of operations,is a vector element-by-element multiplication, W1,W2,W3,W4Is a coefficient matrix, b1,b2,b3,b4Is the deviation, S' is the fractional vector of the inference gate; mu.s1Is a scalar quantity representing the score of the information; according to the same reasoning steps, when the input isAnd OsqWhen the output is μ2;
Judging whether updating needs to be stopped, and calculating mu according to the following formula:
μ=(μ1+μ2)/2
if μ is less than a given threshold τ, the updated { O is output by the inference gatefq,OsqAnd taking it as q in step 2), repeating step 2) -step 4), andand { Ofq,OsqInputting the inference gate respectively for updating, stopping inference until mu is larger than a given threshold value tau or the inference reaches the beginning of the dialogue historical information, and updating the { O in the last roundfq,OsqOutputting as the final result of the encoder, and obtaining the final result using the decoder; when generating a problem, the input query statement X is the latest round of dialogue information, and the final result generated by the decoder is the problem corresponding to the video context; when generating the answer, the input query statement X is a question, and the final result generated by the decoder is the answer to the question.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911299062.6A CN111160038A (en) | 2019-12-16 | 2019-12-16 | Method for generating video conversation answers and questions based on self-attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911299062.6A CN111160038A (en) | 2019-12-16 | 2019-12-16 | Method for generating video conversation answers and questions based on self-attention mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111160038A true CN111160038A (en) | 2020-05-15 |
Family
ID=70557326
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911299062.6A Withdrawn CN111160038A (en) | 2019-12-16 | 2019-12-16 | Method for generating video conversation answers and questions based on self-attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111160038A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112559698A (en) * | 2020-11-02 | 2021-03-26 | 山东师范大学 | Method and system for improving video question-answering precision based on multi-mode fusion model |
CN113704437A (en) * | 2021-09-03 | 2021-11-26 | 重庆邮电大学 | Knowledge base question-answering method integrating multi-head attention mechanism and relative position coding |
CN117172732A (en) * | 2023-07-31 | 2023-12-05 | 北京五八赶集信息技术有限公司 | Recruitment service system, method, equipment and storage medium based on AI |
CN117808443A (en) * | 2023-07-31 | 2024-04-02 | 北京五八赶集信息技术有限公司 | Recruitment service system, method, equipment and storage medium based on AI |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110377711A (en) * | 2019-07-01 | 2019-10-25 | 浙江大学 | A method of open long video question-answering task is solved from attention network using layering convolution |
CN110502627A (en) * | 2019-08-28 | 2019-11-26 | 上海海事大学 | A kind of answer generation method based on multilayer Transformer polymerization encoder |
CN111699498A (en) * | 2018-02-09 | 2020-09-22 | 易享信息技术有限公司 | Multitask learning as question and answer |
-
2019
- 2019-12-16 CN CN201911299062.6A patent/CN111160038A/en not_active Withdrawn
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111699498A (en) * | 2018-02-09 | 2020-09-22 | 易享信息技术有限公司 | Multitask learning as question and answer |
CN110377711A (en) * | 2019-07-01 | 2019-10-25 | 浙江大学 | A method of open long video question-answering task is solved from attention network using layering convolution |
CN110502627A (en) * | 2019-08-28 | 2019-11-26 | 上海海事大学 | A kind of answer generation method based on multilayer Transformer polymerization encoder |
Non-Patent Citations (2)
Title |
---|
ASHISH VASWANI ET AL.: "Attention Is All You Need", 《31ST CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS》 * |
WEIKE JIN ET AL.: "Video Dialog via Progressive Inference and Cross-Transformer", 《PROCEEDINGS OF THE 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112559698A (en) * | 2020-11-02 | 2021-03-26 | 山东师范大学 | Method and system for improving video question-answering precision based on multi-mode fusion model |
CN113704437A (en) * | 2021-09-03 | 2021-11-26 | 重庆邮电大学 | Knowledge base question-answering method integrating multi-head attention mechanism and relative position coding |
CN113704437B (en) * | 2021-09-03 | 2023-08-11 | 重庆邮电大学 | Knowledge base question-answering method integrating multi-head attention mechanism and relative position coding |
CN117172732A (en) * | 2023-07-31 | 2023-12-05 | 北京五八赶集信息技术有限公司 | Recruitment service system, method, equipment and storage medium based on AI |
CN117808443A (en) * | 2023-07-31 | 2024-04-02 | 北京五八赶集信息技术有限公司 | Recruitment service system, method, equipment and storage medium based on AI |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111160038A (en) | Method for generating video conversation answers and questions based on self-attention mechanism | |
CN109785824B (en) | Training method and device of voice translation model | |
CN109992657B (en) | Dialogue type problem generation method based on enhanced dynamic reasoning | |
CN112417134B (en) | Automatic abstract generation system and method based on voice text deep fusion features | |
CN113591902A (en) | Cross-modal understanding and generating method and device based on multi-modal pre-training model | |
CN111625660A (en) | Dialog generation method, video comment method, device, equipment and storage medium | |
CN112287675B (en) | Intelligent customer service intention understanding method based on text and voice information fusion | |
CN115964467A (en) | Visual situation fused rich semantic dialogue generation method | |
CN109902164B (en) | Method for solving question-answering of open long format video by using convolution bidirectional self-attention network | |
CN112966083B (en) | Multi-turn dialogue generation method and device based on dialogue history modeling | |
US12046249B2 (en) | Bandwidth extension of incoming data using neural networks | |
CN111754992A (en) | Noise robust audio/video bimodal speech recognition method and system | |
CN115293132B (en) | Dialog of virtual scenes a treatment method device, electronic apparatus, and storage medium | |
Li et al. | A deep reinforcement learning framework for Identifying funny scenes in movies | |
CN111382257A (en) | Method and system for generating dialog context | |
CN115712709A (en) | Multi-modal dialog question-answer generation method based on multi-relationship graph model | |
CN115269836A (en) | Intention identification method and device | |
CN111625629A (en) | Task-based conversational robot response method, device, robot and storage medium | |
CN113656542A (en) | Dialect recommendation method based on information retrieval and sorting | |
CN116208772A (en) | Data processing method, device, electronic equipment and computer readable storage medium | |
Lin et al. | A hierarchical structured multi-head attention network for multi-turn response generation | |
CN113515617B (en) | Method, device and equipment for generating model through dialogue | |
CN111783434B (en) | Method and system for improving noise immunity of reply generation model | |
Kumar et al. | Towards robust speech recognition model using Deep Learning | |
Meng et al. | Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20200515 |
|
WW01 | Invention patent application withdrawn after publication |