CN109840506A - The method for solving video question-answering task using the video converter of marriage relation interaction - Google Patents
The method for solving video question-answering task using the video converter of marriage relation interaction Download PDFInfo
- Publication number
- CN109840506A CN109840506A CN201910112159.5A CN201910112159A CN109840506A CN 109840506 A CN109840506 A CN 109840506A CN 201910112159 A CN201910112159 A CN 201910112159A CN 109840506 A CN109840506 A CN 109840506A
- Authority
- CN
- China
- Prior art keywords
- video
- question
- answering task
- output
- attention mechanism
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Image Analysis (AREA)
Abstract
The invention discloses a kind of method that the video converter using marriage relation interaction solves video question-answering task, mainly includes the following steps: 1) to design a kind of answer that the video converter model using marriage relation interaction is completed for video question-answering task and obtain.2) training obtains final video converter model, and the answer of video question-answering task is generated using the model.Compared to general video question-answering task solution, present invention utilizes interactive relationship information, can preferably complete video question-answering task.Present invention effect acquired in video question-answering task is more preferable compared to traditional method.
Description
Technical field
The present invention relates to video question-answering tasks more particularly to a kind of video converter using marriage relation interaction to solve view
The method of frequency question-answering task.
Background technique
Video question-answering task is a very challenging task, has attracted the concern of many people at present.In the task
In the problem of needing system that can be directed to some particular video frequency, provide corresponding answer.Video question-answering task is still at present
One more novel task, it is also immature to its research.Computer can be applied to for the research of video question-answering task
The related fieldss such as vision and natural language processing.
Current existing video question-answering task solution is usually to utilize traditional image question and answer related approaches.Utilize volume
Product neural network obtains the coding of image, and the coding of problem is obtained using Recognition with Recurrent Neural Network, and image and problem is used in combination
Coding, generates the feature coding for being mixed with image and problem information, and decoder utilizes the feature for being mixed with image and problem information
Coding obtains final image quiz answers.
Such method is due to lacking the answer for the analysis of the timing information contained in video, for video question-answering task
Generate inaccuracy.To solve the above-mentioned problems, the present invention solves video question and answer using the video converter that marriage relation interacts
Location tasks improve the accuracy that video question-answering task forms video quiz answers.
Summary of the invention
It is an object of the invention to solve the problems of the prior art, in order to overcome the prior art for video question-answering task
The problem of accurate video quiz answers can not be provided, the present invention provide a kind of Video Quality Metric interacted using marriage relation
The method of device solution video question-answering task.Specific technical solution of the present invention is:
The method for solving video question-answering task using the video converter of marriage relation interaction, comprises the following steps:
1. designing a kind of the video object Relation acquisition method, the video object is obtained using the video object Relation acquisition method
Time-space relationship matrix.
2. design one kind interacts attention mechanism unit more, using in more interaction attention mechanism unit combination steps 1
The time-space relationship matrix of the video object of acquisition, the more interaction attention mechanism for obtaining the integrated information contained containing list entries are defeated
Out.
3. the more interaction attention mechanism units designed using step 2, video of the design containing encoder and decoder turns
Parallel operation is trained, and the answer of corresponding video question-answering task is obtained using the video converter trained.
Above-mentioned steps can specifically use following implementation:
For the video frame of video question-answering task, the object in video is obtained using trained the video object identification network
Barment tagWith object's position featureWherein N, which is represented, contains in video
Object number, the barment tag of object nPosition for the high dimension vector obtained using trained model, each object is special
SignFor a 5 dimensional vector (xn,yn,wn,hn,tn),The preceding four-dimension (xn,yn,wn,hn) represent the object bounds frame of object n
Center point coordinate,The 5th dimension tnRepresent frame number number locating for object n.
For the position feature of object mWith the position feature of object n5 dimension relativeness vector (X are calculated according to following formulamn,Ymn,Wmn,Hmn,Tmn),
Later, relativeness vector (X is tieed up by the 5 of acquisitionmn,Ymn,Wmn,Hmn,Tmn) using just remaining comprising different frequency
The position encoded of string function is mapped as higher-dimension expression, and the higher-dimension expression connection that mapping obtains is obtained relativeness featureIt presses
The time-space relationship weight of object m Yu object n are calculated according to following formula
Wherein, WrFor trainable weight vectors.
The time-space relationship matrix W of the video object is obtained using the time-space relationship weight between all objects in the video of acquisitionR。
It designs one kind and interacts attention mechanism unit more, for the matrix Q=(q of input1,q2,...,qlq) and matrix V=
(v1,v2,...,vlv), the column vector K in three-dimensional tensor K is calculated according to following formulaij,
Kij=qiοvj
Wherein, qiRepresent the column vector of the column of input matrix Q i-th, vjRepresent the column vector of input matrix V jth column, ο representative element
The multiplication of plain rank operates.By all column vector K of acquisitionij(i∈[1,2,...,lq],j∈[1,2,...,lv]) group closes
Come, obtains three-dimensional tensor K.By K and it is divided into several sub- tensorsFor sub- tensor K', formula meter is calculated as follows
Calculation obtains weight and vector p,
Wherein, wijFor trainable weight scalar, b1For trainable bias.Obtained weight and vector p are replicated
S*s times, form new three-dimensional tensor M.
Last position of obtained three-dimensional tensor K and new three-dimensional tensor M are subjected to summation compression, it is other to obtain Element-Level
Weight matrix WEWith the other weight matrix W of fragment stageS, utilize the other weight matrix W of obtained Element-LevelE, the other weight of fragment stage
Matrix WSWith input matrix V=(v1,v2,...,vlv), formula is calculated as follows and obtains the integrated information contained containing list entries
More interaction attention mechanism units export O,
Wherein,The multiplication of representative element rank operates, and softmax () represents softmax function calculating operation.
The video converter that the present invention designs is made of encoder and decoder two parts, and the encoder of video converter contains
There are three parts: question text coded portion, object-oriented video coding part, coding video frames part.Wherein question text encodes
Partial Mechanism are as follows: the problem of being inputted for video question-answering task text, using the mapping of the word wherein contained as input sequence
Column, the position encoded technology being used in combination in original conversion device obtain question text location information feature, problem word are mapped
Design are input to questionnaire word location information feature to interact in attention mechanism unit more, will interact attention mechanism list more
The output of member is operated by attended operation and Linear Mapping, to supply unit before being input to later.By preceding to the defeated of supply unit
Out after the Linear Mapping unit by two using ReLU as activation primitive, the corresponding output of question text coded portion is obtained.
Encoded video frame coded portion mechanism are as follows: for the sequence of frames of video of video question-answering task input, utilize
ResNet obtains video frame feature as list entries, and the position encoded technology being used in combination in original conversion device obtains video frame
Video frame feature is input to design with video frame location information feature more interact attention mechanism unit by location information feature
In, the output of more interaction attention mechanism units is operated by attended operation and Linear Mapping, in conjunction with question text coding unit
It is point corresponding to be input in another more interaction attention mechanism units, by outputs warps of more interaction attention mechanism units
It crosses attended operation and Linear Mapping operates, to supply unit before being input to.By the preceding output to supply unit by two with
After ReLU is as the Linear Mapping unit of activation primitive, the corresponding output in coding video frames part is obtained.By coding video frames portion
Divide corresponding output to be re-entered into above-mentioned coding video frames part, carries out T circulation, obtain final coding video frames part
Corresponding output.
Encoder video object coding Partial Mechanism are as follows: utilize the objects looks feature in the video obtainedWith object's position featureAs list entries, by the objects looks in video
Feature is input to design with object's position feature more and interacts in attention mechanism unit, by more interaction attention mechanism units
Output is operated by attended operation and Linear Mapping, is input to another interact in conjunction with question text coded portion is corresponding more
In attention mechanism unit, the output of more interaction attention mechanism units is operated by attended operation and Linear Mapping, it is defeated
Enter to preceding to supply unit.Two Linear Mapping units using ReLU as activation primitive are passed through into the preceding output to supply unit
Afterwards, the corresponding output in object-oriented video coding part is obtained.The corresponding output in object-oriented video coding part is re-entered into above-mentioned
Object-oriented video coding part carries out T circulation, obtains the final corresponding output in object-oriented video coding part.
The corresponding output in coding video frames part output corresponding with object-oriented video coding part is attached operation, it is defeated
Enter to after a Linear Mapping unit, obtains the encoder output of video converter.
There are three types of the decoders of video converter, is directed to multiple video question-answering task, open numeric type respectively
Video question-answering task and open text-type video question-answering task:
For multiple video question-answering task, the assessment for each candidate answers is calculated using following formula
Score s,
Wherein,Represent the transposition of trainable weight matrix, FvoThe encoder for representing the video converter obtained is defeated
Out.
For open numeric type video question-answering task, open numeric type video question and answer are calculated using following formula
The digital answer n of task,
Wherein,Represent the transposition of trainable weight matrix, b2Represent trainable biasing, FvoRepresent the view obtained
The encoder output of frequency converter, Round () represent round function calculating operation.
For open text-type video question-answering task, open text-type video question and answer are calculated using following formula
The answer word probability of task is distributed o,
Wherein,Represent the transposition of trainable weight matrix, b3Represent trainable biasing, FvoRepresent the view obtained
The encoder output of frequency converter, softmax () represent softmax function calculating operation.By the answer word probability of acquisition point
Answer of the word of most probable value as open text-type video question-answering task is corresponded in cloth o.
By training, it is directed to new video question-answering task using the video converter trained, video question and answer can be obtained
The corresponding answer of task.
Detailed description of the invention
Fig. 1 is that the video of the marriage relation interaction for solving video question-answering task of an embodiment according to the present invention turns
The overall schematic of parallel operation.
Specific embodiment
The present invention is further elaborated and is illustrated with reference to the accompanying drawings and detailed description.
As shown in Figure 1, the present invention solves the method packet of video question-answering task using the video converter of marriage relation interaction
Include following steps:
1) a kind of the video object Relation acquisition method is designed, obtains the video object using the video object Relation acquisition method
Time-space relationship matrix;
2) it designs one kind and interacts attention mechanism unit more, using in more interaction attention mechanism unit combination steps 1)
The time-space relationship matrix of the video object of acquisition, the more interaction attention mechanism for obtaining the integrated information contained containing list entries are defeated
Out;
3) using more interaction attention mechanism units of step 2) design, video of the design containing encoder and decoder turns
Parallel operation is trained, and the answer of corresponding video question-answering task is obtained using the video converter trained.
The step 1), the specific steps are that:
For the video frame of video question-answering task, the object in video is obtained using trained the video object identification network
Barment tagWith object's position featureWherein N, which is represented, contains in video
Object number, the barment tag of object nPosition for the high dimension vector obtained using trained model, each object is special
SignFor a 5 dimensional vector (xn,yn,wn,hn,tn),The preceding four-dimension (xn,yn,wn,hn) represent the object bounds frame of object n
Center point coordinate,The 5th dimension tnRepresent frame number number locating for object n.
For the position feature of object mWith the position feature of object n5 dimension relativeness vector (X are calculated according to following formulamn,Ymn,Wmn,Hmn,Tmn),
Later, relativeness vector (X is tieed up by the 5 of acquisitionmn,Ymn,Wmn,Hmn,Tmn) using just remaining comprising different frequency
The position encoded of string function is mapped as higher-dimension expression, and the higher-dimension expression connection that mapping obtains is obtained relativeness featureIt presses
The time-space relationship weight of object m Yu object n are calculated according to following formula
Wherein, WrFor trainable weight vectors.
The time-space relationship matrix W of the video object is obtained using the time-space relationship weight between all objects in the video of acquisitionR。
The step 2), the specific steps are that:
It designs one kind and interacts attention mechanism unit more, for the matrix of inputWith matrixThe column vector K in three-dimensional tensor K is calculated according to following formulaij,
Kij=qiοvj
Wherein, qiRepresent the column vector of the column of input matrix Q i-th, vjRepresent the column vector of input matrix V jth column, ο representative element
The multiplication of plain rank operates.By all column vector K of acquisitionij(i∈[1,2,...,lq],j∈[1,2,...,lv]) group closes
Come, obtains three-dimensional tensor K.By K and it is divided into several sub- tensorsFor sub- tensor K', formula meter is calculated as follows
Calculation obtains weight and vector p,
Wherein, wijFor trainable weight scalar, b1For trainable bias.Obtained weight and vector p are replicated
S*s times, form new three-dimensional tensor M.
Last position of obtained three-dimensional tensor K and new three-dimensional tensor M are subjected to summation compression, it is other to obtain Element-Level
Weight matrix WEWith the other weight matrix W of fragment stageS, utilize the other weight matrix W of obtained Element-LevelE, the other weight of fragment stage
Matrix WSWith input matrixFormula is calculated as follows and obtains the integrated information contained containing list entries
More interaction attention mechanism units export O,
Wherein,The multiplication of representative element rank operates, and softmax () represents softmax function calculating operation.
The step 3), the specific steps are that:
Video converter in step 3) is made of encoder and decoder two parts, and the encoder of video converter contains
Three parts: question text coded portion, object-oriented video coding part, coding video frames part.Wherein question text coding unit
Extension set is made as: the problem of inputting for video question-answering task text, using the mapping of the word wherein contained as list entries,
The position encoded technology being used in combination in original conversion device obtains question text location information feature, and problem word is mapped and asked
Topic word position information characteristics are input in more interaction attention mechanism units of design, by more interaction attention mechanism units
Output is operated by attended operation and Linear Mapping, to supply unit before being input to later.By the preceding output warp to supply unit
After crossing two Linear Mapping units using ReLU as activation primitive, the corresponding output of question text coded portion is obtained.
Encoded video frame coded portion mechanism are as follows: for the sequence of frames of video of video question-answering task input, utilize
ResNet obtains video frame feature as list entries, and the position encoded technology being used in combination in original conversion device obtains video frame
Video frame feature is input to design with video frame location information feature more interact attention mechanism unit by location information feature
In, the output of more interaction attention mechanism units is operated by attended operation and Linear Mapping, in conjunction with question text coding unit
It is point corresponding to be input in another more interaction attention mechanism units, by outputs warps of more interaction attention mechanism units
It crosses attended operation and Linear Mapping operates, to supply unit before being input to.By the preceding output to supply unit by two with
After ReLU is as the Linear Mapping unit of activation primitive, the corresponding output in coding video frames part is obtained.By coding video frames portion
Divide corresponding output to be re-entered into above-mentioned coding video frames part, carries out T circulation, obtain final coding video frames part
Corresponding output.
Encoder video object coding Partial Mechanism are as follows: utilize the objects looks feature in the video obtainedWith object's position featureAs list entries, by the objects looks in video
Feature is input to design with object's position feature more and interacts in attention mechanism unit, by more interaction attention mechanism units
Output is operated by attended operation and Linear Mapping, is input to another interact in conjunction with question text coded portion is corresponding more
In attention mechanism unit, the output of more interaction attention mechanism units is operated by attended operation and Linear Mapping, it is defeated
Enter to preceding to supply unit.Two Linear Mapping units using ReLU as activation primitive are passed through into the preceding output to supply unit
Afterwards, the corresponding output in object-oriented video coding part is obtained.The corresponding output in object-oriented video coding part is re-entered into above-mentioned
Object-oriented video coding part carries out T circulation, obtains the final corresponding output in object-oriented video coding part.
The corresponding output in coding video frames part output corresponding with object-oriented video coding part is attached operation, it is defeated
Enter to after a Linear Mapping unit, obtains the encoder output of video converter.
There are three types of the decoders of video converter, is directed to multiple video question-answering task, open numeric type respectively
Video question-answering task and open text-type video question-answering task:
For multiple video question-answering task, the assessment for each candidate answers is calculated using following formula
Score s,
Wherein,Represent the transposition of trainable weight matrix, FvoThe encoder for representing the video converter obtained is defeated
Out.
For open numeric type video question-answering task, open numeric type video question and answer are calculated using following formula
The digital answer n of task,
Wherein,Represent the transposition of trainable weight matrix, b2Represent trainable biasing, FvoRepresent the view obtained
The encoder output of frequency converter, Round () represent round function calculating operation.
For open text-type video question-answering task, open text-type video question and answer are calculated using following formula
The answer word probability of task is distributed o,
Wherein,Represent the transposition of trainable weight matrix, b3Represent trainable biasing, FvoRepresent the view obtained
The encoder output of frequency converter, softmax () represent softmax function calculating operation.By the answer word probability of acquisition point
Answer of the word of most probable value as open text-type video question-answering task is corresponded in cloth o.
By training, it is directed to new video question-answering task using the video converter trained, video question and answer can be obtained
The corresponding answer of task.
The above method is applied in the following example below, it is specific in embodiment to embody technical effect of the invention
Step repeats no more.
Embodiment
The present invention tests on TGIF-QA experimental data set.TGIF-QA experimental data set is containing there are four types of video question and answer
Task: it finds and gives the psychomotor task (Action) of number of repetition in video, the action state in video is asked to change task
(Trans), ask in video with the maximally related frame task (Frame) of video question-answering task problem, ask in video give movement weight
Again task (Count) is counted.In order to objectively evaluate the performance of algorithm of the invention, the present invention in selected test set,
For finding the psychomotor task (Action) of given number of repetition in video, the action state in video being asked to change task
(Trans), ask in video has used accuracy (ACC) to evaluate with the maximally related frame task (Frame) of video question-answering task problem
Standard evaluates effect of the invention, for asking the number of repetition task (Count) for giving movement in video to use
Mean Square Error evaluation criterion (MSE) evaluates effect of the invention.According to being described in specific embodiment
The step of, resulting experimental result is as shown in table 1, and this method is expressed as VideoTransformer (multi):
1 present invention of table is directed to the test result of TGIF-QA data set.
Claims (4)
1. the method for solving video question-answering task using the video converter of marriage relation interaction is appointed for solving video question and answer
Business, it is characterised in that include the following steps:
1) design a kind of the video object Relation acquisition method, using the video object Relation acquisition method obtain the video object when
Void relation matrix;
2) it designs one kind and interacts attention mechanism unit more, obtained using in more interaction attention mechanism unit combination steps 1)
The video object time-space relationship matrix, obtain more interaction attention mechanism output of the integrated information contained containing list entries;
3) using more interaction attention mechanism units of step 2) design, the Video Quality Metric containing encoder and decoder is designed
Device is trained, and the answer of corresponding video question-answering task is obtained using the video converter trained.
2. the method for solving video question-answering task using the video converter of marriage relation interaction according to claim 1,
It is characterized in that, the step 1) specifically:
For the video frame of video question-answering task, the objects looks in video are obtained using trained the video object identification network
FeatureWith object's position featureWherein N represents the object contained in video
Number, the barment tag of object nFor the high dimension vector obtained using trained model, the position feature of each object
For a 5 dimensional vector (xn,yn,wn,hn,tn),The preceding four-dimension (xn,yn,wn,hn) represent the object bounds frame central point of object n
Coordinate,The 5th dimension tnRepresent frame number number locating for object n;
For the position feature of object mWith the position feature of object n5 dimension relativeness vector (X are calculated according to following formulamn,Ymn,Wmn,Hmn,Tmn),
Later, relativeness vector (X is tieed up by the 5 of acquisitionmn,Ymn,Wmn,Hmn,Tmn) utilize the sin cos functions comprising different frequency
It is position encoded be mapped as higher-dimension expression, will mapping obtain higher-dimension expression connection obtaining relativeness featureAccording to as follows
The time-space relationship weight of object m Yu object n is calculated in formula
Wherein, WrFor trainable weight vectors;
The time-space relationship matrix W of the video object is obtained using the time-space relationship weight between all objects in the video of acquisitionR。
3. the method that the video converter according to claim 2 using marriage relation interaction solves video question-answering task,
It is characterized in that, the step 2) specifically:
It designs one kind and interacts attention mechanism unit more, for the matrix of inputWith matrixThe column vector K in three-dimensional tensor K is calculated according to following formulaij,
Kij=qiοvj
Wherein, qiRepresent the column vector of the column of input matrix Q i-th, vjRepresent the column vector of input matrix V jth column, ο representative element grade
Other multiplication operation;By all column vector K of acquisitionij(i∈[1,2,...,lq],j∈[1,2,...,lv]) combine, it obtains
Obtain three-dimensional tensor K;By K and it is divided into several sub- tensorsFor sub- tensor K', formula is calculated as follows and calculates
To weight and vector p,
Wherein, wijFor trainable weight scalar, b1For trainable bias;Obtained weight and vector p are replicated into s*s
It is secondary, form new three-dimensional tensor M;
Last position of obtained three-dimensional tensor K and new three-dimensional tensor M are subjected to summation compression, obtain the other weight of Element-Level
Matrix WEWith the other weight matrix W of fragment stageS, utilize the other weight matrix W of obtained Element-LevelE, the other weight matrix of fragment stage
WSWith input matrixFormula is calculated as follows and obtains the mostly mutual of the integrated information contained containing list entries
Dynamic attention mechanism unit exports O,
Wherein,The multiplication of representative element rank operates, and softmax () represents softmax function calculating operation.
4. the method that the video converter according to claim 3 using marriage relation interaction solves video question-answering task,
It is characterized in that, the step 3) specifically:
Video converter in step 3) is made of encoder and decoder two parts, there are three the encoder of video converter contains
Part: question text coded portion, object-oriented video coding part, coding video frames part;Wherein question text coding unit extension set
It is made as: the problem of being inputted for video question-answering task text, using the mapping of the word wherein contained as list entries, in conjunction with
Question text location information feature is obtained using the position encoded technology in original conversion device, problem word is mapped and questionnaire
Word location information feature is input in more interaction attention mechanism units of design, by the output of more interaction attention mechanism units
It is operated by attended operation and Linear Mapping, to supply unit before being input to later;The preceding output to supply unit is passed through two
After a Linear Mapping unit using ReLU as activation primitive, the corresponding output of question text coded portion is obtained;
Encoded video frame coded portion mechanism are as follows: for the sequence of frames of video of video question-answering task input, obtained using ResNet
Video frame feature is obtained as list entries, the position encoded technology being used in combination in original conversion device obtains video frame location information
Video frame feature is input to design with video frame location information feature more interacted in attention mechanism unit by feature, will be more
The output for interacting attention mechanism unit is operated by attended operation and Linear Mapping, corresponding in conjunction with question text coded portion
It is input in another more interaction attention mechanism units, by the output of more interaction attention mechanism units by connection behaviour
Make to operate with Linear Mapping, to supply unit before being input to;By the preceding output to supply unit by two using ReLU as sharp
After the Linear Mapping unit of function living, the corresponding output in coding video frames part is obtained;Coding video frames part is corresponding defeated
It is re-entered into above-mentioned coding video frames part out, carries out T circulation, it is corresponding defeated to obtain final coding video frames part
Out;
Encoder video object coding Partial Mechanism are as follows: utilize the objects looks feature in the video obtainedWith object's position featureAs list entries, by the objects looks in video
Feature is input to design with object's position feature more and interacts in attention mechanism unit, by more interaction attention mechanism units
Output is operated by attended operation and Linear Mapping, is input to another interact in conjunction with question text coded portion is corresponding more
In attention mechanism unit, the output of more interaction attention mechanism units is operated by attended operation and Linear Mapping, it is defeated
Enter to preceding to supply unit;Two Linear Mapping units using ReLU as activation primitive are passed through into the preceding output to supply unit
Afterwards, the corresponding output in object-oriented video coding part is obtained;The corresponding output in object-oriented video coding part is re-entered into above-mentioned
Object-oriented video coding part carries out T circulation, obtains the final corresponding output in object-oriented video coding part;
The corresponding output in coding video frames part output corresponding with object-oriented video coding part is attached operation, is input to
After one Linear Mapping unit, the encoder output of video converter is obtained;
There are three types of the decoders of video converter, is directed to multiple video question-answering task, open numeric type video respectively
Question-answering task and open text-type video question-answering task:
For multiple video question-answering task, the assessment score for each candidate answers is calculated using following formula
S,
Wherein,Represent the transposition of trainable weight matrix, FvoRepresent the encoder output of the video converter obtained;
For open numeric type video question-answering task, open numeric type video question-answering task is calculated using following formula
Digital answer n,
Wherein,Represent the transposition of trainable weight matrix, b2Represent trainable biasing, FvoRepresent the Video Quality Metric obtained
The encoder output of device, Round () represent round function calculating operation;
For open text-type video question-answering task, open text-type video question-answering task is calculated using following formula
Answer word probability be distributed o,
Wherein,Represent the transposition of trainable weight matrix, b3Represent trainable biasing, FvoRepresent the Video Quality Metric obtained
The encoder output of device, softmax () represent softmax function calculating operation;It will be in the answer word probability distribution o of acquisition
Answer of the word of corresponding most probable value as open text-type video question-answering task;
By training, it is directed to new video question-answering task using the video converter trained, video question-answering task can be obtained
Corresponding answer.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910112159.5A CN109840506B (en) | 2019-02-13 | 2019-02-13 | Method for solving video question-answering task by utilizing video converter combined with relational interaction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910112159.5A CN109840506B (en) | 2019-02-13 | 2019-02-13 | Method for solving video question-answering task by utilizing video converter combined with relational interaction |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109840506A true CN109840506A (en) | 2019-06-04 |
CN109840506B CN109840506B (en) | 2020-11-20 |
Family
ID=66884667
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910112159.5A Active CN109840506B (en) | 2019-02-13 | 2019-02-13 | Method for solving video question-answering task by utilizing video converter combined with relational interaction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109840506B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110348462A (en) * | 2019-07-09 | 2019-10-18 | 北京金山数字娱乐科技有限公司 | A kind of characteristics of image determination, vision answering method, device, equipment and medium |
CN110378269A (en) * | 2019-07-10 | 2019-10-25 | 浙江大学 | Pass through the movable method not previewed in image query positioning video |
CN110727824A (en) * | 2019-10-11 | 2020-01-24 | 浙江大学 | Method for solving question-answering task of object relationship in video by using multiple interaction attention mechanism |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107463609A (en) * | 2017-06-27 | 2017-12-12 | 浙江大学 | It is a kind of to solve the method for video question and answer using Layered Space-Time notice codec network mechanism |
CN107818306A (en) * | 2017-10-31 | 2018-03-20 | 天津大学 | A kind of video answering method based on attention model |
-
2019
- 2019-02-13 CN CN201910112159.5A patent/CN109840506B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107463609A (en) * | 2017-06-27 | 2017-12-12 | 浙江大学 | It is a kind of to solve the method for video question and answer using Layered Space-Time notice codec network mechanism |
CN107818306A (en) * | 2017-10-31 | 2018-03-20 | 天津大学 | A kind of video answering method based on attention model |
Non-Patent Citations (3)
Title |
---|
ASHISH VASWANI等: "Attention Is All You Need", 《31ST CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS (NIPS 2017)》 * |
HAN HU等: "Relation Networks for Object Detection", 《THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION(CVPR),2018》 * |
杨启凡: "基于时空注意力网络的视频问答", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110348462A (en) * | 2019-07-09 | 2019-10-18 | 北京金山数字娱乐科技有限公司 | A kind of characteristics of image determination, vision answering method, device, equipment and medium |
CN110348462B (en) * | 2019-07-09 | 2022-03-04 | 北京金山数字娱乐科技有限公司 | Image feature determination and visual question and answer method, device, equipment and medium |
CN110378269A (en) * | 2019-07-10 | 2019-10-25 | 浙江大学 | Pass through the movable method not previewed in image query positioning video |
CN110727824A (en) * | 2019-10-11 | 2020-01-24 | 浙江大学 | Method for solving question-answering task of object relationship in video by using multiple interaction attention mechanism |
CN110727824B (en) * | 2019-10-11 | 2022-04-01 | 浙江大学 | Method for solving question-answering task of object relationship in video by using multiple interaction attention mechanism |
Also Published As
Publication number | Publication date |
---|---|
CN109840506B (en) | 2020-11-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108804715A (en) | Merge multitask coordinated recognition methods and the system of audiovisual perception | |
CN107766447A (en) | It is a kind of to solve the method for video question and answer using multilayer notice network mechanism | |
CN108664632A (en) | A kind of text emotion sorting algorithm based on convolutional neural networks and attention mechanism | |
CN109241424A (en) | A kind of recommended method | |
CN110196928B (en) | Fully parallelized end-to-end multi-turn dialogue system with domain expansibility and method | |
CN108228674B (en) | DKT-based information processing method and device | |
CN109840506A (en) | The method for solving video question-answering task using the video converter of marriage relation interaction | |
CN108491514A (en) | The method and device putd question in conversational system, electronic equipment, computer-readable medium | |
CN111680147A (en) | Data processing method, device, equipment and readable storage medium | |
CN110209789A (en) | A kind of multi-modal dialog system and method for user's attention guidance | |
CN109670576A (en) | A kind of multiple scale vision concern Image Description Methods | |
CN109902164B (en) | Method for solving question-answering of open long format video by using convolution bidirectional self-attention network | |
CN110852256A (en) | Method, device and equipment for generating time sequence action nomination and storage medium | |
CN110059220A (en) | A kind of film recommended method based on deep learning Yu Bayesian probability matrix decomposition | |
CN109448703A (en) | In conjunction with the audio scene recognition method and system of deep neural network and topic model | |
CN106503659A (en) | Action identification method based on sparse coding tensor resolution | |
CN115131698B (en) | Video attribute determining method, device, equipment and storage medium | |
CN109857909A (en) | The method that more granularity convolution solve video conversation task from attention context network | |
CN110046271A (en) | A kind of remote sensing images based on vocal guidance describe method | |
CN113888399B (en) | Face age synthesis method based on style fusion and domain selection structure | |
Zhu et al. | Dual-decoder transformer network for answer grounding in visual question answering | |
Li et al. | Aligning open educational resources to new taxonomies: How AI technologies can help and in which scenarios | |
Boutin et al. | Diffusion models as artists: are we closing the gap between humans and machines? | |
Jeon et al. | Leveraging angular distributions for improved knowledge distillation | |
Liu et al. | Digital twins by physical education teaching practice in visual sensing training system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |