CN112527993A - Cross-media hierarchical deep video question-answer reasoning framework - Google Patents
Cross-media hierarchical deep video question-answer reasoning framework Download PDFInfo
- Publication number
- CN112527993A CN112527993A CN202011499931.2A CN202011499931A CN112527993A CN 112527993 A CN112527993 A CN 112527993A CN 202011499931 A CN202011499931 A CN 202011499931A CN 112527993 A CN112527993 A CN 112527993A
- Authority
- CN
- China
- Prior art keywords
- memory
- video
- answer
- reasoning
- shallow
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Databases & Information Systems (AREA)
- Library & Information Science (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a cross-media hierarchical deep video question-answer reasoning framework. The method comprises the steps of 1, storing global semantic information of a video by using a memory component, and obtaining a shallow inference engine through multiple rounds of memory updating iteration. 2. Based on a shallow layer inference engine, a deep layer inference engine is constructed, and multi-mode subcomponents under deep semantic analysis of the video are embedded into memory card slots with different modes to form space memory and time sequence memory. 3. And constructing a multi-modal memory collaborative reasoning framework, and performing more refined reasoning by using multi-modal evidences from objects and actions. 4. And performing multi-mode dynamic memory fusion, using the output of a shallow inference engine as a monitoring whistle, guiding the weight distribution of memory contents of different modes at a lower layer, dynamically fusing the memories of different modes through a dynamic memory fusion module in the framework, and using the output of the dynamic memory fusion module as the input of a reply module to predict the best answer. The reasoning framework of the invention can achieve a remarkable effect on the video question-answer data set.
Description
Technical Field
The invention relates to a deep neural network for video question answering, in particular to a hierarchical deep reasoning framework based on cross-media uniform expression.
Background
The cross-media technology aims to open a semantic gap between different media (such as video media and text media) and form uniform cross-media semantic expression. But due to the complexity of the semantics of the multimedia data itself, this problem has not been solved well before deep learning has emerged. In recent years, deep learning has achieved remarkable performance in various research fields, and the tasks to be solved are modeled end to end by means of a complex neural network model, so that deep uniform expression of cross-media data is learned. Due to the strong semantic expression capability of the depth model, the deep cross-media uniform expression model becomes the current mainstream method.
On the basis of the theory of deep cross-media uniform expression, a plurality of current hot branch directions are derived, such as cross-media retrieval, visual description, visual question answering and the like. Cross-media retrieval of relevant data aimed at finding the best matching one media from the mass database given one media data; the goal of visual description is to give an image an effective overview of its content in one or several natural languages; the visual question-answering aims at using questions described by natural language and a visual data object as input, and after the natural language description and visual content are fully understood by an algorithm, deep reasoning is carried out, and finally an answer described by the natural language is output. Among these tasks, visual question-answering is relatively more challenging, involving fine-grained understanding of visual content and natural language, while also requiring deep knowledge reasoning. Therefore, visual question answering has become a research hotspot in recent years.
Video data, which is a mainstream visual data object, exists in various social networking sites in a large scale, and the data volume thereof almost exceeds the sum of other media data. Video data is more complex than images. Video is not a simple stack of image sequences in number and contains data information in multiple modalities, such as visual, text, voice, etc. Visual objects in the video can present visual features of different visual angles along with the change of time, and spatial visual information at different moments is correlated with each other. In addition, visual question answering based on video data involves more complex questions. The user can present a diversity problem of high degree of freedom according to the video content. The questions in the video question and answer task usually include complicated questions such as action category and action time sequence relation reasoning, besides the questions related to static spatial information such as color, quantity and position. In addition, given a video data object, the amount of visual information on which the model depends to give correct answers to different questions is different. Some questions can give effective answers only depending on one frame of information, and some questions can correctly predict answers only by understanding the semantics of the complete video.
In summary, the difficulty of video question answering lies in how to construct an efficient cross-media question answering reasoning framework on the basis of correctly and effectively understanding video content and sufficiently and accurately understanding question intentions, so as to improve accuracy of answer prediction.
Disclosure of Invention
The invention provides a deep hierarchical reasoning framework for complex long-term video question answering, which mainly comprises: 1. constructing a shallow layer inference machine: executing an irrelevant information filtering function, identifying important visual contents relevant to problem description from all possible long sequence information of the complex long-term video, filtering irrelevant visual information, and avoiding overload of deep memory network information and noise; 2. constructing a deep inference machine: under the guidance of a shallow inference engine, more detailed inference is carried out by utilizing deeper semantic evidence from vision and natural language, and more fine-grained attention is learned so as to improve the quality of cross-modal task inference. In the aspect of video question answering, the depth reasoning framework of the invention is utilized to improve the reasoning quality, and the effect better than that of the traditional visual question answering model is obtained. 3. A memory dynamic fusion module: the method is used for dynamically fusing memories of different modes, and the output of the dynamic memory fusion module is used as the input of the answer module to predict the best answer.
The technical scheme adopted by the invention for solving the technical problems is as follows:
and (1) storing the global semantic information of the video by using a memory component, and obtaining a shallow inference machine through multiple rounds of memory updating iteration under the guidance of the global visual features of the problem description, wherein the shallow inference machine is used for inferring the visual information most relevant to the global semantic features of the problem description.
And (2) constructing a deep inference machine based on the shallow inference machine, and embedding multi-modal subcomponents under deep semantic analysis of the video into memory card slots with different modalities to form spatial memory and time sequence memory.
And (3) constructing a multi-modal memory collaborative reasoning framework, and executing more fine reasoning by using multi-modal evidences from objects and actions so as to improve the quality of question answering.
And (4) performing multi-mode dynamic memory fusion, guiding the weight distribution of the memory contents of different modes at the lower layer by using the output of the shallow inference engine as a monitoring whistle, dynamically fusing the memories of different modes through a dynamic memory fusion module in the framework, and predicting the best answer by using the output of the dynamic memory fusion module as the input of a response module.
Further, the step (1) of storing global semantic information of the video by using a memory component, and under the guidance of the global visual features of the problem description, reasoning out visual information most relevant to the global semantic features of the problem description through multiple rounds of memory update iterations, specifically as follows:
1-1. to characterize videoAnd problem description feature hTInputting into the memory component, firstly converting the input features into the intrinsic vector features of the memory network, as shown in the formulas (1) and (2):
Xe=tanh(WxX+bx) (formula 1)
Wherein, Xe、Respectively indicating the video characteristics and the problem description characteristics after the video characteristics and the problem description characteristics are converted;is a matrix of the mapping and the mapping, is an offset, dzIs an intrinsic spatial dimension of the memory network.
1-2. feature selection was performed using a hard attention mechanism. Computing problem description featuresAnd video feature XeThe similarity between the two video characteristics is sorted according to the similarity score, the first n video characteristics with the highest similarity and most relevant to the problem are selected to update the memory unit in the shallow inference engine, and therefore n key value pairs with the nearest neighbor are obtained, and key is { k ═ k1,k2,...,kn},value={v1,v2,...,vnAnd f, updating the video feature sequence set, as shown in formula (3):
Γ(y1,...,yn)={j1,...jnwhen y isj1≥yj2≥…≥yjn(formula 4)
Wherein f issRepresenting a similarity measure, Γ is the sort operation.
1-3. based onAnd (3) the updated video feature sequence set, and the shallow inference engine learns the probability distribution rho of the feature description on each memory unit. And obtaining the output z of the shallow inference engine through weighting and operation, and combining the problem description characteristics of the original input as the problem of the next round of inference. G in formulas (5) and (6)xAnd GqRespectively, two feedforward fully-connected neural networks are shown.
Wherein v isiIs the storage content of the ith memory cell.
Constructing a deep multi-modal memory network based on the guided shallow inference engine in the step (2), embedding multi-modal subcomponents under deep semantic analysis of the video into memory modules of different modalities to form spatial memory and time sequence memory, and specifically comprising the following steps:
characterizing objects in video by a 1 x 1 convolutional neural network Converting the intrinsic vector characteristics into the intrinsic vector characteristics of a space memory module, and using another 1 x 1 convolutional neural network to characterize the motion in the videoEmbedded in a timing memory module. Wherein d isoIs the dimension of the object feature, k is the number of objects, daIs the dimension of the action feature, l is the number of actions.
The multi-modal subcomponents comprise object features and action features;
the memory modules in different modes comprise a space memory module and a time sequence memory module; and (3) constructing a multi-modal memory collaborative reasoning framework, and executing more detailed reasoning by using multi-modal evidences from objects and actions, wherein the detailed reasoning is as follows:
3-1. for the lambda-th round of reasoning, the action characteristic a is giveniDescription of problem h'lAnd action memory of last momentObtaining similarity distribution of action characteristics through calculation of two combined similarities of P (-) and D (-)As shown in equation (7). The same object feature oiObtaining the similarity distribution on the object characteristicsAs shown in equation (8).
Where P (-) and D (-) are two similarity calculation functions.
3-2, considering that the object characteristics and the action characteristics play an important role in high-quality reasoning question answering, the multi-modal memory collaborative reasoning framework is used as an interactive reasoning machine, allows action memory and object memory to dynamically interact, and when the memory of one mode is updated, the memory of the other mode provides a useful clue for attention learning, in particular:
when calculating the update gating signal of the memory content of a certain modality, in addition to considering the modality itself, the influence of another modality needs to be considered. Taking the motion mode as an example, the refresh gating signal of the motion memoryThis can be derived from equation (10), and the GRU-based attention mechanism is then employed to extract contextual featuresAnd it is used to update the action memory of the wheelAs shown in equation (11).
Wherein the content of the first and second substances,in order to be a set of mapping matrices,is the corresponding offset.
Wherein the content of the first and second substances,in order to be a set of mapping matrices,is the corresponding offset.
And (4) performing multi-mode dynamic memory fusion, guiding the weight distribution of the memory contents of different modes at a lower layer by using the output of a shallow inference engine as a monitoring whistle, dynamically fusing the memories of different modes through a dynamic memory fusion module in the framework, and predicting the best answer by using the output of the dynamic memory fusion module as the input of an answer module, wherein the specific process is as follows:
4-1. due to the complexity and diversity of the problem descriptions, the magnitude of the contribution of the different visual submodes to the question-answering model is dynamically variable. The memory dynamic fusion module utilizes the output u' of the shallow inference engine as a monitoring whistle and combines the original embedded characteristic q of the problem descriptioneGuiding the weight distribution of the memory contents of different modes of the deep inference engine, dynamically fusing the memories of different modes, and fusing the memoriesCan be calculated from equation (15):
wherein, thereinIs a learnable parameter, alpha is a density vector, alphaaAnd alphaoAre two elements of the density vector alpha.
4-2. the output of the dynamic memory fusion module is used as a reply moduleTo predict the best answer. In particular toAnd problem feature qeDo fusion and then pass through the weight matrix WpThe mapping of (a) to (b) yields the feature vector v, as shown in equation (16).
The obtained feature vector v is input into an answer module (multi-classifier) to predict the answer of the question.
The invention has the beneficial effects that:
the invention provides a novel coarse-to-fine hierarchical depth inference framework for a complicated long-term network video question-answering problem, and the method comprises the steps of firstly filtering invalid information from a long video sequence by constructing a shallow inference engine, identifying important visual contents, learning the global attention of coarse-grained videos, and then constructing a deep inference engine to perform depth optimization inference from two directions of interframes and intraframes. Through multiple rounds of reasoning iteration, the reasoning framework can simulate the human video question-answer reasoning process, the key time relevant to the question is located from the long-term video, and relevant evidences are collected to predict the answer. The reasoning framework of our invention can achieve significant effects on the video question-answer dataset.
Drawings
FIG. 1 is a general block diagram of the process of the present invention.
FIG. 2 is a guided shallow inference engine constructed in the method of the present invention.
FIG. 3 is an optimized deep inference engine constructed in the method of the present invention.
Detailed Description
The following is a more detailed description of the detailed parameters of the present invention.
As shown in FIG. 1, the invention provides a hierarchical deep question answering reasoning framework for complex video question answering.
The method for obtaining the visual information of the video by using the memory component in the step (1) includes obtaining a shallow inference engine through multiple rounds of memory update iteration under the guidance of the global visual feature of the problem description, and inferring the visual information most relevant to the global semantic feature of the problem description, as shown in fig. 2, specifically as follows:
1-1. extracting video characteristicsAnd problem description feature hTIt is input into the memory component and the input is converted into the intrinsic vector characteristics of the memory network.
For the extraction of global video data features, large-scale pre-trained neural networks VGG and 3D-CNN are used for extracting intermediate features, and the features are input into a bidirectional GRU network to obtain globally perceived semantic featuresWhereindx2048 is the dimension of the feature. For the extraction of problem description features, firstly, a shallow word embedding model Glove is adopted to encode each word to capture the semantics of the word, and then the generated word vector is sequentially input to a word vector containing dqThe two-way LSTM network with 256 hidden units learns the context information of the problem description, and finally splices the hidden state forward and backward at each moment to represent the global semantics of the problem description
Then, inputting the video characteristics and the problem description characteristics into a memory component, and converting the input into an internal vector of a memory network, wherein Is the mapping matrix and offset, dz256 isThe intrinsic spatial dimensions of the network are remembered.
1-2. feature selection was performed using a hard attention mechanism. Calculating the similarity between the problem description characteristics and the video characteristics, sorting the video characteristics according to the similarity scores, selecting the top 20 video characteristics with the highest similarity and most relevant to the problem to update the storage unit of the memory network, thereby obtaining the 20 key value pairs of the nearest neighbor, wherein key is { k }1,k2,...,kn},value={v1,v2,...,vnAnd forming an updated video feature sequence.
1-3. based on the updated memory content, the learning problem describes the probability distribution ρ over each memory cell. And obtaining the output z of the layer network through weighting and operation, and taking the original characteristics described by the problem as the updated problem of the next round of reasoning.
Constructing a deep inference machine on the basis of the shallow inference machine in the step (2), embedding multi-modal subcomponents under deep semantic analysis of the video into memory card slots with different modalities to form spatial memory and time sequence memory, and specifically comprising the following steps:
the characteristics of video objects are extracted by using the existing fast-RCNN model, 36 targets are sequentially detected in 20 important video units by using the fast-RCNN model based on a shallow inference engine, and 4096-dimensional object characteristics are extracted for each target object. Then, the object features in the video are converted into the internal vectors of the space memory module through a 1 × 1 convolutional neural network to form space memory.
Predicting the first 30 time sequence segments with most potential actions in the video by an external pre-trained video time sequence candidate generation network, and extracting the video characteristics of the 30 time sequence segments as video action characteristics according to the step (1). Characterizing motion in video using another 1 x 1 convolutional neural networkEmbedded into the time sequence memory module to form time sequence memory.
Constructing the multi-modal memory collaborative reasoning in the step (3), and performing more refined reasoning by using multi-modal evidence from objects and actions, as shown in fig. 3, specifically as follows:
3-1. Using multi-modal evidence from objects and actions to perform more refined reasoning, for the lambda-th round of reasoning, given the action characteristics, the characteristics of the problem and the action memory at the previous momentThe similarity distribution of the action characteristics is obtained through calculation of element point multiplication similarity functions and element absolute value similarity functions. The similarity distribution on the object features can be obtained in the same way.
And 3-2, calculating a memory updating gating signal. Update gating in computational action memoryIn this case, in addition to the action modality itself, the influence of the object modality needs to be considered. Similarly, update gating in computational object memoryWhen considering the object modality, an action modality is needed to provide a favorable cue. The GRU-based attention mechanism is then employed to extract contextual features and use it to update the action memory of the current round
And (4) dynamically fusing different modes of memory by the multi-mode memory dynamic fusion module, wherein the output of the memory dynamic fusion module is used as the input of the answer module to predict the best answer, as shown in fig. 3, the following steps are specifically performed:
and 4-1, the dynamic memory fusion module utilizes the output of the shallow inference engine as a monitoring whistle to guide the weight distribution of the memory contents of different modes in the deep layer and dynamically fuse the memories of different modes. The module first remembers u' of the shallow inference engine in 256 dimensions and the problem feature q in 300 dimensionseSplicing, performing matrix transformation of 2 × 556, and passing through classifier of softmaxObtaining the weights of different sub-modes of the deep inference engine, and finally obtaining the fused memory by using the weighted sum operation
And 4-2, taking the output of the memory dynamic fusion module as the input of the answer module to predict the best answer. Specifically, the fusion memory of the dynamic fusion moduleAnd problem feature qePerforming element dot product operation to obtain a 300-dimensional feature vector, and passing through a 1000 × 300 weight matrix WpAnd mapping to obtain a 1000-dimensional feature vector v, inputting the obtained feature vector v into a 1000-way softmax classifier to obtain the probability distribution of an answer dictionary, training the model in an end-to-end mode, and optimizing the model by using softmax cross entropy as a loss function until the network converges.
Claims (5)
1. A cross-media hierarchical deep video question-answer reasoning framework is characterized by comprising the following steps:
step (1), storing global semantic information of a video by using a memory component, and obtaining a shallow inference machine through multiple rounds of memory updating iteration under the guidance of global visual features of problem description, wherein the shallow inference machine is used for inferring visual information most relevant to the global semantic features of the problem description;
step (2) constructing a deep inference machine based on the shallow inference machine, and embedding multi-modal subcomponents under deep semantic analysis of the video into memory card slots with different modalities to form spatial memory and time sequence memory;
step (3), a multi-modal memory collaborative reasoning framework is constructed, and more detailed reasoning is executed by utilizing multi-modal evidences from objects and actions so as to improve the quality of question answering;
and (4) performing multi-mode dynamic memory fusion, guiding the weight distribution of the memory contents of different modes at the lower layer by using the output of the shallow inference engine as a monitoring whistle, dynamically fusing the memories of different modes through a dynamic memory fusion module in the framework, and predicting the best answer by using the output of the dynamic memory fusion module as the input of a response module.
2. The cross-media hierarchical deep video question-answer reasoning framework of claim 1, wherein the step (1) is specifically as follows:
1-1. to characterize videoAnd problem description feature hTInputting into the memory component, firstly converting the input features into the intrinsic vector features of the memory network, as shown in the formulas (1) and (2):
Xe=tanh(WxX+bx) (formula 1)
Wherein, Xe、Respectively indicating the video characteristics and the problem description characteristics after the video characteristics and the problem description characteristics are converted;is a matrix of the mapping and the mapping, is an offset, dzIs an intrinsic spatial dimension of the memory network;
1-2, using hard attention mechanism to select characteristics and calculate problem description characteristicsAnd video feature XeThe similarity between the two video characteristics is sorted according to the similarity score, the first n video characteristics with the highest similarity and most relevant to the problem are selected to update the memory unit in the shallow inference engine, and therefore n key value pairs with the nearest neighbor are obtained, and key is { k ═ k1,k2,...,kn},value={v1,v2,...,vnAnd f, updating the video feature sequence set, as shown in formula (3):
Γ(y1,...,yn)={j1,...jnwhen y isj1≥yj2≥…≥yjn(formula 4)
Wherein f issRepresenting a similarity measure, Γ being an ordering operation;
1-3, based on the updated video feature sequence set, a shallow inference engine learns the probability distribution rho of the problem description features on each memory unit; obtaining output z of the shallow inference engine through weighting and operation, and taking the problem description characteristics of the original input as the problem of the next round of inference; g in formulas (5) and (6)xAnd GqRespectively representing two feedforward fully-connected neural networks;
wherein v isiIs the storage content of the ith memory cell.
3. The cross-media hierarchical deep video question-answer reasoning framework of claim 2, wherein the step (2) is specifically as follows:
characterizing objects in video by a 1 x 1 convolutional neural network Converting the intrinsic vector characteristics into the intrinsic vector characteristics of a space memory module, and using another 1 x 1 convolutional neural network to characterize the motion in the videoEmbedded in a timing memory module; wherein d isoIs the dimension of the object feature, k is the number of objects, daIs the dimension of the action feature, l is the number of actions;
the multi-modal subcomponents include object features and motion features.
4. The cross-media hierarchical deep video question-answer reasoning framework of claim 3, wherein the step (3) is specifically as follows:
3-1. Using multimodal evidence from objects and actions to perform more refined reasoning, given action a for the lambda-th round of reasoningiVerb feature of question h'lAnd action memory of last momentObtaining similarity distribution of action characteristics through calculation of two combined similarities of P (-) and D (-)As shown in equation (7); similarity distribution on object features is also obtainedAs shown in equation (8);
3-2, the multi-modal memory collaborative reasoning module is used as an interactive reasoning machine, dynamic interaction between action memory and object memory is allowed, and when the memory of one mode is updated, the memory of the other mode can provide useful clues for the learning of attention; update gating of action memory when action modality is updatedFrom equation (10), the GRU-based attention mechanism is then employed to extract contextual featuresAnd it is used to update the action memory of the wheelAs shown in formula (11);
wherein the content of the first and second substances,as a group of mapsThe matrix of rays is then formed,is the corresponding offset;
obtaining similarity distribution on object characteristics by the same methodAs shown in formula (14);
5. The cross-media hierarchical deep video question-answer reasoning framework of claim 4, wherein the step (4) performs multi-modal dynamic memory fusion, utilizes the output u' of the shallow inference engine as a monitoring whistle, and combines the original embedded features q of the question descriptioneThe method guides the weight distribution of the memory contents of different modes at the lower layer, dynamically fuses the memories of different modes, takes the output of the dynamic memory fusion module as the input of the answer module to predict the best answer, and comprises the following specific processes:
4-1. dynamic memory fusion module utilizes upper-layer memoryThe output of the network is used as a monitoring whistle to guide the weight distribution of the memory contents of different modes at the lower layer, dynamically fuse the memories of different modes, and fuse the memoriesCan be calculated from equation (15):
wherein the content of the first and second substances,is a learnable parameter, alpha is a density vector, alphaaAnd alphaoAre two elements of the density vector α;
4-2, the output of the memory dynamic fusion module is used as the input of the answer module to predict the best answer; specifically, the method comprises the following steps: will be provided withAnd problem feature qeDo fusion and then pass through the weight matrix WpObtaining the feature vector v through mapping, as shown in formula (16);
the obtained feature vector v is input into a multi-classifier to predict the answer to the question.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011499931.2A CN112527993B (en) | 2020-12-17 | 2020-12-17 | Cross-media hierarchical deep video question-answer reasoning framework |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011499931.2A CN112527993B (en) | 2020-12-17 | 2020-12-17 | Cross-media hierarchical deep video question-answer reasoning framework |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112527993A true CN112527993A (en) | 2021-03-19 |
CN112527993B CN112527993B (en) | 2022-08-05 |
Family
ID=75001166
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011499931.2A Active CN112527993B (en) | 2020-12-17 | 2020-12-17 | Cross-media hierarchical deep video question-answer reasoning framework |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112527993B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113011193A (en) * | 2021-04-09 | 2021-06-22 | 广东外语外贸大学 | Bi-LSTM algorithm-based method and system for evaluating repeatability of detection consultation statement |
CN113536952A (en) * | 2021-06-22 | 2021-10-22 | 电子科技大学 | Video question-answering method based on attention network of motion capture |
CN115618061A (en) * | 2022-11-29 | 2023-01-17 | 广东工业大学 | Semantic-aligned video question-answering method |
WO2023159979A1 (en) * | 2022-02-22 | 2023-08-31 | 中兴通讯股份有限公司 | Ai reasoning method and system, and computer readable storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1296243A2 (en) * | 2001-09-25 | 2003-03-26 | Interuniversitair Microelektronica Centrum Vzw | A method for operating a real-time multimedia terminal in a QoS manner |
US20160342895A1 (en) * | 2015-05-21 | 2016-11-24 | Baidu Usa Llc | Multilingual image question answering |
US20170124432A1 (en) * | 2015-11-03 | 2017-05-04 | Baidu Usa Llc | Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering |
US10106153B1 (en) * | 2018-03-23 | 2018-10-23 | Chongqing Jinkang New Energy Vehicle Co., Ltd. | Multi-network-based path generation for vehicle parking |
CN108846384A (en) * | 2018-07-09 | 2018-11-20 | 北京邮电大学 | Merge the multitask coordinated recognition methods and system of video-aware |
CN108920587A (en) * | 2018-06-26 | 2018-11-30 | 清华大学 | Merge the open field vision answering method and device of external knowledge |
CN109919044A (en) * | 2019-02-18 | 2019-06-21 | 清华大学 | The video semanteme dividing method and device of feature propagation are carried out based on prediction |
CN111242197A (en) * | 2020-01-07 | 2020-06-05 | 中国石油大学(华东) | Image and text matching method based on double-view-domain semantic reasoning network |
-
2020
- 2020-12-17 CN CN202011499931.2A patent/CN112527993B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1296243A2 (en) * | 2001-09-25 | 2003-03-26 | Interuniversitair Microelektronica Centrum Vzw | A method for operating a real-time multimedia terminal in a QoS manner |
US20160342895A1 (en) * | 2015-05-21 | 2016-11-24 | Baidu Usa Llc | Multilingual image question answering |
US20170124432A1 (en) * | 2015-11-03 | 2017-05-04 | Baidu Usa Llc | Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering |
US10106153B1 (en) * | 2018-03-23 | 2018-10-23 | Chongqing Jinkang New Energy Vehicle Co., Ltd. | Multi-network-based path generation for vehicle parking |
CN108920587A (en) * | 2018-06-26 | 2018-11-30 | 清华大学 | Merge the open field vision answering method and device of external knowledge |
CN108846384A (en) * | 2018-07-09 | 2018-11-20 | 北京邮电大学 | Merge the multitask coordinated recognition methods and system of video-aware |
CN109919044A (en) * | 2019-02-18 | 2019-06-21 | 清华大学 | The video semanteme dividing method and device of feature propagation are carried out based on prediction |
CN111242197A (en) * | 2020-01-07 | 2020-06-05 | 中国石油大学(华东) | Image and text matching method based on double-view-domain semantic reasoning network |
Non-Patent Citations (1)
Title |
---|
俞俊等: ""视觉问答技术研究"", 《计算机研究与发展》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113011193A (en) * | 2021-04-09 | 2021-06-22 | 广东外语外贸大学 | Bi-LSTM algorithm-based method and system for evaluating repeatability of detection consultation statement |
CN113011193B (en) * | 2021-04-09 | 2021-11-23 | 广东外语外贸大学 | Bi-LSTM algorithm-based method and system for evaluating repeatability of detection consultation statement |
CN113536952A (en) * | 2021-06-22 | 2021-10-22 | 电子科技大学 | Video question-answering method based on attention network of motion capture |
CN113536952B (en) * | 2021-06-22 | 2023-04-21 | 电子科技大学 | Video question-answering method based on attention network of motion capture |
WO2023159979A1 (en) * | 2022-02-22 | 2023-08-31 | 中兴通讯股份有限公司 | Ai reasoning method and system, and computer readable storage medium |
CN115618061A (en) * | 2022-11-29 | 2023-01-17 | 广东工业大学 | Semantic-aligned video question-answering method |
Also Published As
Publication number | Publication date |
---|---|
CN112527993B (en) | 2022-08-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110263912B (en) | Image question-answering method based on multi-target association depth reasoning | |
CN112527993B (en) | Cross-media hierarchical deep video question-answer reasoning framework | |
Karpathy et al. | Deep visual-semantic alignments for generating image descriptions | |
Shen et al. | Question/answer matching for CQA system via combining lexical and sequential information | |
CN109885756B (en) | CNN and RNN-based serialization recommendation method | |
CN114398961A (en) | Visual question-answering method based on multi-mode depth feature fusion and model thereof | |
CN113297364B (en) | Natural language understanding method and device in dialogue-oriented system | |
CN113886626B (en) | Visual question-answering method of dynamic memory network model based on multi-attention mechanism | |
Zong et al. | Emotion recognition in the wild via sparse transductive transfer linear discriminant analysis | |
Du et al. | Full transformer network with masking future for word-level sign language recognition | |
CN112232086A (en) | Semantic recognition method and device, computer equipment and storage medium | |
Dai et al. | Hybrid deep model for human behavior understanding on industrial internet of video things | |
Zhou et al. | Plenty is plague: Fine-grained learning for visual question answering | |
CN113204675A (en) | Cross-modal video time retrieval method based on cross-modal object inference network | |
CN115270752A (en) | Template sentence evaluation method based on multilevel comparison learning | |
CN114970517A (en) | Visual question and answer oriented method based on multi-modal interaction context perception | |
CN113609326A (en) | Image description generation method based on external knowledge and target relation | |
CN112069399A (en) | Personalized search system based on interactive matching | |
CN116187349A (en) | Visual question-answering method based on scene graph relation information enhancement | |
CN115408603A (en) | Online question-answer community expert recommendation method based on multi-head self-attention mechanism | |
CN114282528A (en) | Keyword extraction method, device, equipment and storage medium | |
CN116720519B (en) | Seedling medicine named entity identification method | |
CN113239678A (en) | Multi-angle attention feature matching method and system for answer selection | |
CN116881416A (en) | Instance-level cross-modal retrieval method for relational reasoning and cross-modal independent matching network | |
CN116189047A (en) | Short video classification method based on multi-mode information aggregation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |