CN112527993A - Cross-media hierarchical deep video question-answer reasoning framework - Google Patents

Cross-media hierarchical deep video question-answer reasoning framework Download PDF

Info

Publication number
CN112527993A
CN112527993A CN202011499931.2A CN202011499931A CN112527993A CN 112527993 A CN112527993 A CN 112527993A CN 202011499931 A CN202011499931 A CN 202011499931A CN 112527993 A CN112527993 A CN 112527993A
Authority
CN
China
Prior art keywords
memory
video
answer
reasoning
shallow
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011499931.2A
Other languages
Chinese (zh)
Other versions
CN112527993B (en
Inventor
余婷
来炳
钱璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University Of Finance & Economics Dongfang College
Original Assignee
Zhejiang University Of Finance & Economics Dongfang College
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University Of Finance & Economics Dongfang College filed Critical Zhejiang University Of Finance & Economics Dongfang College
Priority to CN202011499931.2A priority Critical patent/CN112527993B/en
Publication of CN112527993A publication Critical patent/CN112527993A/en
Application granted granted Critical
Publication of CN112527993B publication Critical patent/CN112527993B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a cross-media hierarchical deep video question-answer reasoning framework. The method comprises the steps of 1, storing global semantic information of a video by using a memory component, and obtaining a shallow inference engine through multiple rounds of memory updating iteration. 2. Based on a shallow layer inference engine, a deep layer inference engine is constructed, and multi-mode subcomponents under deep semantic analysis of the video are embedded into memory card slots with different modes to form space memory and time sequence memory. 3. And constructing a multi-modal memory collaborative reasoning framework, and performing more refined reasoning by using multi-modal evidences from objects and actions. 4. And performing multi-mode dynamic memory fusion, using the output of a shallow inference engine as a monitoring whistle, guiding the weight distribution of memory contents of different modes at a lower layer, dynamically fusing the memories of different modes through a dynamic memory fusion module in the framework, and using the output of the dynamic memory fusion module as the input of a reply module to predict the best answer. The reasoning framework of the invention can achieve a remarkable effect on the video question-answer data set.

Description

Cross-media hierarchical deep video question-answer reasoning framework
Technical Field
The invention relates to a deep neural network for video question answering, in particular to a hierarchical deep reasoning framework based on cross-media uniform expression.
Background
The cross-media technology aims to open a semantic gap between different media (such as video media and text media) and form uniform cross-media semantic expression. But due to the complexity of the semantics of the multimedia data itself, this problem has not been solved well before deep learning has emerged. In recent years, deep learning has achieved remarkable performance in various research fields, and the tasks to be solved are modeled end to end by means of a complex neural network model, so that deep uniform expression of cross-media data is learned. Due to the strong semantic expression capability of the depth model, the deep cross-media uniform expression model becomes the current mainstream method.
On the basis of the theory of deep cross-media uniform expression, a plurality of current hot branch directions are derived, such as cross-media retrieval, visual description, visual question answering and the like. Cross-media retrieval of relevant data aimed at finding the best matching one media from the mass database given one media data; the goal of visual description is to give an image an effective overview of its content in one or several natural languages; the visual question-answering aims at using questions described by natural language and a visual data object as input, and after the natural language description and visual content are fully understood by an algorithm, deep reasoning is carried out, and finally an answer described by the natural language is output. Among these tasks, visual question-answering is relatively more challenging, involving fine-grained understanding of visual content and natural language, while also requiring deep knowledge reasoning. Therefore, visual question answering has become a research hotspot in recent years.
Video data, which is a mainstream visual data object, exists in various social networking sites in a large scale, and the data volume thereof almost exceeds the sum of other media data. Video data is more complex than images. Video is not a simple stack of image sequences in number and contains data information in multiple modalities, such as visual, text, voice, etc. Visual objects in the video can present visual features of different visual angles along with the change of time, and spatial visual information at different moments is correlated with each other. In addition, visual question answering based on video data involves more complex questions. The user can present a diversity problem of high degree of freedom according to the video content. The questions in the video question and answer task usually include complicated questions such as action category and action time sequence relation reasoning, besides the questions related to static spatial information such as color, quantity and position. In addition, given a video data object, the amount of visual information on which the model depends to give correct answers to different questions is different. Some questions can give effective answers only depending on one frame of information, and some questions can correctly predict answers only by understanding the semantics of the complete video.
In summary, the difficulty of video question answering lies in how to construct an efficient cross-media question answering reasoning framework on the basis of correctly and effectively understanding video content and sufficiently and accurately understanding question intentions, so as to improve accuracy of answer prediction.
Disclosure of Invention
The invention provides a deep hierarchical reasoning framework for complex long-term video question answering, which mainly comprises: 1. constructing a shallow layer inference machine: executing an irrelevant information filtering function, identifying important visual contents relevant to problem description from all possible long sequence information of the complex long-term video, filtering irrelevant visual information, and avoiding overload of deep memory network information and noise; 2. constructing a deep inference machine: under the guidance of a shallow inference engine, more detailed inference is carried out by utilizing deeper semantic evidence from vision and natural language, and more fine-grained attention is learned so as to improve the quality of cross-modal task inference. In the aspect of video question answering, the depth reasoning framework of the invention is utilized to improve the reasoning quality, and the effect better than that of the traditional visual question answering model is obtained. 3. A memory dynamic fusion module: the method is used for dynamically fusing memories of different modes, and the output of the dynamic memory fusion module is used as the input of the answer module to predict the best answer.
The technical scheme adopted by the invention for solving the technical problems is as follows:
and (1) storing the global semantic information of the video by using a memory component, and obtaining a shallow inference machine through multiple rounds of memory updating iteration under the guidance of the global visual features of the problem description, wherein the shallow inference machine is used for inferring the visual information most relevant to the global semantic features of the problem description.
And (2) constructing a deep inference machine based on the shallow inference machine, and embedding multi-modal subcomponents under deep semantic analysis of the video into memory card slots with different modalities to form spatial memory and time sequence memory.
And (3) constructing a multi-modal memory collaborative reasoning framework, and executing more fine reasoning by using multi-modal evidences from objects and actions so as to improve the quality of question answering.
And (4) performing multi-mode dynamic memory fusion, guiding the weight distribution of the memory contents of different modes at the lower layer by using the output of the shallow inference engine as a monitoring whistle, dynamically fusing the memories of different modes through a dynamic memory fusion module in the framework, and predicting the best answer by using the output of the dynamic memory fusion module as the input of a response module.
Further, the step (1) of storing global semantic information of the video by using a memory component, and under the guidance of the global visual features of the problem description, reasoning out visual information most relevant to the global semantic features of the problem description through multiple rounds of memory update iterations, specifically as follows:
1-1. to characterize video
Figure BDA0002843298600000031
And problem description feature hTInputting into the memory component, firstly converting the input features into the intrinsic vector features of the memory network, as shown in the formulas (1) and (2):
Xe=tanh(WxX+bx) (formula 1)
Figure BDA0002843298600000032
Wherein, Xe
Figure BDA0002843298600000033
Respectively indicating the video characteristics and the problem description characteristics after the video characteristics and the problem description characteristics are converted;
Figure BDA0002843298600000034
is a matrix of the mapping and the mapping,
Figure BDA0002843298600000035
Figure BDA0002843298600000036
is an offset, dzIs an intrinsic spatial dimension of the memory network.
1-2. feature selection was performed using a hard attention mechanism. Computing problem description features
Figure BDA0002843298600000037
And video feature XeThe similarity between the two video characteristics is sorted according to the similarity score, the first n video characteristics with the highest similarity and most relevant to the problem are selected to update the memory unit in the shallow inference engine, and therefore n key value pairs with the nearest neighbor are obtained, and key is { k ═ k1,k2,...,kn},value={v1,v2,...,vnAnd f, updating the video feature sequence set, as shown in formula (3):
Figure BDA0002843298600000038
Γ(y1,...,yn)={j1,...jnwhen y isj1≥yj2≥…≥yjn(formula 4)
Wherein f issRepresenting a similarity measure, Γ is the sort operation.
1-3. based onAnd (3) the updated video feature sequence set, and the shallow inference engine learns the probability distribution rho of the feature description on each memory unit. And obtaining the output z of the shallow inference engine through weighting and operation, and combining the problem description characteristics of the original input as the problem of the next round of inference. G in formulas (5) and (6)xAnd GqRespectively, two feedforward fully-connected neural networks are shown.
Figure BDA0002843298600000041
Figure BDA0002843298600000042
Wherein v isiIs the storage content of the ith memory cell.
Constructing a deep multi-modal memory network based on the guided shallow inference engine in the step (2), embedding multi-modal subcomponents under deep semantic analysis of the video into memory modules of different modalities to form spatial memory and time sequence memory, and specifically comprising the following steps:
characterizing objects in video by a 1 x 1 convolutional neural network
Figure BDA0002843298600000049
Figure BDA0002843298600000048
Converting the intrinsic vector characteristics into the intrinsic vector characteristics of a space memory module, and using another 1 x 1 convolutional neural network to characterize the motion in the video
Figure BDA00028432986000000410
Embedded in a timing memory module. Wherein d isoIs the dimension of the object feature, k is the number of objects, daIs the dimension of the action feature, l is the number of actions.
The multi-modal subcomponents comprise object features and action features;
the memory modules in different modes comprise a space memory module and a time sequence memory module; and (3) constructing a multi-modal memory collaborative reasoning framework, and executing more detailed reasoning by using multi-modal evidences from objects and actions, wherein the detailed reasoning is as follows:
3-1. for the lambda-th round of reasoning, the action characteristic a is giveniDescription of problem h'lAnd action memory of last moment
Figure BDA0002843298600000043
Obtaining similarity distribution of action characteristics through calculation of two combined similarities of P (-) and D (-)
Figure BDA0002843298600000044
As shown in equation (7). The same object feature oiObtaining the similarity distribution on the object characteristics
Figure BDA0002843298600000045
As shown in equation (8).
Figure BDA0002843298600000046
Figure BDA0002843298600000047
Where P (-) and D (-) are two similarity calculation functions.
3-2, considering that the object characteristics and the action characteristics play an important role in high-quality reasoning question answering, the multi-modal memory collaborative reasoning framework is used as an interactive reasoning machine, allows action memory and object memory to dynamically interact, and when the memory of one mode is updated, the memory of the other mode provides a useful clue for attention learning, in particular:
when calculating the update gating signal of the memory content of a certain modality, in addition to considering the modality itself, the influence of another modality needs to be considered. Taking the motion mode as an example, the refresh gating signal of the motion memory
Figure BDA0002843298600000051
This can be derived from equation (10), and the GRU-based attention mechanism is then employed to extract contextual features
Figure BDA0002843298600000052
And it is used to update the action memory of the wheel
Figure BDA0002843298600000053
As shown in equation (11).
Figure BDA0002843298600000054
Figure BDA0002843298600000055
Figure BDA0002843298600000056
Wherein the content of the first and second substances,
Figure BDA0002843298600000057
in order to be a set of mapping matrices,
Figure BDA0002843298600000058
is the corresponding offset.
The object memory of the round is obtained in the same way
Figure BDA0002843298600000059
As shown in equation (14).
Figure BDA00028432986000000510
Figure BDA00028432986000000511
Figure BDA00028432986000000512
Wherein the content of the first and second substances,
Figure BDA00028432986000000513
in order to be a set of mapping matrices,
Figure BDA00028432986000000514
is the corresponding offset.
And (4) performing multi-mode dynamic memory fusion, guiding the weight distribution of the memory contents of different modes at a lower layer by using the output of a shallow inference engine as a monitoring whistle, dynamically fusing the memories of different modes through a dynamic memory fusion module in the framework, and predicting the best answer by using the output of the dynamic memory fusion module as the input of an answer module, wherein the specific process is as follows:
4-1. due to the complexity and diversity of the problem descriptions, the magnitude of the contribution of the different visual submodes to the question-answering model is dynamically variable. The memory dynamic fusion module utilizes the output u' of the shallow inference engine as a monitoring whistle and combines the original embedded characteristic q of the problem descriptioneGuiding the weight distribution of the memory contents of different modes of the deep inference engine, dynamically fusing the memories of different modes, and fusing the memories
Figure BDA00028432986000000517
Can be calculated from equation (15):
Figure BDA00028432986000000515
wherein, therein
Figure BDA00028432986000000516
Is a learnable parameter, alpha is a density vector, alphaaAnd alphaoAre two elements of the density vector alpha.
4-2. the output of the dynamic memory fusion module is used as a reply moduleTo predict the best answer. In particular to
Figure BDA0002843298600000061
And problem feature qeDo fusion and then pass through the weight matrix WpThe mapping of (a) to (b) yields the feature vector v, as shown in equation (16).
Figure BDA0002843298600000062
The obtained feature vector v is input into an answer module (multi-classifier) to predict the answer of the question.
The invention has the beneficial effects that:
the invention provides a novel coarse-to-fine hierarchical depth inference framework for a complicated long-term network video question-answering problem, and the method comprises the steps of firstly filtering invalid information from a long video sequence by constructing a shallow inference engine, identifying important visual contents, learning the global attention of coarse-grained videos, and then constructing a deep inference engine to perform depth optimization inference from two directions of interframes and intraframes. Through multiple rounds of reasoning iteration, the reasoning framework can simulate the human video question-answer reasoning process, the key time relevant to the question is located from the long-term video, and relevant evidences are collected to predict the answer. The reasoning framework of our invention can achieve significant effects on the video question-answer dataset.
Drawings
FIG. 1 is a general block diagram of the process of the present invention.
FIG. 2 is a guided shallow inference engine constructed in the method of the present invention.
FIG. 3 is an optimized deep inference engine constructed in the method of the present invention.
Detailed Description
The following is a more detailed description of the detailed parameters of the present invention.
As shown in FIG. 1, the invention provides a hierarchical deep question answering reasoning framework for complex video question answering.
The method for obtaining the visual information of the video by using the memory component in the step (1) includes obtaining a shallow inference engine through multiple rounds of memory update iteration under the guidance of the global visual feature of the problem description, and inferring the visual information most relevant to the global semantic feature of the problem description, as shown in fig. 2, specifically as follows:
1-1. extracting video characteristics
Figure BDA0002843298600000071
And problem description feature hTIt is input into the memory component and the input is converted into the intrinsic vector characteristics of the memory network.
For the extraction of global video data features, large-scale pre-trained neural networks VGG and 3D-CNN are used for extracting intermediate features, and the features are input into a bidirectional GRU network to obtain globally perceived semantic features
Figure BDA0002843298600000072
Wherein
Figure BDA0002843298600000073
dx2048 is the dimension of the feature. For the extraction of problem description features, firstly, a shallow word embedding model Glove is adopted to encode each word to capture the semantics of the word, and then the generated word vector is sequentially input to a word vector containing dqThe two-way LSTM network with 256 hidden units learns the context information of the problem description, and finally splices the hidden state forward and backward at each moment to represent the global semantics of the problem description
Figure BDA0002843298600000074
Then, inputting the video characteristics and the problem description characteristics into a memory component, and converting the input into an internal vector of a memory network, wherein
Figure BDA0002843298600000075
Figure BDA0002843298600000076
Is the mapping matrix and offset, dz256 isThe intrinsic spatial dimensions of the network are remembered.
1-2. feature selection was performed using a hard attention mechanism. Calculating the similarity between the problem description characteristics and the video characteristics, sorting the video characteristics according to the similarity scores, selecting the top 20 video characteristics with the highest similarity and most relevant to the problem to update the storage unit of the memory network, thereby obtaining the 20 key value pairs of the nearest neighbor, wherein key is { k }1,k2,...,kn},value={v1,v2,...,vnAnd forming an updated video feature sequence.
1-3. based on the updated memory content, the learning problem describes the probability distribution ρ over each memory cell. And obtaining the output z of the layer network through weighting and operation, and taking the original characteristics described by the problem as the updated problem of the next round of reasoning.
Constructing a deep inference machine on the basis of the shallow inference machine in the step (2), embedding multi-modal subcomponents under deep semantic analysis of the video into memory card slots with different modalities to form spatial memory and time sequence memory, and specifically comprising the following steps:
the characteristics of video objects are extracted by using the existing fast-RCNN model, 36 targets are sequentially detected in 20 important video units by using the fast-RCNN model based on a shallow inference engine, and 4096-dimensional object characteristics are extracted for each target object. Then, the object features in the video are converted into the internal vectors of the space memory module through a 1 × 1 convolutional neural network to form space memory.
Predicting the first 30 time sequence segments with most potential actions in the video by an external pre-trained video time sequence candidate generation network, and extracting the video characteristics of the 30 time sequence segments as video action characteristics according to the step (1). Characterizing motion in video using another 1 x 1 convolutional neural network
Figure BDA0002843298600000081
Embedded into the time sequence memory module to form time sequence memory.
Constructing the multi-modal memory collaborative reasoning in the step (3), and performing more refined reasoning by using multi-modal evidence from objects and actions, as shown in fig. 3, specifically as follows:
3-1. Using multi-modal evidence from objects and actions to perform more refined reasoning, for the lambda-th round of reasoning, given the action characteristics, the characteristics of the problem and the action memory at the previous moment
Figure BDA0002843298600000082
The similarity distribution of the action characteristics is obtained through calculation of element point multiplication similarity functions and element absolute value similarity functions. The similarity distribution on the object features can be obtained in the same way.
And 3-2, calculating a memory updating gating signal. Update gating in computational action memory
Figure BDA0002843298600000083
In this case, in addition to the action modality itself, the influence of the object modality needs to be considered. Similarly, update gating in computational object memory
Figure BDA0002843298600000084
When considering the object modality, an action modality is needed to provide a favorable cue. The GRU-based attention mechanism is then employed to extract contextual features and use it to update the action memory of the current round
Figure BDA0002843298600000085
And (4) dynamically fusing different modes of memory by the multi-mode memory dynamic fusion module, wherein the output of the memory dynamic fusion module is used as the input of the answer module to predict the best answer, as shown in fig. 3, the following steps are specifically performed:
and 4-1, the dynamic memory fusion module utilizes the output of the shallow inference engine as a monitoring whistle to guide the weight distribution of the memory contents of different modes in the deep layer and dynamically fuse the memories of different modes. The module first remembers u' of the shallow inference engine in 256 dimensions and the problem feature q in 300 dimensionseSplicing, performing matrix transformation of 2 × 556, and passing through classifier of softmaxObtaining the weights of different sub-modes of the deep inference engine, and finally obtaining the fused memory by using the weighted sum operation
Figure BDA0002843298600000091
And 4-2, taking the output of the memory dynamic fusion module as the input of the answer module to predict the best answer. Specifically, the fusion memory of the dynamic fusion module
Figure BDA0002843298600000092
And problem feature qePerforming element dot product operation to obtain a 300-dimensional feature vector, and passing through a 1000 × 300 weight matrix WpAnd mapping to obtain a 1000-dimensional feature vector v, inputting the obtained feature vector v into a 1000-way softmax classifier to obtain the probability distribution of an answer dictionary, training the model in an end-to-end mode, and optimizing the model by using softmax cross entropy as a loss function until the network converges.

Claims (5)

1. A cross-media hierarchical deep video question-answer reasoning framework is characterized by comprising the following steps:
step (1), storing global semantic information of a video by using a memory component, and obtaining a shallow inference machine through multiple rounds of memory updating iteration under the guidance of global visual features of problem description, wherein the shallow inference machine is used for inferring visual information most relevant to the global semantic features of the problem description;
step (2) constructing a deep inference machine based on the shallow inference machine, and embedding multi-modal subcomponents under deep semantic analysis of the video into memory card slots with different modalities to form spatial memory and time sequence memory;
step (3), a multi-modal memory collaborative reasoning framework is constructed, and more detailed reasoning is executed by utilizing multi-modal evidences from objects and actions so as to improve the quality of question answering;
and (4) performing multi-mode dynamic memory fusion, guiding the weight distribution of the memory contents of different modes at the lower layer by using the output of the shallow inference engine as a monitoring whistle, dynamically fusing the memories of different modes through a dynamic memory fusion module in the framework, and predicting the best answer by using the output of the dynamic memory fusion module as the input of a response module.
2. The cross-media hierarchical deep video question-answer reasoning framework of claim 1, wherein the step (1) is specifically as follows:
1-1. to characterize video
Figure FDA0002843298590000011
And problem description feature hTInputting into the memory component, firstly converting the input features into the intrinsic vector features of the memory network, as shown in the formulas (1) and (2):
Xe=tanh(WxX+bx) (formula 1)
Figure FDA0002843298590000012
Wherein, Xe
Figure FDA0002843298590000013
Respectively indicating the video characteristics and the problem description characteristics after the video characteristics and the problem description characteristics are converted;
Figure FDA0002843298590000014
is a matrix of the mapping and the mapping,
Figure FDA0002843298590000015
Figure FDA0002843298590000016
is an offset, dzIs an intrinsic spatial dimension of the memory network;
1-2, using hard attention mechanism to select characteristics and calculate problem description characteristics
Figure FDA0002843298590000017
And video feature XeThe similarity between the two video characteristics is sorted according to the similarity score, the first n video characteristics with the highest similarity and most relevant to the problem are selected to update the memory unit in the shallow inference engine, and therefore n key value pairs with the nearest neighbor are obtained, and key is { k ═ k1,k2,...,kn},value={v1,v2,...,vnAnd f, updating the video feature sequence set, as shown in formula (3):
Figure FDA0002843298590000021
Γ(y1,...,yn)={j1,...jnwhen y isj1≥yj2≥…≥yjn(formula 4)
Wherein f issRepresenting a similarity measure, Γ being an ordering operation;
1-3, based on the updated video feature sequence set, a shallow inference engine learns the probability distribution rho of the problem description features on each memory unit; obtaining output z of the shallow inference engine through weighting and operation, and taking the problem description characteristics of the original input as the problem of the next round of inference; g in formulas (5) and (6)xAnd GqRespectively representing two feedforward fully-connected neural networks;
Figure FDA0002843298590000022
Figure FDA0002843298590000023
wherein v isiIs the storage content of the ith memory cell.
3. The cross-media hierarchical deep video question-answer reasoning framework of claim 2, wherein the step (2) is specifically as follows:
characterizing objects in video by a 1 x 1 convolutional neural network
Figure FDA0002843298590000024
Figure FDA0002843298590000025
Converting the intrinsic vector characteristics into the intrinsic vector characteristics of a space memory module, and using another 1 x 1 convolutional neural network to characterize the motion in the video
Figure FDA0002843298590000026
Embedded in a timing memory module; wherein d isoIs the dimension of the object feature, k is the number of objects, daIs the dimension of the action feature, l is the number of actions;
the multi-modal subcomponents include object features and motion features.
4. The cross-media hierarchical deep video question-answer reasoning framework of claim 3, wherein the step (3) is specifically as follows:
3-1. Using multimodal evidence from objects and actions to perform more refined reasoning, given action a for the lambda-th round of reasoningiVerb feature of question h'lAnd action memory of last moment
Figure FDA0002843298590000027
Obtaining similarity distribution of action characteristics through calculation of two combined similarities of P (-) and D (-)
Figure FDA0002843298590000028
As shown in equation (7); similarity distribution on object features is also obtained
Figure FDA0002843298590000029
As shown in equation (8);
Figure FDA00028432985900000210
Figure FDA00028432985900000211
3-2, the multi-modal memory collaborative reasoning module is used as an interactive reasoning machine, dynamic interaction between action memory and object memory is allowed, and when the memory of one mode is updated, the memory of the other mode can provide useful clues for the learning of attention; update gating of action memory when action modality is updated
Figure FDA0002843298590000031
From equation (10), the GRU-based attention mechanism is then employed to extract contextual features
Figure FDA0002843298590000032
And it is used to update the action memory of the wheel
Figure FDA0002843298590000033
As shown in formula (11);
Figure FDA0002843298590000034
Figure FDA0002843298590000035
Figure FDA0002843298590000036
wherein the content of the first and second substances,
Figure FDA0002843298590000037
as a group of mapsThe matrix of rays is then formed,
Figure FDA0002843298590000038
is the corresponding offset;
obtaining similarity distribution on object characteristics by the same method
Figure FDA0002843298590000039
As shown in formula (14);
Figure FDA00028432985900000310
Figure FDA00028432985900000311
Figure FDA00028432985900000312
wherein the content of the first and second substances,
Figure FDA00028432985900000313
in order to be a set of mapping matrices,
Figure FDA00028432985900000314
is the corresponding offset.
5. The cross-media hierarchical deep video question-answer reasoning framework of claim 4, wherein the step (4) performs multi-modal dynamic memory fusion, utilizes the output u' of the shallow inference engine as a monitoring whistle, and combines the original embedded features q of the question descriptioneThe method guides the weight distribution of the memory contents of different modes at the lower layer, dynamically fuses the memories of different modes, takes the output of the dynamic memory fusion module as the input of the answer module to predict the best answer, and comprises the following specific processes:
4-1. dynamic memory fusion module utilizes upper-layer memoryThe output of the network is used as a monitoring whistle to guide the weight distribution of the memory contents of different modes at the lower layer, dynamically fuse the memories of different modes, and fuse the memories
Figure FDA00028432985900000317
Can be calculated from equation (15):
Figure FDA00028432985900000315
wherein the content of the first and second substances,
Figure FDA00028432985900000316
is a learnable parameter, alpha is a density vector, alphaaAnd alphaoAre two elements of the density vector α;
4-2, the output of the memory dynamic fusion module is used as the input of the answer module to predict the best answer; specifically, the method comprises the following steps: will be provided with
Figure FDA0002843298590000041
And problem feature qeDo fusion and then pass through the weight matrix WpObtaining the feature vector v through mapping, as shown in formula (16);
Figure FDA0002843298590000042
the obtained feature vector v is input into a multi-classifier to predict the answer to the question.
CN202011499931.2A 2020-12-17 2020-12-17 Cross-media hierarchical deep video question-answer reasoning framework Active CN112527993B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011499931.2A CN112527993B (en) 2020-12-17 2020-12-17 Cross-media hierarchical deep video question-answer reasoning framework

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011499931.2A CN112527993B (en) 2020-12-17 2020-12-17 Cross-media hierarchical deep video question-answer reasoning framework

Publications (2)

Publication Number Publication Date
CN112527993A true CN112527993A (en) 2021-03-19
CN112527993B CN112527993B (en) 2022-08-05

Family

ID=75001166

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011499931.2A Active CN112527993B (en) 2020-12-17 2020-12-17 Cross-media hierarchical deep video question-answer reasoning framework

Country Status (1)

Country Link
CN (1) CN112527993B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011193A (en) * 2021-04-09 2021-06-22 广东外语外贸大学 Bi-LSTM algorithm-based method and system for evaluating repeatability of detection consultation statement
CN113536952A (en) * 2021-06-22 2021-10-22 电子科技大学 Video question-answering method based on attention network of motion capture
CN115618061A (en) * 2022-11-29 2023-01-17 广东工业大学 Semantic-aligned video question-answering method
WO2023159979A1 (en) * 2022-02-22 2023-08-31 中兴通讯股份有限公司 Ai reasoning method and system, and computer readable storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1296243A2 (en) * 2001-09-25 2003-03-26 Interuniversitair Microelektronica Centrum Vzw A method for operating a real-time multimedia terminal in a QoS manner
US20160342895A1 (en) * 2015-05-21 2016-11-24 Baidu Usa Llc Multilingual image question answering
US20170124432A1 (en) * 2015-11-03 2017-05-04 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering
US10106153B1 (en) * 2018-03-23 2018-10-23 Chongqing Jinkang New Energy Vehicle Co., Ltd. Multi-network-based path generation for vehicle parking
CN108846384A (en) * 2018-07-09 2018-11-20 北京邮电大学 Merge the multitask coordinated recognition methods and system of video-aware
CN108920587A (en) * 2018-06-26 2018-11-30 清华大学 Merge the open field vision answering method and device of external knowledge
CN109919044A (en) * 2019-02-18 2019-06-21 清华大学 The video semanteme dividing method and device of feature propagation are carried out based on prediction
CN111242197A (en) * 2020-01-07 2020-06-05 中国石油大学(华东) Image and text matching method based on double-view-domain semantic reasoning network

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1296243A2 (en) * 2001-09-25 2003-03-26 Interuniversitair Microelektronica Centrum Vzw A method for operating a real-time multimedia terminal in a QoS manner
US20160342895A1 (en) * 2015-05-21 2016-11-24 Baidu Usa Llc Multilingual image question answering
US20170124432A1 (en) * 2015-11-03 2017-05-04 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering
US10106153B1 (en) * 2018-03-23 2018-10-23 Chongqing Jinkang New Energy Vehicle Co., Ltd. Multi-network-based path generation for vehicle parking
CN108920587A (en) * 2018-06-26 2018-11-30 清华大学 Merge the open field vision answering method and device of external knowledge
CN108846384A (en) * 2018-07-09 2018-11-20 北京邮电大学 Merge the multitask coordinated recognition methods and system of video-aware
CN109919044A (en) * 2019-02-18 2019-06-21 清华大学 The video semanteme dividing method and device of feature propagation are carried out based on prediction
CN111242197A (en) * 2020-01-07 2020-06-05 中国石油大学(华东) Image and text matching method based on double-view-domain semantic reasoning network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
俞俊等: ""视觉问答技术研究"", 《计算机研究与发展》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011193A (en) * 2021-04-09 2021-06-22 广东外语外贸大学 Bi-LSTM algorithm-based method and system for evaluating repeatability of detection consultation statement
CN113011193B (en) * 2021-04-09 2021-11-23 广东外语外贸大学 Bi-LSTM algorithm-based method and system for evaluating repeatability of detection consultation statement
CN113536952A (en) * 2021-06-22 2021-10-22 电子科技大学 Video question-answering method based on attention network of motion capture
CN113536952B (en) * 2021-06-22 2023-04-21 电子科技大学 Video question-answering method based on attention network of motion capture
WO2023159979A1 (en) * 2022-02-22 2023-08-31 中兴通讯股份有限公司 Ai reasoning method and system, and computer readable storage medium
CN115618061A (en) * 2022-11-29 2023-01-17 广东工业大学 Semantic-aligned video question-answering method

Also Published As

Publication number Publication date
CN112527993B (en) 2022-08-05

Similar Documents

Publication Publication Date Title
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
CN112527993B (en) Cross-media hierarchical deep video question-answer reasoning framework
Karpathy et al. Deep visual-semantic alignments for generating image descriptions
Shen et al. Question/answer matching for CQA system via combining lexical and sequential information
CN109885756B (en) CNN and RNN-based serialization recommendation method
CN114398961A (en) Visual question-answering method based on multi-mode depth feature fusion and model thereof
CN113297364B (en) Natural language understanding method and device in dialogue-oriented system
CN113886626B (en) Visual question-answering method of dynamic memory network model based on multi-attention mechanism
Zong et al. Emotion recognition in the wild via sparse transductive transfer linear discriminant analysis
Du et al. Full transformer network with masking future for word-level sign language recognition
CN112232086A (en) Semantic recognition method and device, computer equipment and storage medium
Dai et al. Hybrid deep model for human behavior understanding on industrial internet of video things
Zhou et al. Plenty is plague: Fine-grained learning for visual question answering
CN113204675A (en) Cross-modal video time retrieval method based on cross-modal object inference network
CN115270752A (en) Template sentence evaluation method based on multilevel comparison learning
CN114970517A (en) Visual question and answer oriented method based on multi-modal interaction context perception
CN113609326A (en) Image description generation method based on external knowledge and target relation
CN112069399A (en) Personalized search system based on interactive matching
CN116187349A (en) Visual question-answering method based on scene graph relation information enhancement
CN115408603A (en) Online question-answer community expert recommendation method based on multi-head self-attention mechanism
CN114282528A (en) Keyword extraction method, device, equipment and storage medium
CN116720519B (en) Seedling medicine named entity identification method
CN113239678A (en) Multi-angle attention feature matching method and system for answer selection
CN116881416A (en) Instance-level cross-modal retrieval method for relational reasoning and cross-modal independent matching network
CN116189047A (en) Short video classification method based on multi-mode information aggregation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant