CN112527993A

CN112527993A - Cross-media hierarchical deep video question-answer reasoning framework

Info

Publication number: CN112527993A
Application number: CN202011499931.2A
Authority: CN
Inventors: 余婷; 来炳; 钱璐
Original assignee: Zhejiang University Of Finance & Economics Dongfang College
Current assignee: Zhejiang University Of Finance & Economics Dongfang College
Priority date: 2020-12-17
Filing date: 2020-12-17
Publication date: 2021-03-19
Anticipated expiration: 2040-12-17
Also published as: CN112527993B

Abstract

The invention discloses a cross-media hierarchical deep video question-answer reasoning framework. The method comprises the steps of 1, storing global semantic information of a video by using a memory component, and obtaining a shallow inference engine through multiple rounds of memory updating iteration. 2. Based on a shallow layer inference engine, a deep layer inference engine is constructed, and multi-mode subcomponents under deep semantic analysis of the video are embedded into memory card slots with different modes to form space memory and time sequence memory. 3. And constructing a multi-modal memory collaborative reasoning framework, and performing more refined reasoning by using multi-modal evidences from objects and actions. 4. And performing multi-mode dynamic memory fusion, using the output of a shallow inference engine as a monitoring whistle, guiding the weight distribution of memory contents of different modes at a lower layer, dynamically fusing the memories of different modes through a dynamic memory fusion module in the framework, and using the output of the dynamic memory fusion module as the input of a reply module to predict the best answer. The reasoning framework of the invention can achieve a remarkable effect on the video question-answer data set.

Description

Cross-media hierarchical deep video question-answer reasoning framework

Technical Field

The invention relates to a deep neural network for video question answering, in particular to a hierarchical deep reasoning framework based on cross-media uniform expression.

Background

The cross-media technology aims to open a semantic gap between different media (such as video media and text media) and form uniform cross-media semantic expression. But due to the complexity of the semantics of the multimedia data itself, this problem has not been solved well before deep learning has emerged. In recent years, deep learning has achieved remarkable performance in various research fields, and the tasks to be solved are modeled end to end by means of a complex neural network model, so that deep uniform expression of cross-media data is learned. Due to the strong semantic expression capability of the depth model, the deep cross-media uniform expression model becomes the current mainstream method.

On the basis of the theory of deep cross-media uniform expression, a plurality of current hot branch directions are derived, such as cross-media retrieval, visual description, visual question answering and the like. Cross-media retrieval of relevant data aimed at finding the best matching one media from the mass database given one media data; the goal of visual description is to give an image an effective overview of its content in one or several natural languages; the visual question-answering aims at using questions described by natural language and a visual data object as input, and after the natural language description and visual content are fully understood by an algorithm, deep reasoning is carried out, and finally an answer described by the natural language is output. Among these tasks, visual question-answering is relatively more challenging, involving fine-grained understanding of visual content and natural language, while also requiring deep knowledge reasoning. Therefore, visual question answering has become a research hotspot in recent years.

Video data, which is a mainstream visual data object, exists in various social networking sites in a large scale, and the data volume thereof almost exceeds the sum of other media data. Video data is more complex than images. Video is not a simple stack of image sequences in number and contains data information in multiple modalities, such as visual, text, voice, etc. Visual objects in the video can present visual features of different visual angles along with the change of time, and spatial visual information at different moments is correlated with each other. In addition, visual question answering based on video data involves more complex questions. The user can present a diversity problem of high degree of freedom according to the video content. The questions in the video question and answer task usually include complicated questions such as action category and action time sequence relation reasoning, besides the questions related to static spatial information such as color, quantity and position. In addition, given a video data object, the amount of visual information on which the model depends to give correct answers to different questions is different. Some questions can give effective answers only depending on one frame of information, and some questions can correctly predict answers only by understanding the semantics of the complete video.

In summary, the difficulty of video question answering lies in how to construct an efficient cross-media question answering reasoning framework on the basis of correctly and effectively understanding video content and sufficiently and accurately understanding question intentions, so as to improve accuracy of answer prediction.

Disclosure of Invention

The invention provides a deep hierarchical reasoning framework for complex long-term video question answering, which mainly comprises: 1. constructing a shallow layer inference machine: executing an irrelevant information filtering function, identifying important visual contents relevant to problem description from all possible long sequence information of the complex long-term video, filtering irrelevant visual information, and avoiding overload of deep memory network information and noise; 2. constructing a deep inference machine: under the guidance of a shallow inference engine, more detailed inference is carried out by utilizing deeper semantic evidence from vision and natural language, and more fine-grained attention is learned so as to improve the quality of cross-modal task inference. In the aspect of video question answering, the depth reasoning framework of the invention is utilized to improve the reasoning quality, and the effect better than that of the traditional visual question answering model is obtained. 3. A memory dynamic fusion module: the method is used for dynamically fusing memories of different modes, and the output of the dynamic memory fusion module is used as the input of the answer module to predict the best answer.

The technical scheme adopted by the invention for solving the technical problems is as follows:

and (1) storing the global semantic information of the video by using a memory component, and obtaining a shallow inference machine through multiple rounds of memory updating iteration under the guidance of the global visual features of the problem description, wherein the shallow inference machine is used for inferring the visual information most relevant to the global semantic features of the problem description.

And (2) constructing a deep inference machine based on the shallow inference machine, and embedding multi-modal subcomponents under deep semantic analysis of the video into memory card slots with different modalities to form spatial memory and time sequence memory.

And (3) constructing a multi-modal memory collaborative reasoning framework, and executing more fine reasoning by using multi-modal evidences from objects and actions so as to improve the quality of question answering.

And (4) performing multi-mode dynamic memory fusion, guiding the weight distribution of the memory contents of different modes at the lower layer by using the output of the shallow inference engine as a monitoring whistle, dynamically fusing the memories of different modes through a dynamic memory fusion module in the framework, and predicting the best answer by using the output of the dynamic memory fusion module as the input of a response module.

Further, the step (1) of storing global semantic information of the video by using a memory component, and under the guidance of the global visual features of the problem description, reasoning out visual information most relevant to the global semantic features of the problem description through multiple rounds of memory update iterations, specifically as follows:

1-1. to characterize video

And problem description feature h_TInputting into the memory component, firstly converting the input features into the intrinsic vector features of the memory network, as shown in the formulas (1) and (2):

X^e＝tanh(W_xX+b_x) (formula 1)

Wherein, X^e、

Respectively indicating the video characteristics and the problem description characteristics after the video characteristics and the problem description characteristics are converted;

is a matrix of the mapping and the mapping,

is an offset, d_zIs an intrinsic spatial dimension of the memory network.

1-2. feature selection was performed using a hard attention mechanism. Computing problem description features

And video feature X^eThe similarity between the two video characteristics is sorted according to the similarity score, the first n video characteristics with the highest similarity and most relevant to the problem are selected to update the memory unit in the shallow inference engine, and therefore n key value pairs with the nearest neighbor are obtained, and key is { k ═ k₁，k₂，...，k_n}，value＝{v₁，v₂，...，v_nAnd f, updating the video feature sequence set, as shown in formula (3):

Γ(y₁，...，y_n)＝{j₁，...j_nwhen y is_j1≥y_j2≥…≥y_jn(formula 4)

Wherein f is_sRepresenting a similarity measure, Γ is the sort operation.

1-3. based onAnd (3) the updated video feature sequence set, and the shallow inference engine learns the probability distribution rho of the feature description on each memory unit. And obtaining the output z of the shallow inference engine through weighting and operation, and combining the problem description characteristics of the original input as the problem of the next round of inference. G in formulas (5) and (6)_xAnd G_qRespectively, two feedforward fully-connected neural networks are shown.

Wherein v is_iIs the storage content of the ith memory cell.

Constructing a deep multi-modal memory network based on the guided shallow inference engine in the step (2), embedding multi-modal subcomponents under deep semantic analysis of the video into memory modules of different modalities to form spatial memory and time sequence memory, and specifically comprising the following steps:

characterizing objects in video by a 1 x 1 convolutional neural network

Converting the intrinsic vector characteristics into the intrinsic vector characteristics of a space memory module, and using another 1 x 1 convolutional neural network to characterize the motion in the video

Embedded in a timing memory module. Wherein d is_oIs the dimension of the object feature, k is the number of objects, d_aIs the dimension of the action feature, l is the number of actions.

The multi-modal subcomponents comprise object features and action features;

the memory modules in different modes comprise a space memory module and a time sequence memory module; and (3) constructing a multi-modal memory collaborative reasoning framework, and executing more detailed reasoning by using multi-modal evidences from objects and actions, wherein the detailed reasoning is as follows:

3-1. for the lambda-th round of reasoning, the action characteristic a is given_iDescription of problem h'_lAnd action memory of last moment

Obtaining similarity distribution of action characteristics through calculation of two combined similarities of P (-) and D (-)

As shown in equation (7). The same object feature o_iObtaining the similarity distribution on the object characteristics

As shown in equation (8).

Where P (-) and D (-) are two similarity calculation functions.

3-2, considering that the object characteristics and the action characteristics play an important role in high-quality reasoning question answering, the multi-modal memory collaborative reasoning framework is used as an interactive reasoning machine, allows action memory and object memory to dynamically interact, and when the memory of one mode is updated, the memory of the other mode provides a useful clue for attention learning, in particular:

when calculating the update gating signal of the memory content of a certain modality, in addition to considering the modality itself, the influence of another modality needs to be considered. Taking the motion mode as an example, the refresh gating signal of the motion memory

This can be derived from equation (10), and the GRU-based attention mechanism is then employed to extract contextual features

And it is used to update the action memory of the wheel

As shown in equation (11).

Wherein the content of the first and second substances,

in order to be a set of mapping matrices,

is the corresponding offset.

The object memory of the round is obtained in the same way

As shown in equation (14).

Wherein the content of the first and second substances,

in order to be a set of mapping matrices,

is the corresponding offset.

And (4) performing multi-mode dynamic memory fusion, guiding the weight distribution of the memory contents of different modes at a lower layer by using the output of a shallow inference engine as a monitoring whistle, dynamically fusing the memories of different modes through a dynamic memory fusion module in the framework, and predicting the best answer by using the output of the dynamic memory fusion module as the input of an answer module, wherein the specific process is as follows:

4-1. due to the complexity and diversity of the problem descriptions, the magnitude of the contribution of the different visual submodes to the question-answering model is dynamically variable. The memory dynamic fusion module utilizes the output u' of the shallow inference engine as a monitoring whistle and combines the original embedded characteristic q of the problem description^eGuiding the weight distribution of the memory contents of different modes of the deep inference engine, dynamically fusing the memories of different modes, and fusing the memories

Can be calculated from equation (15):

wherein, therein

Is a learnable parameter, alpha is a density vector, alpha^aAnd alpha^oAre two elements of the density vector alpha.

4-2. the output of the dynamic memory fusion module is used as a reply moduleTo predict the best answer. In particular to

And problem feature q^eDo fusion and then pass through the weight matrix W_pThe mapping of (a) to (b) yields the feature vector v, as shown in equation (16).

The obtained feature vector v is input into an answer module (multi-classifier) to predict the answer of the question.

The invention has the beneficial effects that:

the invention provides a novel coarse-to-fine hierarchical depth inference framework for a complicated long-term network video question-answering problem, and the method comprises the steps of firstly filtering invalid information from a long video sequence by constructing a shallow inference engine, identifying important visual contents, learning the global attention of coarse-grained videos, and then constructing a deep inference engine to perform depth optimization inference from two directions of interframes and intraframes. Through multiple rounds of reasoning iteration, the reasoning framework can simulate the human video question-answer reasoning process, the key time relevant to the question is located from the long-term video, and relevant evidences are collected to predict the answer. The reasoning framework of our invention can achieve significant effects on the video question-answer dataset.

Drawings

FIG. 1 is a general block diagram of the process of the present invention.

FIG. 2 is a guided shallow inference engine constructed in the method of the present invention.

FIG. 3 is an optimized deep inference engine constructed in the method of the present invention.

Detailed Description

The following is a more detailed description of the detailed parameters of the present invention.

As shown in FIG. 1, the invention provides a hierarchical deep question answering reasoning framework for complex video question answering.

The method for obtaining the visual information of the video by using the memory component in the step (1) includes obtaining a shallow inference engine through multiple rounds of memory update iteration under the guidance of the global visual feature of the problem description, and inferring the visual information most relevant to the global semantic feature of the problem description, as shown in fig. 2, specifically as follows:

1-1. extracting video characteristics

And problem description feature h_TIt is input into the memory component and the input is converted into the intrinsic vector characteristics of the memory network.

For the extraction of global video data features, large-scale pre-trained neural networks VGG and 3D-CNN are used for extracting intermediate features, and the features are input into a bidirectional GRU network to obtain globally perceived semantic features

Wherein

d_x2048 is the dimension of the feature. For the extraction of problem description features, firstly, a shallow word embedding model Glove is adopted to encode each word to capture the semantics of the word, and then the generated word vector is sequentially input to a word vector containing d_qThe two-way LSTM network with 256 hidden units learns the context information of the problem description, and finally splices the hidden state forward and backward at each moment to represent the global semantics of the problem description

Then, inputting the video characteristics and the problem description characteristics into a memory component, and converting the input into an internal vector of a memory network, wherein

Is the mapping matrix and offset, d_z256 isThe intrinsic spatial dimensions of the network are remembered.

1-2. feature selection was performed using a hard attention mechanism. Calculating the similarity between the problem description characteristics and the video characteristics, sorting the video characteristics according to the similarity scores, selecting the top 20 video characteristics with the highest similarity and most relevant to the problem to update the storage unit of the memory network, thereby obtaining the 20 key value pairs of the nearest neighbor, wherein key is { k }₁，k₂，...，k_n}，value＝{v₁，v₂，...，v_nAnd forming an updated video feature sequence.

1-3. based on the updated memory content, the learning problem describes the probability distribution ρ over each memory cell. And obtaining the output z of the layer network through weighting and operation, and taking the original characteristics described by the problem as the updated problem of the next round of reasoning.

Constructing a deep inference machine on the basis of the shallow inference machine in the step (2), embedding multi-modal subcomponents under deep semantic analysis of the video into memory card slots with different modalities to form spatial memory and time sequence memory, and specifically comprising the following steps:

the characteristics of video objects are extracted by using the existing fast-RCNN model, 36 targets are sequentially detected in 20 important video units by using the fast-RCNN model based on a shallow inference engine, and 4096-dimensional object characteristics are extracted for each target object. Then, the object features in the video are converted into the internal vectors of the space memory module through a 1 × 1 convolutional neural network to form space memory.

Predicting the first 30 time sequence segments with most potential actions in the video by an external pre-trained video time sequence candidate generation network, and extracting the video characteristics of the 30 time sequence segments as video action characteristics according to the step (1). Characterizing motion in video using another 1 x 1 convolutional neural network

Embedded into the time sequence memory module to form time sequence memory.

Constructing the multi-modal memory collaborative reasoning in the step (3), and performing more refined reasoning by using multi-modal evidence from objects and actions, as shown in fig. 3, specifically as follows:

3-1. Using multi-modal evidence from objects and actions to perform more refined reasoning, for the lambda-th round of reasoning, given the action characteristics, the characteristics of the problem and the action memory at the previous moment

The similarity distribution of the action characteristics is obtained through calculation of element point multiplication similarity functions and element absolute value similarity functions. The similarity distribution on the object features can be obtained in the same way.

And 3-2, calculating a memory updating gating signal. Update gating in computational action memory

In this case, in addition to the action modality itself, the influence of the object modality needs to be considered. Similarly, update gating in computational object memory

When considering the object modality, an action modality is needed to provide a favorable cue. The GRU-based attention mechanism is then employed to extract contextual features and use it to update the action memory of the current round

And (4) dynamically fusing different modes of memory by the multi-mode memory dynamic fusion module, wherein the output of the memory dynamic fusion module is used as the input of the answer module to predict the best answer, as shown in fig. 3, the following steps are specifically performed:

and 4-1, the dynamic memory fusion module utilizes the output of the shallow inference engine as a monitoring whistle to guide the weight distribution of the memory contents of different modes in the deep layer and dynamically fuse the memories of different modes. The module first remembers u' of the shallow inference engine in 256 dimensions and the problem feature q in 300 dimensions^eSplicing, performing matrix transformation of 2 × 556, and passing through classifier of softmaxObtaining the weights of different sub-modes of the deep inference engine, and finally obtaining the fused memory by using the weighted sum operation

And 4-2, taking the output of the memory dynamic fusion module as the input of the answer module to predict the best answer. Specifically, the fusion memory of the dynamic fusion module

And problem feature q^ePerforming element dot product operation to obtain a 300-dimensional feature vector, and passing through a 1000 × 300 weight matrix W_pAnd mapping to obtain a 1000-dimensional feature vector v, inputting the obtained feature vector v into a 1000-way softmax classifier to obtain the probability distribution of an answer dictionary, training the model in an end-to-end mode, and optimizing the model by using softmax cross entropy as a loss function until the network converges.

Claims

1. A cross-media hierarchical deep video question-answer reasoning framework is characterized by comprising the following steps:

step (1), storing global semantic information of a video by using a memory component, and obtaining a shallow inference machine through multiple rounds of memory updating iteration under the guidance of global visual features of problem description, wherein the shallow inference machine is used for inferring visual information most relevant to the global semantic features of the problem description;

step (2) constructing a deep inference machine based on the shallow inference machine, and embedding multi-modal subcomponents under deep semantic analysis of the video into memory card slots with different modalities to form spatial memory and time sequence memory;

step (3), a multi-modal memory collaborative reasoning framework is constructed, and more detailed reasoning is executed by utilizing multi-modal evidences from objects and actions so as to improve the quality of question answering;

2. The cross-media hierarchical deep video question-answer reasoning framework of claim 1, wherein the step (1) is specifically as follows:

1-1. to characterize video

X^e＝tanh(W_xX+b_x) (formula 1)

Wherein, X^e、

is a matrix of the mapping and the mapping,

is an offset, d_zIs an intrinsic spatial dimension of the memory network;

1-2, using hard attention mechanism to select characteristics and calculate problem description characteristics

Wherein f is_sRepresenting a similarity measure, Γ being an ordering operation;

1-3, based on the updated video feature sequence set, a shallow inference engine learns the probability distribution rho of the problem description features on each memory unit; obtaining output z of the shallow inference engine through weighting and operation, and taking the problem description characteristics of the original input as the problem of the next round of inference; g in formulas (5) and (6)_xAnd G_qRespectively representing two feedforward fully-connected neural networks;

wherein v is_iIs the storage content of the ith memory cell.

3. The cross-media hierarchical deep video question-answer reasoning framework of claim 2, wherein the step (2) is specifically as follows:

characterizing objects in video by a 1 x 1 convolutional neural network

Embedded in a timing memory module; wherein d is_oIs the dimension of the object feature, k is the number of objects, d_aIs the dimension of the action feature, l is the number of actions;

the multi-modal subcomponents include object features and motion features.

4. The cross-media hierarchical deep video question-answer reasoning framework of claim 3, wherein the step (3) is specifically as follows:

3-1. Using multimodal evidence from objects and actions to perform more refined reasoning, given action a for the lambda-th round of reasoning_iVerb feature of question h'_lAnd action memory of last moment

As shown in equation (7); similarity distribution on object features is also obtained

As shown in equation (8);

3-2, the multi-modal memory collaborative reasoning module is used as an interactive reasoning machine, dynamic interaction between action memory and object memory is allowed, and when the memory of one mode is updated, the memory of the other mode can provide useful clues for the learning of attention; update gating of action memory when action modality is updated

From equation (10), the GRU-based attention mechanism is then employed to extract contextual features

And it is used to update the action memory of the wheel

As shown in formula (11);

wherein the content of the first and second substances,

as a group of mapsThe matrix of rays is then formed,

is the corresponding offset;

obtaining similarity distribution on object characteristics by the same method

As shown in formula (14);

wherein the content of the first and second substances,

in order to be a set of mapping matrices,

is the corresponding offset.

5. The cross-media hierarchical deep video question-answer reasoning framework of claim 4, wherein the step (4) performs multi-modal dynamic memory fusion, utilizes the output u' of the shallow inference engine as a monitoring whistle, and combines the original embedded features q of the question description^eThe method guides the weight distribution of the memory contents of different modes at the lower layer, dynamically fuses the memories of different modes, takes the output of the dynamic memory fusion module as the input of the answer module to predict the best answer, and comprises the following specific processes:

4-1. dynamic memory fusion module utilizes upper-layer memoryThe output of the network is used as a monitoring whistle to guide the weight distribution of the memory contents of different modes at the lower layer, dynamically fuse the memories of different modes, and fuse the memories

Can be calculated from equation (15):

wherein the content of the first and second substances,

is a learnable parameter, alpha is a density vector, alpha^aAnd alpha^oAre two elements of the density vector α;

4-2, the output of the memory dynamic fusion module is used as the input of the answer module to predict the best answer; specifically, the method comprises the following steps: will be provided with

And problem feature q^eDo fusion and then pass through the weight matrix W_pObtaining the feature vector v through mapping, as shown in formula (16);

the obtained feature vector v is input into a multi-classifier to predict the answer to the question.