CN113886626A

CN113886626A - Visual question-answering method of dynamic memory network model based on multiple attention mechanism

Info

Publication number: CN113886626A
Application number: CN202111083704.6A
Authority: CN
Inventors: 缪亚林; 童萌; 程文芳; 李臻
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2021-09-14
Filing date: 2021-09-14
Publication date: 2022-01-04
Anticipated expiration: 2041-09-14
Also published as: CN113886626B

Abstract

The invention relates to a visual question-answering method of a dynamic memory network model based on a multiple attention mechanism, which comprises the following steps: step 1, preprocessing an input image and a text; step 2, extracting the characteristics of the problems input in the step 1, and dividing the problems into independent words according to punctuation marks and spaces; step 3, the picture input in the step 1 is sent to a feature extraction network, and a regional target feature consisting of the features of K regions with the highest confidence coefficients is obtained; step 4, iteratively updating and memorizing the question features and the picture features obtained in the step 2 and the step 3 by using a multi-attention mechanism to generate context vectors required by answering questions; and 5, sending the question features in the step 2 and the new image features generated in the step 4 into a feature fusion device to jointly infer an answer, wherein the answer is selected from candidate answers with the highest probability given by the classifier. The method improves the accuracy of the visual question-answering model.

Description

Visual question-answering method of dynamic memory network model based on multiple attention mechanism

Technical Field

The invention belongs to the technical field of cross-modal tasks combined in the field of computer vision and natural language processing, and particularly relates to a visual question-answering method of a dynamic memory network model based on a multiple attention mechanism.

Background

The attention mechanism is widely applied to tasks such as visual question answering, image captions and machine translation, and the visual question answering attention model generates the attention distribution of picture characteristics based on the problem characteristics so as to carry out accurate question answering. At present, the visual question-answering attention mechanism generally performs weighted pooling only on the last convolutional layer of an image, that is, different spatial regions have different weights, but different channels have the same weight, so that the spatial information of the feature map is inevitably lost, which conflicts with the coexistence of the spatiality and the channeling of the feature map of the convolutional neural network. Worse still, the attention mechanism is only used on the last convolution layer where the receptive fields are quite large and the differences between the receptive fields are limited, resulting in less apparent spatial attention. Researchers have therefore proposed combining channel and spatial attention as the "left and right arm" of the neural network.

Some questions in visual question-answering relate to multi-jump relationships between objects, such as "what is in the basket of a bicycle? "the model needs to first find the bicycle in the picture, locate the position of the basket from the bicycle, and then identify the objects contained within the basket. It can be seen that visual question-answer prediction requires a stepwise matching of the best picture regions for answering questions according to the question. Therefore, in addition to using the attention mechanism to extract the key information needed for answering the questions, the visual question-answering model should have a certain memory capability, and search, reasoning and store the relevant information according to different questions. The neural networks with memory functions such as RNN, LSTM, GRU and the like have short memory step length, so that the long-term memory and storage requirements of the visual question-answering task on effective information cannot be met. To mitigate loss of valid information, a dynamic memory network is used herein to iteratively look for visual information related to a problem.

Disclosure of Invention

The invention aims to provide a visual question-answering method of a dynamic memory network model based on a multiple attention mechanism, which solves the complex problem that multiple reasoning is needed in visual question-answering and improves the accuracy of the visual question-answering model.

The technical scheme adopted by the invention is that the visual question-answering method based on the dynamic memory network model of the multiple attention mechanism comprises the following steps:

step 1, preprocessing an input image and a text, and sending the image and the text into an input module of a model to extract image and text characteristics to obtain target-level characteristics;

step 2, in order to obtain problem features, the problems input in the step 1 are subjected to feature extraction, and are divided into independent words according to punctuation marks and spaces; then, vectorizing representation is carried out on the words by using a pre-trained word model, then the word vector representation is input into a recurrent neural network, and the hidden state of the last time step is obtained to obtain problem characteristics;

step 3, in order to obtain picture characteristics, the picture input in the step 1 is sent to a characteristic extraction network, and regional target characteristics consisting of the characteristics of K regions with the highest confidence coefficient are obtained;

step 4, iteratively updating and memorizing the question features and the picture features obtained in the step 2 and the step 3 by using a multi-attention mechanism to generate context vectors required by answering questions;

and 5, sending the question features in the step 2 and the new image features generated in the step 4 into a feature fusion device to jointly infer an answer, wherein the answer is selected from candidate answers with the highest probability given by the classifier.

The present invention is also characterized in that,

the specific implementation of step 2 is as follows:

step 2.1: first, the input question text is processed into a form that the model can accept, then the input question q is represented as:

q＝[q_l，q₂，...，q_N]

wherein: n is the sentence length, q_iIs a word;

step 2.2: secondly, mapping the words to the same vector space by using a word vector model to obtain word embedding representation of the words; and the word vector h of the obtained word is represented as:

h＝[h₁，h₂，...，h_Nl

wherein: h is_iAs a word q_iH is a word vector after training; the processed word vector is input into the GRU network, and the process is expressed by the following equation:

wherein: s is the input text sentence characteristic, h_iIn order to input a word vector of text,

representing that the word vector is in the P dimension;

step 2.3: and finally, inputting the word vector into a recurrent neural network to extract the characteristics of the sentence, namely the problem characteristics.

The problem feature in step 2 is to obtain a word vector representation for each word using a Glove word vector model pre-trained on the corpus.

Step 3 is specifically implemented according to the following steps:

after receiving an input picture, because not all elements in the picture are related to a question, in order to more accurately lock a target, an attention mechanism needs to be applied to the picture representation to respectively find out key areas for solving the question, wherein a top-down attention model is used, and a target detection network fast R-CNN with high-level semantics is adopted to extract picture features; firstly, extracting an image feature map by using a VGG (video graphics gateway) and ResNet basic network, then obtaining a suggestion frame feature map with a fixed size according to a region suggestion network and a region suggestion pooling, and then classifying and regressing to obtain accurate image features; and finally, obtaining the first K candidate regions with the maximum confidence as image features, wherein the extraction process is as follows:

wherein: v. of_KRepresenting any one of the candidate objects, V represents the confidence of the selection,

indicating that each candidate object is in the D-dimension.

Step 4 is specifically implemented according to the following steps:

step 4.1: firstly, fusing the problem features and the picture features obtained in the step 2 and the step 3;

step 4.2: secondly, obtaining a channel characteristic diagram closely related to the question through channel attention by the object characteristic diagram, further obtaining an object space area closely related to the question by using a space attention mechanism on the characteristic diagram focused by the channel, updating model memory based on the object space area, and iterating the process to obtain key context information of answering the question; updated model memory m^tThe following were used:

wherein: [ ·; a]Representing a feature splicing operation, W^tIndicating the parameter update matrix, b indicating the bias,

representing a new image feature, m^tWhere t denotes a certain time, m^t-1Representing a contextual memory; q represents a problem vector.

Step 5 is specifically implemented according to the following steps:

first, the updated model is memorized m^tPerforming feature fusion on the problem vector Q in a BLOCK multi-mode fusion mode to obtain fusion featuresJ; after the joint feature representation J is obtained, a classification process is performed using the two fully connected layers; answer prediction is then performed using a Sigmoid function in a DMN-MA model that allows multiple correct answers per question, each candidate answer having a score in the range between (0, 1); and finally, selecting the candidate answer with the maximum probability value as the final answer of the model, wherein the candidate answer is as follows:

y＝Sigmoid(W_jJ+b_j)

wherein: w_jParameter representing the fully connected layer, b_jBias terms are represented, y represents the final answer, and cross entropy is used as a loss function in the training process.

The invention has the beneficial effects that:

1. the invention is based on a dynamic memory network model of a multiple attention mechanism. Different from the previous attention model, the model not only uses the space-based attention mechanism, but also uses the channel attention mechanism, so that the visual question-answering model uses different weights on different channel feature maps, and the space attention mechanism is effectively supplemented by the channel attention mechanism. In addition, an input module and a scene memory module of the dynamic memory network model are deeply researched, and the fast-RCNN is used in the input module to obtain object characteristics of a target level; in the scene memory module, a multiple attention mechanism is used for continuously carrying out memory updating and storage according to the questions, iterative reasoning is carried out to obtain the most relevant visual vectors for answering the questions, and the context information is effectively utilized for carrying out answer reasoning. And finally fusing final memory and question representation of the network and inferring a correct answer.

2. The method is scientific and reasonable in design, can continuously perform memory updating and storage according to the problems by using a multiple memory mechanism, obtains the most relevant visual vector of the answers to the problems through iterative reasoning, and effectively utilizes the context information to perform answer reasoning. And the memory network further improves the accuracy of the visual question-answering model.

3. The method of the invention provides a dynamic memory network model (DMN-MA) based on a multiple attention mechanism on the basis of a dynamic memory network. Different from the prior model, the method applies a multiple attention mechanism based on problem guidance when reading the input image features, focuses on not only the spatial region of the image, but also different convolution channels of the image, and better conforms to the three-dimensional characteristics of both characteristic map channeling and spatiality. The DMN-MA model iteratively inquires visual information related to the question when finding image features, continuously updates memory contents, and obtains key memory for answering the question, thereby solving the complex problem that multiple reasoning is needed in visual question answering.

Drawings

FIG. 1 is a schematic diagram of a scene memory module iterating twice in the method of the present invention;

FIG. 2 is a diagram of an overall framework of a dynamic memory network model based on a multi-attention mechanism in the method of the present invention;

FIG. 3 is a schematic diagram of the present invention before the memory visualization process;

FIG. 4 is a schematic diagram of the present invention after a memory visualization process in a simulation experiment.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The invention relates to a visual question-answering method of a dynamic memory network model based on a multiple attention mechanism, which comprises the following steps:

the specific implementation of step 2 is as follows:

step 2.1: firstly, processing an input question text into a form which can be accepted by a model, namely dividing all words in the question text into independent words according to punctuation marks and spaces; then the input question q is represented as:

q＝[q₁，q2，...，q_N]

wherein: n is the sentence length, q_iIs a word;

step 2.2: secondly, mapping the words to the same vector space by using a word vector model to obtain word embedding representation of the words; word embedding is a method for converting words in a text into real number vectors, and the conversion into the vectors can be conveniently calculated. And the word vector h of the obtained word is represented as:

h＝[h₁，h₂，...，h_N]

wherein: h is_iAs a word q_iH is a word vector after training; the word vector representation for each word is obtained here using a pre-trained Glove word vector model, and since the question text typically does not exceed 20 words in the visual question-answer dataset used herein, the processed word vectors are entered into the GRU network herein, the process being represented by the following equation:

indicating that the word vector is in the P dimension.

The problem feature in step 2 is to obtain a word vector representation for each word using a Glove word vector model pre-trained on a large corpus.

Step 3, in order to obtain picture characteristics, the picture input in the step 1 is sent to a characteristic extraction network, and regional target characteristics consisting of the characteristics of K regions with the highest confidence coefficient are obtained; the feature extraction network used here is the fast R-CNN network.

Step 3 is specifically implemented according to the following steps:

after accepting the input picture. Since not all elements in the graph are related to the question, in order to lock the target more accurately, attention mechanism needs to be applied to the graph representation to respectively find out the areas critical for solving the question. A top-down attention model is used, and a target detection network fast R-CNN with high-level semantics is adopted to extract picture features; firstly, extracting an image feature map by using a VGG (video graphics gateway) and ResNet basic network, then obtaining a suggestion frame feature map with a fixed size according to a region suggestion network and a region suggestion pooling, and then classifying and regressing to obtain accurate image features; and finally, obtaining the first K candidate regions with the maximum confidence as image features, wherein the extraction process is as follows:

indicating that each candidate object is in the D-dimension.

Step 4, iteratively updating and memorizing the question features and the picture features obtained in the step 2 and the step 3 by using a multi-attention mechanism to generate context vectors required by answering questions; the memory of the answered question is updated one time by one time in a way of combining channel attention and space attention;

step 4 is specifically implemented according to the following steps:

step 4.2: secondly, as shown in the image channel feature map of fig. 1, the object feature map is firstly obtained through channel attention and the channel feature map closely related to the problem, and thenUsing a space attention mechanism on a characteristic diagram focused by a channel to obtain an object space region closely related to a question, updating model memory based on the object space region, and iterating the process to obtain key context information of answering the question; updated model memory m^tThe following were used:

representing a new image feature, m^tWhere t denotes a certain time, m^t-1Representing a contextual memory; q represents a problem vector. Wherein the main focus of the channel attention is the object, and then the correlation calculation is carried out to obtain the channel attention vector. Spatial attention is the process of locating the best object region to answer a question by question, giving different object regions different weights, and not treating each object region equally. And updating the scene memory by using the new image characteristics to generate a vector after passing through the channel attention module and the space attention module each time. Following the work of previous visual questions and answers, memory was updated using the ReLU activation function.

Step 5 is specifically implemented according to the following steps:

first, the updated model is memorized m^tAnd performing feature fusion on the problem vector Q in a BLOCK multi-mode fusion mode to obtain a fusion feature J. After the joint feature representation J is obtained, the classification process is performed using the two fully-connected layers. Answer prediction is then performed using Sigmoid functions in the DMN-MA model, which allows multiple correct answers per question, with a score for each candidate answer, ranging between (0, 1). Final probability of selectionThe candidate answer with the largest value is taken as the final answer of the model and is as follows:

y＝Sigmoid(W_jJ+b_j)

wherein: w_jParameter representing the fully connected layer, b_jRepresenting a bias term and y representing the final answer. And cross entropy is used as a loss function in the training process.

The specific process of the invention is shown in figure 2. Firstly, extracting regional target characteristics of an input image and a text, processing a vector of a dimensional fact by the image, and coding an input problem; establishing a dynamic memory network model based on a multiple attention mechanism, then inputting the obtained text questions and image characteristics for multiple times for iteration, and updating the context memory after each iteration until answers with higher probability appear. And secondly, fusing the reuse characteristics, interacting with the question to obtain new graph characteristics, and finally deducing an answer by the obtained graph characteristics and the question. Compared with the traditional method utilizing the overall image characteristics or other graph network visual question-answering methods neglecting the relationship importance, the method provided by the invention has the advantage that the performance of the visual question-answering model is effectively improved by adopting the technical scheme of the invention.

Simulation experiment and characterization of experimental results

1. Data set

The model was experimented on two visual question-answering public datasets, the COCO-QA and VQA 2.0.0 datasets, respectively. The COCO-QA dataset pictures are from MS-COCO. Comprising 123587 pictures, wherein 72783 pictures are used for training and 38948 pictures are used for testing, and importantly, the answers of the data set question are distributed uniformly. The VQA 2.0.0 data set contained 204721 pictures from the MS-COCO, 123,287 pictures for the training and validation set, 8 million for the training set, and 81434 pictures for the test set. The data set had 614163 questions, three for each picture, and ten answers for each question, each provided by ten different annotators.

2. Experimental Environment

The development framework was python version 1.1.0 using python3.6 development language. Specifically, the image input module K is 100, the dimension of each object feature vector is 2048, and ResNet152 is used as a base network for image feature extraction. The problem module treats the problem as a fixed length, discards excess length, and fills in the deficiency with 0. The length of the COCO-QA data set problem is fixed to be 20, and the length of the VQA2.0 data set problem is fixed to be 14. The word vector dimension is 300, the GRU hidden layer dimension is 2048, and the resulting problem vector dimension is also 2048. An answer prediction stage, wherein the COCO-QA data set has 430 answers; VQA 2.0.0, if an answer occurs more than 8 times in the training set, it is added to the candidate answer set to obtain 3129 candidate answers.

All activation functions used ReLU in the experiments and dropout of p 0.5 was used in the input and output layers to prevent overfitting. All training samples are randomly shuffle during training, the batch size is set to 32, and the epoch is 20. The initial learning rate is 0.001 by using an Adam random gradient descent algorithm in the training process, and the DMN-MA model reduces the learning rate to 1/10 before every 3 epochs after 5 epochs are trained.

3. Results and analysis of the experiments

Due to the uncertainty of the iteration times of the DMN-MA model scene memory module, different iteration times are firstly set on the COCO-QA data set and the VQA 2.0.0 data set to find the optimal performance of the model. The results of the model experiments on the overall accuracy and the number of iterations for both data sets are shown in table 1.

TABLE 1 context memory Module iteration count accuracy comparison

From table 1, it can be seen that the number of iterations is increased, the accuracy of the model is increased, when the number of iterations is 3, the overall accuracy of the model on the two data sets is the highest, and the accuracy of the model is sharply decreased by increasing the number of iterations. Overall, the multiple attention mechanism iterates 3 times with the highest accuracy, so the experiment sets the number of iterations to 3.

Next, to verify the validity of the proposed model, table 2 lists the experimental results of the model and other mainstream methods on the COCO-QA test set.

TABLE 2 Overall accuracy on COCO-QA data set, comparison of WUPS index with other methods

As can be seen from Table 2, the overall accuracy of the proposed DMN-MA model reaches 64.57%, and compared with the traditional VIS + LSTM method, the accuracy is improved by 11.26%. Particularly, compared with a visual question-answer classical attention method SAN, the overall accuracy is improved by about 3%, and compared with a QPU model, the accuracy is improved by 2.07%. In addition, the model also has unusual effects on WUPS0.9 and WUPS 0.0. It is not sufficient to use only spatial attention for iterative reasoning, as question-based channel attention is equally important in visual question-answering studies.

As shown in Table 3, the overall performance of the proposed DMN-MA model is 12.96% higher than that of the reference model CNN + LSTM, 4.91% higher than that of the MCB model and 2.54% higher than that of the Resonnet model; in addition, the model is 1.51% higher than the visual question-answering system model of the classical top-down attention mechanism in overall accuracy. It is worth noting that the DMN-MA model and the visual question-answering system model of the top-down attention system adopt the same data preprocessing mode, namely fast-RCNN is adopted to extract image visual characteristics, GLOVE + GRU is adopted to extract question characteristics, and the difference is that the visual question-answering system model of the top-down attention system only adopts the spatial attention system to predict answers, which fully proves the effectiveness of the proposed model.

TABLE 3 comparison of accuracy of various question types on COCO-QA

In conclusion, the DMN-MA model is compared with a plurality of mainstream methods on COCO-QA and VQA 2.0.0 data sets, and the advantages of a multi-attention mechanism and a memory network are combined, so that the DMN-MA model is more consistent with the three-dimensional characteristics of a convolution characteristic diagram, and meanwhile, the loss of context information is reduced in the answer prediction process, and the performance is better.

4. Attention visualization

Several pictures and questions in the data set were randomly chosen for attention visualization presentation for the proposed model, as shown in fig. 3-4. Fig. 3 shows the upper part of the question, fig. 3 shows the original picture, fig. 4 shows the picture after the model attention visualization, the lower group route is the answer of the data set, and the Prediction indicates the answer of the model Prediction.

Claims

1. The visual question-answering method of the dynamic memory network model based on the multiple attention mechanism is characterized by comprising the following steps of:

step 2, extracting the characteristics of the problems input in the step 1, and dividing the problems into independent words according to punctuation marks and spaces; then, vectorizing representation is carried out on the words by using a pre-trained word model, then the word vector representation is input into a recurrent neural network, and the hidden state of the last time step is obtained to obtain problem characteristics;

step 3, the picture input in the step 1 is sent to a feature extraction network, and a regional target feature consisting of the features of K regions with the highest confidence coefficients is obtained;

2. The visual question-answering method based on the multiple attention mechanism dynamic memory network model according to claim 1, wherein the specific implementation manner of the step 2 is as follows:

q＝[q₁，q₂，...，q_N]

wherein: n is the sentence length, q_iIs a word;

h＝[h₁，h₂，...，h_N]

S＝ReLU(GRU(h_i))，

representing that the word vector is in the P dimension;

3. The visual question-answering method based on the multi-attention mechanism dynamic memory network model according to claim 2, wherein the question feature in step 2 is to obtain a word vector representation of each word using a Glove word vector model pre-trained on a corpus.

4. The visual question-answering method based on the multiple attention mechanism dynamic memory network model according to claim 3, wherein the step 3 is implemented by the following steps:

V＝[v₁，v₂，...，v_K]，

indicating that each candidate object is in the D-dimension.

5. The visual question-answering method based on the multiple attention mechanism dynamic memory network model according to claim 4, wherein the step 4 is implemented by the following steps:

step 4.2: secondly, the object characteristic diagram is firstly obtained through channel attention and is closely related to the problem, a space attention mechanism is further used on the characteristic diagram which is focused through the channel to obtain an object space area which is closely related to the problem, and the memory of the model is updated based on the object space areaIterating the process to obtain key context information of the answer questions; updated model memory m^tThe following were used:

6. The visual question-answering method based on the multiple attention mechanism dynamic memory network model according to claim 5, wherein the step 5 is implemented by the following steps:

first, the updated model is memorized m^tPerforming feature fusion on the problem vector Q in a BLOCK multi-mode fusion mode to obtain fusion features J; after the joint feature representation J is obtained, a classification process is performed using the two fully connected layers; answer prediction is then performed using a Sigmoid function in a DMN-MA model that allows multiple correct answers per question, each candidate answer having a score in the range between (0, 1); and finally, selecting the candidate answer with the maximum probability value as the final answer of the model, wherein the candidate answer is as follows:

y＝Sigmoid(W_jJ+b_j)