CN113886626A - Visual question-answering method of dynamic memory network model based on multiple attention mechanism - Google Patents

Visual question-answering method of dynamic memory network model based on multiple attention mechanism Download PDF

Info

Publication number
CN113886626A
CN113886626A CN202111083704.6A CN202111083704A CN113886626A CN 113886626 A CN113886626 A CN 113886626A CN 202111083704 A CN202111083704 A CN 202111083704A CN 113886626 A CN113886626 A CN 113886626A
Authority
CN
China
Prior art keywords
question
model
features
input
attention mechanism
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111083704.6A
Other languages
Chinese (zh)
Other versions
CN113886626B (en
Inventor
缪亚林
童萌
程文芳
李臻
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN202111083704.6A priority Critical patent/CN113886626B/en
Publication of CN113886626A publication Critical patent/CN113886626A/en
Application granted granted Critical
Publication of CN113886626B publication Critical patent/CN113886626B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Library & Information Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a visual question-answering method of a dynamic memory network model based on a multiple attention mechanism, which comprises the following steps: step 1, preprocessing an input image and a text; step 2, extracting the characteristics of the problems input in the step 1, and dividing the problems into independent words according to punctuation marks and spaces; step 3, the picture input in the step 1 is sent to a feature extraction network, and a regional target feature consisting of the features of K regions with the highest confidence coefficients is obtained; step 4, iteratively updating and memorizing the question features and the picture features obtained in the step 2 and the step 3 by using a multi-attention mechanism to generate context vectors required by answering questions; and 5, sending the question features in the step 2 and the new image features generated in the step 4 into a feature fusion device to jointly infer an answer, wherein the answer is selected from candidate answers with the highest probability given by the classifier. The method improves the accuracy of the visual question-answering model.

Description

Visual question-answering method of dynamic memory network model based on multiple attention mechanism
Technical Field
The invention belongs to the technical field of cross-modal tasks combined in the field of computer vision and natural language processing, and particularly relates to a visual question-answering method of a dynamic memory network model based on a multiple attention mechanism.
Background
The attention mechanism is widely applied to tasks such as visual question answering, image captions and machine translation, and the visual question answering attention model generates the attention distribution of picture characteristics based on the problem characteristics so as to carry out accurate question answering. At present, the visual question-answering attention mechanism generally performs weighted pooling only on the last convolutional layer of an image, that is, different spatial regions have different weights, but different channels have the same weight, so that the spatial information of the feature map is inevitably lost, which conflicts with the coexistence of the spatiality and the channeling of the feature map of the convolutional neural network. Worse still, the attention mechanism is only used on the last convolution layer where the receptive fields are quite large and the differences between the receptive fields are limited, resulting in less apparent spatial attention. Researchers have therefore proposed combining channel and spatial attention as the "left and right arm" of the neural network.
Some questions in visual question-answering relate to multi-jump relationships between objects, such as "what is in the basket of a bicycle? "the model needs to first find the bicycle in the picture, locate the position of the basket from the bicycle, and then identify the objects contained within the basket. It can be seen that visual question-answer prediction requires a stepwise matching of the best picture regions for answering questions according to the question. Therefore, in addition to using the attention mechanism to extract the key information needed for answering the questions, the visual question-answering model should have a certain memory capability, and search, reasoning and store the relevant information according to different questions. The neural networks with memory functions such as RNN, LSTM, GRU and the like have short memory step length, so that the long-term memory and storage requirements of the visual question-answering task on effective information cannot be met. To mitigate loss of valid information, a dynamic memory network is used herein to iteratively look for visual information related to a problem.
Disclosure of Invention
The invention aims to provide a visual question-answering method of a dynamic memory network model based on a multiple attention mechanism, which solves the complex problem that multiple reasoning is needed in visual question-answering and improves the accuracy of the visual question-answering model.
The technical scheme adopted by the invention is that the visual question-answering method based on the dynamic memory network model of the multiple attention mechanism comprises the following steps:
step 1, preprocessing an input image and a text, and sending the image and the text into an input module of a model to extract image and text characteristics to obtain target-level characteristics;
step 2, in order to obtain problem features, the problems input in the step 1 are subjected to feature extraction, and are divided into independent words according to punctuation marks and spaces; then, vectorizing representation is carried out on the words by using a pre-trained word model, then the word vector representation is input into a recurrent neural network, and the hidden state of the last time step is obtained to obtain problem characteristics;
step 3, in order to obtain picture characteristics, the picture input in the step 1 is sent to a characteristic extraction network, and regional target characteristics consisting of the characteristics of K regions with the highest confidence coefficient are obtained;
step 4, iteratively updating and memorizing the question features and the picture features obtained in the step 2 and the step 3 by using a multi-attention mechanism to generate context vectors required by answering questions;
and 5, sending the question features in the step 2 and the new image features generated in the step 4 into a feature fusion device to jointly infer an answer, wherein the answer is selected from candidate answers with the highest probability given by the classifier.
The present invention is also characterized in that,
the specific implementation of step 2 is as follows:
step 2.1: first, the input question text is processed into a form that the model can accept, then the input question q is represented as:
q=[ql,q2,...,qN]
wherein: n is the sentence length, qiIs a word;
step 2.2: secondly, mapping the words to the same vector space by using a word vector model to obtain word embedding representation of the words; and the word vector h of the obtained word is represented as:
h=[h1,h2,...,hNl
wherein: h isiAs a word qiH is a word vector after training; the processed word vector is input into the GRU network, and the process is expressed by the following equation:
Figure BDA0003261683790000031
wherein: s is the input text sentence characteristic, hiIn order to input a word vector of text,
Figure BDA0003261683790000032
representing that the word vector is in the P dimension;
step 2.3: and finally, inputting the word vector into a recurrent neural network to extract the characteristics of the sentence, namely the problem characteristics.
The problem feature in step 2 is to obtain a word vector representation for each word using a Glove word vector model pre-trained on the corpus.
Step 3 is specifically implemented according to the following steps:
after receiving an input picture, because not all elements in the picture are related to a question, in order to more accurately lock a target, an attention mechanism needs to be applied to the picture representation to respectively find out key areas for solving the question, wherein a top-down attention model is used, and a target detection network fast R-CNN with high-level semantics is adopted to extract picture features; firstly, extracting an image feature map by using a VGG (video graphics gateway) and ResNet basic network, then obtaining a suggestion frame feature map with a fixed size according to a region suggestion network and a region suggestion pooling, and then classifying and regressing to obtain accurate image features; and finally, obtaining the first K candidate regions with the maximum confidence as image features, wherein the extraction process is as follows:
Figure BDA0003261683790000041
wherein: v. ofKRepresenting any one of the candidate objects, V represents the confidence of the selection,
Figure BDA0003261683790000042
indicating that each candidate object is in the D-dimension.
Step 4 is specifically implemented according to the following steps:
step 4.1: firstly, fusing the problem features and the picture features obtained in the step 2 and the step 3;
step 4.2: secondly, obtaining a channel characteristic diagram closely related to the question through channel attention by the object characteristic diagram, further obtaining an object space area closely related to the question by using a space attention mechanism on the characteristic diagram focused by the channel, updating model memory based on the object space area, and iterating the process to obtain key context information of answering the question; updated model memory mtThe following were used:
Figure BDA0003261683790000043
wherein: [ ·; a]Representing a feature splicing operation, WtIndicating the parameter update matrix, b indicating the bias,
Figure BDA0003261683790000044
representing a new image feature, mtWhere t denotes a certain time, mt-1Representing a contextual memory; q represents a problem vector.
Step 5 is specifically implemented according to the following steps:
first, the updated model is memorized mtPerforming feature fusion on the problem vector Q in a BLOCK multi-mode fusion mode to obtain fusion featuresJ; after the joint feature representation J is obtained, a classification process is performed using the two fully connected layers; answer prediction is then performed using a Sigmoid function in a DMN-MA model that allows multiple correct answers per question, each candidate answer having a score in the range between (0, 1); and finally, selecting the candidate answer with the maximum probability value as the final answer of the model, wherein the candidate answer is as follows:
y=Sigmoid(WjJ+bj)
wherein: wjParameter representing the fully connected layer, bjBias terms are represented, y represents the final answer, and cross entropy is used as a loss function in the training process.
The invention has the beneficial effects that:
1. the invention is based on a dynamic memory network model of a multiple attention mechanism. Different from the previous attention model, the model not only uses the space-based attention mechanism, but also uses the channel attention mechanism, so that the visual question-answering model uses different weights on different channel feature maps, and the space attention mechanism is effectively supplemented by the channel attention mechanism. In addition, an input module and a scene memory module of the dynamic memory network model are deeply researched, and the fast-RCNN is used in the input module to obtain object characteristics of a target level; in the scene memory module, a multiple attention mechanism is used for continuously carrying out memory updating and storage according to the questions, iterative reasoning is carried out to obtain the most relevant visual vectors for answering the questions, and the context information is effectively utilized for carrying out answer reasoning. And finally fusing final memory and question representation of the network and inferring a correct answer.
2. The method is scientific and reasonable in design, can continuously perform memory updating and storage according to the problems by using a multiple memory mechanism, obtains the most relevant visual vector of the answers to the problems through iterative reasoning, and effectively utilizes the context information to perform answer reasoning. And the memory network further improves the accuracy of the visual question-answering model.
3. The method of the invention provides a dynamic memory network model (DMN-MA) based on a multiple attention mechanism on the basis of a dynamic memory network. Different from the prior model, the method applies a multiple attention mechanism based on problem guidance when reading the input image features, focuses on not only the spatial region of the image, but also different convolution channels of the image, and better conforms to the three-dimensional characteristics of both characteristic map channeling and spatiality. The DMN-MA model iteratively inquires visual information related to the question when finding image features, continuously updates memory contents, and obtains key memory for answering the question, thereby solving the complex problem that multiple reasoning is needed in visual question answering.
Drawings
FIG. 1 is a schematic diagram of a scene memory module iterating twice in the method of the present invention;
FIG. 2 is a diagram of an overall framework of a dynamic memory network model based on a multi-attention mechanism in the method of the present invention;
FIG. 3 is a schematic diagram of the present invention before the memory visualization process;
FIG. 4 is a schematic diagram of the present invention after a memory visualization process in a simulation experiment.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
The invention relates to a visual question-answering method of a dynamic memory network model based on a multiple attention mechanism, which comprises the following steps:
step 1, preprocessing an input image and a text, and sending the image and the text into an input module of a model to extract image and text characteristics to obtain target-level characteristics;
step 2, in order to obtain problem features, the problems input in the step 1 are subjected to feature extraction, and are divided into independent words according to punctuation marks and spaces; then, vectorizing representation is carried out on the words by using a pre-trained word model, then the word vector representation is input into a recurrent neural network, and the hidden state of the last time step is obtained to obtain problem characteristics;
the specific implementation of step 2 is as follows:
step 2.1: firstly, processing an input question text into a form which can be accepted by a model, namely dividing all words in the question text into independent words according to punctuation marks and spaces; then the input question q is represented as:
q=[q1,q2,...,qN]
wherein: n is the sentence length, qiIs a word;
step 2.2: secondly, mapping the words to the same vector space by using a word vector model to obtain word embedding representation of the words; word embedding is a method for converting words in a text into real number vectors, and the conversion into the vectors can be conveniently calculated. And the word vector h of the obtained word is represented as:
h=[h1,h2,...,hN]
wherein: h isiAs a word qiH is a word vector after training; the word vector representation for each word is obtained here using a pre-trained Glove word vector model, and since the question text typically does not exceed 20 words in the visual question-answer dataset used herein, the processed word vectors are entered into the GRU network herein, the process being represented by the following equation:
Figure BDA0003261683790000071
wherein: s is the input text sentence characteristic, hiIn order to input a word vector of text,
Figure BDA0003261683790000072
indicating that the word vector is in the P dimension.
Step 2.3: and finally, inputting the word vector into a recurrent neural network to extract the characteristics of the sentence, namely the problem characteristics.
The problem feature in step 2 is to obtain a word vector representation for each word using a Glove word vector model pre-trained on a large corpus.
Step 3, in order to obtain picture characteristics, the picture input in the step 1 is sent to a characteristic extraction network, and regional target characteristics consisting of the characteristics of K regions with the highest confidence coefficient are obtained; the feature extraction network used here is the fast R-CNN network.
Step 3 is specifically implemented according to the following steps:
after accepting the input picture. Since not all elements in the graph are related to the question, in order to lock the target more accurately, attention mechanism needs to be applied to the graph representation to respectively find out the areas critical for solving the question. A top-down attention model is used, and a target detection network fast R-CNN with high-level semantics is adopted to extract picture features; firstly, extracting an image feature map by using a VGG (video graphics gateway) and ResNet basic network, then obtaining a suggestion frame feature map with a fixed size according to a region suggestion network and a region suggestion pooling, and then classifying and regressing to obtain accurate image features; and finally, obtaining the first K candidate regions with the maximum confidence as image features, wherein the extraction process is as follows:
Figure BDA0003261683790000081
wherein: v. ofKRepresenting any one of the candidate objects, V represents the confidence of the selection,
Figure BDA0003261683790000082
indicating that each candidate object is in the D-dimension.
Step 4, iteratively updating and memorizing the question features and the picture features obtained in the step 2 and the step 3 by using a multi-attention mechanism to generate context vectors required by answering questions; the memory of the answered question is updated one time by one time in a way of combining channel attention and space attention;
step 4 is specifically implemented according to the following steps:
step 4.1: firstly, fusing the problem features and the picture features obtained in the step 2 and the step 3;
step 4.2: secondly, as shown in the image channel feature map of fig. 1, the object feature map is firstly obtained through channel attention and the channel feature map closely related to the problem, and thenUsing a space attention mechanism on a characteristic diagram focused by a channel to obtain an object space region closely related to a question, updating model memory based on the object space region, and iterating the process to obtain key context information of answering the question; updated model memory mtThe following were used:
Figure BDA0003261683790000091
wherein: [ ·; a]Representing a feature splicing operation, WtIndicating the parameter update matrix, b indicating the bias,
Figure BDA0003261683790000092
representing a new image feature, mtWhere t denotes a certain time, mt-1Representing a contextual memory; q represents a problem vector. Wherein the main focus of the channel attention is the object, and then the correlation calculation is carried out to obtain the channel attention vector. Spatial attention is the process of locating the best object region to answer a question by question, giving different object regions different weights, and not treating each object region equally. And updating the scene memory by using the new image characteristics to generate a vector after passing through the channel attention module and the space attention module each time. Following the work of previous visual questions and answers, memory was updated using the ReLU activation function.
And 5, sending the question features in the step 2 and the new image features generated in the step 4 into a feature fusion device to jointly infer an answer, wherein the answer is selected from candidate answers with the highest probability given by the classifier.
Step 5 is specifically implemented according to the following steps:
first, the updated model is memorized mtAnd performing feature fusion on the problem vector Q in a BLOCK multi-mode fusion mode to obtain a fusion feature J. After the joint feature representation J is obtained, the classification process is performed using the two fully-connected layers. Answer prediction is then performed using Sigmoid functions in the DMN-MA model, which allows multiple correct answers per question, with a score for each candidate answer, ranging between (0, 1). Final probability of selectionThe candidate answer with the largest value is taken as the final answer of the model and is as follows:
y=Sigmoid(WjJ+bj)
wherein: wjParameter representing the fully connected layer, bjRepresenting a bias term and y representing the final answer. And cross entropy is used as a loss function in the training process.
The specific process of the invention is shown in figure 2. Firstly, extracting regional target characteristics of an input image and a text, processing a vector of a dimensional fact by the image, and coding an input problem; establishing a dynamic memory network model based on a multiple attention mechanism, then inputting the obtained text questions and image characteristics for multiple times for iteration, and updating the context memory after each iteration until answers with higher probability appear. And secondly, fusing the reuse characteristics, interacting with the question to obtain new graph characteristics, and finally deducing an answer by the obtained graph characteristics and the question. Compared with the traditional method utilizing the overall image characteristics or other graph network visual question-answering methods neglecting the relationship importance, the method provided by the invention has the advantage that the performance of the visual question-answering model is effectively improved by adopting the technical scheme of the invention.
Simulation experiment and characterization of experimental results
1. Data set
The model was experimented on two visual question-answering public datasets, the COCO-QA and VQA 2.0.0 datasets, respectively. The COCO-QA dataset pictures are from MS-COCO. Comprising 123587 pictures, wherein 72783 pictures are used for training and 38948 pictures are used for testing, and importantly, the answers of the data set question are distributed uniformly. The VQA 2.0.0 data set contained 204721 pictures from the MS-COCO, 123,287 pictures for the training and validation set, 8 million for the training set, and 81434 pictures for the test set. The data set had 614163 questions, three for each picture, and ten answers for each question, each provided by ten different annotators.
2. Experimental Environment
The development framework was python version 1.1.0 using python3.6 development language. Specifically, the image input module K is 100, the dimension of each object feature vector is 2048, and ResNet152 is used as a base network for image feature extraction. The problem module treats the problem as a fixed length, discards excess length, and fills in the deficiency with 0. The length of the COCO-QA data set problem is fixed to be 20, and the length of the VQA2.0 data set problem is fixed to be 14. The word vector dimension is 300, the GRU hidden layer dimension is 2048, and the resulting problem vector dimension is also 2048. An answer prediction stage, wherein the COCO-QA data set has 430 answers; VQA 2.0.0, if an answer occurs more than 8 times in the training set, it is added to the candidate answer set to obtain 3129 candidate answers.
All activation functions used ReLU in the experiments and dropout of p 0.5 was used in the input and output layers to prevent overfitting. All training samples are randomly shuffle during training, the batch size is set to 32, and the epoch is 20. The initial learning rate is 0.001 by using an Adam random gradient descent algorithm in the training process, and the DMN-MA model reduces the learning rate to 1/10 before every 3 epochs after 5 epochs are trained.
3. Results and analysis of the experiments
Due to the uncertainty of the iteration times of the DMN-MA model scene memory module, different iteration times are firstly set on the COCO-QA data set and the VQA 2.0.0 data set to find the optimal performance of the model. The results of the model experiments on the overall accuracy and the number of iterations for both data sets are shown in table 1.
Figure BDA0003261683790000121
TABLE 1 context memory Module iteration count accuracy comparison
From table 1, it can be seen that the number of iterations is increased, the accuracy of the model is increased, when the number of iterations is 3, the overall accuracy of the model on the two data sets is the highest, and the accuracy of the model is sharply decreased by increasing the number of iterations. Overall, the multiple attention mechanism iterates 3 times with the highest accuracy, so the experiment sets the number of iterations to 3.
Next, to verify the validity of the proposed model, table 2 lists the experimental results of the model and other mainstream methods on the COCO-QA test set.
Figure BDA0003261683790000122
TABLE 2 Overall accuracy on COCO-QA data set, comparison of WUPS index with other methods
As can be seen from Table 2, the overall accuracy of the proposed DMN-MA model reaches 64.57%, and compared with the traditional VIS + LSTM method, the accuracy is improved by 11.26%. Particularly, compared with a visual question-answer classical attention method SAN, the overall accuracy is improved by about 3%, and compared with a QPU model, the accuracy is improved by 2.07%. In addition, the model also has unusual effects on WUPS0.9 and WUPS 0.0. It is not sufficient to use only spatial attention for iterative reasoning, as question-based channel attention is equally important in visual question-answering studies.
As shown in Table 3, the overall performance of the proposed DMN-MA model is 12.96% higher than that of the reference model CNN + LSTM, 4.91% higher than that of the MCB model and 2.54% higher than that of the Resonnet model; in addition, the model is 1.51% higher than the visual question-answering system model of the classical top-down attention mechanism in overall accuracy. It is worth noting that the DMN-MA model and the visual question-answering system model of the top-down attention system adopt the same data preprocessing mode, namely fast-RCNN is adopted to extract image visual characteristics, GLOVE + GRU is adopted to extract question characteristics, and the difference is that the visual question-answering system model of the top-down attention system only adopts the spatial attention system to predict answers, which fully proves the effectiveness of the proposed model.
Figure BDA0003261683790000131
TABLE 3 comparison of accuracy of various question types on COCO-QA
In conclusion, the DMN-MA model is compared with a plurality of mainstream methods on COCO-QA and VQA 2.0.0 data sets, and the advantages of a multi-attention mechanism and a memory network are combined, so that the DMN-MA model is more consistent with the three-dimensional characteristics of a convolution characteristic diagram, and meanwhile, the loss of context information is reduced in the answer prediction process, and the performance is better.
4. Attention visualization
Several pictures and questions in the data set were randomly chosen for attention visualization presentation for the proposed model, as shown in fig. 3-4. Fig. 3 shows the upper part of the question, fig. 3 shows the original picture, fig. 4 shows the picture after the model attention visualization, the lower group route is the answer of the data set, and the Prediction indicates the answer of the model Prediction.

Claims (6)

1. The visual question-answering method of the dynamic memory network model based on the multiple attention mechanism is characterized by comprising the following steps of:
step 1, preprocessing an input image and a text, and sending the image and the text into an input module of a model to extract image and text characteristics to obtain target-level characteristics;
step 2, extracting the characteristics of the problems input in the step 1, and dividing the problems into independent words according to punctuation marks and spaces; then, vectorizing representation is carried out on the words by using a pre-trained word model, then the word vector representation is input into a recurrent neural network, and the hidden state of the last time step is obtained to obtain problem characteristics;
step 3, the picture input in the step 1 is sent to a feature extraction network, and a regional target feature consisting of the features of K regions with the highest confidence coefficients is obtained;
step 4, iteratively updating and memorizing the question features and the picture features obtained in the step 2 and the step 3 by using a multi-attention mechanism to generate context vectors required by answering questions;
and 5, sending the question features in the step 2 and the new image features generated in the step 4 into a feature fusion device to jointly infer an answer, wherein the answer is selected from candidate answers with the highest probability given by the classifier.
2. The visual question-answering method based on the multiple attention mechanism dynamic memory network model according to claim 1, wherein the specific implementation manner of the step 2 is as follows:
step 2.1: first, the input question text is processed into a form that the model can accept, then the input question q is represented as:
q=[q1,q2,...,qN]
wherein: n is the sentence length, qiIs a word;
step 2.2: secondly, mapping the words to the same vector space by using a word vector model to obtain word embedding representation of the words; and the word vector h of the obtained word is represented as:
h=[h1,h2,...,hN]
wherein: h isiAs a word qiH is a word vector after training; the processed word vector is input into the GRU network, and the process is expressed by the following equation:
S=ReLU(GRU(hi)),
Figure FDA0003261683780000021
wherein: s is the input text sentence characteristic, hiIn order to input a word vector of text,
Figure FDA0003261683780000022
representing that the word vector is in the P dimension;
step 2.3: and finally, inputting the word vector into a recurrent neural network to extract the characteristics of the sentence, namely the problem characteristics.
3. The visual question-answering method based on the multi-attention mechanism dynamic memory network model according to claim 2, wherein the question feature in step 2 is to obtain a word vector representation of each word using a Glove word vector model pre-trained on a corpus.
4. The visual question-answering method based on the multiple attention mechanism dynamic memory network model according to claim 3, wherein the step 3 is implemented by the following steps:
after receiving an input picture, because not all elements in the picture are related to a question, in order to more accurately lock a target, an attention mechanism needs to be applied to the picture representation to respectively find out key areas for solving the question, wherein a top-down attention model is used, and a target detection network fast R-CNN with high-level semantics is adopted to extract picture features; firstly, extracting an image feature map by using a VGG (video graphics gateway) and ResNet basic network, then obtaining a suggestion frame feature map with a fixed size according to a region suggestion network and a region suggestion pooling, and then classifying and regressing to obtain accurate image features; and finally, obtaining the first K candidate regions with the maximum confidence as image features, wherein the extraction process is as follows:
V=[v1,v2,...,vK],
Figure FDA0003261683780000031
wherein: v. ofKRepresenting any one of the candidate objects, V represents the confidence of the selection,
Figure FDA0003261683780000032
indicating that each candidate object is in the D-dimension.
5. The visual question-answering method based on the multiple attention mechanism dynamic memory network model according to claim 4, wherein the step 4 is implemented by the following steps:
step 4.1: firstly, fusing the problem features and the picture features obtained in the step 2 and the step 3;
step 4.2: secondly, the object characteristic diagram is firstly obtained through channel attention and is closely related to the problem, a space attention mechanism is further used on the characteristic diagram which is focused through the channel to obtain an object space area which is closely related to the problem, and the memory of the model is updated based on the object space areaIterating the process to obtain key context information of the answer questions; updated model memory mtThe following were used:
Figure FDA0003261683780000033
wherein: [ ·; a]Representing a feature splicing operation, WtIndicating the parameter update matrix, b indicating the bias,
Figure FDA0003261683780000034
representing a new image feature, mtWhere t denotes a certain time, mt-1Representing a contextual memory; q represents a problem vector.
6. The visual question-answering method based on the multiple attention mechanism dynamic memory network model according to claim 5, wherein the step 5 is implemented by the following steps:
first, the updated model is memorized mtPerforming feature fusion on the problem vector Q in a BLOCK multi-mode fusion mode to obtain fusion features J; after the joint feature representation J is obtained, a classification process is performed using the two fully connected layers; answer prediction is then performed using a Sigmoid function in a DMN-MA model that allows multiple correct answers per question, each candidate answer having a score in the range between (0, 1); and finally, selecting the candidate answer with the maximum probability value as the final answer of the model, wherein the candidate answer is as follows:
y=Sigmoid(WjJ+bj)
wherein: wjParameter representing the fully connected layer, bjBias terms are represented, y represents the final answer, and cross entropy is used as a loss function in the training process.
CN202111083704.6A 2021-09-14 2021-09-14 Visual question-answering method of dynamic memory network model based on multi-attention mechanism Active CN113886626B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111083704.6A CN113886626B (en) 2021-09-14 2021-09-14 Visual question-answering method of dynamic memory network model based on multi-attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111083704.6A CN113886626B (en) 2021-09-14 2021-09-14 Visual question-answering method of dynamic memory network model based on multi-attention mechanism

Publications (2)

Publication Number Publication Date
CN113886626A true CN113886626A (en) 2022-01-04
CN113886626B CN113886626B (en) 2024-02-02

Family

ID=79009636

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111083704.6A Active CN113886626B (en) 2021-09-14 2021-09-14 Visual question-answering method of dynamic memory network model based on multi-attention mechanism

Country Status (1)

Country Link
CN (1) CN113886626B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114416914A (en) * 2022-03-30 2022-04-29 中建电子商务有限责任公司 Processing method based on picture question and answer
CN114417044A (en) * 2022-01-19 2022-04-29 中国科学院空天信息创新研究院 Image question and answer method and device
US20220164588A1 (en) * 2020-11-20 2022-05-26 Fujitsu Limited Storage medium, machine learning method, and output device
CN114661874A (en) * 2022-03-07 2022-06-24 浙江理工大学 Visual question-answering method based on multi-angle semantic understanding and self-adaptive dual channels

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163299A (en) * 2019-05-31 2019-08-23 合肥工业大学 A kind of vision answering method based on bottom-up attention mechanism and memory network
US20200302340A1 (en) * 2019-03-22 2020-09-24 Royal Bank Of Canada Systems and methods for learning user representations for open vocabulary data sets

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200302340A1 (en) * 2019-03-22 2020-09-24 Royal Bank Of Canada Systems and methods for learning user representations for open vocabulary data sets
CN110163299A (en) * 2019-05-31 2019-08-23 合肥工业大学 A kind of vision answering method based on bottom-up attention mechanism and memory network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
闫茹玉;刘学亮;: "结合自底向上注意力机制和记忆网络的视觉问答模型", 中国图象图形学报, no. 05 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220164588A1 (en) * 2020-11-20 2022-05-26 Fujitsu Limited Storage medium, machine learning method, and output device
CN114417044A (en) * 2022-01-19 2022-04-29 中国科学院空天信息创新研究院 Image question and answer method and device
CN114661874A (en) * 2022-03-07 2022-06-24 浙江理工大学 Visual question-answering method based on multi-angle semantic understanding and self-adaptive dual channels
CN114661874B (en) * 2022-03-07 2024-04-30 浙江理工大学 Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels
CN114416914A (en) * 2022-03-30 2022-04-29 中建电子商务有限责任公司 Processing method based on picture question and answer
CN114416914B (en) * 2022-03-30 2022-07-08 中建电子商务有限责任公司 Processing method based on picture question and answer

Also Published As

Publication number Publication date
CN113886626B (en) 2024-02-02

Similar Documents

Publication Publication Date Title
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
CN110188358B (en) Training method and device for natural language processing model
CN113886626B (en) Visual question-answering method of dynamic memory network model based on multi-attention mechanism
Seo et al. Visual reference resolution using attention memory for visual dialog
JP6351689B2 (en) Attention based configurable convolutional neural network (ABC-CNN) system and method for visual question answering
CN111984772B (en) Medical image question-answering method and system based on deep learning
CN108446404B (en) Search method and system for unconstrained visual question-answer pointing problem
CN112733866A (en) Network construction method for improving text description correctness of controllable image
CN110826609B (en) Double-current feature fusion image identification method based on reinforcement learning
CN114201592A (en) Visual question-answering method for medical image diagnosis
CN112527993B (en) Cross-media hierarchical deep video question-answer reasoning framework
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
CN115222998B (en) Image classification method
CN113158815A (en) Unsupervised pedestrian re-identification method, system and computer readable medium
CN113095251B (en) Human body posture estimation method and system
CN112818889A (en) Dynamic attention-based method for integrating accuracy of visual question-answer answers by hyper-network
CN112966135A (en) Image-text retrieval method and system based on attention mechanism and gate control mechanism
CN115270752A (en) Template sentence evaluation method based on multilevel comparison learning
Li et al. TAM at VQA-Med 2021: A Hybrid Model with Feature Extraction and Fusion for Medical Visual Question Answering.
CN114329148A (en) Content information identification method and device, computer equipment and storage medium
CN116524513B (en) Open vocabulary scene graph generation method, system, equipment and storage medium
CN115861995B (en) Visual question-answering method and device, electronic equipment and storage medium
CN114168769B (en) Visual question-answering method based on GAT relation reasoning
CN113821610A (en) Information matching method, device, equipment and storage medium
CN113609355A (en) Video question-answering system, method, computer and storage medium based on dynamic attention and graph network reasoning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant