CN110889340A

CN110889340A - Visual question-answering model based on iterative attention mechanism

Info

Publication number: CN110889340A
Application number: CN201911099046.2A
Authority: CN
Inventors: 颜丙旭; 刘杰
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2019-11-12
Filing date: 2019-11-12
Publication date: 2020-03-17

Abstract

The invention discloses a visual question-answering model based on an iterative attention mechanism, which comprises the following three steps: step S1, constructing a dual attention mechanism; step S2, iterating the internal structure of the model; step S3, predicting answers; the invention uses VGGNet to extract the characteristics of the image; coding the question and the answer in a bidirectional LSTM mode; taking the results of the first part and the second part as input, namely taking the picture characteristic vector and the problem characteristic vector as input, firstly adding an attention mechanism into the two vectors respectively, obtaining two attention characteristic vectors after calculation, and then fusing to obtain a new picture and a new problem characteristic vector; iteratively performing the third part of work content to reduce the granularity of the concerned area, and obtaining a final picture and a problem feature vector; and predicting answer distribution by using the pictures obtained in the steps and the feature vectors of the questions. The invention has the beneficial effects that: the focus is on the question, the focus area is accurate, and the predicted answer is accurate.

Description

Visual question-answering model based on iterative attention mechanism

Technical Field

The invention relates to the technical field of vision based on computers, in particular to a visual question-answering model based on an iterative attention mechanism.

Background

A key solution to Visual Question Answering (VQA) consists in how to extract and fuse visual and linguistic features extracted from input images and questions; the general framework of the existing methods is that visual and linguistic features are extracted separately from images and questions in an initial step, and fused together in a later step to compute and predict; in early studies, researchers employed simple fusion methods such as concatenation, summation, multiplication of visual and linguistic features, which were then fed into fully-concatenated layers to predict answers.

To date, VQA all focus models in the literature on visual attention, not on issues; consider the problem "how many ways of cat are one in this image? "and" how many mechanisms can you see in this image? "of the problem; they have the same meaning, and both questions are basically determined by "howmann cats", which shows that models using "howmann cats" are more robust than models using words that are not relevant to the answer.

Furthermore, most of the recently proposed visual question-answer models are based on neural networks; one commonly used method is to extract global image feature vectors using Convolutional Neural Networks (CNN) and encode the corresponding questions into feature vectors using long short term memory networks (LSTM), then process them and predict the answers; although these methods have had good results, these models often do not give accurate answers when the answers are related to some fine-grained regions in the image.

The above disadvantages can be simplified into two points:

① the focus of the existing attention models is on the vision, not the problem

② when using the attention mechanism, the area of interest is not precise, especially for some fine-grained areas;

③, the above-mentioned deficiencies result in inaccurate answers to the predicted questions.

Therefore, the prior art needs a visual question-answering model based on an iterative attention mechanism, which has the advantages of accurate attention area and accurate predicted answer, and has the attention point on the question.

Disclosure of Invention

The present invention is directed to a visual question-answering model based on an iterative attention mechanism, so as to solve the above-mentioned problems in the background art.

In order to achieve the purpose, the invention provides the following technical scheme:

a visual question-answering model based on an iterative attention mechanism comprises the following steps:

step S1: constructing a double attention mechanism;

step S2: the internal structure of the iterative model, namely the fusion method of each image and the problem;

step S3: and (6) predicting answers.

As a further scheme of the invention: the step S1 includes:

firstly, extracting image features by using VggNet, and considering that an iterative model is used later, the iterative model is Q_lEncoding the problem with Bi-LSTM to be V_lCreating two attention graphs, Q_lAnd V_lThe calculation formula of (a) is as follows:

above A_QlAnd B_VlEach row of (a) contains a single intent;

the dimension feature vector

And

projecting into a plurality of low dimensional spaces; let the number of low-dimensional spaces be d_h(≡ d/h) as feature vector dimension; by using

And

representing a linear projection; the projection feature matrix of the ith space is:

an attention map is created in each matrix by normalization by column and by row using the softmax function, and the formula is as follows;

when the invention uses multiplicative (or dot-product) attention, as follows, the average fusion of multiple features is equivalent to averaging the attention maps, and the formula is as follows:

the present invention uses product attention to obtain a characterization of a problem and an image

And

the formula is as follows:

the above

And V_lIs the same size as (i.e. d x T),

and Q_lIs the same, i.e., d × N.

As a further scheme of the invention: the step S2 includes:

in computing feature representations

And

then, in the matrix

The nth column of (a) stores a representation of the entire image associated with the nth question word, i.e., the attention feature vector for the nth word; then, the nth column vector is cascaded

And the nth question word vector

Fusing to form two-dimensional vector

Projecting the connected vectors into a d-dimensional space through a single-layer network, and connecting the d-dimensional space through a ReLU activation function and a residual error; the formula is as follows:

wherein

And

are learning weights and bias terms; when N (N is 1, …, N) words are all involved in the calculation, the result is obtained

Similarly, the representation v of the t-th image area_ltRepresentation of the entire query word in relation to the t-th image area

Connected in series, projected into d-dimensional space, the formula is as follows:

wherein the content of the first and second substances,

and

are learning weights and bias terms; when T (T is 1, …, T) areas are all involved in the calculation, the result is obtained

As a further scheme of the invention: the step S3 includes:

the invention uses the last output Q of the iterative model_LAnd V_LTo predict answer distributions; since they contain representations of N question words and T image regions, the present invention first performs a self-attention mechanism on them to obtain an aggregated representation of the entire question and image; for Q_LThe operation of (1) is as follows:

calculating the "score", s_qL1,…,s_qLNAre each q_L1,…,q_LNBy applying MLP with two layers in the hidden layer;

normalization with softmax to obtain weights

Use of

Calculating an aggregation expression by using a formula;

v was obtained using the same method_LWeight matrix of

And polymerization of

The score of a predefined answer is calculated using MLP, which is a widely used method in recent studies, and the formula is as follows:

compared with the prior art, the invention has the beneficial effects that:

aiming at the problems that the existing visual question-answering model does not adopt an attention mechanism to the problem words to eliminate the interference of irrelevant words, and the attention area is not accurate when the attention mechanism is utilized, the invention creatively constructs a double attention mechanism and an iterative model to utilize the attention mechanism on the problems and reduce the granularity of the attention area; the specific idea is that an attention feature vector is generated on an image area corresponding to each question word, and an attention feature vector is generated on the question word corresponding to each image area; then it performs the calculation of attention feature vectors, the concatenation of multimodal representations and their transformation through the single layer network of ReLU and residual concatenation; the calculation is packaged into an iterative attention mechanism model, the model considers the interaction between all image areas and all question words, a hierarchical structure can be formed in an iterative mode, multi-step interaction between the images and the questions is achieved to reduce the granularity of the attention areas, the attention areas and the attention words are obtained more accurately, and then answer prediction is carried out; experiments prove that the model improves the accuracy of the predicted answers.

Drawings

FIG. 1 is a step diagram of a visual question-answering model based on an iterative attention mechanism according to the present invention.

FIG. 2 is a flowchart illustrating the effect of the visual question-answering model based on the iterative attention mechanism according to the present invention.

FIG. 3 is a schematic diagram of step S1 of the visual question-answering model based on the iterative attention mechanism according to the present invention.

FIG. 4 is a schematic diagram of step S2 of the visual question-answering model based on the iterative attention mechanism according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments; all other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1 to 4, in an embodiment of the present invention, a visual question-answering model based on an iterative attention mechanism includes the following steps:

step S1: constructing a double attention mechanism;

step S3: and (6) predicting answers.

The step S1 includes:

above A_QlAnd B_VlEach row of (a) contains a single intent;

the dimension feature vector

And

And

And

the formula is as follows:

the above

And V_lIs the same size as (i.e. d x T),

and Q_lIs the same, i.e., d × N.

The step S2 includes:

in computing feature representations

And

then, in the matrix

And the nth question word vector

Fusing to form two-dimensional vector

wherein

And

wherein the content of the first and second substances,

and

The step S3 includes:

the invention uses the last input of the iterative modelOut of Q_LAnd V_LTo predict answer distributions; since they contain representations of N question words and T image regions, the present invention first performs a self-attention mechanism on them to obtain an aggregated representation of the entire question and image; for Q_LThe operation of (1) is as follows:

normalization with softmax to obtain weights

Use of

Calculating an aggregation expression by using a formula;

v was obtained using the same method_LWeight matrix of

And polymerization of

in the implementation of the invention, the comparison of the effect of the model of the invention and other models is tested on a COCO-QA data set, and experiments prove that the model of the invention is superior to other models, and the test effect is as follows:

therefore, the method and the system can help the visually impaired people to understand the visual information, and can apply the visual question-answering to the image retrieval system in the future to help the user to retrieve the required images.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof; the present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein; any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.