CN110889340A - Visual question-answering model based on iterative attention mechanism - Google Patents
Visual question-answering model based on iterative attention mechanism Download PDFInfo
- Publication number
- CN110889340A CN110889340A CN201911099046.2A CN201911099046A CN110889340A CN 110889340 A CN110889340 A CN 110889340A CN 201911099046 A CN201911099046 A CN 201911099046A CN 110889340 A CN110889340 A CN 110889340A
- Authority
- CN
- China
- Prior art keywords
- attention
- question
- iterative
- follows
- attention mechanism
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/217—Validation; Performance evaluation; Active pattern learning techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a visual question-answering model based on an iterative attention mechanism, which comprises the following three steps: step S1, constructing a dual attention mechanism; step S2, iterating the internal structure of the model; step S3, predicting answers; the invention uses VGGNet to extract the characteristics of the image; coding the question and the answer in a bidirectional LSTM mode; taking the results of the first part and the second part as input, namely taking the picture characteristic vector and the problem characteristic vector as input, firstly adding an attention mechanism into the two vectors respectively, obtaining two attention characteristic vectors after calculation, and then fusing to obtain a new picture and a new problem characteristic vector; iteratively performing the third part of work content to reduce the granularity of the concerned area, and obtaining a final picture and a problem feature vector; and predicting answer distribution by using the pictures obtained in the steps and the feature vectors of the questions. The invention has the beneficial effects that: the focus is on the question, the focus area is accurate, and the predicted answer is accurate.
Description
Technical Field
The invention relates to the technical field of vision based on computers, in particular to a visual question-answering model based on an iterative attention mechanism.
Background
A key solution to Visual Question Answering (VQA) consists in how to extract and fuse visual and linguistic features extracted from input images and questions; the general framework of the existing methods is that visual and linguistic features are extracted separately from images and questions in an initial step, and fused together in a later step to compute and predict; in early studies, researchers employed simple fusion methods such as concatenation, summation, multiplication of visual and linguistic features, which were then fed into fully-concatenated layers to predict answers.
To date, VQA all focus models in the literature on visual attention, not on issues; consider the problem "how many ways of cat are one in this image? "and" how many mechanisms can you see in this image? "of the problem; they have the same meaning, and both questions are basically determined by "howmann cats", which shows that models using "howmann cats" are more robust than models using words that are not relevant to the answer.
Furthermore, most of the recently proposed visual question-answer models are based on neural networks; one commonly used method is to extract global image feature vectors using Convolutional Neural Networks (CNN) and encode the corresponding questions into feature vectors using long short term memory networks (LSTM), then process them and predict the answers; although these methods have had good results, these models often do not give accurate answers when the answers are related to some fine-grained regions in the image.
The above disadvantages can be simplified into two points:
① the focus of the existing attention models is on the vision, not the problem
② when using the attention mechanism, the area of interest is not precise, especially for some fine-grained areas;
③, the above-mentioned deficiencies result in inaccurate answers to the predicted questions.
Therefore, the prior art needs a visual question-answering model based on an iterative attention mechanism, which has the advantages of accurate attention area and accurate predicted answer, and has the attention point on the question.
Disclosure of Invention
The present invention is directed to a visual question-answering model based on an iterative attention mechanism, so as to solve the above-mentioned problems in the background art.
In order to achieve the purpose, the invention provides the following technical scheme:
a visual question-answering model based on an iterative attention mechanism comprises the following steps:
step S1: constructing a double attention mechanism;
step S2: the internal structure of the iterative model, namely the fusion method of each image and the problem;
step S3: and (6) predicting answers.
As a further scheme of the invention: the step S1 includes:
firstly, extracting image features by using VggNet, and considering that an iterative model is used later, the iterative model is QlEncoding the problem with Bi-LSTM to be VlCreating two attention graphs, QlAnd VlThe calculation formula of (a) is as follows:
above AQlAnd BVlEach row of (a) contains a single intent;
the dimension feature vectorAndprojecting into a plurality of low dimensional spaces; let the number of low-dimensional spaces be dh(≡ d/h) as feature vector dimension; by usingAndrepresenting a linear projection; the projection feature matrix of the ith space is:
an attention map is created in each matrix by normalization by column and by row using the softmax function, and the formula is as follows;
when the invention uses multiplicative (or dot-product) attention, as follows, the average fusion of multiple features is equivalent to averaging the attention maps, and the formula is as follows:
the present invention uses product attention to obtain a characterization of a problem and an imageAndthe formula is as follows:
As a further scheme of the invention: the step S2 includes:
in computing feature representationsAndthen, in the matrixThe nth column of (a) stores a representation of the entire image associated with the nth question word, i.e., the attention feature vector for the nth word; then, the nth column vector is cascadedAnd the nth question word vectorFusing to form two-dimensional vector
Projecting the connected vectors into a d-dimensional space through a single-layer network, and connecting the d-dimensional space through a ReLU activation function and a residual error; the formula is as follows:
whereinAndare learning weights and bias terms; when N (N is 1, …, N) words are all involved in the calculation, the result is obtained
Similarly, the representation v of the t-th image arealtRepresentation of the entire query word in relation to the t-th image areaConnected in series, projected into d-dimensional space, the formula is as follows:
wherein the content of the first and second substances,andare learning weights and bias terms; when T (T is 1, …, T) areas are all involved in the calculation, the result is obtained
As a further scheme of the invention: the step S3 includes:
the invention uses the last output Q of the iterative modelLAnd VLTo predict answer distributions; since they contain representations of N question words and T image regions, the present invention first performs a self-attention mechanism on them to obtain an aggregated representation of the entire question and image; for QLThe operation of (1) is as follows:
calculating the "score", sqL1,…,sqLNAre each qL1,…,qLNBy applying MLP with two layers in the hidden layer;
The score of a predefined answer is calculated using MLP, which is a widely used method in recent studies, and the formula is as follows:
compared with the prior art, the invention has the beneficial effects that:
aiming at the problems that the existing visual question-answering model does not adopt an attention mechanism to the problem words to eliminate the interference of irrelevant words, and the attention area is not accurate when the attention mechanism is utilized, the invention creatively constructs a double attention mechanism and an iterative model to utilize the attention mechanism on the problems and reduce the granularity of the attention area; the specific idea is that an attention feature vector is generated on an image area corresponding to each question word, and an attention feature vector is generated on the question word corresponding to each image area; then it performs the calculation of attention feature vectors, the concatenation of multimodal representations and their transformation through the single layer network of ReLU and residual concatenation; the calculation is packaged into an iterative attention mechanism model, the model considers the interaction between all image areas and all question words, a hierarchical structure can be formed in an iterative mode, multi-step interaction between the images and the questions is achieved to reduce the granularity of the attention areas, the attention areas and the attention words are obtained more accurately, and then answer prediction is carried out; experiments prove that the model improves the accuracy of the predicted answers.
Drawings
FIG. 1 is a step diagram of a visual question-answering model based on an iterative attention mechanism according to the present invention.
FIG. 2 is a flowchart illustrating the effect of the visual question-answering model based on the iterative attention mechanism according to the present invention.
FIG. 3 is a schematic diagram of step S1 of the visual question-answering model based on the iterative attention mechanism according to the present invention.
FIG. 4 is a schematic diagram of step S2 of the visual question-answering model based on the iterative attention mechanism according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments; all other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1 to 4, in an embodiment of the present invention, a visual question-answering model based on an iterative attention mechanism includes the following steps:
step S1: constructing a double attention mechanism;
step S2: the internal structure of the iterative model, namely the fusion method of each image and the problem;
step S3: and (6) predicting answers.
The step S1 includes:
firstly, extracting image features by using VggNet, and considering that an iterative model is used later, the iterative model is QlEncoding the problem with Bi-LSTM to be VlCreating two attention graphs, QlAnd VlThe calculation formula of (a) is as follows:
above AQlAnd BVlEach row of (a) contains a single intent;
the dimension feature vectorAndprojecting into a plurality of low dimensional spaces; let the number of low-dimensional spaces be dh(≡ d/h) as feature vector dimension; by usingAndrepresenting a linear projection; the projection feature matrix of the ith space is:
an attention map is created in each matrix by normalization by column and by row using the softmax function, and the formula is as follows;
when the invention uses multiplicative (or dot-product) attention, as follows, the average fusion of multiple features is equivalent to averaging the attention maps, and the formula is as follows:
the present invention uses product attention to obtain a characterization of a problem and an imageAndthe formula is as follows:
The step S2 includes:
in computing feature representationsAndthen, in the matrixThe nth column of (a) stores a representation of the entire image associated with the nth question word, i.e., the attention feature vector for the nth word; then, the nth column vector is cascadedAnd the nth question word vectorFusing to form two-dimensional vector
Projecting the connected vectors into a d-dimensional space through a single-layer network, and connecting the d-dimensional space through a ReLU activation function and a residual error; the formula is as follows:
whereinAndare learning weights and bias terms; when N (N is 1, …, N) words are all involved in the calculation, the result is obtained
Similarly, the representation v of the t-th image arealtRepresentation of the entire query word in relation to the t-th image areaConnected in series, projected into d-dimensional space, the formula is as follows:
wherein the content of the first and second substances,andare learning weights and bias terms; when T (T is 1, …, T) areas are all involved in the calculation, the result is obtained
The step S3 includes:
the invention uses the last input of the iterative modelOut of QLAnd VLTo predict answer distributions; since they contain representations of N question words and T image regions, the present invention first performs a self-attention mechanism on them to obtain an aggregated representation of the entire question and image; for QLThe operation of (1) is as follows:
calculating the "score", sqL1,…,sqLNAre each qL1,…,qLNBy applying MLP with two layers in the hidden layer;
The score of a predefined answer is calculated using MLP, which is a widely used method in recent studies, and the formula is as follows:
in the implementation of the invention, the comparison of the effect of the model of the invention and other models is tested on a COCO-QA data set, and experiments prove that the model of the invention is superior to other models, and the test effect is as follows:
therefore, the method and the system can help the visually impaired people to understand the visual information, and can apply the visual question-answering to the image retrieval system in the future to help the user to retrieve the required images.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof; the present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein; any reference sign in a claim should not be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.
Claims (4)
1. A visual question-answering model based on an iterative attention mechanism is characterized in that: the method comprises the following steps:
step S1: constructing a double attention mechanism;
step S2: the internal structure of the iterative model, namely the fusion method of each image and the problem;
step S3: and (6) predicting answers.
2. The visual question-answering model based on the iterative attention mechanism as claimed in claim 1, wherein: the step S1 includes:
firstly, extracting image features by using VggNet, and considering that an iterative model is used later, the iterative model is QlEncoding the problem with Bi-LSTM to be VlCreating two attention graphs, QlAnd VlThe calculation formula of (a) is as follows:
above AQlAnd BVlEach row of (a) contains a single intent;
the dimension feature vectorAndprojecting into a plurality of low dimensional spaces; let the number of low-dimensional spaces be dh(≡ d/h) as feature vector dimension; by usingAndrepresenting a linear projection; the projection feature matrix of the ith space is:
an attention map is created in each matrix by normalization by column and by row using the softmax function, and the formula is as follows;
when the invention uses multiplicative (or dot-product) attention, as follows, the average fusion of multiple features is equivalent to averaging the attention maps, and the formula is as follows:
the present invention uses product attention to obtain a characterization of a problem and an imageAndthe formula is as follows:
3. The visual question-answering model based on the iterative attention mechanism as claimed in claim 1, wherein: the step S2 includes:
in computing feature representationsAndthen, in the matrixThe nth column of (a) stores a representation of the entire image associated with the nth question word, i.e., the attention feature vector for the nth word; then, the nth column vector is cascadedAnd the nth question word vectorFusing to form two-dimensional vector
Projecting the connected vectors into a d-dimensional space through a single-layer network, and connecting the d-dimensional space through a ReLU activation function and a residual error; the formula is as follows:
whereinAndare learning weights and bias terms; when N (N is 1, …, N) words are all involved in the calculation, the result is obtained
Similarly, the representation v of the t-th image arealtRepresentation of the entire query word in relation to the t-th image areaConnected in series, projected into d-dimensional space, the formula is as follows:
4. The visual question-answering model based on the iterative attention mechanism as claimed in claim 1, wherein: the step S3 includes:
the invention uses the last output Q of the iterative modelLAnd VLTo predict answer distributions; since they contain representations of N question words and T image regions, the present invention first performs a self-attention mechanism on them to obtain an aggregated representation of the entire question and image; for QLThe operation of (1) is as follows:
calculating the "score", sqL1,…,sqLNAre each qL1,…,qLNBy applying MLP with two layers in the hidden layer;
The score of a predefined answer is calculated using MLP, which is a widely used method in recent studies, and the formula is as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911099046.2A CN110889340A (en) | 2019-11-12 | 2019-11-12 | Visual question-answering model based on iterative attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911099046.2A CN110889340A (en) | 2019-11-12 | 2019-11-12 | Visual question-answering model based on iterative attention mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110889340A true CN110889340A (en) | 2020-03-17 |
Family
ID=69747275
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911099046.2A Pending CN110889340A (en) | 2019-11-12 | 2019-11-12 | Visual question-answering model based on iterative attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110889340A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111680484A (en) * | 2020-05-29 | 2020-09-18 | 北京理工大学 | Answer model generation method and system for visual general knowledge reasoning question and answer |
CN111858849A (en) * | 2020-06-10 | 2020-10-30 | 南京邮电大学 | VQA method based on intensive attention module |
CN112036276A (en) * | 2020-08-19 | 2020-12-04 | 北京航空航天大学 | Artificial intelligent video question-answering method |
-
2019
- 2019-11-12 CN CN201911099046.2A patent/CN110889340A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111680484A (en) * | 2020-05-29 | 2020-09-18 | 北京理工大学 | Answer model generation method and system for visual general knowledge reasoning question and answer |
CN111680484B (en) * | 2020-05-29 | 2023-04-07 | 北京理工大学 | Answer model generation method and system for visual general knowledge reasoning question and answer |
CN111858849A (en) * | 2020-06-10 | 2020-10-30 | 南京邮电大学 | VQA method based on intensive attention module |
CN112036276A (en) * | 2020-08-19 | 2020-12-04 | 北京航空航天大学 | Artificial intelligent video question-answering method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20220222920A1 (en) | Content processing method and apparatus, computer device, and storage medium | |
CN108647233A (en) | A kind of answer sort method for question answering system | |
CN108920544A (en) | A kind of personalized position recommended method of knowledge based map | |
CN113486190B (en) | Multi-mode knowledge representation method integrating entity image information and entity category information | |
CN110889340A (en) | Visual question-answering model based on iterative attention mechanism | |
CN109544306A (en) | A kind of cross-cutting recommended method and device based on user behavior sequence signature | |
CN113761153B (en) | Picture-based question-answering processing method and device, readable medium and electronic equipment | |
CN109766557A (en) | A kind of sentiment analysis method, apparatus, storage medium and terminal device | |
CN115080801B (en) | Cross-modal retrieval method and system based on federal learning and data binary representation | |
CN113177141B (en) | Multi-label video hash retrieval method and device based on semantic embedded soft similarity | |
CN112131261B (en) | Community query method and device based on community network and computer equipment | |
CN112699215B (en) | Grading prediction method and system based on capsule network and interactive attention mechanism | |
Wu et al. | Multi-stage optimization model for hesitant qualitative decision making with hesitant fuzzy linguistic preference relations | |
CN114818691A (en) | Article content evaluation method, device, equipment and medium | |
CN114780777B (en) | Cross-modal retrieval method and device based on semantic enhancement, storage medium and terminal | |
CN113239209A (en) | Knowledge graph personalized learning path recommendation method based on RankNet-transformer | |
CN112000788A (en) | Data processing method and device and computer readable storage medium | |
CN105677838A (en) | User profile creating and personalized search ranking method and system based on user requirements | |
CN110008411A (en) | It is a kind of to be registered the deep learning point of interest recommended method of sparse matrix based on user | |
CN114330704A (en) | Statement generation model updating method and device, computer equipment and storage medium | |
Fotheringham et al. | Multiscale geographically weighted regression: Theory and practice | |
CN116662497A (en) | Visual question-answer data processing method, device and computer equipment | |
CN106021346A (en) | A retrieval processing method and device | |
CN112035567A (en) | Data processing method and device and computer readable storage medium | |
Wang | Research on Online Education Resources Recommendation Based on Deep Learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20200317 |
|
WD01 | Invention patent application deemed withdrawn after publication |