CN117422978A

CN117422978A - Grounding visual question-answering method based on dynamic two-stage visual information fusion

Info

Publication number: CN117422978A
Application number: CN202311428263.8A
Authority: CN
Inventors: 周东生; 张悦; 樊万姝; 车超
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2023-10-31
Filing date: 2023-10-31
Publication date: 2024-01-19

Abstract

The invention provides a grounding visual question-answering method based on dynamic double-stage visual information fusion, which constructs a grounding visual question-answering system through a double-stage multi-scale network, namely, the grounding visual question-answering system is divided into a pixel-level feature and a regional-level feature which are guided by languages, and two scale branches are combined to carry out final text answer and grounding answer prediction; the dynamic regional level feature positioning network for question guidance is provided, and masks with different sizes are allocated for grounding answers in a self-adaptive manner through the positioning of the visual information for question guidance, so that the accuracy of positioning and segmentation of small targets is improved; a cross-modal aggregation module is also designed to fuse the features of the two levels, so that feature fusion between the pixel level and the region level features can be enhanced, and the segmentation effect of the grounding answer mask edge can be improved. The grounding visual question-answering system built through the language-guided self-adaptive two-stage feature fusion network can generate an answer grounding mask while answering questions, and the accuracy of the whole model is effectively improved.

Description

Grounding visual question-answering method based on dynamic two-stage visual information fusion

Technical Field

The invention belongs to the technical field of computer vision and natural language processing, and particularly relates to a grounding vision question-answering method based on dynamic two-stage vision information fusion.

Background

In recent years, VQA (visual question and answer) technology has been rapidly developed, and more practical application scenes are provided, such as answering questions of visually impaired patients or helping radiologists to diagnose deadly diseases early, and man-machine interaction. With the increasing maturity of these systems, the accuracy of a system that produces only good answers, with the answer having a basis, will also be important for various studies and applications. By considering the reasoning mechanism of the model, an interpretable support can be provided for the answer to some extent. An ideal VQA system for such purposes should not only generate accurate answers, but should also provide a mechanism to verify the answers.

However, conventional VQA generally only outputs the final text answer and lacks verification of visual evidence, so there has been work in recent years to try to solve this problem, such as the MAC-CAPS method (capsule-based weak supervision grounded visual question-answering) to give a visual attention attempt while obtaining the text answer, in order to better assess the accuracy of the system positioning answer. Similar approaches are LXMERT (trans former based cross-modal encoder), DCAMN (dual capsule attention mask network with mutual learning function for visual questions and answers), etc. also output the grounded answer area in its corresponding picture while generating text answers. But these methods typically output attention patterns or boxes associated with the questions to reveal the ground-related areas, and if a ground-based image answer mask is provided in response to the visual questions, it can be directly verified whether the obtained answer is convincing, which can make the VQA system more reliable. Meanwhile, in the application angle, the image grounding mask can be obtained to expand more applications, for example, a question facing a person with vision impairment can be used for dividing relevant content from the background, the background is subjected to blurring processing to protect privacy, or relevant vision areas can be enlarged, and a user with low vision can find out wanted information more quickly.

An answer grounding task is therefore presented which, unlike the conventional VQA task, starts from the actual application of visually impaired people and aims at the system to output a mask map of the visual area corresponding to the answer while answering the text answer. For this task DAVI (answer grounding based on dual visual language interactions) is a combination of two pre-trained large models, including two encoders and two decoders, of BLIP (guided language image pre-training, unified visual language understanding and generation) and VIT (multi-modal framework based on visual and language research), combining the text image segmentation task model and the visual-to-language generation task model, but actually still equivalent to dividing the two interrelated tasks of generating text answers and outputting a ground mask into two independent tasks. The newly published DDTN (ground-based visual answer on dual-decoder transformer network) does not employ a large-scale pre-training model, but the segmentation effect is also much reduced compared to DAVT.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides the grounding visual question answering method based on dynamic two-stage visual information fusion, which can achieve a better segmentation effect on the premise of not being based on a large-scale pre-training model, realize the output of two answer modes under the condition of one encoder and one decoder, and better realize the interaction between the two modes.

In order to achieve the above purpose, the technical scheme of the invention is as follows:

a grounding visual question-answering method based on dynamic two-stage visual information fusion comprises the following steps:

step 1: as shown in FIG. 1, the present invention employs a question-guided region-level dynamic multi-scale approach to localization and segmentation of ground answers, designing a language-guided region-level feature module QGDR consisting of a cross-attention module and a spatial annotatorThe intention module is composed to finally obtain the regional mask prediction characteristic F with the resolution ratio from small to large _i ∈F _t ,F _s ,F _m ,F _l The method comprises the steps of carrying out a first treatment on the surface of the Wherein F is _t ,F _s ,F _m ,F _l For four types of regional feature hierarchies, from F _t To F _l The spatial resolution increases by two times layer by layer;

step 2: and meanwhile, in order to reduce the calculation cost and maintain the performance, a mask with proper resolution is adaptively allocated to each positioning object by adopting a dynamic method, and budget limitation is carried out on resource consumption. The QGDR output has four different switch states corresponding to four different mask resolutions, [14×14, 28×28, 56×56, 112×112];

step 3: in order to better fuse the two levels of characteristics, a cross-mode multi-scale fusion module FPA is also designed to output the characteristics F of the language-guided pixel-level characteristic module PWAM and the language-guided regional-level characteristic module QGDR _i And P _i Polymerizing;

step 4: constructing information flow between each level of the language-guided pixel level feature module PWAM and each level of the language-guided regional level feature module QGDR, performing layered progressive decoding, and finally obtaining a grounding answer by an image segmentation decoder and a text answer by a text decoder; training a grounding visual question-answering model consisting of two-stage characteristic branches by adopting mask loss, edge loss, budget constraint and text loss;

step 5: and (3) loading the model in the step (4), and inputting the required image and the corresponding question into the trained grounding visual question-answering model to obtain a corresponding grounding answer and a text answer.

Based on the scheme, the method adopts multi-scale information fusion, can better understand and process visual information under different scales, and is beneficial to improving understanding and positioning of complex scenes, so that accuracy of questions and answers is improved. The method adopts self-adaptive resolution mask allocation, and dynamically allocates masks with proper resolution according to the needs of each positioning object, so that the resource utilization efficiency can be improved, and meanwhile, the high-resolution processing of the key area is kept. Through introducing a cross-mode multi-scale fusion module, the language-guided pixel-level features and region-level features are subjected to multi-scale aggregation, so that text information and image information can be combined better, and understanding of the questions and generating capacity of answers are improved. The method of layered progressive decoding is adopted, information is decoded from pixel level features to regional level features and then to final answers, so that detail information in an image can be better captured, the detail information is associated with a question, and accuracy of questions and answers is improved. A variety of penalty functions, including mask penalty, edge penalty, budget constraint, and text penalty, are used to comprehensively consider targets of different aspects, thereby better training the model and improving performance of the model. The method can be applied to ground visual question answering, provides an efficient and accurate method for machine understanding of images and question answering, and is potentially applied to various fields such as automatic driving, medical image analysis, image retrieval and the like.

Further, the step 1 specifically includes:

step 1.1: first, ROI-aligned region feature Z is extracted from a swin-transducer _i Carrying out average pooling to obtainRe-combining problem features K extracted from BERT _i . Will->And K _i Input into cross-model intent, this step can be thought of as injecting the word attention in the question into different visual channels to guide visual localization, facilitating multimodal information complementation enhancement. Wherein T represents transposition operation, and the specific formula is as follows after two linear transformations:

wherein Q is _i Represents an attention weight; d, d _i Representation ofAnd->The length of the vector; />A vector representing the problem feature generated by linear transformation;

step 1.2: for the obtained Q _i Performing global pooling operation to obtain information weightIs fed into the attention module SE-block to weight the different channels of visual information for screening. Then classifying by using a plurality of convolution and full-connected layers to obtain regional mask prediction features F with different sizes _i . The specific formula is as follows:

in the method, in the process of the invention,representing the Flatehen operation, F _ex Representing operations in the SE-block module, w representing weights; f (F) _ex The specific operation formula is as follows:

where delta represents a sigmoid function, ρ represents a ReLU function,and->Representing the weight matrix dimension.

Further, the step 2 specifically includes:

the QGDR module is effectively a lightweight classifier aimed at selecting the best mask resolution from among the k different scale candidates located, and accurately locating and segmenting the ground answer at the minimum resource cost. QGDR will F _i Hierarchical structure F of area features divided into four types _t ,F _s ,F _m ,F _l From F _t To F _l The spatial resolution is incremented by a factor of two layer by layer. And outputs a probability vector epsilon by performing softmax operation ^k ＝[ε ¹ ,…,ε ^k ]. Each element of the probability vector represents a probability that the corresponding candidate resolution is selected. Soft output ε of QGDR ^k Should be converted into a single thermal prediction, expressed as h= [ H ] ₁ ,…,h _k ]. This process can be done by discrete sampling, followed by gradient back-propagation with Gumbel-Softmax to update QGDR. The specific formula is as follows:

wherein τ is a parameter; gumbel-softmax approaches unity fever when τ approaches 0. g _i Representing gummel distribution; epsilon ^k′ Representing k' discrete probability vectors.

Further, the step 3 specifically includes:

step 3.1: picture and problem two-modality information gets transmembrane fusion features after language-guided pixel level feature module (PWAM) and language-guided regional level feature module (QGDR) processingAnd F _i ∈R ^C ^×H×W . The outputs of the two modules are then multi-scale aggregated. F due to upsampling and ROI pooling operations of the two modules _i And P _i There is a spatial misalignment between the two, and in order to enhance the segmentation performance of the boundary region, a cross-modal multi-scale fusion module FPA for adaptively aggregating multi-scale features is designed. The FPA includes as shown in FIG. 1A deformable convolution and a dynamic convolution. First F _i Up-sampling by deconvolution (Deconv) and then adding F _i And P _i In series, the series of features is passed through a 3 x 3 conv to obtain an offset map, denoted Δo. Finally, F is calculated by the learned offset o _i Alignment P _i Adjusting the output F of QGDR by a deformable convolution form conv1 _i To better match the PWAM output P _i The specific formula of alignment is as follows:

O _i ＝Φ[conv(ρ(F _i )||P _i )] (5)

where ρ represents a Deconv operation, Φ represents a Deconv 1 operation, and |is a join operation.

Step 3.2: o after variable convolution operation _i And P _i The addition is then performed by a 1 x 1 convolution to achieve an output channel C. Finally, condConv, which is similar to the attention mechanism, is convolved by conditions, which more focus on the salient portions of the object. The cross-modal multi-scale fusion module FPA is inserted into the different stages of swin-transformer decoding, which plays a key role in improving ground answer mask prediction. The specific formula is as follows:

Y _i ＝ψ(conv _1×1 (O _i +P _i )) (6)

wherein Y is _i Representing the regional characteristics; psi stands for CondConv operation.

Further, the step 4 specifically includes:

the QGDR dynamically locates the ground answers in the images by the language guide image and provides ground answer masks that assign different resolutions to different aggregation stages. The cost of computing resources is reduced while ensuring accuracy, so that three loss functions are adopted for training the dynamic multi-scale module.

Step 4.1: first, mask loss (mask loss), given one VQA instance, the mask switching state h= [ H ] of different resolutions is predicted first with QGDR ₁ ,…,h _k ]And the fusion of the FPA modules is transferred to different stages of a decoding end to obtain a group of K mask predictive picturesThe mask penalty function is defined as follows:

where N represents N different instances,ground answer mask representing kth prediction, +.>A mask of true grounding answer, h _i Indicating whether the kth mask resolution is selected as the output resolution. />Represented as binary cross entropy loss.

Step 4.2: second, edge loss, which is a dynamic selection of masks generated by the QGDR, is generally considered to use the size of mask loss as a measure of mask quality, but mask loss generated on different masks is actually very close and it is difficult to distinguish mask quality. In contrast, the difference of edge loss generated by masks with different resolutions is larger, and the quality of the masks can be reflected better. The present invention takes edge loss to measure mask quality. Output f= [ F ] of given QGDR ₁ ,···,f _k ]And edge mapping of different resolutions, usingThe edge loss is expressed as follows:

wherein, thereinRepresenting ground truth answer edge by first masking +.>The soft edge map is obtained by applying the laplace operator to the top, and then converted into a binary edge map by thresholding.

Step 4.3: the QGDR module is optimized by edge loss in step 4.2, but there is a problem in that the model training tends to converge to a suboptimal solution, i.e., all instances are split with a mask of maximum resolution, since the mask contains more detailed information, the predictive loss is minimal. In practice, experiments have shown that not all samples need the largest mask for segmentation. In order to avoid the problems described above, to improve model efficiency and reduce computational effort, the present invention employs budget constraints to train QGDR. Specifically, let C denote the corresponding computational cost of the selected mask resolution. Indicating that the expected deviation (E (C)) calculated for the current batch data exceeds the target deviation (in C) _t Representation) a penalty is added to the model.

Step 4.4: the overall objective function of the resulting ground answer branch is as follows, where λ ₁ And lambda (lambda) ₂ Is a trade-off of super parameters:

finally, the problem features and visual features are combined by the element product and classified by the Softmax function. The network is trained using a text answer and a binary cross entropy loss function of the PWAM.

Further, the step 5 specifically includes:

and (3) loading the model best trained in the step (4), inputting the images and the corresponding questions thereof into the model, and outputting answers and corresponding evaluation indexes.

The invention has the beneficial effects that: the invention provides a grounding visual question-answering method based on dynamic two-stage visual information fusion, which constructs multi-level direct flow from pixel-level features to regional-level features, thereby promoting complementary information aggregation of the multi-level features. Specifically, the invention provides a dynamic region-level module taking a problem as a guide, which can effectively position a region-level object according to the problem and dynamically select masks with different resolutions, thereby realizing multi-scale feature fusion of language-guided object-level features. In addition, the invention provides a cross-mode multi-scale fusion module, which takes the language in the image as the guide, and adaptively aggregates the pixel-level information and the area-level content, so that the interaction and fusion of multi-mode information from different layers are realized, the high-quality information interaction is realized, and the accuracy of the whole model is effectively improved.

Drawings

Fig. 1 is a diagram of a grounded visual question-answering network framework based on dynamic two-stage visual information fusion.

Detailed Description

The embodiment of the invention is implemented on the premise of the technical scheme of the invention, and a detailed implementation mode and a specific operation process are provided, but the protection scope of the invention is not limited to the following embodiment.

The invention provides a grounding visual question-answering method based on dynamic double-stage visual information fusion, which constructs a grounding visual question-answering system through a double-stage multi-scale network, namely, the grounding visual question-answering system is divided into a pixel-level feature and a regional-level feature which are guided by languages, and two scale branches are combined to carry out final text answer and grounding answer prediction; the dynamic regional level feature positioning network for question guidance is provided, and masks with different sizes are allocated for grounding answers in a self-adaptive manner through the positioning of the visual information for question guidance, so that the accuracy of positioning and segmentation of small targets is improved; in addition, a cross-modal aggregation module is designed to fuse the features of the two levels, so that feature fusion between the pixel level and the region level features can be enhanced, and the segmentation effect of the grounding answer mask edge can be improved. The grounding visual question-answering system built through the language-guided self-adaptive two-stage feature fusion network can generate an answer grounding mask while answering questions, and the accuracy of the whole model is effectively improved.

Example 1

According to the embodiment, a Windows system is used as a development environment, pycharm is used as a development platform, python is used as a development language, and the grounding visual question answering method based on dynamic double-stage visual information fusion is adopted to complete the grounding answer prediction aiming at pictures shot by the visually impaired and related questions.

In this embodiment, a visual question-answering method based on dynamic two-stage visual information fusion includes the following steps:

step 1: loading pre-training weights of a Swin-Transformer and a BERT encoder in a DDVT network into a grounding visual question-answering network shown in figure 1;

step 2: inputting the 'image-question-ground answer' pair in the training set into the ground visual question-answer network of the step 1 for training;

step 3: and (3) taking the required image and the corresponding problem as input, and loading the network model which is trained and stored in the step (2) to obtain the corresponding grounding answer and the corresponding evaluation index. The present invention uses the cross-over ratio, i.e., the overlap area between the model predictive cut and the label divided by the joint area between the predictive cut and the label, as an evaluation index. Its calculation mode can be represented by formula (16), in which S _i And S is _u Representing the predictive segmentation answer and the true label answer, respectively.

According to the steps, the LXMTRT model, mac-Caps model, UNIFIED model, DDVT model, MCAN model and other models are compared. As can be seen from table 1, the accuracy of the method proposed by the present invention is substantially better over the other methods on both common test sets.

Table 1 comparison of the performance of the model on the blocked portion of the VizWizGroundVQA validation set and VQS test set

The foregoing descriptions of specific exemplary embodiments of the present invention are presented for purposes of illustration and description. It is not intended to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain the specific principles of the invention and its practical application to thereby enable one skilled in the art to make and utilize the invention in various exemplary embodiments and with various modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.

Claims

1. The grounding visual question-answering method based on dynamic two-stage visual information fusion is characterized by comprising the following steps of:

step 1: positioning and dividing the ground answers by adopting a question-guided regional dynamic multi-scale method, and designing a language-guided regional feature module QGDR consisting of a cross attention module and a spatial attention module to finally obtain regional mask prediction features F with small-to-large resolution _i ∈F _t ,F _s ,F _m ,F _l Wherein F _t ,F _s ,F _m ,F _l For four types of regional feature hierarchies, from F _t To F _l The spatial resolution increases by two times layer by layer;

step 2: adopting a dynamic method to adaptively allocate masks with proper resolution to each positioning object, and carrying out budget limitation on resource consumption; the QGDR outputs four different switch states corresponding to four different mask resolutions, namely [14×14, 28×28, 56×56, 112×112];

step 3: designing a cross-mode multi-scale fusion module FPA, and carrying out multi-scale aggregation on the characteristics output by a language-guided pixel-level characteristic module PWAM and a language-guided regional-level characteristic module QGDR;

2. The ground-based visual question answering method based on dynamic two-stage visual information fusion according to claim 1, wherein the method of question guiding area-level dynamic multiscale in step 1 specifically comprises:

step 1.1: first, ROI-aligned region feature Z is extracted from a swin-transducer _i Carrying out average pooling to obtainRe-combining problem features K extracted from BERT _i Will->And K _i Input into cross-model section, where T represents the transpose operation, and after two linear transformations, the specific formula is as follows:

step 1.2: for the obtained Q _i Performing global pooling operation to obtain information weightFed into an attention module SE-block to weight the different channels of visual information for screening; then classifying by using a plurality of convolution and full-connected layers to obtain regional mask prediction features F with different sizes _i The specific formula is as follows:

in the method, in the process of the invention,representing the Flatehen operation, F _ex Representing operations in the SE-block module, w representing weights;

F _ex the specific operation formula is as follows:

where delta represents a sigmoid function, ρ represents a ReLU function,and->Representation rightsAnd (5) a heavy matrix dimension.

3. The ground-based visual question-answering method based on dynamic two-stage visual information fusion according to claim 1, wherein the step 2 of adaptively allocating a mask with a proper resolution to each positioning object by using a dynamic method specifically comprises:

QGDR is a lightweight classifier that selects the best mask resolution from the located k candidate objects of different scales, which QGDR will F _i Hierarchical structure F of area features divided into four types _t ,F _s ,F _m ,F _l From F _t To F _l The spatial resolution is increased by two times layer by layer, and a probability vector epsilon is output by performing softmax operation ^k ＝[ε ¹ ,…,ε ^k ]The method comprises the steps of carrying out a first treatment on the surface of the Each element of the probability vector represents a probability that the corresponding candidate resolution is selected; soft output ε of QGDR ^k Conversion to a single thermal prediction, denoted h= [ H ] ₁ ,…,h _k ]This process is done by discrete sampling, and then gradient back-propagation update QGDR is performed using Gumbel-Softmax, specifically as follows:

wherein τ is a parameter; gumbel-softmax approaches unity heat when τ approaches 0; g _i Representing gummel distribution; epsilon ^k′ Representing k' discrete probability vectors.

4. The ground-based visual question-answering method based on dynamic two-stage visual information fusion according to claim 1, wherein the step 3 specifically comprises:

step 3.1: the two modal information of the picture and the question obtains the transmembrane fusion characteristic after being processed by a language-guided pixel-level characteristic module PWAM and a language-guided regional-level characteristic module QGDRAnd F _i ∈R ^C×H×W The outputs of the two modules are then multi-scale aggregated; designing a cross-modal multi-scale fusion module FPA of self-adaptive aggregation multi-scale features, wherein the FPA comprises a deformable convolution and a dynamic convolution; first F _i Upsampling by deconvolution Deconv, then adding F _i And P _i Concatenating the concatenated features to obtain an offset map, denoted Δo, by a 3×3 conv; finally, F is calculated by the learned offset o _i Alignment P _i Adjusting the output F of QGDR by a deformable convolution form conv1 _i Is positioned to be in communication with the output P of PWAM _i The specific formula of alignment is as follows:

O _i ＝Φ[conv(ρ(F _i )|||P _i )] (5)

where ρ represents a Deconv operation, Φ represents a Deconv 1 operation, and |is a join operation;

step 3.2: o after variable convolution operation _i And P _i Adding, and then realizing that the output channel is C through 1X 1 convolution; finally, the cross-modal multi-scale fusion module FPA is inserted into different stages of swin-transformer decoding through conditional convolution CondConv, and the specific formula is as follows:

Y _i ＝ψ(conv _1×1 (O _i +P _i )) (6)

5. The ground-based visual question-answering method based on dynamic two-stage visual information fusion according to claim 4, wherein the mask loss in step 4 is specifically: given one VQA instance, the mask switching state h= [ H ] for its different resolutions is first predicted with QGDR ₁ ,…,h _k ]And the fusion of the FPA modules is transferred to different stages of a decoding end to obtain a group of K mask predictive picturesThe mask penalty function is defined as follows:

where N represents N different instances,ground answer mask representing kth prediction, +.>A mask of true grounding answer, h _i Indicating whether the kth mask resolution is selected as output resolution, +.>Represented as binary cross entropy loss.

6. The ground-based visual question-answering method based on dynamic two-stage visual information fusion according to claim 5, wherein the edge loss in step 4 is specifically: using edge loss to measure mask quality, the output f= [ F ] of a given QGDR ₁ ,···,f _k ]And edge mapping of different resolutions, usingThe edge loss is expressed as follows:

wherein,representing ground truth answer edge by first masking +.>The soft edge map is obtained by applying the laplace operator to the top, and then converted into a binary edge map by thresholding.

7. The ground-engaging visual question-answering method based on dynamic two-stage visual information fusion according to claim 6, wherein the budget constraint and text loss in step 4 are specifically: training QGDR with budget constraints, in particular, assuming that C represents the corresponding calculation cost of the selected mask resolution, represents that the expected deviation E (C) calculated on the current batch data exceeds the target deviation C _t When adding a penalty to the model:

the overall objective function of the resulting ground answer branch is as follows: wherein lambda is ₁ And lambda (lambda) ₂ Is a trade-off of super parameters:

finally, the question features and visual features are combined by the element product, classified by the Softmax function, and trained using the text answer and the binary cross entropy loss function of PWAM.