CN117422978A - Grounding visual question-answering method based on dynamic two-stage visual information fusion - Google Patents

Grounding visual question-answering method based on dynamic two-stage visual information fusion Download PDF

Info

Publication number
CN117422978A
CN117422978A CN202311428263.8A CN202311428263A CN117422978A CN 117422978 A CN117422978 A CN 117422978A CN 202311428263 A CN202311428263 A CN 202311428263A CN 117422978 A CN117422978 A CN 117422978A
Authority
CN
China
Prior art keywords
question
grounding
mask
visual
answer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311428263.8A
Other languages
Chinese (zh)
Inventor
周东生
张悦
樊万姝
车超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University
Original Assignee
Dalian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University filed Critical Dalian University
Priority to CN202311428263.8A priority Critical patent/CN117422978A/en
Publication of CN117422978A publication Critical patent/CN117422978A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/86Arrangements for image or video recognition or understanding using pattern recognition or machine learning using syntactic or structural representations of the image or video pattern, e.g. symbolic string recognition; using graph matching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H80/00ICT specially adapted for facilitating communication between medical practitioners or patients, e.g. for collaborative diagnosis, therapy or health monitoring

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Primary Health Care (AREA)
  • Epidemiology (AREA)
  • Human Computer Interaction (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Processing (AREA)

Abstract

The invention provides a grounding visual question-answering method based on dynamic double-stage visual information fusion, which constructs a grounding visual question-answering system through a double-stage multi-scale network, namely, the grounding visual question-answering system is divided into a pixel-level feature and a regional-level feature which are guided by languages, and two scale branches are combined to carry out final text answer and grounding answer prediction; the dynamic regional level feature positioning network for question guidance is provided, and masks with different sizes are allocated for grounding answers in a self-adaptive manner through the positioning of the visual information for question guidance, so that the accuracy of positioning and segmentation of small targets is improved; a cross-modal aggregation module is also designed to fuse the features of the two levels, so that feature fusion between the pixel level and the region level features can be enhanced, and the segmentation effect of the grounding answer mask edge can be improved. The grounding visual question-answering system built through the language-guided self-adaptive two-stage feature fusion network can generate an answer grounding mask while answering questions, and the accuracy of the whole model is effectively improved.

Description

Grounding visual question-answering method based on dynamic two-stage visual information fusion
Technical Field
The invention belongs to the technical field of computer vision and natural language processing, and particularly relates to a grounding vision question-answering method based on dynamic two-stage vision information fusion.
Background
In recent years, VQA (visual question and answer) technology has been rapidly developed, and more practical application scenes are provided, such as answering questions of visually impaired patients or helping radiologists to diagnose deadly diseases early, and man-machine interaction. With the increasing maturity of these systems, the accuracy of a system that produces only good answers, with the answer having a basis, will also be important for various studies and applications. By considering the reasoning mechanism of the model, an interpretable support can be provided for the answer to some extent. An ideal VQA system for such purposes should not only generate accurate answers, but should also provide a mechanism to verify the answers.
However, conventional VQA generally only outputs the final text answer and lacks verification of visual evidence, so there has been work in recent years to try to solve this problem, such as the MAC-CAPS method (capsule-based weak supervision grounded visual question-answering) to give a visual attention attempt while obtaining the text answer, in order to better assess the accuracy of the system positioning answer. Similar approaches are LXMERT (trans former based cross-modal encoder), DCAMN (dual capsule attention mask network with mutual learning function for visual questions and answers), etc. also output the grounded answer area in its corresponding picture while generating text answers. But these methods typically output attention patterns or boxes associated with the questions to reveal the ground-related areas, and if a ground-based image answer mask is provided in response to the visual questions, it can be directly verified whether the obtained answer is convincing, which can make the VQA system more reliable. Meanwhile, in the application angle, the image grounding mask can be obtained to expand more applications, for example, a question facing a person with vision impairment can be used for dividing relevant content from the background, the background is subjected to blurring processing to protect privacy, or relevant vision areas can be enlarged, and a user with low vision can find out wanted information more quickly.
An answer grounding task is therefore presented which, unlike the conventional VQA task, starts from the actual application of visually impaired people and aims at the system to output a mask map of the visual area corresponding to the answer while answering the text answer. For this task DAVI (answer grounding based on dual visual language interactions) is a combination of two pre-trained large models, including two encoders and two decoders, of BLIP (guided language image pre-training, unified visual language understanding and generation) and VIT (multi-modal framework based on visual and language research), combining the text image segmentation task model and the visual-to-language generation task model, but actually still equivalent to dividing the two interrelated tasks of generating text answers and outputting a ground mask into two independent tasks. The newly published DDTN (ground-based visual answer on dual-decoder transformer network) does not employ a large-scale pre-training model, but the segmentation effect is also much reduced compared to DAVT.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides the grounding visual question answering method based on dynamic two-stage visual information fusion, which can achieve a better segmentation effect on the premise of not being based on a large-scale pre-training model, realize the output of two answer modes under the condition of one encoder and one decoder, and better realize the interaction between the two modes.
In order to achieve the above purpose, the technical scheme of the invention is as follows:
a grounding visual question-answering method based on dynamic two-stage visual information fusion comprises the following steps:
step 1: as shown in FIG. 1, the present invention employs a question-guided region-level dynamic multi-scale approach to localization and segmentation of ground answers, designing a language-guided region-level feature module QGDR consisting of a cross-attention module and a spatial annotatorThe intention module is composed to finally obtain the regional mask prediction characteristic F with the resolution ratio from small to large i ∈F t ,F s ,F m ,F l The method comprises the steps of carrying out a first treatment on the surface of the Wherein F is t ,F s ,F m ,F l For four types of regional feature hierarchies, from F t To F l The spatial resolution increases by two times layer by layer;
step 2: and meanwhile, in order to reduce the calculation cost and maintain the performance, a mask with proper resolution is adaptively allocated to each positioning object by adopting a dynamic method, and budget limitation is carried out on resource consumption. The QGDR output has four different switch states corresponding to four different mask resolutions, [14×14, 28×28, 56×56, 112×112];
step 3: in order to better fuse the two levels of characteristics, a cross-mode multi-scale fusion module FPA is also designed to output the characteristics F of the language-guided pixel-level characteristic module PWAM and the language-guided regional-level characteristic module QGDR i And P i Polymerizing;
step 4: constructing information flow between each level of the language-guided pixel level feature module PWAM and each level of the language-guided regional level feature module QGDR, performing layered progressive decoding, and finally obtaining a grounding answer by an image segmentation decoder and a text answer by a text decoder; training a grounding visual question-answering model consisting of two-stage characteristic branches by adopting mask loss, edge loss, budget constraint and text loss;
step 5: and (3) loading the model in the step (4), and inputting the required image and the corresponding question into the trained grounding visual question-answering model to obtain a corresponding grounding answer and a text answer.
Based on the scheme, the method adopts multi-scale information fusion, can better understand and process visual information under different scales, and is beneficial to improving understanding and positioning of complex scenes, so that accuracy of questions and answers is improved. The method adopts self-adaptive resolution mask allocation, and dynamically allocates masks with proper resolution according to the needs of each positioning object, so that the resource utilization efficiency can be improved, and meanwhile, the high-resolution processing of the key area is kept. Through introducing a cross-mode multi-scale fusion module, the language-guided pixel-level features and region-level features are subjected to multi-scale aggregation, so that text information and image information can be combined better, and understanding of the questions and generating capacity of answers are improved. The method of layered progressive decoding is adopted, information is decoded from pixel level features to regional level features and then to final answers, so that detail information in an image can be better captured, the detail information is associated with a question, and accuracy of questions and answers is improved. A variety of penalty functions, including mask penalty, edge penalty, budget constraint, and text penalty, are used to comprehensively consider targets of different aspects, thereby better training the model and improving performance of the model. The method can be applied to ground visual question answering, provides an efficient and accurate method for machine understanding of images and question answering, and is potentially applied to various fields such as automatic driving, medical image analysis, image retrieval and the like.
Further, the step 1 specifically includes:
step 1.1: first, ROI-aligned region feature Z is extracted from a swin-transducer i Carrying out average pooling to obtainRe-combining problem features K extracted from BERT i . Will->And K i Input into cross-model intent, this step can be thought of as injecting the word attention in the question into different visual channels to guide visual localization, facilitating multimodal information complementation enhancement. Wherein T represents transposition operation, and the specific formula is as follows after two linear transformations:
wherein Q is i Represents an attention weight; d, d i Representation ofAnd->The length of the vector; />A vector representing the problem feature generated by linear transformation;
step 1.2: for the obtained Q i Performing global pooling operation to obtain information weightIs fed into the attention module SE-block to weight the different channels of visual information for screening. Then classifying by using a plurality of convolution and full-connected layers to obtain regional mask prediction features F with different sizes i . The specific formula is as follows:
in the method, in the process of the invention,representing the Flatehen operation, F ex Representing operations in the SE-block module, w representing weights; f (F) ex The specific operation formula is as follows:
where delta represents a sigmoid function, ρ represents a ReLU function,and->Representing the weight matrix dimension.
Further, the step 2 specifically includes:
the QGDR module is effectively a lightweight classifier aimed at selecting the best mask resolution from among the k different scale candidates located, and accurately locating and segmenting the ground answer at the minimum resource cost. QGDR will F i Hierarchical structure F of area features divided into four types t ,F s ,F m ,F l From F t To F l The spatial resolution is incremented by a factor of two layer by layer. And outputs a probability vector epsilon by performing softmax operation k =[ε 1 ,…,ε k ]. Each element of the probability vector represents a probability that the corresponding candidate resolution is selected. Soft output ε of QGDR k Should be converted into a single thermal prediction, expressed as h= [ H ] 1 ,…,h k ]. This process can be done by discrete sampling, followed by gradient back-propagation with Gumbel-Softmax to update QGDR. The specific formula is as follows:
wherein τ is a parameter; gumbel-softmax approaches unity fever when τ approaches 0. g i Representing gummel distribution; epsilon k′ Representing k' discrete probability vectors.
Further, the step 3 specifically includes:
step 3.1: picture and problem two-modality information gets transmembrane fusion features after language-guided pixel level feature module (PWAM) and language-guided regional level feature module (QGDR) processingAnd F i ∈R C ×H×W . The outputs of the two modules are then multi-scale aggregated. F due to upsampling and ROI pooling operations of the two modules i And P i There is a spatial misalignment between the two, and in order to enhance the segmentation performance of the boundary region, a cross-modal multi-scale fusion module FPA for adaptively aggregating multi-scale features is designed. The FPA includes as shown in FIG. 1A deformable convolution and a dynamic convolution. First F i Up-sampling by deconvolution (Deconv) and then adding F i And P i In series, the series of features is passed through a 3 x 3 conv to obtain an offset map, denoted Δo. Finally, F is calculated by the learned offset o i Alignment P i Adjusting the output F of QGDR by a deformable convolution form conv1 i To better match the PWAM output P i The specific formula of alignment is as follows:
O i =Φ[conv(ρ(F i )||P i )] (5)
where ρ represents a Deconv operation, Φ represents a Deconv 1 operation, and |is a join operation.
Step 3.2: o after variable convolution operation i And P i The addition is then performed by a 1 x 1 convolution to achieve an output channel C. Finally, condConv, which is similar to the attention mechanism, is convolved by conditions, which more focus on the salient portions of the object. The cross-modal multi-scale fusion module FPA is inserted into the different stages of swin-transformer decoding, which plays a key role in improving ground answer mask prediction. The specific formula is as follows:
Y i =ψ(conv 1×1 (O i +P i )) (6)
wherein Y is i Representing the regional characteristics; psi stands for CondConv operation.
Further, the step 4 specifically includes:
the QGDR dynamically locates the ground answers in the images by the language guide image and provides ground answer masks that assign different resolutions to different aggregation stages. The cost of computing resources is reduced while ensuring accuracy, so that three loss functions are adopted for training the dynamic multi-scale module.
Step 4.1: first, mask loss (mask loss), given one VQA instance, the mask switching state h= [ H ] of different resolutions is predicted first with QGDR 1 ,…,h k ]And the fusion of the FPA modules is transferred to different stages of a decoding end to obtain a group of K mask predictive picturesThe mask penalty function is defined as follows:
where N represents N different instances,ground answer mask representing kth prediction, +.>A mask of true grounding answer, h i Indicating whether the kth mask resolution is selected as the output resolution. />Represented as binary cross entropy loss.
Step 4.2: second, edge loss, which is a dynamic selection of masks generated by the QGDR, is generally considered to use the size of mask loss as a measure of mask quality, but mask loss generated on different masks is actually very close and it is difficult to distinguish mask quality. In contrast, the difference of edge loss generated by masks with different resolutions is larger, and the quality of the masks can be reflected better. The present invention takes edge loss to measure mask quality. Output f= [ F ] of given QGDR 1 ,···,f k ]And edge mapping of different resolutions, usingThe edge loss is expressed as follows:
wherein, thereinRepresenting ground truth answer edge by first masking +.>The soft edge map is obtained by applying the laplace operator to the top, and then converted into a binary edge map by thresholding.
Step 4.3: the QGDR module is optimized by edge loss in step 4.2, but there is a problem in that the model training tends to converge to a suboptimal solution, i.e., all instances are split with a mask of maximum resolution, since the mask contains more detailed information, the predictive loss is minimal. In practice, experiments have shown that not all samples need the largest mask for segmentation. In order to avoid the problems described above, to improve model efficiency and reduce computational effort, the present invention employs budget constraints to train QGDR. Specifically, let C denote the corresponding computational cost of the selected mask resolution. Indicating that the expected deviation (E (C)) calculated for the current batch data exceeds the target deviation (in C) t Representation) a penalty is added to the model.
Step 4.4: the overall objective function of the resulting ground answer branch is as follows, where λ 1 And lambda (lambda) 2 Is a trade-off of super parameters:
finally, the problem features and visual features are combined by the element product and classified by the Softmax function. The network is trained using a text answer and a binary cross entropy loss function of the PWAM.
Further, the step 5 specifically includes:
and (3) loading the model best trained in the step (4), inputting the images and the corresponding questions thereof into the model, and outputting answers and corresponding evaluation indexes.
The invention has the beneficial effects that: the invention provides a grounding visual question-answering method based on dynamic two-stage visual information fusion, which constructs multi-level direct flow from pixel-level features to regional-level features, thereby promoting complementary information aggregation of the multi-level features. Specifically, the invention provides a dynamic region-level module taking a problem as a guide, which can effectively position a region-level object according to the problem and dynamically select masks with different resolutions, thereby realizing multi-scale feature fusion of language-guided object-level features. In addition, the invention provides a cross-mode multi-scale fusion module, which takes the language in the image as the guide, and adaptively aggregates the pixel-level information and the area-level content, so that the interaction and fusion of multi-mode information from different layers are realized, the high-quality information interaction is realized, and the accuracy of the whole model is effectively improved.
Drawings
Fig. 1 is a diagram of a grounded visual question-answering network framework based on dynamic two-stage visual information fusion.
Detailed Description
The embodiment of the invention is implemented on the premise of the technical scheme of the invention, and a detailed implementation mode and a specific operation process are provided, but the protection scope of the invention is not limited to the following embodiment.
The invention provides a grounding visual question-answering method based on dynamic double-stage visual information fusion, which constructs a grounding visual question-answering system through a double-stage multi-scale network, namely, the grounding visual question-answering system is divided into a pixel-level feature and a regional-level feature which are guided by languages, and two scale branches are combined to carry out final text answer and grounding answer prediction; the dynamic regional level feature positioning network for question guidance is provided, and masks with different sizes are allocated for grounding answers in a self-adaptive manner through the positioning of the visual information for question guidance, so that the accuracy of positioning and segmentation of small targets is improved; in addition, a cross-modal aggregation module is designed to fuse the features of the two levels, so that feature fusion between the pixel level and the region level features can be enhanced, and the segmentation effect of the grounding answer mask edge can be improved. The grounding visual question-answering system built through the language-guided self-adaptive two-stage feature fusion network can generate an answer grounding mask while answering questions, and the accuracy of the whole model is effectively improved.
Example 1
According to the embodiment, a Windows system is used as a development environment, pycharm is used as a development platform, python is used as a development language, and the grounding visual question answering method based on dynamic double-stage visual information fusion is adopted to complete the grounding answer prediction aiming at pictures shot by the visually impaired and related questions.
In this embodiment, a visual question-answering method based on dynamic two-stage visual information fusion includes the following steps:
step 1: loading pre-training weights of a Swin-Transformer and a BERT encoder in a DDVT network into a grounding visual question-answering network shown in figure 1;
step 2: inputting the 'image-question-ground answer' pair in the training set into the ground visual question-answer network of the step 1 for training;
step 3: and (3) taking the required image and the corresponding problem as input, and loading the network model which is trained and stored in the step (2) to obtain the corresponding grounding answer and the corresponding evaluation index. The present invention uses the cross-over ratio, i.e., the overlap area between the model predictive cut and the label divided by the joint area between the predictive cut and the label, as an evaluation index. Its calculation mode can be represented by formula (16), in which S i And S is u Representing the predictive segmentation answer and the true label answer, respectively.
According to the steps, the LXMTRT model, mac-Caps model, UNIFIED model, DDVT model, MCAN model and other models are compared. As can be seen from table 1, the accuracy of the method proposed by the present invention is substantially better over the other methods on both common test sets.
Table 1 comparison of the performance of the model on the blocked portion of the VizWizGroundVQA validation set and VQS test set
The foregoing descriptions of specific exemplary embodiments of the present invention are presented for purposes of illustration and description. It is not intended to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain the specific principles of the invention and its practical application to thereby enable one skilled in the art to make and utilize the invention in various exemplary embodiments and with various modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.

Claims (7)

1. The grounding visual question-answering method based on dynamic two-stage visual information fusion is characterized by comprising the following steps of:
step 1: positioning and dividing the ground answers by adopting a question-guided regional dynamic multi-scale method, and designing a language-guided regional feature module QGDR consisting of a cross attention module and a spatial attention module to finally obtain regional mask prediction features F with small-to-large resolution i ∈F t ,F s ,F m ,F l Wherein F t ,F s ,F m ,F l For four types of regional feature hierarchies, from F t To F l The spatial resolution increases by two times layer by layer;
step 2: adopting a dynamic method to adaptively allocate masks with proper resolution to each positioning object, and carrying out budget limitation on resource consumption; the QGDR outputs four different switch states corresponding to four different mask resolutions, namely [14×14, 28×28, 56×56, 112×112];
step 3: designing a cross-mode multi-scale fusion module FPA, and carrying out multi-scale aggregation on the characteristics output by a language-guided pixel-level characteristic module PWAM and a language-guided regional-level characteristic module QGDR;
step 4: constructing information flow between each level of the language-guided pixel level feature module PWAM and each level of the language-guided regional level feature module QGDR, performing layered progressive decoding, and finally obtaining a grounding answer by an image segmentation decoder and a text answer by a text decoder; training a grounding visual question-answering model consisting of two-stage characteristic branches by adopting mask loss, edge loss, budget constraint and text loss;
step 5: and (3) loading the model in the step (4), and inputting the required image and the corresponding question into the trained grounding visual question-answering model to obtain a corresponding grounding answer and a text answer.
2. The ground-based visual question answering method based on dynamic two-stage visual information fusion according to claim 1, wherein the method of question guiding area-level dynamic multiscale in step 1 specifically comprises:
step 1.1: first, ROI-aligned region feature Z is extracted from a swin-transducer i Carrying out average pooling to obtainRe-combining problem features K extracted from BERT i Will->And K i Input into cross-model section, where T represents the transpose operation, and after two linear transformations, the specific formula is as follows:
wherein Q is i Represents an attention weight; d, d i Representation ofAnd->The length of the vector; />A vector representing the problem feature generated by linear transformation;
step 1.2: for the obtained Q i Performing global pooling operation to obtain information weightFed into an attention module SE-block to weight the different channels of visual information for screening; then classifying by using a plurality of convolution and full-connected layers to obtain regional mask prediction features F with different sizes i The specific formula is as follows:
in the method, in the process of the invention,representing the Flatehen operation, F ex Representing operations in the SE-block module, w representing weights;
F ex the specific operation formula is as follows:
where delta represents a sigmoid function, ρ represents a ReLU function,and->Representation rightsAnd (5) a heavy matrix dimension.
3. The ground-based visual question-answering method based on dynamic two-stage visual information fusion according to claim 1, wherein the step 2 of adaptively allocating a mask with a proper resolution to each positioning object by using a dynamic method specifically comprises:
QGDR is a lightweight classifier that selects the best mask resolution from the located k candidate objects of different scales, which QGDR will F i Hierarchical structure F of area features divided into four types t ,F s ,F m ,F l From F t To F l The spatial resolution is increased by two times layer by layer, and a probability vector epsilon is output by performing softmax operation k =[ε 1 ,…,ε k ]The method comprises the steps of carrying out a first treatment on the surface of the Each element of the probability vector represents a probability that the corresponding candidate resolution is selected; soft output ε of QGDR k Conversion to a single thermal prediction, denoted h= [ H ] 1 ,…,h k ]This process is done by discrete sampling, and then gradient back-propagation update QGDR is performed using Gumbel-Softmax, specifically as follows:
wherein τ is a parameter; gumbel-softmax approaches unity heat when τ approaches 0; g i Representing gummel distribution; epsilon k′ Representing k' discrete probability vectors.
4. The ground-based visual question-answering method based on dynamic two-stage visual information fusion according to claim 1, wherein the step 3 specifically comprises:
step 3.1: the two modal information of the picture and the question obtains the transmembrane fusion characteristic after being processed by a language-guided pixel-level characteristic module PWAM and a language-guided regional-level characteristic module QGDRAnd F i ∈R C×H×W The outputs of the two modules are then multi-scale aggregated; designing a cross-modal multi-scale fusion module FPA of self-adaptive aggregation multi-scale features, wherein the FPA comprises a deformable convolution and a dynamic convolution; first F i Upsampling by deconvolution Deconv, then adding F i And P i Concatenating the concatenated features to obtain an offset map, denoted Δo, by a 3×3 conv; finally, F is calculated by the learned offset o i Alignment P i Adjusting the output F of QGDR by a deformable convolution form conv1 i Is positioned to be in communication with the output P of PWAM i The specific formula of alignment is as follows:
O i =Φ[conv(ρ(F i )|||P i )] (5)
where ρ represents a Deconv operation, Φ represents a Deconv 1 operation, and |is a join operation;
step 3.2: o after variable convolution operation i And P i Adding, and then realizing that the output channel is C through 1X 1 convolution; finally, the cross-modal multi-scale fusion module FPA is inserted into different stages of swin-transformer decoding through conditional convolution CondConv, and the specific formula is as follows:
Y i =ψ(conv 1×1 (O i +P i )) (6)
wherein Y is i Representing the regional characteristics; psi stands for CondConv operation.
5. The ground-based visual question-answering method based on dynamic two-stage visual information fusion according to claim 4, wherein the mask loss in step 4 is specifically: given one VQA instance, the mask switching state h= [ H ] for its different resolutions is first predicted with QGDR 1 ,…,h k ]And the fusion of the FPA modules is transferred to different stages of a decoding end to obtain a group of K mask predictive picturesThe mask penalty function is defined as follows:
where N represents N different instances,ground answer mask representing kth prediction, +.>A mask of true grounding answer, h i Indicating whether the kth mask resolution is selected as output resolution, +.>Represented as binary cross entropy loss.
6. The ground-based visual question-answering method based on dynamic two-stage visual information fusion according to claim 5, wherein the edge loss in step 4 is specifically: using edge loss to measure mask quality, the output f= [ F ] of a given QGDR 1 ,···,f k ]And edge mapping of different resolutions, usingThe edge loss is expressed as follows:
wherein,representing ground truth answer edge by first masking +.>The soft edge map is obtained by applying the laplace operator to the top, and then converted into a binary edge map by thresholding.
7. The ground-engaging visual question-answering method based on dynamic two-stage visual information fusion according to claim 6, wherein the budget constraint and text loss in step 4 are specifically: training QGDR with budget constraints, in particular, assuming that C represents the corresponding calculation cost of the selected mask resolution, represents that the expected deviation E (C) calculated on the current batch data exceeds the target deviation C t When adding a penalty to the model:
the overall objective function of the resulting ground answer branch is as follows: wherein lambda is 1 And lambda (lambda) 2 Is a trade-off of super parameters:
finally, the question features and visual features are combined by the element product, classified by the Softmax function, and trained using the text answer and the binary cross entropy loss function of PWAM.
CN202311428263.8A 2023-10-31 2023-10-31 Grounding visual question-answering method based on dynamic two-stage visual information fusion Pending CN117422978A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311428263.8A CN117422978A (en) 2023-10-31 2023-10-31 Grounding visual question-answering method based on dynamic two-stage visual information fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311428263.8A CN117422978A (en) 2023-10-31 2023-10-31 Grounding visual question-answering method based on dynamic two-stage visual information fusion

Publications (1)

Publication Number Publication Date
CN117422978A true CN117422978A (en) 2024-01-19

Family

ID=89524446

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311428263.8A Pending CN117422978A (en) 2023-10-31 2023-10-31 Grounding visual question-answering method based on dynamic two-stage visual information fusion

Country Status (1)

Country Link
CN (1) CN117422978A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118093837A (en) * 2024-04-23 2024-05-28 豫章师范学院 Psychological support question-answering text generation method and system based on transform double decoding structure

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118093837A (en) * 2024-04-23 2024-05-28 豫章师范学院 Psychological support question-answering text generation method and system based on transform double decoding structure

Similar Documents

Publication Publication Date Title
Anderson et al. Bottom-up and top-down attention for image captioning and visual question answering
AU2019200270B2 (en) Concept mask: large-scale segmentation from semantic concepts
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
Xiao et al. Heterogeneous knowledge distillation for simultaneous infrared-visible image fusion and super-resolution
CN110782420A (en) Small target feature representation enhancement method based on deep learning
CN115496928B (en) Multi-modal image feature matching method based on multi-feature matching
CN112348036A (en) Self-adaptive target detection method based on lightweight residual learning and deconvolution cascade
CN110033054B (en) Personalized handwriting migration method and system based on collaborative stroke optimization
CN112634296A (en) RGB-D image semantic segmentation method and terminal for guiding edge information distillation through door mechanism
Yin et al. End-to-end face parsing via interlinked convolutional neural networks
CN110503052A (en) A kind of image, semantic dividing method based on improvement U-NET network
CN116309648A (en) Medical image segmentation model construction method based on multi-attention fusion
Zhang et al. Global context aware RCNN for object detection
CN116229056A (en) Semantic segmentation method, device and equipment based on double-branch feature fusion
CN116645592B (en) Crack detection method based on image processing and storage medium
CN117422978A (en) Grounding visual question-answering method based on dynamic two-stage visual information fusion
CN116758130A (en) Monocular depth prediction method based on multipath feature extraction and multi-scale feature fusion
CN112037239B (en) Text guidance image segmentation method based on multi-level explicit relation selection
CN114445620A (en) Target segmentation method for improving Mask R-CNN
Wu et al. STR transformer: a cross-domain transformer for scene text recognition
Lv et al. An inverted residual based lightweight network for object detection in sweeping robots
Zheng et al. Feature pyramid of bi-directional stepped concatenation for small object detection
Li et al. Maskformer with improved encoder-decoder module for semantic segmentation of fine-resolution remote sensing images
Yao et al. SSNet: A novel transformer and CNN hybrid network for remote sensing semantic segmentation
Liu et al. Dunhuang murals contour generation network based on convolution and self-attention fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination