CN113869349B

CN113869349B - Schematic question-answering method based on hierarchical multi-task learning

Info

Publication number: CN113869349B
Application number: CN202110892487.9A
Authority: CN
Inventors: 袁召全; 彭潇; 吴晓
Original assignee: Southwest Jiaotong University
Current assignee: Southwest Jiaotong University
Priority date: 2021-08-04
Filing date: 2021-08-04
Publication date: 2022-10-14
Anticipated expiration: 2041-08-04
Also published as: CN113869349A

Abstract

The invention relates to the technical field of image question answering, in particular to a schematic question answering method based on hierarchical multi-task learning, which comprises the following steps of: s1, detecting image components of a training image and a pre-training target detector, and performing position coding and visual feature extraction; s2, taking the training set images and the region characteristic sequence as the input of a diagram analysis module, and predicting the relationship between components; s3, forming a statement sentence by the training set questions and the alternative answers, performing word instalation processing, and extracting language features to obtain a language sequence formed by the alternative answers and the questions; s4, for the training set, splicing the output of the graph analysis module and the language sequence and inputting the spliced output into a question-answering module, and training network parameters; and S5, coding the graph, the question and the alternative answer to the test set image to form a region feature sequence and a language sequence. The invention combines the graph analysis module and the question and answer module together, utilizes multi-level tasks to train, and realizes a multi-task learning framework based on two levels of analysis and question and answer.

Description

Schematic question-answering method based on hierarchical multi-task learning

Technical Field

The invention relates to the technical field of image question answering, in particular to a schematic diagram question answering method based on hierarchical multi-task learning.

Background

In recent years, with the continuous development of the fields of deep learning and computer vision, the comprehension ability of a computer to natural images has greatly progressed, for example, in the field of visual question answering, the current method can reach higher precision. However, the computer has difficulty in achieving a good semantic understanding of the schematic diagram, such as in the task of question answering the schematic diagram. The schematic question-answering task requires the computation mechanism to solve semantic knowledge in the schematic and deduce a correct answer from a plurality of candidate answers according to the knowledge. Schematic question-answering can be considered as an evaluation task for intelligent reasoning on diagram semantics, and is very challenging. First, the schematic diagram is an important media form for explanation, which is widely used in textbooks, slides, documents, and the like. While natural images hardly play the same role. Unlike natural images, schematics have highly structured semantic information, such as arrows, that represent a connection. Second, visually similar structures may have very different semantics in different schematics. These cause that the question-answering method applied to the natural image is difficult to apply to the schematic.

In order to answer questions related to the schematic diagrams, the existing method generally divides the question answering into two independent stages, and the analysis diagram module firstly identifies components and then generates a structure diagram by pairwise matching and classifying the components. The question-answering module generates facts, questions and alternative options according to the structure diagram and selects the option with the highest possibility. The method has no feedback between the generation of the diagram and the question reasoning, the accuracy of the analyzed structure diagram only depends on the image information of the training set, and the question-answering module selects the correct answer according to the generated structure diagram. During training, errors of structure analysis can cause errors of the question answering module to be very large, but the errors are only used for optimizing the question answering module and are not propagated to the analysis module in a reverse direction. Such two-stage modules cannot feed back each other, and naturally, it cannot be guaranteed that a globally optimal solution of the problem is found. The schematic diagrams often appear in materials and documents, contain rich knowledge and structural information, and are more complex than natural images, so that the common visual question answering method is difficult to apply to schematic diagram question answering. Also, similar schematics may contain disparate semantic information, which also makes schematic question-answering very challenging.

Disclosure of Invention

The invention aims to provide a schematic question-answering method based on hierarchical multi-task learning aiming at the problems in the background technology, which comprises the following steps:

s1, detecting image components of a training image and a pre-training target detector, performing position coding and visual feature extraction, and then coding to form a region feature sequence;

s2, taking the training set images and the region characteristic sequences as the input of a graph analysis module, predicting the relationship between components and training network parameters;

s3, forming a statement sentence by the training set questions and the alternative answers, performing word instalation processing, and extracting language features to obtain a language sequence formed by the alternative answers and the questions;

s4, for the training set, splicing the output of the graph analysis module and the language sequence and inputting the output and the language sequence into a question-answering module, predicting correct options of the question and training network parameters;

and S5, coding the graph, the question and the alternative answer into a regional characteristic sequence and a language sequence for the test set image, inputting the regional characteristic sequence and the language sequence into a depth network, and predicting the correct options of the question.

Preferably, S2 and S4 use a multitask learning framework to solve the question-answering task by illustrating two learning tasks, namely the analysis task and the question-answering task.

Preferably, the training graph analysis module and the question and answer module are combined in the step S4, so that the training loss of the question and answer can be correctly fed back to the graph analysis module and the question and answer module.

Preferably, the graph analysis task is firstly carried out, then the question and answer task is carried out by utilizing the output in the graph analysis task, and a hierarchical multi-task framework is constructed.

Preferably, in S1, the pre-trained target detector is a YOLO v3 target detector pre-trained based on the COCO dataset and the schematic question-and-answer image dataset; the detected schematic diagram components comprise four categories of characters, object areas, arrow heads and arrow tails; the region signature sequence coding can be divided into the following two sub-steps:

s1.1, for scheme I, the detected composition is O = { O = ₁ ，o ₂ ，...，o _m Obtaining a visual feature sequence [ z ] with the dimension of 2048 through a deep network feature extraction module ResNet101 ₁ ，z ₂ ，...，z _m ]Then the whole image is also passed through the z obtained by the same feature extractor ₀ Put in the first bit of the sequence as global information, constituting [ z ] ₀ ，...，z _m ]Upper left coordinate (x) of each detected component _min ，y _min ) And lower right coordinate (x) _max ，y _max ) According to (x) _min ，y _min ，x _max ，y _max ) Width, height, area constitute a 7-dimensional sequence of positional features [ q ] ₀ ，q ₁ ，...，q _m ]；

S1.2, visual characteristic sequence [ z ₀ ，...，z _m ]And obtaining 1024-dimensional region feature sequence through a visual feature and position feature fusion module according to the position features

Preferably, in S2, the module consists of multiple layers of transformers and one layer of GRU, and the region characteristic sequence H is formed by ⁱⁿ Inputting into a Transfomer encoder to obtain 1024-dimensional output

Forming a pair of regions pairwise, predicting whether a relationship exists between the two regions, and determining a relationship candidate pair<o _i ，o _j >Is characterized by

By

Concatenating to obtain, where i, j =1, 2., m and i ≠ j, the features

The composed sequences are input into GRU, and each pair is predicted<o _i ，o _j >If a relationship exists, then the loss is calculated according to:

where N is the number of pairs of relationship candidates, y _n Is the true value of the nth pair of relationships,

is the predicted value of the model to the nth pair relation.

Preferably, in S3, the tokenization and coding process uses RoBERTA, and for a question q of the image I, the question contains K candidate answers { a } _k |k＝1，..K, questions q and a _k Statement sentence s composed of space connections _k Inputting Roberta to carry out word segmentation and coding to obtain s _k Characteristic of a language

Wherein

n is the maximum number of words in the sentence.

Preferably, in S4, the question answering module is composed of a multi-layer Transformer module TB _dqa And a full connection layer, a region characteristic sequence splicing language sequence,

by TB _dqa To obtain

Multiplying the 0 th and the m +1 th vectors by corresponding elements, and inputting the multiplied vectors into the full-link layer and the softmax layer to obtain a _k Score of candidate item

Then, the score with the highest score is selected as a predicted value, and the question-answer loss is calculated according to the following formula:

wherein, t _correct A label of correct answer indicating the current question, in indicates if k and t _correct If the two are the same, the value is 1, otherwise, the value is 0;

global loss of L = al _sp +βL _dqa And the alpha and the beta are adjustable hyper-parameters which are used for learning the balanced question-answering module and the balanced question-answering module, and global parameter adjustment is carried out through a back propagation algorithm so as to optimize the network parameters by taking the minimized global loss function as a target until the function value is not reduced any more.

Preferably, in S5, the trained network parameters are fixed to reason about the schematic diagram and the problem, and the method specifically includes the following substeps:

s5.1, carrying out component detection and feature coding according to the schematic diagram S1, and then passing through TB according to the schematic diagram S2 _sp Obtain a characteristic output H ^out ；

S5.2, preprocessing the question and the candidate answer according to S3 to obtain the question and the candidate answer a _k Characteristic of a language

S5.3、H ^out And characteristics of language

Splicing the characteristic sequences that make up the union

Then press through TB in S4 _dqa And softmax layer get a _k Score of candidate item

And selecting the option with the highest score as an output answer.

Compared with the prior art, the invention has the following beneficial technical effects:

the invention combines the graph analysis module and the question-answering module together, utilizes multi-level tasks for training, realizes a multi-task learning framework based on two levels of analysis and question-answering, and has better semantic understanding and reasoning capability compared with the prior method.

Drawings

FIG. 1 is a schematic diagram of a training flow diagram of a question-answering method based on hierarchical multi-task learning according to the present invention;

fig. 2 is a diagram illustrating a model structure of a question-and-answer.

Detailed Description

The images of the data set involved in the method proposed by the present invention are schematic diagrams, each schematic diagram comprises a plurality of questions related to the contents of the schematic diagram, and each question has four candidate answers.

In order to solve the multi-stage problem of the previous work, the invention provides a schematic question-answering method based on hierarchical multi-task learning.

As shown in fig. 1-2, the schematic question-answering method based on hierarchical multitask learning provided by the present invention includes the following steps:

s1, detecting image components of a training image and a pre-training target detector, performing position coding and visual feature extraction, and then coding to form a region feature sequence; the pre-training target detector is a YOLO v3 target detector pre-trained on the basis of a COCO data set and a schematic question and answer image data set; the detected schematic diagram components comprise four categories of characters, object areas, arrow heads and arrow tails; the region signature sequence coding can be divided into the following two sub-steps: s1.1, for scheme I, the detected composition is O = { O = ₁ ，o ₂ ，...，o _m Obtaining a visual feature sequence [ z ] with dimension 2048 through a depth network feature extraction module ResNet101 ₁ ，z ₂ ，...，z _m ]Then the whole image is also passed through the z obtained by the same feature extractor ₀ Put in the first bit of the sequence as global information, constituting [ z ] ₀ ，...，z _m ]Upper left coordinate (x) of each detected component _min ，y _min ) And lower right coordinate (x) _max ，y _max ) According to (x) _min ，y _min ，x _max ，y _max ) Width, height, area make up a 7-dimensional sequence of positional features [ q [ ] ₀ ，q ₁ ，...，q _m ](ii) a S1.2, visual characteristic sequence [ z ₀ ，...，z _m ]And obtaining a 1024-dimensional region feature sequence through a visual feature and position feature fusion module according to the position features

S2, inputting the training set images and the region characteristic sequences as an image analysis module, predicting the relationship between components and training network parameters; the module consists of a plurality of layers of transformers and a layer of GRU is composed of a region signature sequence H ⁱⁿ Inputting into a Transfomer encoder to obtain 1024-dimensional output

By

Concatenating to obtain, where i, j =1, 2., m and i ≠ j, the features

is the predicted value of the model to the nth pair relation.

S3, forming a statement sentence by the training set questions and the alternative answers, then performing word instantiation processing and extracting language features to obtain a language sequence formed by the alternative answers and the questions; the tokenization and encoding process utilizes RoBERTA, for a question q of an image I, the question containing K candidate answers { a } _k I K =1, say, K }, will question q and a _k Statement sentence s composed of space connections _k Inputting Roberta to carry out word segmentation and coding to obtain s _k Characteristic of a language

Wherein

n is the maximum number of words in the sentence.

S4, for the training set, splicing the output of the graph analysis module and the language sequence and inputting the spliced output into a question-answering module, predicting correct options of the question and training network parameters; s2 and S4 use a multi-task learning framework to solve the question and answer task through two learning tasks, namely a diagram analysis task and a question and answer task. And S4, the training graph analysis module and the question and answer module are combined, so that the training loss of the question and answer can be correctly fed back to the graph analysis module and the question and answer module. The question-answering module is composed of a multi-layer Transformer module TB _dqa And a full connection layer, a region characteristic sequence splicing language sequence,

by TB _dqa To obtain

Multiplying the 0 th and the (m + 1) th vectors by corresponding elements, and inputting the multiplied vectors into a full connection layer and a softmax layer to obtain a _k Score of candidate item

Then, the score with the highest score is selected as a predicted value, and the question-answering loss is calculated according to the following formula:

wherein, t _correct A correct answer label indicating the current question, in indicates if k and t _correct If the two are the same, the value is 1, otherwise, the value is 0; global loss L = al _sp +βL _dqa And the alpha and the beta are adjustable hyper-parameters which are used for learning the balanced question-answering module and the balanced question-answering module, and global parameter adjustment is carried out through a back propagation algorithm so as to optimize the network parameters by taking the minimized global loss function as a target until the function value is not reduced any more.

And S5, coding the graph, the question and the alternative answer into a regional characteristic sequence and a language sequence for the test set image, inputting the regional characteristic sequence and the language sequence into a depth network, and predicting the correct options of the question. The method is characterized in that the diagram and the problem are reasoned by fixing trained network parameters, and the method specifically comprises the following substeps:

S5.3、H ^out And characteristics of language

Splicing the characteristic sequences that make up the union

Then according to the passage TB in S4 _dqa And softmax layer to obtain a _k Score of candidate item

And selecting the option with the highest score as an output answer.

In the embodiment, the graph analysis task is performed first, and then the question and answer task is performed by using the output in the graph analysis task, so that a hierarchical multi-task framework is constructed.

In the invention, the model framework comprises a diagram analysis module and a question-answering module and is based on hierarchical multi-task learning training. In our multi-task learning paradigm, the two tasks of graph analysis and question answering are at different levels in the training and use different neural network modules to form a hierarchy. First, a pre-trained target detector is used to detect meaningful regions in the schematic and extract features such as text, arrows, objects. Then, the image feature sequence and the word sequence of the question are spliced into an input sequence. After the sequence is input into the model, the model is simultaneously trained by a structure analysis task and a question-answering task of the graph, the graph analysis module encodes components in the graph and relation information thereof, and the question-answering module deduces correct answers by combining structure signals with answers of questions. The schematic diagram question-answering method provided by the invention can be used for simultaneously carrying out structural analysis and deducing question answers on the diagram, realizes a multi-task learning framework based on two layers of analysis and question-answering, and has better semantic understanding and reasoning capabilities compared with the existing method.

The embodiments of the present invention have been described in detail with reference to the drawings, but the present invention is not limited thereto, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention.

Claims

1. The schematic question answering method based on hierarchical multitask learning is characterized by comprising the following steps of:

s4, for the training set, splicing the output of the graph analysis module and the language sequence and inputting the spliced output into a question-answering module, predicting correct options of the question and training network parameters;

s5, coding the graph, the question and the alternative answer into a regional characteristic sequence and a language sequence for the test set image, inputting a depth network, and predicting correct options of the question;

in S1, a pre-training target detector is a YOLO v3 target detector pre-trained on the basis of a COCO data set and a schematic question and answer image data set; the detected schematic diagram components comprise four categories of characters, object areas, arrow heads and arrow tails; the region signature sequence coding can be divided into the following two sub-steps:

s1.1 for scheme I, the detected composition is O = { O ₁ ，o ₂ ，...，o _m Obtaining a visual feature sequence [ z ] with the dimension of 2048 through a deep network feature extraction module ResNet101 ₁ ，z ₂ ，...，z _m ]Then the whole image is also passed through the z obtained by the same feature extractor ₀ Put in the first bit of the sequence as global information, constituting [ z ] ₀ ，...，z _m ]Upper left coordinate (x) of each detected component _min ，y _min ) And lower right coordinate (x) _max ，y _max ) According to (x) _min ，y _min ，x _max ，y _max ) Width, height, area constitute a 7-dimensional sequence of positional features [ q ] ₀ ，q ₁ ，...，q _m ]；

S1.2, visual characteristic sequence [ z ₀ ，...，z _m ]And obtaining a 1024-dimensional region feature sequence through a visual feature and position feature fusion module according to the position features

In S2, the module consists of a plurality of layers of transformers and a layer of GRU, and the region characteristic sequence H is divided into ⁱⁿ Inputting into a Transfomer encoder to obtain 1024-dimensional output

Forming a pair of regions pairwise, predicting whether a relation exists between the two regions, and enabling the relation candidate pair to be less than o _i ，o _j Characteristic of

By

q _i ，

qj are spliced, where i, j =1, 2.. M and i ≠ j, the features

The composed sequences are input into GRU, and each pair is predicted to be < o _i ，o _j If a relationship exists, then the loss is calculated according to:

predicting the nth pair relation for the model;

in S3, the tokenization and coding process utilizes RoBERTA, and for a question q of an image I, the question contains K candidate answers { a } _k I K =1,.., K }, questions q and a are solved _k Statement sentence s composed of space connections _k Inputting Roberta to carry out word segmentation and coding to obtain s _k Characteristic of a language

Wherein

n is the maximum number of words in the sentence;

in S4, the question answering module is composed of a multi-layer Transformer module TB _dqa And a full connection layer, a region characteristic sequence splicing language sequence,

by TB _dqa To obtain

Multiplying the 0 th and the m +1 th vectors by corresponding elements and outputtingInto the fully-connected layer and the softmax layer to obtain a _k Score of candidate item

2. The method for learning question and answer based on hierarchical multitask according to claim 1, characterized by that S2 and S4 use multitask learning frame to solve the question and answer task by means of two learning tasks of diagram analysis task and question and answer task.

3. The schematic diagram question-answering method based on hierarchical multitask learning according to claim 1, characterized in that in S4, a training diagram analysis module and a question-answering module are combined, so that the training loss of the question-answering can be correctly fed back to the diagram analysis module and the question-answering module.

4. The method for learning and answering questions and answers based on hierarchical multitask as claimed in claim 1, wherein a graph analysis task is performed first, and then the question-answering task is performed by using output in the graph analysis task to construct a hierarchical multitask framework.

5. The schematic diagram question-answering method based on hierarchical multitask learning according to claim 1, characterized in that in S5, the trained network parameters are fixed to reason about the schematic diagram and the question, and the method specifically comprises the following substeps:

s5.1, carrying out component detection and feature coding according to S1 in the schematic diagram, and then passing through TB according to S2 _sp Obtain a characteristic output H ^out ；

S5.3、H ^out And the characteristics of language

Splicing of characteristic sequences constituting a union

The option with the highest score is selected as the output answer.