CN113869349B - Schematic question-answering method based on hierarchical multi-task learning - Google Patents

Schematic question-answering method based on hierarchical multi-task learning Download PDF

Info

Publication number
CN113869349B
CN113869349B CN202110892487.9A CN202110892487A CN113869349B CN 113869349 B CN113869349 B CN 113869349B CN 202110892487 A CN202110892487 A CN 202110892487A CN 113869349 B CN113869349 B CN 113869349B
Authority
CN
China
Prior art keywords
question
sequence
answering
module
answer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN202110892487.9A
Other languages
Chinese (zh)
Other versions
CN113869349A (en
Inventor
袁召全
彭潇
吴晓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest Jiaotong University
Original Assignee
Southwest Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Jiaotong University filed Critical Southwest Jiaotong University
Priority to CN202110892487.9A priority Critical patent/CN113869349B/en
Publication of CN113869349A publication Critical patent/CN113869349A/en
Application granted granted Critical
Publication of CN113869349B publication Critical patent/CN113869349B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of image question answering, in particular to a schematic question answering method based on hierarchical multi-task learning, which comprises the following steps of: s1, detecting image components of a training image and a pre-training target detector, and performing position coding and visual feature extraction; s2, taking the training set images and the region characteristic sequence as the input of a diagram analysis module, and predicting the relationship between components; s3, forming a statement sentence by the training set questions and the alternative answers, performing word instalation processing, and extracting language features to obtain a language sequence formed by the alternative answers and the questions; s4, for the training set, splicing the output of the graph analysis module and the language sequence and inputting the spliced output into a question-answering module, and training network parameters; and S5, coding the graph, the question and the alternative answer to the test set image to form a region feature sequence and a language sequence. The invention combines the graph analysis module and the question and answer module together, utilizes multi-level tasks to train, and realizes a multi-task learning framework based on two levels of analysis and question and answer.

Description

Schematic question-answering method based on hierarchical multi-task learning
Technical Field
The invention relates to the technical field of image question answering, in particular to a schematic diagram question answering method based on hierarchical multi-task learning.
Background
In recent years, with the continuous development of the fields of deep learning and computer vision, the comprehension ability of a computer to natural images has greatly progressed, for example, in the field of visual question answering, the current method can reach higher precision. However, the computer has difficulty in achieving a good semantic understanding of the schematic diagram, such as in the task of question answering the schematic diagram. The schematic question-answering task requires the computation mechanism to solve semantic knowledge in the schematic and deduce a correct answer from a plurality of candidate answers according to the knowledge. Schematic question-answering can be considered as an evaluation task for intelligent reasoning on diagram semantics, and is very challenging. First, the schematic diagram is an important media form for explanation, which is widely used in textbooks, slides, documents, and the like. While natural images hardly play the same role. Unlike natural images, schematics have highly structured semantic information, such as arrows, that represent a connection. Second, visually similar structures may have very different semantics in different schematics. These cause that the question-answering method applied to the natural image is difficult to apply to the schematic.
In order to answer questions related to the schematic diagrams, the existing method generally divides the question answering into two independent stages, and the analysis diagram module firstly identifies components and then generates a structure diagram by pairwise matching and classifying the components. The question-answering module generates facts, questions and alternative options according to the structure diagram and selects the option with the highest possibility. The method has no feedback between the generation of the diagram and the question reasoning, the accuracy of the analyzed structure diagram only depends on the image information of the training set, and the question-answering module selects the correct answer according to the generated structure diagram. During training, errors of structure analysis can cause errors of the question answering module to be very large, but the errors are only used for optimizing the question answering module and are not propagated to the analysis module in a reverse direction. Such two-stage modules cannot feed back each other, and naturally, it cannot be guaranteed that a globally optimal solution of the problem is found. The schematic diagrams often appear in materials and documents, contain rich knowledge and structural information, and are more complex than natural images, so that the common visual question answering method is difficult to apply to schematic diagram question answering. Also, similar schematics may contain disparate semantic information, which also makes schematic question-answering very challenging.
Disclosure of Invention
The invention aims to provide a schematic question-answering method based on hierarchical multi-task learning aiming at the problems in the background technology, which comprises the following steps:
s1, detecting image components of a training image and a pre-training target detector, performing position coding and visual feature extraction, and then coding to form a region feature sequence;
s2, taking the training set images and the region characteristic sequences as the input of a graph analysis module, predicting the relationship between components and training network parameters;
s3, forming a statement sentence by the training set questions and the alternative answers, performing word instalation processing, and extracting language features to obtain a language sequence formed by the alternative answers and the questions;
s4, for the training set, splicing the output of the graph analysis module and the language sequence and inputting the output and the language sequence into a question-answering module, predicting correct options of the question and training network parameters;
and S5, coding the graph, the question and the alternative answer into a regional characteristic sequence and a language sequence for the test set image, inputting the regional characteristic sequence and the language sequence into a depth network, and predicting the correct options of the question.
Preferably, S2 and S4 use a multitask learning framework to solve the question-answering task by illustrating two learning tasks, namely the analysis task and the question-answering task.
Preferably, the training graph analysis module and the question and answer module are combined in the step S4, so that the training loss of the question and answer can be correctly fed back to the graph analysis module and the question and answer module.
Preferably, the graph analysis task is firstly carried out, then the question and answer task is carried out by utilizing the output in the graph analysis task, and a hierarchical multi-task framework is constructed.
Preferably, in S1, the pre-trained target detector is a YOLO v3 target detector pre-trained based on the COCO dataset and the schematic question-and-answer image dataset; the detected schematic diagram components comprise four categories of characters, object areas, arrow heads and arrow tails; the region signature sequence coding can be divided into the following two sub-steps:
s1.1, for scheme I, the detected composition is O = { O = 1 ,o 2 ,...,o m Obtaining a visual feature sequence [ z ] with the dimension of 2048 through a deep network feature extraction module ResNet101 1 ,z 2 ,...,z m ]Then the whole image is also passed through the z obtained by the same feature extractor 0 Put in the first bit of the sequence as global information, constituting [ z ] 0 ,...,z m ]Upper left coordinate (x) of each detected component min ,y min ) And lower right coordinate (x) max ,y max ) According to (x) min ,y min ,x max ,y max ) Width, height, area constitute a 7-dimensional sequence of positional features [ q ] 0 ,q 1 ,...,q m ];
S1.2, visual characteristic sequence [ z 0 ,...,z m ]And obtaining 1024-dimensional region feature sequence through a visual feature and position feature fusion module according to the position features
Figure GDA0003815598420000031
Preferably, in S2, the module consists of multiple layers of transformers and one layer of GRU, and the region characteristic sequence H is formed by in Inputting into a Transfomer encoder to obtain 1024-dimensional output
Figure GDA0003815598420000032
Forming a pair of regions pairwise, predicting whether a relationship exists between the two regions, and determining a relationship candidate pair<o i ,o j >Is characterized by
Figure GDA0003815598420000033
By
Figure GDA0003815598420000034
Concatenating to obtain, where i, j =1, 2., m and i ≠ j, the features
Figure GDA0003815598420000035
The composed sequences are input into GRU, and each pair is predicted<o i ,o j >If a relationship exists, then the loss is calculated according to:
Figure GDA0003815598420000036
where N is the number of pairs of relationship candidates, y n Is the true value of the nth pair of relationships,
Figure GDA0003815598420000037
is the predicted value of the model to the nth pair relation.
Preferably, in S3, the tokenization and coding process uses RoBERTA, and for a question q of the image I, the question contains K candidate answers { a } k |k=1,..K, questions q and a k Statement sentence s composed of space connections k Inputting Roberta to carry out word segmentation and coding to obtain s k Characteristic of a language
Figure GDA0003815598420000041
Wherein
Figure GDA0003815598420000042
n is the maximum number of words in the sentence.
Preferably, in S4, the question answering module is composed of a multi-layer Transformer module TB dqa And a full connection layer, a region characteristic sequence splicing language sequence,
Figure GDA0003815598420000043
by TB dqa To obtain
Figure GDA0003815598420000044
Multiplying the 0 th and the m +1 th vectors by corresponding elements, and inputting the multiplied vectors into the full-link layer and the softmax layer to obtain a k Score of candidate item
Figure GDA0003815598420000045
Then, the score with the highest score is selected as a predicted value, and the question-answer loss is calculated according to the following formula:
Figure GDA0003815598420000046
wherein, t correct A label of correct answer indicating the current question, in indicates if k and t correct If the two are the same, the value is 1, otherwise, the value is 0;
global loss of L = al sp +βL dqa And the alpha and the beta are adjustable hyper-parameters which are used for learning the balanced question-answering module and the balanced question-answering module, and global parameter adjustment is carried out through a back propagation algorithm so as to optimize the network parameters by taking the minimized global loss function as a target until the function value is not reduced any more.
Preferably, in S5, the trained network parameters are fixed to reason about the schematic diagram and the problem, and the method specifically includes the following substeps:
s5.1, carrying out component detection and feature coding according to the schematic diagram S1, and then passing through TB according to the schematic diagram S2 sp Obtain a characteristic output H out
S5.2, preprocessing the question and the candidate answer according to S3 to obtain the question and the candidate answer a k Characteristic of a language
Figure GDA0003815598420000047
S5.3、H out And characteristics of language
Figure GDA0003815598420000048
Splicing the characteristic sequences that make up the union
Figure GDA0003815598420000051
Then press through TB in S4 dqa And softmax layer get a k Score of candidate item
Figure GDA0003815598420000052
And selecting the option with the highest score as an output answer.
Compared with the prior art, the invention has the following beneficial technical effects:
the invention combines the graph analysis module and the question-answering module together, utilizes multi-level tasks for training, realizes a multi-task learning framework based on two levels of analysis and question-answering, and has better semantic understanding and reasoning capability compared with the prior method.
Drawings
FIG. 1 is a schematic diagram of a training flow diagram of a question-answering method based on hierarchical multi-task learning according to the present invention;
fig. 2 is a diagram illustrating a model structure of a question-and-answer.
Detailed Description
The images of the data set involved in the method proposed by the present invention are schematic diagrams, each schematic diagram comprises a plurality of questions related to the contents of the schematic diagram, and each question has four candidate answers.
In order to solve the multi-stage problem of the previous work, the invention provides a schematic question-answering method based on hierarchical multi-task learning.
As shown in fig. 1-2, the schematic question-answering method based on hierarchical multitask learning provided by the present invention includes the following steps:
s1, detecting image components of a training image and a pre-training target detector, performing position coding and visual feature extraction, and then coding to form a region feature sequence; the pre-training target detector is a YOLO v3 target detector pre-trained on the basis of a COCO data set and a schematic question and answer image data set; the detected schematic diagram components comprise four categories of characters, object areas, arrow heads and arrow tails; the region signature sequence coding can be divided into the following two sub-steps: s1.1, for scheme I, the detected composition is O = { O = 1 ,o 2 ,...,o m Obtaining a visual feature sequence [ z ] with dimension 2048 through a depth network feature extraction module ResNet101 1 ,z 2 ,...,z m ]Then the whole image is also passed through the z obtained by the same feature extractor 0 Put in the first bit of the sequence as global information, constituting [ z ] 0 ,...,z m ]Upper left coordinate (x) of each detected component min ,y min ) And lower right coordinate (x) max ,y max ) According to (x) min ,y min ,x max ,y max ) Width, height, area make up a 7-dimensional sequence of positional features [ q [ ] 0 ,q 1 ,...,q m ](ii) a S1.2, visual characteristic sequence [ z 0 ,...,z m ]And obtaining a 1024-dimensional region feature sequence through a visual feature and position feature fusion module according to the position features
Figure GDA0003815598420000061
S2, inputting the training set images and the region characteristic sequences as an image analysis module, predicting the relationship between components and training network parameters; the module consists of a plurality of layers of transformers and a layer of GRU is composed of a region signature sequence H in Inputting into a Transfomer encoder to obtain 1024-dimensional output
Figure GDA0003815598420000062
Forming a pair of regions pairwise, predicting whether a relationship exists between the two regions, and determining a relationship candidate pair<o i ,o j >Is characterized by
Figure GDA0003815598420000063
By
Figure GDA0003815598420000064
Concatenating to obtain, where i, j =1, 2., m and i ≠ j, the features
Figure GDA0003815598420000065
The composed sequences are input into GRU, and each pair is predicted<o i ,o j >If a relationship exists, then the loss is calculated according to:
Figure GDA0003815598420000066
where N is the number of pairs of relationship candidates, y n Is the true value of the nth pair of relationships,
Figure GDA0003815598420000067
is the predicted value of the model to the nth pair relation.
S3, forming a statement sentence by the training set questions and the alternative answers, then performing word instantiation processing and extracting language features to obtain a language sequence formed by the alternative answers and the questions; the tokenization and encoding process utilizes RoBERTA, for a question q of an image I, the question containing K candidate answers { a } k I K =1, say, K }, will question q and a k Statement sentence s composed of space connections k Inputting Roberta to carry out word segmentation and coding to obtain s k Characteristic of a language
Figure GDA0003815598420000068
Wherein
Figure GDA0003815598420000069
n is the maximum number of words in the sentence.
S4, for the training set, splicing the output of the graph analysis module and the language sequence and inputting the spliced output into a question-answering module, predicting correct options of the question and training network parameters; s2 and S4 use a multi-task learning framework to solve the question and answer task through two learning tasks, namely a diagram analysis task and a question and answer task. And S4, the training graph analysis module and the question and answer module are combined, so that the training loss of the question and answer can be correctly fed back to the graph analysis module and the question and answer module. The question-answering module is composed of a multi-layer Transformer module TB dqa And a full connection layer, a region characteristic sequence splicing language sequence,
Figure GDA0003815598420000071
by TB dqa To obtain
Figure GDA0003815598420000072
Multiplying the 0 th and the (m + 1) th vectors by corresponding elements, and inputting the multiplied vectors into a full connection layer and a softmax layer to obtain a k Score of candidate item
Figure GDA0003815598420000073
Then, the score with the highest score is selected as a predicted value, and the question-answering loss is calculated according to the following formula:
Figure GDA0003815598420000074
wherein, t correct A correct answer label indicating the current question, in indicates if k and t correct If the two are the same, the value is 1, otherwise, the value is 0; global loss L = al sp +βL dqa And the alpha and the beta are adjustable hyper-parameters which are used for learning the balanced question-answering module and the balanced question-answering module, and global parameter adjustment is carried out through a back propagation algorithm so as to optimize the network parameters by taking the minimized global loss function as a target until the function value is not reduced any more.
And S5, coding the graph, the question and the alternative answer into a regional characteristic sequence and a language sequence for the test set image, inputting the regional characteristic sequence and the language sequence into a depth network, and predicting the correct options of the question. The method is characterized in that the diagram and the problem are reasoned by fixing trained network parameters, and the method specifically comprises the following substeps:
s5.1, carrying out component detection and feature coding according to the schematic diagram S1, and then passing through TB according to the schematic diagram S2 sp Obtain a characteristic output H out
S5.2, preprocessing the question and the candidate answer according to S3 to obtain the question and the candidate answer a k Characteristic of a language
Figure GDA0003815598420000075
S5.3、H out And characteristics of language
Figure GDA0003815598420000076
Splicing the characteristic sequences that make up the union
Figure GDA0003815598420000077
Then according to the passage TB in S4 dqa And softmax layer to obtain a k Score of candidate item
Figure GDA0003815598420000078
And selecting the option with the highest score as an output answer.
In the embodiment, the graph analysis task is performed first, and then the question and answer task is performed by using the output in the graph analysis task, so that a hierarchical multi-task framework is constructed.
In the invention, the model framework comprises a diagram analysis module and a question-answering module and is based on hierarchical multi-task learning training. In our multi-task learning paradigm, the two tasks of graph analysis and question answering are at different levels in the training and use different neural network modules to form a hierarchy. First, a pre-trained target detector is used to detect meaningful regions in the schematic and extract features such as text, arrows, objects. Then, the image feature sequence and the word sequence of the question are spliced into an input sequence. After the sequence is input into the model, the model is simultaneously trained by a structure analysis task and a question-answering task of the graph, the graph analysis module encodes components in the graph and relation information thereof, and the question-answering module deduces correct answers by combining structure signals with answers of questions. The schematic diagram question-answering method provided by the invention can be used for simultaneously carrying out structural analysis and deducing question answers on the diagram, realizes a multi-task learning framework based on two layers of analysis and question-answering, and has better semantic understanding and reasoning capabilities compared with the existing method.
The embodiments of the present invention have been described in detail with reference to the drawings, but the present invention is not limited thereto, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention.

Claims (5)

1. The schematic question answering method based on hierarchical multitask learning is characterized by comprising the following steps of:
s1, detecting image components of a training image and a pre-training target detector, performing position coding and visual feature extraction, and then coding to form a region feature sequence;
s2, taking the training set images and the region characteristic sequences as the input of a graph analysis module, predicting the relationship between components and training network parameters;
s3, forming a statement sentence by the training set questions and the alternative answers, performing word instalation processing, and extracting language features to obtain a language sequence formed by the alternative answers and the questions;
s4, for the training set, splicing the output of the graph analysis module and the language sequence and inputting the spliced output into a question-answering module, predicting correct options of the question and training network parameters;
s5, coding the graph, the question and the alternative answer into a regional characteristic sequence and a language sequence for the test set image, inputting a depth network, and predicting correct options of the question;
in S1, a pre-training target detector is a YOLO v3 target detector pre-trained on the basis of a COCO data set and a schematic question and answer image data set; the detected schematic diagram components comprise four categories of characters, object areas, arrow heads and arrow tails; the region signature sequence coding can be divided into the following two sub-steps:
s1.1 for scheme I, the detected composition is O = { O 1 ,o 2 ,...,o m Obtaining a visual feature sequence [ z ] with the dimension of 2048 through a deep network feature extraction module ResNet101 1 ,z 2 ,...,z m ]Then the whole image is also passed through the z obtained by the same feature extractor 0 Put in the first bit of the sequence as global information, constituting [ z ] 0 ,...,z m ]Upper left coordinate (x) of each detected component min ,y min ) And lower right coordinate (x) max ,y max ) According to (x) min ,y min ,x max ,y max ) Width, height, area constitute a 7-dimensional sequence of positional features [ q ] 0 ,q 1 ,...,q m ];
S1.2, visual characteristic sequence [ z 0 ,...,z m ]And obtaining a 1024-dimensional region feature sequence through a visual feature and position feature fusion module according to the position features
Figure FDA0003815598410000011
In S2, the module consists of a plurality of layers of transformers and a layer of GRU, and the region characteristic sequence H is divided into in Inputting into a Transfomer encoder to obtain 1024-dimensional output
Figure FDA0003815598410000021
Forming a pair of regions pairwise, predicting whether a relation exists between the two regions, and enabling the relation candidate pair to be less than o i ,o j Characteristic of
Figure FDA0003815598410000022
By
Figure FDA0003815598410000023
q i
Figure FDA0003815598410000024
qj are spliced, where i, j =1, 2.. M and i ≠ j, the features
Figure FDA0003815598410000025
The composed sequences are input into GRU, and each pair is predicted to be < o i ,o j If a relationship exists, then the loss is calculated according to:
Figure FDA0003815598410000026
where N is the number of pairs of relationship candidates, y n Is the true value of the nth pair of relationships,
Figure FDA0003815598410000027
predicting the nth pair relation for the model;
in S3, the tokenization and coding process utilizes RoBERTA, and for a question q of an image I, the question contains K candidate answers { a } k I K =1,.., K }, questions q and a are solved k Statement sentence s composed of space connections k Inputting Roberta to carry out word segmentation and coding to obtain s k Characteristic of a language
Figure FDA0003815598410000028
Wherein
Figure FDA0003815598410000029
n is the maximum number of words in the sentence;
in S4, the question answering module is composed of a multi-layer Transformer module TB dqa And a full connection layer, a region characteristic sequence splicing language sequence,
Figure FDA00038155984100000210
by TB dqa To obtain
Figure FDA00038155984100000211
Multiplying the 0 th and the m +1 th vectors by corresponding elements and outputtingInto the fully-connected layer and the softmax layer to obtain a k Score of candidate item
Figure FDA00038155984100000212
Then, the score with the highest score is selected as a predicted value, and the question-answer loss is calculated according to the following formula:
Figure FDA00038155984100000213
wherein, t correct A label of correct answer indicating the current question, in indicates if k and t correct If the two are the same, the value is 1, otherwise, the value is 0;
global loss of L = al sp +βL dqa And the alpha and the beta are adjustable hyper-parameters which are used for learning the balanced question-answering module and the balanced question-answering module, and global parameter adjustment is carried out through a back propagation algorithm so as to optimize the network parameters by taking the minimized global loss function as a target until the function value is not reduced any more.
2. The method for learning question and answer based on hierarchical multitask according to claim 1, characterized by that S2 and S4 use multitask learning frame to solve the question and answer task by means of two learning tasks of diagram analysis task and question and answer task.
3. The schematic diagram question-answering method based on hierarchical multitask learning according to claim 1, characterized in that in S4, a training diagram analysis module and a question-answering module are combined, so that the training loss of the question-answering can be correctly fed back to the diagram analysis module and the question-answering module.
4. The method for learning and answering questions and answers based on hierarchical multitask as claimed in claim 1, wherein a graph analysis task is performed first, and then the question-answering task is performed by using output in the graph analysis task to construct a hierarchical multitask framework.
5. The schematic diagram question-answering method based on hierarchical multitask learning according to claim 1, characterized in that in S5, the trained network parameters are fixed to reason about the schematic diagram and the question, and the method specifically comprises the following substeps:
s5.1, carrying out component detection and feature coding according to S1 in the schematic diagram, and then passing through TB according to S2 sp Obtain a characteristic output H out
S5.2, preprocessing the question and the candidate answer according to S3 to obtain the question and the candidate answer a k Characteristic of a language
Figure FDA0003815598410000031
S5.3、H out And the characteristics of language
Figure FDA0003815598410000032
Splicing of characteristic sequences constituting a union
Figure FDA0003815598410000033
Then according to the passage TB in S4 dqa And softmax layer to obtain a k Score of candidate item
Figure FDA0003815598410000034
The option with the highest score is selected as the output answer.
CN202110892487.9A 2021-08-04 2021-08-04 Schematic question-answering method based on hierarchical multi-task learning Expired - Fee Related CN113869349B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110892487.9A CN113869349B (en) 2021-08-04 2021-08-04 Schematic question-answering method based on hierarchical multi-task learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110892487.9A CN113869349B (en) 2021-08-04 2021-08-04 Schematic question-answering method based on hierarchical multi-task learning

Publications (2)

Publication Number Publication Date
CN113869349A CN113869349A (en) 2021-12-31
CN113869349B true CN113869349B (en) 2022-10-14

Family

ID=78990250

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110892487.9A Expired - Fee Related CN113869349B (en) 2021-08-04 2021-08-04 Schematic question-answering method based on hierarchical multi-task learning

Country Status (1)

Country Link
CN (1) CN113869349B (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10909329B2 (en) * 2015-05-21 2021-02-02 Baidu Usa Llc Multilingual image question answering
CN111008293A (en) * 2018-10-06 2020-04-14 上海交通大学 Visual question-answering method based on structured semantic representation
CN110348535B (en) * 2019-07-17 2022-05-31 北京金山数字娱乐科技有限公司 Visual question-answering model training method and device
CN111782839B (en) * 2020-06-30 2023-08-22 北京百度网讯科技有限公司 Image question-answering method, device, computer equipment and medium

Also Published As

Publication number Publication date
CN113869349A (en) 2021-12-31

Similar Documents

Publication Publication Date Title
Wu et al. Image-to-markup generation via paired adversarial learning
Atienza Advanced Deep Learning with Keras: Apply deep learning techniques, autoencoders, GANs, variational autoencoders, deep reinforcement learning, policy gradients, and more
Davis et al. End-to-end document recognition and understanding with dessurt
Li et al. Improving attention-based handwritten mathematical expression recognition with scale augmentation and drop attention
CN110866542B (en) Depth representation learning method based on feature controllable fusion
CN111160037A (en) Fine-grained emotion analysis method supporting cross-language migration
CN112115238A (en) Question-answering method and system based on BERT and knowledge base
Sharma et al. A survey of methods, datasets and evaluation metrics for visual question answering
CN111831789A (en) Question-answer text matching method based on multilayer semantic feature extraction structure
CN111680484B (en) Answer model generation method and system for visual general knowledge reasoning question and answer
CN110781672A (en) Question bank production method and system based on machine intelligence
Davis et al. Visual FUDGE: form understanding via dynamic graph editing
CN115953569A (en) One-stage visual positioning model construction method based on multi-step reasoning
Kovvuri et al. Pirc net: Using proposal indexing, relationships and context for phrase grounding
Palash et al. Bangla image caption generation through cnn-transformer based encoder-decoder network
Deng et al. A position-aware transformer for image captioning
CN117609536A (en) Language-guided reference expression understanding reasoning network system and reasoning method
CN113869349B (en) Schematic question-answering method based on hierarchical multi-task learning
Sharma et al. Visual question answering model based on the fusion of multimodal features by a two-way co-attention mechanism
CN115409028A (en) Knowledge and data driven multi-granularity Chinese text sentiment analysis method
CN114691848A (en) Relational triple combined extraction method and automatic question-answering system construction method
CN112364654A (en) Education-field-oriented entity and relation combined extraction method
Pei et al. Visual relational reasoning for image caption
Xie et al. Enhancing multimodal deep representation learning by fixed model reuse
Cai et al. Multimodal Visual Question Answering Model Enhanced with Image Emotional Information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20221014