CN115129848A - Method, device, equipment and medium for processing visual question-answering task - Google Patents

Method, device, equipment and medium for processing visual question-answering task Download PDF

Info

Publication number
CN115129848A
CN115129848A CN202211068333.9A CN202211068333A CN115129848A CN 115129848 A CN115129848 A CN 115129848A CN 202211068333 A CN202211068333 A CN 202211068333A CN 115129848 A CN115129848 A CN 115129848A
Authority
CN
China
Prior art keywords
text
image
target detection
fusion
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211068333.9A
Other languages
Chinese (zh)
Other versions
CN115129848B (en
Inventor
李仁刚
张润泽
赵雅倩
郭振华
范宝余
李晓川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202211068333.9A priority Critical patent/CN115129848B/en
Publication of CN115129848A publication Critical patent/CN115129848A/en
Priority to PCT/CN2022/142512 priority patent/WO2024045444A1/en
Application granted granted Critical
Publication of CN115129848B publication Critical patent/CN115129848B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/19007Matching; Proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19147Obtaining sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19173Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/1918Fusion techniques, i.e. combining data from various sources, e.g. sensor fusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The application relates to the technical field of image processing, and discloses a method, a device, equipment and a medium for processing a visual question-answering task, wherein a feature fusion processing is carried out on an image to be analyzed and a first text to obtain a fusion feature; the fusion features contain coordinate information for each detection box. According to the correlation between the image to be analyzed and the first text, screening out a target detection frame meeting the correlation requirement from the fusion characteristics; inputting coordinate information, classification categories and semantic features corresponding to the target detection box into the trained visual question-answering model to obtain a second text matched with the first text; and the first text and the second text have a logical correspondence. By performing feature fusion processing on the image to be analyzed and the first text, comprehensive analysis of the image to be analyzed and the first text can be realized. The detection frames are deleted based on the correlation, so that the interference caused by invalid detection frames is effectively reduced, the calculated amount of the visual question-answering model is reduced, and the performance of the visual question-answering task is improved.

Description

Method, device, equipment and medium for processing visual question-answering task
Technical Field
The present application relates to the field of image processing technologies, and in particular, to a method, an apparatus, a device, and a computer-readable storage medium for processing a visual question-answering task.
Background
Visual question answering refers to a computer that can produce a correct answer given an image and a natural language question associated with the image. Visual question-answering has now become an interesting test for evaluating the reasoning and generalization capabilities of computational models. It relates to visual identification, logic, arithmetic, spatial reasoning, intuitive physics, causal and multi-hop reasoning. It also requires the combination of two modes of different nature: images and languages. High-dimensional visual modalities cover much of the garbage, focusing attention on the information most relevant to the underlying reasoning problem, which also requires identifying key areas or objects and linking them with the problem.
Generally speaking, vision plays an important role in multi-modal understanding tasks, and given a question, clues need to be found from vision so as to find corresponding answers. Usually the visual cues come from semantic features of the picture: the method mainly comprises two forms, wherein one form is directly from an image classification network; the other is from the coordinate frame obtained by target detection. The second is usually the current multimodal understanding model of mainstream choice. However, the current implementation does not consider the trade-off relationship between the quality and the quantity of the detection coordinate frames.
Generally, the number of coordinate boxes can be defined using a classification confidence threshold, but such is highly dependent on the threshold for classification confidence. If the threshold is too small, the number of coordinate boxes is too large, and there is much redundant information, which undoubtedly adds noise to the following VQA (Visual Question Answer) model; if the threshold is too large, then the number of coordinate boxes is too small and coordinate boxes that are directly or indirectly related to the problem may be filtered out. For the quality of the coordinate frame, only the coordinate frame directly or indirectly related to the problem can be referred to as a good quality coordinate frame. In the traditional target detection, more redundant target frames often exist in visual clues extracted according to a classification confidence threshold value, so that the performance of a visual question-answering task is poor.
Therefore, how to improve the performance of the visual question-answering task is a problem to be solved by those skilled in the art.
Disclosure of Invention
The embodiment of the application aims to provide a method, a device and equipment for processing a visual question-answering task and a computer readable storage medium, which can improve the performance of the visual question-answering task.
In order to solve the foregoing technical problem, an embodiment of the present application provides a method for processing a visual question-answering task, including:
performing feature fusion processing on the image to be analyzed and the first text to obtain fusion features; the fusion features comprise coordinate information of each detection frame;
according to the correlation between the image to be analyzed and the first text, screening a target detection frame meeting the correlation requirement from the fusion characteristics;
inputting coordinate information, classification categories and semantic features corresponding to the target detection box into a trained visual question-answering model to obtain a second text matched with the first text; wherein the first text and the second text have a logical correspondence.
Optionally, the screening, according to the correlation between the image to be analyzed and the first text, a target detection box that meets the correlation requirement from the fusion features includes:
calculating the intersection ratio of each image detection box contained in the image characteristics of the image to be analyzed and the text detection box corresponding to the text characteristics of the first text;
and selecting a target detection frame with the intersection ratio larger than a preset threshold value from all the image detection frames.
Optionally, the screening, according to the correlation between the image to be analyzed and the first text, a target detection box that meets the correlation requirement from the fusion features includes:
screening out a target detection frame meeting the correlation requirement from the fusion characteristics by using a trained target detection model; the target detection model is obtained based on historical images and historical texts in a training mode.
Optionally, for the training process of the target detection model, the method includes:
training an initial detection model by using a target detection data set to obtain a weight parameter corresponding to the initial detection model;
performing positive and negative sample discrimination training on the initial detection model based on the sample label corresponding to each sample in the target detection data set;
after the positive and negative sample discrimination training is finished, calculating a loss function of the initial detection model; the loss function comprises an initial loss function and loss functions corresponding to positive and negative samples;
and adjusting the respective initialization weights of a language coding module and a fusion module contained in the initial detection model and the weight parameters corresponding to the initial detection model according to the loss function of the initial detection model to obtain a trained target detection model.
Optionally, the performing, based on the sample label corresponding to each sample in the target detection data set, positive and negative sample discrimination training on the initial detection model includes:
identifying probability values corresponding to all samples in the target detection data set by using the initial detection model;
determining a loss function corresponding to the positive and negative samples according to the sample label and the probability value corresponding to each sample in the target detection data set;
and adjusting parameters corresponding to a fusion module in the initial detection model based on the loss function corresponding to the positive and negative samples so as to finish the discrimination training of the positive and negative samples.
Optionally, the determining, according to the sample label and the probability value corresponding to each sample in the target detection data set, a loss function corresponding to positive and negative samples includes:
inputting the sample label and the probability value corresponding to each sample in the target detection data set into a positive and negative sample loss function calculation formula to determine a loss function corresponding to the positive and negative samples; wherein, the positive and negative sample loss function calculation formula is:
Figure 390927DEST_PATH_IMAGE001
wherein,Nthe total number of samples is represented by,y i is shown asiNumerical value corresponding to sample label of each sample, when the sample label is positive sampley i =1, sample label is negative sampley i =0,w + Indicating the threshold value to which the positive samples correspond,p i denotes the firstiThe probability value that a sample belongs to a positive sample,w - indicating a threshold corresponding to a negative example.
Optionally, for the training process of the visual question-answering model, the method includes:
screening out a positive sample from the target detection data set by using a trained target detection model;
and training the initial visual question-answer model by using the coordinate information, the classification category and the semantic features corresponding to the positive sample to obtain the trained visual question-answer model.
Optionally, the performing feature fusion processing on the image to be analyzed and the first text to obtain fusion features includes:
extracting image characteristics of the image to be analyzed by using a target detection module of the target detection model; the image features comprise image features corresponding to the detection frames respectively;
performing feature coding on the first text by using a language coding module of the target detection model to obtain text features;
and fusing the image features and the text features by utilizing a fusion module of the target detection model to obtain fusion features.
Optionally, the first text is a question text; the second text is an answer text matched with the question text.
Optionally, the first text is a plurality of question texts, and the second text is an answer text matched with each question text;
correspondingly, the step of screening out the target detection box meeting the requirement of the correlation from the fusion features according to the correlation between the image to be analyzed and the first text comprises the following steps:
and carrying out parallel analysis on the image to be analyzed and the plurality of question texts by using the trained target detection model to obtain target detection frames corresponding to the question texts.
Optionally, the performing feature fusion processing on the image to be analyzed and the first text to obtain fusion features includes:
extracting image characteristics of the image to be analyzed; the image features comprise image features corresponding to the detection frames respectively;
performing feature coding on the first text to obtain text features;
and fusing the image features and the text features to obtain fused features.
The embodiment of the application also provides a processing device of the visual question answering task, which comprises a fusion unit, a screening unit and an obtaining unit;
the fusion unit is used for performing feature fusion processing on the image to be analyzed and the first text to obtain fusion features; the fusion features comprise coordinate information of each detection frame;
the screening unit is used for screening a target detection frame meeting the correlation requirement from the fusion characteristics according to the correlation between the image to be analyzed and the first text;
the obtaining unit is used for inputting the coordinate information, the classification category and the semantic features corresponding to the target detection box into a trained visual question-answering model so as to obtain a second text matched with the first text; wherein the first text and the second text have a logical correspondence.
Optionally, the screening unit includes a calculating subunit and a selecting subunit;
the calculation subunit is used for calculating the intersection ratio of each image detection box contained in the image characteristics of the image to be analyzed and the text detection box corresponding to the text characteristics of the first text;
and the selecting subunit is used for selecting the target detection frame with the intersection ratio larger than a preset threshold value from all the image detection frames.
Optionally, the screening unit is configured to screen a target detection frame meeting the requirement of relevance from the fusion features by using a trained target detection model; the target detection model is obtained based on historical images and historical texts in a training mode.
Optionally, for a training process of the target detection model, the apparatus includes a training unit, a discrimination unit, a calculation unit, and an adjustment unit;
the training unit is used for training an initial detection model by using a target detection data set to obtain a weight parameter corresponding to the initial detection model;
the judging unit is used for carrying out positive and negative sample judging training on the initial detection model based on the sample label corresponding to each sample in the target detection data set;
the calculation unit is used for calculating a loss function of the initial detection model after finishing the discrimination training of the positive and negative samples; the loss function comprises an initial loss function and loss functions corresponding to positive and negative samples;
and the adjusting unit is used for adjusting the respective initialization weights of the language coding module and the fusion module contained in the initial detection model and the weight parameters corresponding to the initial detection model according to the loss function of the initial detection model to obtain the trained target detection model.
Optionally, the distinguishing unit includes an identifying subunit, a determining subunit and a parameter adjusting subunit;
the identification subunit is configured to identify, by using the initial detection model, a probability value corresponding to each sample in the target detection data set;
the determining subunit is configured to determine a loss function corresponding to the positive and negative samples according to the sample label and the probability value corresponding to each sample in the target detection data set;
and the parameter adjusting subunit is used for adjusting the parameters corresponding to the fusion module in the initial detection model based on the loss function corresponding to the positive and negative samples so as to complete the discrimination training of the positive and negative samples.
Optionally, the determining subunit is configured to input the sample label and the probability value corresponding to each sample in the target detection data set to a positive and negative sample loss function calculation formula, so as to determine a loss function corresponding to a positive and negative sample; wherein, the positive and negative sample loss function calculation formula is:
Figure 776909DEST_PATH_IMAGE001
wherein,Nthe total number of samples is represented by,y i is shown asiNumerical value corresponding to sample label of each sample, when the sample label is positive sampley i =1, sample label is negative sampley i =0,w + Indicating the threshold value to which the positive samples correspond,p i is shown asiThe probability value that a sample belongs to a positive sample,w - indicating a threshold corresponding to a negative example.
Optionally, for the training process of the visual question-answer model, the apparatus comprises a question-answer training unit;
the screening unit is further used for screening out a positive sample from the target detection data set by using the trained target detection model;
and the question-answer training unit is used for training an initial visual question-answer model by using the coordinate information, the classification category and the semantic features corresponding to the positive sample so as to obtain the trained visual question-answer model.
Optionally, the fusion unit includes an extraction subunit, a coding subunit, and a feature fusion subunit;
the extraction subunit is configured to extract, by using a target detection module of the target detection model, image features of the image to be analyzed; the image features comprise image features corresponding to the detection frames respectively;
the coding subunit is configured to perform feature coding on the first text by using a language coding module of the target detection model to obtain text features;
and the feature fusion subunit is configured to fuse the image feature and the text feature by using a fusion module of the target detection model to obtain a fusion feature.
Optionally, the first text is a question text; the second text is an answer text matched with the question text.
Optionally, the first text is a plurality of question texts, and the second text is an answer text matched with each question text;
correspondingly, the screening unit is configured to perform parallel analysis on the image to be analyzed and the plurality of question texts by using the trained target detection model to obtain a target detection box corresponding to each question text.
Optionally, the fusion unit includes an extraction subunit, a coding subunit, and a feature fusion subunit;
the extraction subunit is used for extracting the image characteristics of the image to be analyzed; the image features comprise image features corresponding to the detection frames respectively;
the coding subunit is configured to perform feature coding on the first text to obtain text features;
and the feature fusion subunit is used for fusing the image features and the text features to obtain fusion features.
The embodiment of the application also provides terminal equipment, which comprises a display screen, an input interface and a processor, wherein the processor is respectively connected with the display screen and the input interface;
the input interface is used for receiving an image to be analyzed and a first text;
the processor is used for carrying out feature fusion processing on the image to be analyzed and the first text to obtain fusion features; the fusion features comprise coordinate information of each detection frame; according to the correlation between the image to be analyzed and the first text, screening out a target detection box meeting the correlation requirement from the fusion characteristics; inputting coordinate information, classification categories and semantic features corresponding to the target detection box into a trained visual question-answering model to obtain a second text matched with the first text; wherein the first text and the second text have a logical correspondence;
the display screen is used for displaying the first text and the second text corresponding to the first text.
An embodiment of the present application further provides an electronic device, including:
a memory for storing a computer program;
a processor for executing the computer program to implement the steps of the processing method of the visual question-answering task as described above.
The embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the processing method of the visual question-answering task are implemented.
According to the technical scheme, the image to be analyzed and the first text are subjected to feature fusion processing to obtain fusion features; the fusion characteristics comprise coordinate information of each detection frame; each detection frame has image information corresponding to the detection frame, the number of the detection frames corresponding to the fusion features is often large, and the detection frames not only contain the detection frame with strong correlation with the first text, but also contain the detection frame with weak correlation with the first text. In order to delete the detection frame with weak correlation, a target detection frame meeting the correlation requirement can be screened from the fusion characteristics according to the correlation between the image to be analyzed and the first text; inputting coordinate information, classification categories and semantic features corresponding to the target detection box into the trained visual question-answering model to obtain a second text matched with the first text; and the first text and the second text have a logical correspondence. In the technical scheme, the comprehensive analysis of the image to be analyzed and the first text can be realized by performing feature fusion processing on the image to be analyzed and the first text. The detection frames are deleted based on the correlation, so that the interference caused by invalid detection frames is effectively reduced, the calculated amount of the visual question-answering model is reduced, and the performance of the visual question-answering task is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application, the drawings required for the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained by those skilled in the art without inventive effort.
Fig. 1 is a schematic diagram of a hardware composition framework to which a processing method of a visual question-answering task provided in an embodiment of the present application is applied;
fig. 2 is a schematic diagram of a hardware composition framework to which another method for processing a visual question-answering task according to an embodiment of the present application is applied;
fig. 3 is a flowchart of a processing method of a visual question-answering task according to an embodiment of the present application;
fig. 4 is a network structure diagram of a target detection model according to an embodiment of the present application;
fig. 5 is a flowchart of a training method of a target detection model according to an embodiment of the present disclosure;
fig. 6 is a network structure diagram of a convergence module according to an embodiment of the present application;
fig. 7 is a schematic diagram illustrating parallel processing of different visual question-answering tasks at a mobile phone end according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a processing apparatus for a visual question-answering task according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a terminal device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without any creative effort belong to the protection scope of the present application.
The terms "including" and "having," and any variations thereof, in the description and claims of this application and the drawings described above, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may include other steps or elements not expressly listed.
In order that those skilled in the art will better understand the disclosure, the following detailed description will be given with reference to the accompanying drawings.
For convenience of understanding, a hardware composition framework used in a scheme corresponding to the processing method of the visual question-answering task provided in the embodiment of the present application is introduced first. Referring to fig. 1, fig. 1 is a schematic diagram of a hardware composition framework applicable to a processing method of a visual question-answering task according to an embodiment of the present application. Wherein the electronic device 100 may include a processor 101 and a memory 102, and may further include one or more of a multimedia component 103, an information input/information output (I/O) interface 104, and a communication component 105.
The processor 101 is configured to control the overall operation of the electronic device 100 to complete all or part of the steps in the processing method of the visual question-answering task; the memory 102 is used to store various types of data to support operation at the electronic device 100, such data may include, for example, instructions for any application or method operating on the electronic device 100, as well as application-related data. The Memory 102 may be implemented by any type or combination of volatile and non-volatile Memory devices, such as one or more of Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic or optical disk. In the present embodiment, the memory 102 stores therein at least programs and/or data for realizing the following functions:
performing feature fusion processing on the image to be analyzed and the first text to obtain fusion features; the fusion characteristics comprise coordinate information of each detection frame;
according to the correlation between the image to be analyzed and the first text, screening out a target detection frame meeting the correlation requirement from the fusion characteristics;
inputting coordinate information, classification categories and semantic features corresponding to the target detection box into the trained visual question-answering model to obtain a second text matched with the first text; and the first text and the second text have a logical correspondence.
The multimedia component 103 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 102 or transmitted through the communication component 105. The audio assembly also includes at least one speaker for outputting audio signals. An information input/information output (I/O) interface 104 provides an interface between the processor 101 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 105 is used for wired or wireless communication between the electronic device 100 and other devices. Wireless Communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC), 2G, 3G, or 4G, or a combination of one or more of them, so that the corresponding Communication component 105 may include: Wi-Fi part, Bluetooth part, NFC part.
The electronic Device 100 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components, and may be used to perform the Processing of visual question answering tasks.
Of course, the structure of the electronic device 100 shown in fig. 1 does not constitute a limitation to the electronic device in the embodiment of the present application, and in practical applications, the electronic device 100 may include more or less components than those shown in fig. 1, or some components may be combined.
It is to be understood that, in the embodiment of the present application, the number of the electronic devices is not limited, and the electronic devices may be a processing method in which a plurality of electronic devices cooperate to complete a visual question and answer task. In a possible implementation manner, please refer to fig. 2, and fig. 2 is a schematic diagram of a hardware composition framework applicable to another processing method for a visual question-answering task provided in an embodiment of the present application. As can be seen from fig. 2, the hardware composition framework may include: the first electronic device 11 and the second electronic device 12 are connected to each other through a network 13.
In the embodiment of the present application, the hardware structures of the first electronic device 11 and the second electronic device 12 may refer to the electronic device 100 in fig. 1. That is, it can be understood that there are two electronic devices 100 in the present embodiment, and the two devices perform data interaction. Further, in the embodiment of the present application, the form of the network 13 is not limited, that is, the network 13 may be a wireless network (such as WIFI, bluetooth, etc.), or may also be a wired network.
The first electronic device 11 and the second electronic device 12 may be the same electronic device, for example, the first electronic device 11 and the second electronic device 12 are both servers; or may be different types of electronic devices, for example, the first electronic device 11 may be a smartphone or other intelligent terminal, and the second electronic device 12 may be a server. In one possible embodiment, a server with high computing power may be used as the second electronic device 12 to improve the data processing efficiency and reliability, and thus the processing efficiency of the model training and/or the visual question answering. Meanwhile, a smartphone with low cost and wide application range is used as the first electronic device 11 to realize interaction between the second electronic device 12 and the user. It is to be understood that the interaction process may be: the first electronic device 11 transmits the image to be analyzed and the first text to the second electronic device 12, and the second electronic device 12 performs feature fusion processing on the image to be analyzed and the first text to obtain fusion features; the fusion characteristics comprise coordinate information of each detection frame; according to the correlation between the image to be analyzed and the first text, screening out a target detection frame meeting the correlation requirement from the fusion characteristics; and inputting the coordinate information, the classification category and the semantic features corresponding to the target detection box into the trained visual question-answering model to obtain a second text matched with the first text, and feeding the second text back to the first electronic device 11.
Next, a method for processing a visual question-answering task provided in an embodiment of the present application will be described in detail. Fig. 3 is a flowchart of a processing method of a visual question-answering task according to an embodiment of the present application, where the method includes:
s301: and performing feature fusion processing on the image to be analyzed and the first text to obtain fusion features.
In the traditional mode, all pictures of a data set are inferred by directly using target detection pre-training weights, and then each picture calculates a detection frame, classification confidence and extracted coordinate frame semantic features. And then selecting a corresponding detection frame by setting a threshold value of the classification confidence or setting the number of detection frames output by each image.
The detection frame refers to a position area where the target object is located in the picture. The target object may be a person or an object associated with the text, or may be a person or an object not associated with the text. For example, one picture includes a girl, a dog and a sky, the girl, the dog and the sky can be used as target objects, and the detection frame corresponding to the target object can include a location area where the girl is located, a location area where the dog is located and a location area where the sky is located.
The number of detection frames generated in the conventional method is often large, and the detection frames having strong correlation with the text cannot be well selected by selecting the detection frames according to the set threshold or the set number, so that the answer corresponding to the text generated by the subsequent visual question-answering model is not suitable.
Therefore, in the embodiment of the application, in order to improve the performance of the visual question-answering task, feature fusion processing may be performed on the image to be analyzed and the first text to obtain fusion features, so that the detection boxes are conveniently screened according to the fusion features, and the detection boxes with weak text correlation are deleted.
In practical application, the image characteristics of the image to be analyzed can be extracted; the image features comprise image features corresponding to the detection frames respectively. Performing feature coding on the first text to obtain text features; and fusing the image features and the text features to obtain fused features. The fusion features contain coordinate information of each detection frame.
In this embodiment, the image to be analyzed may be any one of pictures, and the first text may be a question raised for the image to be analyzed. For example, Where the picture includes a girl and a dog playing on a beach, the first text may be "Where the woman sits".
S302: and screening a target detection frame meeting the correlation requirement from the fusion characteristics according to the correlation between the image to be analyzed and the first text.
The fusion feature may be derived based on image features of the image to be analyzed and text features of the first text.
Both image features and text features may be presented in the form of detection boxes. For the correlation of the image to be analyzed and the first text, evaluation may be performed based on an IOU value (Intersection Over Union) between detection boxes.
In a specific implementation, the intersection ratio of each image detection box contained in the image features and the text detection box corresponding to the text features can be calculated; and selecting a target detection frame with the intersection ratio larger than a preset threshold value from all the image detection frames.
The value of the preset threshold can be flexibly set according to actual requirements, and can be set to 0.5, for example. The image detection boxes are processed in a similar manner, and for example, one image detection box can calculate the IOU values of the image detection box and the text detection box. The IOU value is greater than 0.5, which indicates that the image detection box and the text detection box have strong correlation and belong to positive samples, and at this time, the image detection box can be used as a target detection box to participate in the subsequent analysis process.
S303: and inputting the coordinate information, classification category and semantic features corresponding to the target detection box into the trained visual question-answering model to obtain a second text matched with the first text.
And the first text and the second text have a logical correspondence. For example, the first text may be a question text and the second text may be an answer text.
The number of the target detection frames is less than that of the image detection frames contained in the image features.
After the target detection frame is screened out, coordinate information, classification categories and semantic features corresponding to the target detection frame can be extracted through a Feed-Forward Network (FFN) module. And inputting the coordinate information, classification category and semantic features corresponding to the target detection box into the trained visual question-answering model to obtain a second text matched with the first text.
The Visual question-answering model can be VINVL (Visual Representations in Vision-Language Models) or LXMERT (Learning Cross-Modality Encode retrieval from transformations).
According to the technical scheme, the image to be analyzed and the first text are subjected to feature fusion processing to obtain fusion features; the fusion characteristics comprise coordinate information of each detection frame; each detection frame has image information corresponding to the detection frame, the number of the detection frames corresponding to the fusion features is often large, and the detection frames not only contain the detection frame with strong correlation with the first text, but also contain the detection frame with weak correlation with the first text. In order to delete the detection frame with weak correlation, a target detection frame meeting the correlation requirement can be screened from the fusion characteristics according to the correlation between the image to be analyzed and the first text; inputting coordinate information, classification categories and semantic features corresponding to the target detection box into the trained visual question-answering model to obtain a second text matched with the first text; and the first text and the second text have a logical correspondence. In the technical scheme, the comprehensive analysis of the image to be analyzed and the first text can be realized by performing feature fusion processing on the image to be analyzed and the first text. The detection frames are deleted based on the correlation, so that the interference caused by invalid detection frames is effectively reduced, the calculated amount of the visual question-answering model is reduced, and the performance of the visual question-answering task is improved.
In the embodiment of the application, the processing of the visual question-answering task can be realized by adopting a mode of combining a target detection model and a visual question-answering model. The target detection model can analyze the image to be analyzed and the first text, so that a target detection box meeting the correlation requirement is screened out, and coordinate information, classification categories and semantic features corresponding to the target detection box are extracted. The target detection model can be trained based on historical images and historical texts.
In practical application, the trained target detection model can be used for carrying out feature fusion processing on the image to be analyzed and the first text, so that fusion features are obtained, and a target detection frame meeting the correlation requirement is screened out from the fusion features.
The basic model on which the target DEtection model adopted in the embodiment of the present application depends may be a deter (DEtection trunk, a transform-based target DEtection network), and the model changes target DEtection into a binary matching problem between a DEtection frame and a standard frame (group trunk) by using a transform structure of a recent fire.
Fig. 4 is a Network structure diagram of an object detection model according to an embodiment of the present disclosure, where the object detection model includes a backbone Network, a coding module (transform encoder), a decoding module (transform decoder), a fusion module, a Feed-Forward Network (FFN) module, and a speech coding module (Roberta). The backbone network, the coding module and the decoding module can extract image features. The language coding module may extract text features. The fusion module can realize the fusion of the image characteristics and the text characteristics, so that the target detection box is screened out. The forward propagation network module can extract the coordinate information, classification category and semantic features of the target detection box.
In the embodiment of the present application, the backbone network, the encoding module and the decoding module may be used as the target detection module. Extracting image characteristics of an image to be analyzed by using a target detection module of a target detection model; the image feature may include an image feature corresponding to each of the plurality of detection frames. Performing feature coding on the first text by using a language coding module of the target detection model to obtain text features; and fusing the image features and the text features by utilizing a fusion module of the target detection model to obtain fusion features.
The trained target detection model can be used for realizing the analysis processing of the image to be analyzed and the first text. The training of the target detection model is a basic premise of the target detection model for performing the visual question-answering task processing, and fig. 5 is a flowchart of a training method of the target detection model provided by the embodiment of the application, and the method includes:
s501: and training the initial detection model by using the target detection data set to obtain a weight parameter corresponding to the initial detection model.
The target detection dataset may include a COCO (common Objects in context) dataset, a Visual Genome dataset, an Objects365 dataset, and the like.
In the model training stage, the picture firstly extracts features through a backbone network (CNN), and meanwhile, position coding features are added, wherein the position coding features are obtained in a self-adaptive mode according to the resolution of the picture, and the significance is that local position information of the picture feature map is obtained. And (3) encoding image features by using an encoder of a transformer, setting a learnable initialization embedding parameter query, and decoding a corresponding target position and classification from the encoded image features. These queries are equivalent to adaptive anchor (target detection predefined anchor) information, and the detection position and corresponding class of the corresponding object are decoded by a decoder. And (3) introducing binary Matching in the training process to complete the Matching of the group Truth coordinate frame and the detection frame. The matching strategy is as follows:
Figure 133809DEST_PATH_IMAGE002
wherein,y i represents a group Truth coordinate frame,y i pred and representing a detection box, wherein the Hungarian matching algorithm is utilized to match the detection box with the coordinate box. argmin denotes
Figure 466702DEST_PATH_IMAGE003
When reaching the minimum valuey i Andy i pred the value of (a).L match And the matching degree of the detection frame and the coordinate frame is shown.
Assuming that there are N (N < 100) objects in the picture, only N detection frames correspond to the group try coordinate frame after the hungarian matching algorithm is performed on 100 queries, so that the operation of removing the repeated frame by the NMS in the conventional target detection frame is not needed.
S502: and performing positive and negative sample discrimination training on the initial detection model based on the sample label corresponding to each sample in the target detection data set.
Taking the example of generating an answer corresponding to a question according to image features, a positive sample may be an image feature having a strong correlation with the question, and a negative sample may be an image feature having a weak correlation with the question.
In a specific implementation, the probability value corresponding to each sample in the target detection data set may be identified by using the initial detection model. The higher the probability value, the stronger the correlation between the image features contained in the sample and the problem.
The samples in the target detection dataset may be detection frames corresponding to respective pictures in the target detection dataset, each detection frame having its corresponding image feature.
According to the sample label and the probability value corresponding to each sample in the target detection data set, a loss function corresponding to the positive and negative samples can be determined; and adjusting parameters corresponding to a fusion module in the initial detection model based on the loss function corresponding to the positive and negative samples so as to finish the discrimination training of the positive and negative samples.
For the determination of the positive and negative sample loss functions, a positive and negative sample loss function calculation formula can be set, and the sample labels and the probability values corresponding to the samples in the target detection data set are input into the positive and negative sample loss function calculation formula to determine the loss functions corresponding to the positive and negative samples; wherein, the positive and negative sample loss function calculation formula is:
Figure 749916DEST_PATH_IMAGE004
wherein,Nthe total number of samples is represented by,y i denotes the firstiNumerical value corresponding to sample label of each sample, when the sample label is positive sampley i =1, sample label is negative sampley i =0,w + Indicating the threshold value to which the positive samples correspond,p i is shown asiThe probability value that a sample belongs to a positive sample,w - indicating a threshold corresponding to a negative example.
In consideration of the practical application of the method,the proportion of positive and negative samples has the problem of imbalance, and the proportion of positive samples is often smaller, so that the threshold value can be setw + =40,w - =1。
S503: and after the positive and negative sample discrimination training is finished, calculating a loss function of the initial detection model.
The loss function may include an initial loss function and a loss function corresponding to positive and negative samples. The calculation method of the loss function corresponding to the positive and negative samples can be referred to the above description, and is not described herein again.
The initial loss function is calculated as follows:
Figure 978903DEST_PATH_IMAGE005
wherein, the initial loss function comprises three terms, the first term represents the classification loss, the second term represents the IOU loss, and the third term represents L 1 And (4) loss.yRepresents a group channel coordinate frame,y pred a detection frame obtained by extracting the image feature is indicated,σ i indicates a group channel number ofiThe serial number in the detection frame corresponding to the coordinate frame of (1).p σ i() (c i ) And represents the classification probability of the detection box corresponding to the ground channel.b i Represents that the ground truth sequence is as followsiI.e., [ x1, y1, x2, y2 ]]. In the same way, the method for preparing the composite material,b σ (i) The coordinates of the detection box matched with the ground channel.λ iou Andλ 1 the regression loss coefficients of the coordinate frames are respectively expressed, and may be all set to 1 in the present application.L iou It is indicated that the loss of the IOU,L 1 to representL 1 And (4) loss.
L iou The calculation formula of (a) is as follows:
Figure 511515DEST_PATH_IMAGE006
L 1 the calculation formula is as follows, namely the sum of absolute values of coordinates of the detection frame and the four points of the group Truth:
Figure 29126DEST_PATH_IMAGE007
s504: and adjusting the respective initialization weights of the language coding module and the fusion module contained in the initial detection model and the weight parameters corresponding to the initial detection model according to the loss function of the initial detection model to obtain the trained target detection model.
Compared with the traditional target detection model, the problem-based optimized target detection model is added with a language coding module and a fusion module. The language coding module can adopt Roberta-base pre-training weight to generate coding characteristics for problems
Figure 242545DEST_PATH_IMAGE008
Characteristic dimension is 768.
Fig. 6 shows a network structure diagram of the fusion module, where the fusion module includes two single-mode transformer models (intra-attentions), a cross-mode transformer model (cross-transformer), a linear layer, and a positive/negative sample discrimination module. Wherein, the linear layer can be connected with the FFN module of the target detection model.
In practical applications, the speech encoding module will input the Roberta encoded text features into the intra-transformer network module, and the decoding module will input the image features output by the decoder of the DETR into the intra-transformer network module. Then, the two modality fused features continue to pass through a cross-modality transform (cross-transform) network module, and finally the output fused features are input into a linear layer. 100 query vectors are preset in the DETR, which corresponds to 100 detection frames. Here, each test box is given either a positive or negative swatch label based on the coordinate box of the question location. The decision criterion may be the IOU value of the coordinate frame from the detection with respect to the question given in the group Truth in the GQA dataset. And if the IOU values of the two are more than 0.5, determining as a positive sample, otherwise, determining as a negative sample. Firstly, training for judging positive and negative samples, and gradually adding an FFN module to classify a coordinate frame and optimize regression of related position coordinates along with the increase of training times.
The method for adjusting the model parameters based on the loss function belongs to the conventional method, and is not described herein again.
After the trained target detection model is obtained, a positive sample can be screened out from the target detection data set by using the trained target detection model; and training the initial visual question-answer model by using the coordinate information, classification category and semantic features corresponding to the positive sample to obtain the trained visual question-answer model.
The framework of the target detection model provided by the embodiment of the application emphasizes on optimizing the process of extracting visual clues for target detection, and inputting the problems into the target detection model, so that the target detection frames directly related to the problems or indirectly related to reasoning can be successfully detected, and redundant target detection frames in the traditional scheme can be greatly deleted; from the aspect of visual question and answer task performance, visual clues are optimized, and therefore task performance is greatly improved.
The processing scheme of the visual question answering task provided by the embodiment of the application can be conveniently applied to terminal equipment such as mobile phones and Field-Programmable Gate Array (FPGA) chips. Based on the functions required to be realized, the method can be divided into an optimized visual cue module and a visual question and answer module. The visual cue optimizing module mainly comprises a backbone network, a target detection module (comprising an encoding module and a decoding module) and an MLP module (comprising a fusion module and an FFN module).
The backbone network adopts a Swin transform structure, the target detection module adopts basic transform encoder and transform decoder modules, and the MLP module consists of a series of full connections and matrix vector operations. Because the transform and the MLP network are all the multiplication and addition operations of the matrix, parallel acceleration processing can be conveniently carried out on hardware equipment.
Therefore, in practical applications, the first text may be a plurality of question texts, and correspondingly, the second text is an answer text matched with each question text. In specific implementation, the trained target detection model can be used to perform parallel analysis on the image to be analyzed and the plurality of problem texts, so as to obtain the target detection boxes corresponding to the problem texts.
Taking a mobile phone as an example, fig. 7 is a schematic diagram of processing different visual question-answering tasks in parallel at a mobile phone end according to an embodiment of the present application, where two models may be set on the mobile phone, and each model includes an optimized visual cue module and a visual question-answering module. Neither model has convolution operations, so inference can be made in parallel. The optimization visual cue module is used for giving a problem and the whole image, and outputting a partial image area related to the problem and classification of the corresponding area; given the question of "Is the person happy" and the entire image, the output Is a girl area and a dog area, and dog and girl. The visual question-answering module takes the result obtained in the last step and the question as input and deduces the final answer 'Yes'. Given the question of "What is the weather like" and the entire image, the output is a Sky region, and Sky is output. The visual question-answering module takes the result obtained in the last step and the question as input and deduces the final answer 'Sunny'.
By deploying the plurality of optimized visual cue modules and the plurality of visual question-answering modules on the terminal equipment, the parallel processing of the plurality of visual question-answering tasks can be realized, the processing efficiency of the visual question-answering tasks is greatly improved, and the performance of the terminal equipment can be fully exerted.
Fig. 8 is a schematic structural diagram of a processing apparatus for a visual question-answering task according to an embodiment of the present application, including a fusion unit 81, a screening unit 82, and an obtaining unit 83;
the fusion unit 81 is configured to perform feature fusion processing on the image to be analyzed and the first text to obtain a fusion feature; the fusion characteristics comprise coordinate information of each detection frame;
the screening unit 82 is used for screening a target detection frame meeting the correlation requirement from the fusion characteristics according to the correlation between the image to be analyzed and the first text;
the obtaining unit 83 is configured to input the coordinate information, the classification category, and the semantic features corresponding to the target detection box into the trained visual question-answering model to obtain a second text matched with the first text; and the first text and the second text have a logical correspondence.
Optionally, the screening unit includes a calculating subunit and a selecting subunit;
the calculation subunit is used for calculating the intersection ratio of each image detection box contained in the image characteristics of the image to be analyzed and the text detection box corresponding to the text characteristics of the first text;
and the selecting subunit is used for selecting the target detection frame with the intersection ratio larger than the preset threshold value from all the image detection frames.
Optionally, the screening unit is configured to screen a target detection frame that meets the requirement of the correlation from the fusion features by using the trained target detection model; the target detection model is obtained based on historical images and historical texts.
Optionally, for a training process of the target detection model, the device includes a training unit, a discrimination unit, a calculation unit, and an adjustment unit;
the training unit is used for training the initial detection model by using the target detection data set to obtain a weight parameter corresponding to the initial detection model;
the judging unit is used for carrying out positive and negative sample judging training on the initial detection model based on the sample label corresponding to each sample in the target detection data set;
the calculation unit is used for calculating a loss function of the initial detection model after the positive and negative sample discrimination training is finished; the loss function comprises an initial loss function and loss functions corresponding to positive and negative samples;
and the adjusting unit is used for adjusting the respective initialization weights of the language coding module and the fusion module contained in the initial detection model and the weight parameters corresponding to the initial detection model according to the loss function of the initial detection model to obtain the trained target detection model.
Optionally, the distinguishing unit includes an identifying subunit, a determining subunit and a parameter adjusting subunit;
the identification subunit is used for identifying the probability value corresponding to each sample in the target detection data set by using the initial detection model;
the determining subunit is used for determining a loss function corresponding to the positive and negative samples according to the sample label and the probability value corresponding to each sample in the target detection data set;
and the parameter adjusting subunit is used for adjusting the parameters corresponding to the fusion module in the initial detection model based on the loss functions corresponding to the positive and negative samples so as to complete the discrimination training of the positive and negative samples.
Optionally, the determining subunit is configured to input the sample label and the probability value corresponding to each sample in the target detection data set to a positive and negative sample loss function calculation formula, so as to determine a loss function corresponding to the positive and negative samples; wherein, the positive and negative sample loss function calculation formula is:
Figure 173592DEST_PATH_IMAGE009
wherein,Nthe total number of samples is represented by,y i is shown asiThe value corresponding to the sample label of each sample, when the sample label is positive sampley i =1, sample label is negative sampley i =0,w + Indicating the threshold value to which the positive samples correspond,p i denotes the firstiThe probability value that a sample belongs to a positive sample,w - indicating a threshold corresponding to a negative example.
Optionally, for the training process of the visual question-answer model, the apparatus includes a question-answer training unit;
the screening unit is also used for screening out positive samples from the target detection data set by using the trained target detection model;
and the question-answer training unit is used for training the initial visual question-answer model by utilizing the coordinate information, the classification category and the semantic features corresponding to the positive sample so as to obtain the trained visual question-answer model.
Optionally, the fusion unit includes an extraction subunit, a coding subunit, and a feature fusion subunit;
the extraction subunit is used for extracting the image characteristics of the image to be analyzed by utilizing a target detection module of the target detection model; the image features comprise image features corresponding to the detection frames respectively;
the coding subunit is used for carrying out feature coding on the first text by utilizing a language coding module of the target detection model to obtain text features;
and the feature fusion subunit is used for fusing the image features and the text features by using a fusion module of the target detection model to obtain fusion features.
Optionally, the first text is a question text; the second text is an answer text matching the question text.
Optionally, the first text is a plurality of question texts, and the second text is an answer text matched with each question text;
correspondingly, the screening unit is used for performing parallel analysis on the image to be analyzed and the plurality of problem texts by using the trained target detection model to obtain the target detection frames corresponding to the problem texts.
Optionally, the fusion unit includes an extraction subunit, a coding subunit, and a feature fusion subunit;
the extraction subunit is used for extracting the image characteristics of the image to be analyzed; the image features comprise image features corresponding to the detection frames respectively;
the encoding subunit is used for carrying out feature encoding on the first text to obtain text features;
and the feature fusion subunit is used for fusing the image features and the text features to obtain fusion features.
For the description of the features in the embodiment corresponding to fig. 8, reference may be made to the related description of the embodiments corresponding to fig. 3 and fig. 5, which is not repeated here.
According to the technical scheme, the image to be analyzed and the first text are subjected to feature fusion processing to obtain fusion features; the fusion characteristics comprise coordinate information of each detection frame; each detection frame has image information corresponding to the detection frame, the number of the detection frames corresponding to the fusion features is often large, and the detection frames not only contain the detection frame with strong correlation with the first text, but also contain the detection frame with weak correlation with the first text. In order to delete the detection frame with weak correlation, a target detection frame meeting the correlation requirement can be screened from the fusion characteristics according to the correlation between the image to be analyzed and the first text; inputting coordinate information, classification categories and semantic features corresponding to the target detection box into the trained visual question-answering model to obtain a second text matched with the first text; and the first text and the second text have a logical correspondence. In the technical scheme, the comprehensive analysis of the image to be analyzed and the first text can be realized by performing feature fusion processing on the image to be analyzed and the first text. The detection frames are deleted based on the correlation, so that the interference caused by invalid detection frames is effectively reduced, the calculated amount of the visual question-answering model is reduced, and the performance of the visual question-answering task is improved.
Fig. 9 is a schematic structural diagram of a terminal device according to an embodiment of the present application, including a display screen 91, an input interface 92, and a processor respectively connected to the display screen 91 and the input interface 92; since the processor is built in the terminal device, the processor is not shown in fig. 9.
An input interface 92 for receiving an image to be analyzed and a first text;
the processor is used for carrying out feature fusion processing on the image to be analyzed and the first text to obtain fusion features; the fusion characteristics comprise coordinate information of each detection frame; according to the correlation between the image to be analyzed and the first text, screening out a target detection frame meeting the correlation requirement from the fusion characteristics; inputting coordinate information, classification categories and semantic features corresponding to the target detection box into the trained visual question-answering model to obtain a second text matched with the first text; the first text and the second text have a logical corresponding relation;
and the display screen 91 is used for displaying the first text and the corresponding second text.
For the description of the features in the embodiment corresponding to fig. 9, reference may be made to the related description of the embodiments corresponding to fig. 3 and fig. 5, which is not repeated here.
The input interface 92 may be used to enable connection to an external device such as a usb disk. There may be a plurality of input interfaces, and one input interface is illustrated in fig. 9. In practical applications, a user may input the image to be analyzed and the first text into the terminal device through the input keyboard, or write the image to be analyzed and the first text into a usb disk, and insert the usb disk into the input interface 92 of the terminal device. After acquiring the image to be analyzed and the first text, the terminal device may transmit the image to be analyzed and the first text to the processor, and after analyzing the image to be analyzed and the first text, the processor may obtain a second text matched with the first text, and at this time, the terminal device may display the second text through the display screen 91.
It should be noted that the functional modules, such as the display 91, the input interface 92, and the processor, included in the terminal device in fig. 9 are merely examples, and in an actual application, the terminal device may also include more or less functional modules based on actual requirements, which is not limited thereto.
According to the technical scheme, the image to be analyzed and the first text are subjected to feature fusion processing to obtain fusion features; the fusion characteristics comprise coordinate information of each detection frame; each detection frame has image information corresponding to the detection frame, the number of the detection frames corresponding to the fusion features is often large, and the detection frames not only contain the detection frame with strong correlation with the first text, but also contain the detection frame with weak correlation with the first text. In order to delete the detection frame with weak correlation, a target detection frame meeting the correlation requirement can be screened from the fusion characteristics according to the correlation between the image to be analyzed and the first text; inputting coordinate information, classification categories and semantic features corresponding to the target detection box into the trained visual question-answering model to obtain a second text matched with the first text; and the first text and the second text have a logical correspondence. In the technical scheme, the comprehensive analysis of the image to be analyzed and the first text can be realized by performing feature fusion processing on the image to be analyzed and the first text. The detection frames are deleted based on the correlation, so that the interference caused by invalid detection frames is effectively reduced, the calculated amount of the visual question-answering model is reduced, and the performance of the visual question-answering task is improved.
It is to be understood that, if the processing method of the visual question-answering task in the above embodiments is implemented in the form of a software functional unit and sold or used as a stand-alone product, it may be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the present application may be substantially or partially implemented in the form of a software product, which is stored in a storage medium and executes all or part of the steps of the methods of the embodiments of the present application, or all or part of the technical solutions. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), an electrically erasable programmable ROM, a register, a hard disk, a removable magnetic disk, a CD-ROM, a magnetic or optical disk, and other various media capable of storing program codes.
Based on this, the embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the processing method of the visual question-answering task.
The method, the apparatus, the device and the computer-readable storage medium for processing the visual question answering task provided by the embodiments of the present application are described in detail above. The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the components and steps of the various examples have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The method, apparatus, device and computer-readable storage medium for processing a visual question-answering task provided by the present application are described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present application.

Claims (15)

1. A method for processing a visual question-answering task, comprising:
performing feature fusion processing on the image to be analyzed and the first text to obtain fusion features; the fusion features comprise coordinate information of each detection frame;
according to the correlation between the image to be analyzed and the first text, screening out a target detection box meeting the correlation requirement from the fusion characteristics;
inputting coordinate information, classification categories and semantic features corresponding to the target detection box into a trained visual question-answering model to obtain a second text matched with the first text; wherein the first text and the second text have a logical correspondence.
2. The method for processing the visual question-answering task according to claim 1, wherein the step of screening out the target detection boxes meeting the correlation requirement from the fusion features according to the correlation between the image to be analyzed and the first text comprises the following steps:
calculating the intersection ratio of each image detection box contained in the image characteristics of the image to be analyzed and the text detection box corresponding to the text characteristics of the first text;
and selecting a target detection frame with the intersection ratio larger than a preset threshold value from all the image detection frames.
3. The method for processing the visual question-answering task according to claim 1, wherein the step of screening out the target detection boxes meeting the correlation requirement from the fusion features according to the correlation between the image to be analyzed and the first text comprises the following steps:
screening out a target detection frame meeting the correlation requirement from the fusion characteristics by using a trained target detection model; the target detection model is obtained based on historical images and historical texts.
4. The method for processing the visual question-answering task according to claim 3, wherein aiming at the training process of the target detection model, the method comprises the following steps:
training an initial detection model by using a target detection data set to obtain a weight parameter corresponding to the initial detection model;
performing positive and negative sample discrimination training on the initial detection model based on the sample label corresponding to each sample in the target detection data set;
after the positive and negative sample discrimination training is finished, calculating a loss function of the initial detection model; the loss function comprises an initial loss function and loss functions corresponding to positive and negative samples;
and adjusting the respective initialization weights of the language coding module and the fusion module contained in the initial detection model and the weight parameters corresponding to the initial detection model according to the loss function of the initial detection model to obtain the trained target detection model.
5. The method for processing the visual question-answering task according to claim 4, wherein the performing positive and negative sample discrimination training on the initial detection model based on the sample labels corresponding to the samples in the target detection data set includes:
identifying probability values corresponding to all samples in the target detection data set by using the initial detection model;
determining a loss function corresponding to positive and negative samples according to the sample label and the probability value corresponding to each sample in the target detection data set;
and adjusting parameters corresponding to a fusion module in the initial detection model based on the loss function corresponding to the positive and negative samples so as to complete the discrimination training of the positive and negative samples.
6. The method for processing the visual question-answering task according to claim 5, wherein the determining a loss function corresponding to positive and negative samples according to the sample labels and the probability values corresponding to the samples in the target detection data set comprises:
inputting the sample label and the probability value corresponding to each sample in the target detection data set into a positive and negative sample loss function calculation formula to determine a loss function corresponding to the positive and negative samples; wherein, the positive and negative sample loss function calculation formula is:
Figure DEST_PATH_IMAGE001
wherein,Nthe total number of samples is represented by,y i is shown asiThe value corresponding to the sample label of each sample, when the sample label is positive sampley i =1, sample label is negative sampley i =0,w + Indicating the threshold value to which the positive samples correspond,p i denotes the firstiThe probability value that a sample belongs to a positive sample,w - indicating a threshold corresponding to a negative example.
7. The method for processing the visual question-answering task according to claim 4, wherein aiming at the training process of the visual question-answering model, the method comprises the following steps:
screening out a positive sample from the target detection data set by using a trained target detection model;
and training the initial visual question-answering model by utilizing the coordinate information, the classification category and the semantic features corresponding to the positive sample to obtain the trained visual question-answering model.
8. The method for processing the visual question-answering task according to claim 4, wherein the performing feature fusion processing on the image to be analyzed and the first text to obtain a fusion feature comprises:
extracting image features of the image to be analyzed by using a target detection module of the target detection model; the image features comprise image features corresponding to the detection frames respectively;
performing feature coding on the first text by using a language coding module of the target detection model to obtain text features;
and fusing the image features and the text features by utilizing a fusion module of the target detection model to obtain fusion features.
9. The method for processing a visual question-answering task according to any one of claims 1 to 8, wherein the first text is a question text; the second text is an answer text matched with the question text.
10. The method of claim 9, wherein the first text is a plurality of question texts, and the second text is an answer text matching each of the question texts;
correspondingly, the step of screening out the target detection box meeting the requirement of the correlation from the fusion features according to the correlation between the image to be analyzed and the first text comprises the following steps:
and carrying out parallel analysis on the image to be analyzed and the plurality of question texts by using the trained target detection model to obtain target detection frames corresponding to the question texts.
11. The method for processing the visual question-answering task according to claim 1, wherein the performing feature fusion processing on the image to be analyzed and the first text to obtain a fusion feature comprises:
extracting image characteristics of the image to be analyzed; the image features comprise image features corresponding to the detection frames respectively;
performing feature coding on the first text to obtain text features;
and fusing the image features and the text features to obtain fused features.
12. A processing device of a visual question-answering task is characterized by comprising a fusion unit, a screening unit and an obtaining unit;
the fusion unit is used for performing feature fusion processing on the image to be analyzed and the first text to obtain fusion features; the fusion features comprise coordinate information of each detection frame;
the screening unit is used for screening a target detection frame meeting the correlation requirement from the fusion characteristics according to the correlation between the image to be analyzed and the first text;
the obtaining unit is used for inputting the coordinate information, the classification category and the semantic features corresponding to the target detection box into a trained visual question-answering model so as to obtain a second text matched with the first text; wherein the first text and the second text have a logical correspondence.
13. The terminal equipment is characterized by comprising a display screen, an input interface and a processor which is respectively connected with the display screen and the input interface;
the input interface is used for receiving an image to be analyzed and a first text;
the processor is used for carrying out feature fusion processing on the image to be analyzed and the first text to obtain fusion features; the fusion features comprise coordinate information of each detection frame; according to the correlation between the image to be analyzed and the first text, screening out a target detection box meeting the correlation requirement from the fusion characteristics; inputting coordinate information, classification categories and semantic features corresponding to the target detection box into a trained visual question-answering model to obtain a second text matched with the first text; wherein the first text and the second text have a logical correspondence;
the display screen is used for displaying the first text and the second text corresponding to the first text.
14. An electronic device, comprising:
a memory for storing a computer program;
a processor for executing the computer program to implement the steps of the method for processing a visual question-answering task according to any one of claims 1 to 11.
15. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, realizes the steps of the method for processing a visual question-answering task according to any one of claims 1 to 11.
CN202211068333.9A 2022-09-02 2022-09-02 Method, device, equipment and medium for processing visual question-answering task Active CN115129848B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202211068333.9A CN115129848B (en) 2022-09-02 2022-09-02 Method, device, equipment and medium for processing visual question-answering task
PCT/CN2022/142512 WO2024045444A1 (en) 2022-09-02 2022-12-27 Processing method and apparatus for visual question answering task, and device and non-volatile readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211068333.9A CN115129848B (en) 2022-09-02 2022-09-02 Method, device, equipment and medium for processing visual question-answering task

Publications (2)

Publication Number Publication Date
CN115129848A true CN115129848A (en) 2022-09-30
CN115129848B CN115129848B (en) 2023-02-28

Family

ID=83387703

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211068333.9A Active CN115129848B (en) 2022-09-02 2022-09-02 Method, device, equipment and medium for processing visual question-answering task

Country Status (2)

Country Link
CN (1) CN115129848B (en)
WO (1) WO2024045444A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115861995A (en) * 2023-02-08 2023-03-28 山东海量信息技术研究院 Visual question-answering method and device, electronic equipment and storage medium
CN116884003A (en) * 2023-07-18 2023-10-13 南京领行科技股份有限公司 Picture automatic labeling method and device, electronic equipment and storage medium
WO2024045444A1 (en) * 2022-09-02 2024-03-07 苏州浪潮智能科技有限公司 Processing method and apparatus for visual question answering task, and device and non-volatile readable storage medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117874706B (en) * 2024-03-12 2024-05-31 之江实验室 Multi-modal knowledge distillation learning method and device
CN117892140B (en) * 2024-03-15 2024-05-31 浪潮电子信息产业股份有限公司 Visual question and answer and model training method and device thereof, electronic equipment and storage medium
CN118194923A (en) * 2024-05-17 2024-06-14 北京大学 Method, device, equipment and computer readable medium for constructing large language model

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108228703A (en) * 2017-10-31 2018-06-29 北京市商汤科技开发有限公司 Image answering method, device, system and storage medium
CN111860653A (en) * 2020-07-22 2020-10-30 苏州浪潮智能科技有限公司 Visual question answering method and device, electronic equipment and storage medium
CN111994377A (en) * 2020-07-21 2020-11-27 浙江大华技术股份有限公司 Method and device for detecting packaging box process and computer equipment
CN112949630A (en) * 2021-03-01 2021-06-11 北京交通大学 Weak supervision target detection method based on frame classification screening
CN113435998A (en) * 2021-06-23 2021-09-24 平安科技(深圳)有限公司 Loan overdue prediction method and device, electronic equipment and storage medium
CN114840651A (en) * 2022-04-20 2022-08-02 南方科技大学 Visual question-answering training method and system and computer readable storage medium
CN114972944A (en) * 2022-06-16 2022-08-30 中国电信股份有限公司 Training method and device of visual question-answering model, question-answering method, medium and equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111782839B (en) * 2020-06-30 2023-08-22 北京百度网讯科技有限公司 Image question-answering method, device, computer equipment and medium
CN115129848B (en) * 2022-09-02 2023-02-28 苏州浪潮智能科技有限公司 Method, device, equipment and medium for processing visual question-answering task

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108228703A (en) * 2017-10-31 2018-06-29 北京市商汤科技开发有限公司 Image answering method, device, system and storage medium
CN111994377A (en) * 2020-07-21 2020-11-27 浙江大华技术股份有限公司 Method and device for detecting packaging box process and computer equipment
CN111860653A (en) * 2020-07-22 2020-10-30 苏州浪潮智能科技有限公司 Visual question answering method and device, electronic equipment and storage medium
CN112949630A (en) * 2021-03-01 2021-06-11 北京交通大学 Weak supervision target detection method based on frame classification screening
CN113435998A (en) * 2021-06-23 2021-09-24 平安科技(深圳)有限公司 Loan overdue prediction method and device, electronic equipment and storage medium
CN114840651A (en) * 2022-04-20 2022-08-02 南方科技大学 Visual question-answering training method and system and computer readable storage medium
CN114972944A (en) * 2022-06-16 2022-08-30 中国电信股份有限公司 Training method and device of visual question-answering model, question-answering method, medium and equipment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024045444A1 (en) * 2022-09-02 2024-03-07 苏州浪潮智能科技有限公司 Processing method and apparatus for visual question answering task, and device and non-volatile readable storage medium
CN115861995A (en) * 2023-02-08 2023-03-28 山东海量信息技术研究院 Visual question-answering method and device, electronic equipment and storage medium
CN116884003A (en) * 2023-07-18 2023-10-13 南京领行科技股份有限公司 Picture automatic labeling method and device, electronic equipment and storage medium
CN116884003B (en) * 2023-07-18 2024-03-22 南京领行科技股份有限公司 Picture automatic labeling method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN115129848B (en) 2023-02-28
WO2024045444A1 (en) 2024-03-07

Similar Documents

Publication Publication Date Title
CN115129848B (en) Method, device, equipment and medium for processing visual question-answering task
CN110263324B (en) Text processing method, model training method and device
CN111294646A (en) Video processing method, device, equipment and storage medium
CN116012488A (en) Stylized image generation method, device, computer equipment and storage medium
CN115223020B (en) Image processing method, apparatus, device, storage medium, and computer program product
CN114339450B (en) Video comment generation method, system, device and storage medium
CN115455171B (en) Text video mutual inspection rope and model training method, device, equipment and medium
CN110263218B (en) Video description text generation method, device, equipment and medium
CN117521675A (en) Information processing method, device, equipment and storage medium based on large language model
CN115658955B (en) Cross-media retrieval and model training method, device, equipment and menu retrieval system
KR20190000587A (en) Computer program stored in computer-readable medium and user device having translation algorithm using by deep learning neural network circuit
CN115512005A (en) Data processing method and device
CN111597341A (en) Document level relation extraction method, device, equipment and storage medium
WO2024083121A1 (en) Data processing method and apparatus
WO2023173552A1 (en) Establishment method for target detection model, application method for target detection model, and device, apparatus and medium
CN113095072B (en) Text processing method and device
CN116310983A (en) Multi-mode emotion recognition method and device
CN116452706A (en) Image generation method and device for presentation file
CN112861474B (en) Information labeling method, device, equipment and computer readable storage medium
CN114548274A (en) Multi-modal interaction-based rumor detection method and system
CN116913278B (en) Voice processing method, device, equipment and storage medium
CN117746078A (en) Object detection method and system based on user-defined category
CN116993963A (en) Image processing method, device, equipment and storage medium
CN115761332A (en) Smoke and flame detection method, device, equipment and storage medium
CN115147931A (en) Person-object interaction detection method based on person paired decoding interaction of DETR (digital enhanced tomography)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant