CN116894996A

CN116894996A - Training of visual question-answering model and visual question-answering task processing method and device

Info

Publication number: CN116894996A
Application number: CN202310833232.4A
Authority: CN
Inventors: 王昊; 杨明川; 刘振华; 李伟
Original assignee: China Telecom Technology Innovation Center; China Telecom Corp Ltd
Current assignee: China Telecom Technology Innovation Center; China Telecom Corp Ltd
Priority date: 2023-07-07
Filing date: 2023-07-07
Publication date: 2023-10-17

Abstract

The disclosure relates to the technical field of machine learning, and relates to a training method and device of a visual question-answering model, a visual question-answering task processing method and device, a computer readable storage medium and electronic equipment, wherein the training method of the visual question-answering model comprises the following steps: acquiring initial training data, wherein the initial training data comprises image-text input data and true value answers corresponding to the image-text input data, and the image-text input data comprises image feature data and text feature data; inputting the image-text input data into a first initial model to obtain a first reference answer; screening the initial training data based on the first reference answer and the true value answer to obtain first target training data; updating the second initial model by using a first reference answer and a corresponding true value answer of the first target training data to obtain a visual question-answer model; the model framework of the first initial model is consistent with that of the second initial model. The technical scheme of the embodiment of the disclosure improves the processing precision of obtaining the visual question-answering model.

Description

Training of visual question-answering model and visual question-answering task processing method and device

Technical Field

The disclosure relates to the technical field of machine learning, in particular to a training method and device of a visual question-answering model, a visual question-answering task processing method and device, a computer readable storage medium and electronic equipment.

Background

The visual question-answering model has been developed in the multi-mode learning field, but the current visual question-answering methods all need accurate data labels to construct a complete data set.

However, construction of the visual question-answering model labeling data set requires great labor and time costs, and the labeled data inevitably has noise, so that the processing precision of the trained visual question-answering model is low.

Disclosure of Invention

The present disclosure aims to provide a training method of a visual question-answering model, a training device of the visual question-answering model, a visual question-answering task processing method, a visual question-answering task processing device, a computer readable medium and an electronic device, so that processing accuracy of obtaining the visual question-answering model is improved at least to some extent.

According to a first aspect of the present disclosure, there is provided a training method of a visual question-answering model, including: acquiring initial training data, wherein the initial training data comprises image-text input data and true value answers corresponding to the image-text input data, and the image-text input data comprises image feature data and text feature data; inputting the image-text input data into a first initial model to obtain a first reference answer; screening the initial training data based on the first reference answer and the true value answer to obtain first target training data; updating the second initial model by using a first reference answer and a corresponding true value answer of the first target training data to obtain a visual question-answer model; the model framework of the first initial model is consistent with that of the second initial model.

According to a second aspect of the present disclosure, there is provided a training apparatus of a visual question-answering model, comprising: the data acquisition module is used for acquiring initial training data, wherein the initial training data comprises image-text input data and true value answers corresponding to the image-text input data, and the image-text input data comprises image feature data and text feature data; the data processing module is used for inputting the image-text input data into the first initial model to obtain a first reference answer; the data screening module is used for screening the initial training data based on the first reference answer and the true value answer to obtain first target training data; the model updating module is used for updating the second initial model by using the first reference answer of the first target training data and the corresponding true value answer to obtain a visual question-answer model; the model framework of the first initial model is consistent with that of the second initial model.

According to a third aspect of the present disclosure, there is provided a visual question-answering task processing method, including: acquiring reference image features corresponding to an image to be questioned and corresponding reference text features; inputting the reference image features and the reference text features into a visual question-answering model to obtain a target answer; the visual question-answering model can be obtained according to a training method of the visual question-answering model.

According to a fourth aspect of the present disclosure, there is provided a visual question-answering task processing device including: the feature acquisition module is used for acquiring the reference image features corresponding to the image to be questioned and the corresponding reference text features; the task processing module is used for inputting the reference image features and the reference text features into the visual question-answering model to obtain a target answer; the visual question-answering model is obtained according to a training method of the visual question-answering model.

According to a fifth aspect of the present disclosure, there is provided a computer readable medium having stored thereon a computer program which, when executed by a processor, implements the method described above.

According to a sixth aspect of the present disclosure, there is provided an electronic apparatus, comprising: one or more processors; and a memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the methods described above.

According to the training method of the visual question-answering model, on one hand, the first initial model is used for screening the initial training data to obtain the first target training data, so that the accuracy of the training data of the visual question-answering model is improved, and the accuracy of the obtained visual question-answering model is higher; on the other hand, parameters of the second initial model are updated through the first reference answer output by the first initial model to obtain a visual question-answer model, so that the anti-noise capability of the obtained visual question-answer model is further improved, and the accuracy of the visual question-answer model is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort. In the drawings:

FIG. 1 illustrates a schematic diagram of an exemplary system architecture to which embodiments of the present disclosure may be applied;

FIG. 2 schematically illustrates a flow chart of a method of training a visual question-answering model in an exemplary embodiment of the present disclosure;

FIG. 3 schematically illustrates a visual question-answering model architecture diagram in an exemplary embodiment of the present disclosure;

FIG. 4 schematically illustrates a partition diagram of an initial sub-model in an exemplary embodiment of the present disclosure;

FIG. 5 schematically illustrates a flowchart of another method of training a visual question-answering model in an exemplary embodiment of the present disclosure;

FIG. 6 schematically illustrates a data flow diagram of a training method for a visual question-answering model in an exemplary embodiment of the present disclosure;

FIG. 7 schematically illustrates a flow chart of a visual question-answering task processing method in an exemplary embodiment of the present disclosure;

FIG. 8 schematically illustrates a data flow diagram of a visual question-answering task processing method in an exemplary embodiment of the present disclosure;

fig. 9 schematically illustrates a composition diagram of a visual question-answering task processing device in an exemplary embodiment of the present disclosure;

FIG. 10 schematically illustrates a composition diagram of a training apparatus of a visual question-answering model in an exemplary embodiment of the present disclosure;

fig. 11 shows a schematic diagram of an electronic device to which embodiments of the present disclosure may be applied.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.

Visual questioning and answering is a practical machine learning task that requires an AI model to output an answer to a visual question about an image. The challenge of this task is its multitasking and openness, which involves solving multiple technical research problems in computer vision and natural language understanding simultaneously. The progress of this task will enable multi-modal machine learning to be used in a wide variety of applications, either to aid blind and visually impaired people, or to communicate with robots to enhance the user's visual experience and external knowledge.

The existing visual question-answering system has the following problems: without high quality visual pictures, questions and triads of extensive initial training data, it is more difficult to train an effective and steady-performing visual question-answering model.

However, preparing a complete dataset is time consuming and burdensome. At present, the field of visual question and answer research mainly focuses on the design of complex models, but not the construction of training data required in real scenes. Therefore, how to make a model perform robust training under a lower quality label is a problem that needs to be solved at present.

Based on the above drawbacks, the present disclosure provides a training method of a visual question-answering model, and fig. 1 is a schematic diagram illustrating a system architecture in which the training method of the visual question-answering model may be implemented, where the system architecture 100 may include a terminal 110 and a server 120. The terminal 110 may be a terminal device such as a smart phone, a tablet computer, a desktop computer, a notebook computer, etc., and the server 120 generally refers to a background system that provides a visual question-answering related service in the present exemplary embodiment, and may be a server or a cluster formed by multiple servers. The terminal 110 and the server 120 may form a connection through a wired or wireless communication link for data interaction.

In one embodiment, the training method of the visual question-answering model described above may be performed by the terminal 110. For example, a user obtains initial training data by using a terminal 110, the initial training data includes graphic input data and true value answers corresponding to the graphic input data, the graphic input data includes image feature data and text feature data, the terminal 110 firstly inputs the graphic input data into a first initial model to obtain a first reference answer, then screens the initial training data based on the first reference answer and the true value answer to obtain first target training data, and finally updates a second initial model by using the first reference answer and the corresponding true value answer of the first target training data to obtain a visual question-answer model; the model framework of the first initial model is consistent with that of the second initial model.

In one embodiment, the training method of the visual question-answering model described above may be performed by the server 120. For example, the user obtains initial training data by using the terminal 110, where the initial training data includes graphic input data and a true value answer corresponding to the graphic input data, after the graphic input data includes image feature data and text feature data, the terminal 110 uploads the initial training data to the server 120, the server 120 inputs the graphic input data to the first initial model to obtain a first reference answer, then filters the initial training data based on the first reference answer and the true value answer to obtain first target training data, and finally updates the second initial model by using the first reference answer and the corresponding true value answer of the first target training data to obtain a visual question-answer model, and then returns the visual question-answer model to the terminal 110.

As can be seen from the above, the execution subject of the training method of the visual question-answering model in the present exemplary embodiment may be the terminal 110 or the server 120 described above, which is not limited by the present disclosure.

The training method of the visual question-answering model in the present exemplary embodiment will be described below with reference to fig. 2, which shows an exemplary flow of the training method of the visual question-answering model, and may include steps S210 to S240.

Referring to fig. 2, in step S210, initial training data is acquired, where the initial training data includes teletext input data and a true value answer corresponding to the teletext input data, and the teletext input data includes image feature data and text feature data.

In an example embodiment of the present disclosure, the processor may first obtain initial training data, where the initial training data includes teletext input data and a true value answer corresponding to the teletext input data, where the teletext input data may include image feature data and text feature data.

Specifically, the processor may first obtain an initial image, an initial text question, and a true value answer corresponding to the initial text question, and then may perform feature extraction on the initial image and the initial text question to obtain image feature data and text feature data.

In one example embodiment, the image feature data may be obtained by feature extraction of the initial image using a Faster-RCNN (Regions with CNN features, regional generation network). In the present exemplary embodiment, the bottom-up attention model is implemented using the fast-RCNN, and the overlapping of interest boxes is allowed through a set threshold to enable more efficient understanding of image content, resulting in more accurate image feature data.

In one example embodiment, feature extraction of the original question text may be implemented using a text feature extraction model, which may include two LSTM's (Long Short-Term Memory networks), including a top-down attention-mechanism LSTM, which may help re-weight image features. After extracting the features, a GRU (Gated Recurrent Unit, gate control loop unit) module is used for performing sequence coding processing.

In step S220, the teletext input data is input to the first initial model to obtain a first reference answer.

In an exemplary embodiment of the present disclosure, after obtaining the teletext input data, the teletext input data may be input into the first initial model to obtain the first reference answer.

The frame structures of the first initial model and the second initial model are the same, and model parameters in the first initial model and the second initial model can be the same or different, preferably, the parameters in the first initial model and the parameters in the second initial model are different, and the second initial model can absorb the noise resistance of the first initial model during training, so that the accuracy of the visual question-answering model obtained through training is higher.

In this exemplary embodiment, the first initial model and the second initial model may be model structures of UpDn and GRU, and the initial training data is d= { I _i ,Q _i ,a _i } ^N Which includes N images I _i Problem Q _i Answer a _i For each pair.

Specifically, referring to fig. 3, the first initial model and the second initial model may include a top-down attention LSTM module 310, an attention weighting module 320, a language LSTM module 330, and a loss function layer 340.

Specifically, the input of the t-th time step of the top-down attention (LSTM) module is:

wherein, the liquid crystal display device comprises a liquid crystal display device,for the output of one round on the language LSTM module +.>Is the average value of the image characteristic data, W _e Is the ebedding matrix of question-answering text, pi _t Is a one-hot coded word entered in the current round, and provides text information, image beliefs and summary and one-head description information of the current language model for top-down attention LSTM modules, respectively.

Further weighting may be given to each feature data

And by a softmax function, to obtain: a, a _t ＝softmax(a _t )

Finally obtaining image weighting characteristics:

on the other hand, the probability distribution is calculated through the LSTM network from bottom to top, and specifically, the image weighting characteristics are addedInputs that make up the language LSTM module:

at this time, the probability distribution of the predicted word (i.e., the reference answer) at the time t is:

wherein the loss functions of the first initial model and the second initial model are as follows:

wherein y is _i The label representing sample i has a positive class of 1 and a negative class of 0.P is p _i Representing the probability that sample i is predicted to be a positive class.

It should be noted that the above description about the specific frameworks of the first initial model and the second initial model is exemplary, and the specific frameworks of the first initial model and the second initial model are not defined in detail in this disclosure.

In step S230, the initial training data is filtered based on the first reference answer and the true answer to obtain first target training data.

In the present exemplary embodiment, the above-described method may include step S310 and step S330.

In step S310, a loss value of the first reference answer and the true answer is calculated.

In this exemplary embodiment, the loss value of the first reference answer and the true answer may be determined by using the determined loss function, or the similarity between the first reference answer and the true answer may be determined first, and the loss value may be obtained by subtracting the similarity from 1, and the calculation of the loss value may be customized according to the user requirement, which is not specifically limited in this exemplary embodiment.

In step S320, initial training data corresponding to the first reference answer whose loss value is smaller than the preset threshold is used as the first target training data.

After obtaining the corresponding loss values of the first references, a preset threshold value can be determined, wherein the preset threshold value can be 0.2, 0.3 and the like, the user can also customize according to the user requirement, then the loss values are respectively compared with the preset threshold value, and initial training data corresponding to a first reference answer with the loss value smaller than the preset threshold value is used as the first target training data.

The loss value is utilized to screen initial training data to obtain first target data, noise during training can be reduced, accuracy of the first target training data is improved, and processing accuracy of a visual question-answer model obtained through training can be higher.

In an example embodiment of the present disclosure, when the first target training data is acquired, steps S410 to S430 may be further included.

In step S410, the initial training data is divided into a plurality of sets of initial sub-data;

in this exemplary embodiment, referring to fig. 4, the initial training data may be first divided into a plurality of sets of initial sub-data, where the number of image feature data and text feature data in each set of initial sub-data may be one or more, for example, 3, 5, etc., and the number of initial training data in each set of initial sub-data may be the same or different, which is not specifically limited in this exemplary embodiment.

In step S420, screening each initial sub-data based on the first reference answer and the true value answer to obtain target sub-data;

after the grouping is completed, each initial sub-data is screened by using the first reference answer and the true value answer to obtain a plurality of target sub-data, and a specific screening process may refer to steps S310 to S320, which is not specifically limited in this exemplary embodiment.

In step S430, a plurality of sets of target sub-data are set as first target training data.

After obtaining the plurality of target sub-data, the plurality of target sub-data can be used as the first target training data, the initial training data can be divided into a plurality of groups for screening, the plurality of groups can be screened simultaneously, the screening rate can be increased or decreased, meanwhile, when the processor is busy, the processing pressure of the processor can be reduced by sequentially screening the initial sub-data, and the initial training data can be screened under the condition of narrow bandwidth.

In step S240, the second initial model is updated with the first reference answer and the corresponding true answer of the first target training data to obtain a visual question-answer model.

In this example embodiment, after the first target training data is obtained, the visual question-answer model may be obtained by updating the second initial model with the first reference answer and the true value answer corresponding to the first reference answer in the target initial training data.

Specifically, a first gradient value of back propagation may be determined based on the first reference answer in the first target training data and a true value answer corresponding to the first reference answer, and then the parameters in the second initial model are updated by using the first gradient value obtained by calculation to obtain the visual question-answer model.

In an example embodiment of the present disclosure, referring to fig. 5, the training method of the visual question-answering model may further include steps S250 to S270.

In step S250, the teletext input data is input to the second initial model to obtain a second reference answer.

In an exemplary embodiment of the present disclosure, after obtaining the teletext input data, the teletext input data may be input into the second initial model to obtain the second reference answer.

In step S260, the initial training data is filtered based on the second reference answer and the true answer to obtain second target training data.

In the present exemplary embodiment, the above-described method may include step S510 and step S530.

In step S510, a loss value of the second reference answer and the true answer is calculated.

In this exemplary embodiment, the loss value of the second reference answer and the true answer may be determined by using the determined loss function, or the similarity between the second reference answer and the true answer may be determined first, and the loss value may be obtained by subtracting the similarity from 1, and the calculation of the loss value may be customized according to the user requirement, which is not specifically limited in this exemplary embodiment.

In step S520, the initial training data corresponding to the second reference answer with the loss value smaller than the preset threshold is used as the second target training data.

After obtaining the corresponding loss value of each second reference, a preset threshold value can be determined, wherein the preset threshold value can be 0.2, 0.3 and the like, the user can also customize according to the user requirement, then the loss values are respectively compared with the preset threshold value, and initial training data corresponding to a second reference answer with the loss value smaller than the preset threshold value is used as the second target training data.

The loss value is utilized to screen the initial training data to obtain second target data, so that noise during training can be reduced, the accuracy of the second target training data is improved, and the processing accuracy of the visual question-answering model obtained through training can be higher.

In an example embodiment of the present disclosure, when the second target training data is acquired, steps S610 to S630 may be further included.

In step S610, the initial training data is divided into a plurality of sets of initial sub-data;

in this exemplary embodiment, referring to fig. 5, the initial training data may be first divided into a plurality of sets of initial sub-data, where the number of image feature data and text feature data in each set of initial sub-data may be one or more, for example, 3, 5, etc., and the number of initial training data in each set of initial sub-data may be the same or different, which is not specifically limited in this exemplary embodiment.

In step S620, screening the initial sub-data based on the second reference answer and the true value answer to obtain target sub-data;

after the grouping is completed, each initial sub-data is screened by using the second reference answer and the true value answer to obtain a plurality of target sub-data, and the specific screening process may refer to steps S510 to S520, which is not specifically limited in this exemplary embodiment.

In step S630, the plurality of sets of target sub-data are regarded as second target training data.

After obtaining the plurality of target sub-data, the plurality of target sub-data can be used as the second target training data, the initial training data can be divided into a plurality of groups for screening, the plurality of groups can be screened simultaneously, the screening rate can be increased or decreased, meanwhile, when the processor is busy, the processing pressure of the processor can be reduced by sequentially screening the initial sub-data, and the initial training data can be screened under the condition of narrow bandwidth.

In step S270, the first initial model is updated with the second reference answer and the corresponding true answer of the second target training data to obtain a visual question-answer model.

In this example embodiment, after the second target training data is obtained, the visual question-answer model may be obtained by updating the second initial model with the second reference answer and the true value answer corresponding to the second reference answer in the second target training data.

Specifically, a second gradient value of back propagation may be determined based on the second reference answer in the second target training data and a true value answer corresponding to the second reference answer, and then the parameters in the first initial model are updated by using the calculated second gradient value to obtain the visual question-answer model.

In this exemplary embodiment, referring to fig. 6, the output and the true value answer of the first initial model may be used to update the second initial model, and the output and the true value answer of the second initial model may be used to update the first initial model to obtain two visual question-answer models, where any one visual question-answer model may complete the visual question-answer task.

For example, the initial sub-data includes three initial training data, and the initial sub-data may be fed into the first initial model. And respectively calculating the loss values of the three initial training data, selecting two first target sub-data, calculating the counter-propagation gradient value corresponding to the first target sub-data, training a second initial model by using the gradient value, and similarly, sending the initial sub-data into the second initial model. And respectively calculating loss values of the three initial training data, selecting two second target sub-data, calculating a counter-propagation gradient value corresponding to the second target sub-data, and training the first initial model by using the gradient value. After training is completed on all initial sub-data, two visual question-answering models are obtained, and the two visual question-answering models can accurately complete visual question-answering tasks.

In summary, in this exemplary embodiment, on the one hand, the first initial model is used to screen the initial training data to obtain the first target training data, so that the accuracy of the initial training data is improved, and the accuracy of the obtained visual question-answering model is higher; on the other hand, parameters of the second initial model are updated through the first reference answer output by the first initial model to obtain a visual question-answer model, so that the anti-noise capability of the obtained visual question-answer model is further improved, and the accuracy of the visual question-answer model is improved. According to the method, the initial training data are divided into a plurality of groups to be screened, the screening rate can be increased or decreased, and meanwhile, when the processor is busy, the processing pressure of the processor can be reduced by sequentially screening all the initial sub-data, and the initial training data can be screened under the condition of narrow bandwidth.

Further, referring to fig. 7, the disclosure further provides a visual question-answering task processing method, which may be executed by the terminal in fig. 1 or may be executed by the server in fig. 1, for example, the user may use the terminal 110 to obtain the reference image feature corresponding to the image to be asked and the corresponding reference text feature, and load a visual question-answering model, where the visual question-answering model is trained by the training method of the visual model. And then inputting the reference image features and the reference text features into a visual question-answering model to obtain a target answer so as to finish the processing of the visual question-answering task.

For another example, after the user obtains the reference image feature and the corresponding reference text feature corresponding to the image to be questioned using the terminal 110, the reference image feature and the corresponding reference text feature corresponding to the image to be questioned are uploaded to the server 120, and the server 120 loads a visual question-answering model, wherein the visual question-answering model is trained by the training method of the visual model, and then the visual question-answering model is utilized to process the reference image feature and the corresponding reference text feature corresponding to the image to be questioned to obtain the target answer.

The visual question-answering task processing method may specifically include steps S710 to S720.

In step S710, a reference image feature corresponding to the image to be questioned and a corresponding reference text feature are obtained.

In this exemplary embodiment, referring to fig. 8, after obtaining an image to be questioned and a corresponding question, the image to be questioned may be input to the image feature extraction module to obtain a reference image feature, and the question may be input to the text feature extraction module to obtain a reference text feature.

The specific flow of image feature extraction and text feature extraction may refer to the training method of the visual question-answering model, and will not be described herein.

In step S720, the reference image features and the reference text features are input to the visual question-answer model to obtain a target answer.

After the reference image features and the reference text features are obtained, the reference image features and the reference text features may be input into the visual question-answering model to obtain a target answer.

The specific structure of the visual question-answering model may refer to the training method of the visual question-answering model and will not be described herein.

It is noted that the above-described figures are merely schematic illustrations of processes involved in a method according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.

Further, referring to fig. 9, in this exemplary embodiment, a training apparatus 900 for a visual question-answering model is further provided, which includes a data acquisition module 910, a data processing module 920, a data screening module 930, and a model updating module 940. Wherein:

the data acquisition module 910 may be configured to acquire initial training data, where the initial training data includes graphic input data and a true value answer corresponding to the graphic input data, and the graphic input data includes image feature data and text feature data; the data processing module 920 may be configured to input the teletext input data to the first initial model to obtain a first reference answer; the data filtering module 930 may be configured to filter the initial training data based on the first reference answer and the true answer to obtain first target training data; model update module 940 may update the second initial model with the first reference answer and the corresponding true answer for the first target training data to obtain a visual question-answer model. The model framework of the first initial model is consistent with that of the second initial model.

In an example embodiment, the data acquisition module 910 may be configured to acquire an initial image, an initial question text, and a true value answer corresponding to the initial question text; and extracting features of the multiple initial images and the initial problem text to obtain image feature data and text feature data.

In an example embodiment, the data filtering module 930 may be configured to calculate the penalty value for the first reference answer and the true value answer; and taking initial training data corresponding to the first reference answer with the loss value smaller than the preset threshold value as first target training data.

In another example embodiment, the data filtering module 930 may be configured to divide the initial training data into multiple sets of initial sub-data; screening all the initial sub-data based on the first reference answer and the true value answer to obtain target sub-data; and taking the multiple groups of target sub-data as first target training data.

In an example embodiment, the model update module 940 may be configured to update the second initial model with the second reference answer and the corresponding true answer in each target sub-data to obtain the visual question-answer model, respectively.

Further, referring to fig. 10, in this exemplary embodiment, a visual question-answering task processing device 1000 is further provided, which includes a feature obtaining module 1010 and a task processing module 1020.

Wherein:

the feature acquisition module 1010 may be configured to acquire a reference image feature corresponding to an image to be questioned and a corresponding reference text feature; the task processing module 1020 may be configured to input the reference image feature and the reference text feature to the visual question-answering model to obtain a target answer; wherein the visual question model is obtained according to the training method of the visual question model of any one of claims 1-6.

The specific details of each module in the above apparatus are already described in the method section, and the details that are not disclosed can be referred to the embodiment of the method section, so that they will not be described in detail.

The exemplary embodiments of the present disclosure also provide an electronic device for performing the training method of the visual question-answer model, which may be the terminal 110 or the server 120. In general, the electronic device may include a processor and a memory for storing executable instructions of the processor, the processor configured to perform the training method of the visual question-answer model described above via execution of the executable instructions.

The configuration of the electronic device will be exemplarily described below using the mobile terminal 1100 of fig. 11 as an example. It will be appreciated by those skilled in the art that the configuration of fig. 11 can also be applied to stationary type devices in addition to components specifically for mobile purposes.

As shown in fig. 11, the mobile terminal 1100 may specifically include: processor 1101, memory 1102, bus 1103, mobile communication module 1104, antenna 1, wireless communication module 1105, antenna 2, display 1106, camera module 1107, audio module 1108, power module 1109, and sensor module 1110.

The processor 1101 may include one or more processing units, such as: the processor 1101 may include an AP (Application Processor ), modem processor, GPU (Graphics Processing Unit, graphics processor), ISP (Image Signal Processor ), controller, encoder, decoder, DSP (Digital Signal Processor ), baseband processor and/or NPU (Neural-Network Processing Unit, neural network processor), and the like. The training method of the visual question-answering model in the present exemplary embodiment may be performed by an AP, GPU, or DSP, and may be performed by an NPU when the method involves neural network-related processing.

The processor 1101 may form a connection with the memory 1102 or other components through a bus 1103.

Memory 1102 may be used to store computer-executable program code that includes instructions. The processor 1101 performs various functional applications and data processing of the mobile terminal 1100 by executing instructions stored in the memory 1102. Memory 1102 may also store application data, such as files that store images, videos, and the like.

The communication functions of the mobile terminal 1100 may be implemented by the mobile communication module 1104, the antenna 1, the wireless communication module 1105, the antenna 2, a modem processor, a baseband processor, and the like. The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. The mobile communication module 1104 may provide a mobile communication solution of 2G, 3G, 4G, 5G, etc. applied on the mobile terminal 1100. The wireless communication module 1105 may provide a wireless communication solution for wireless local area network, bluetooth, near field communication, etc. that is applied to the mobile terminal 1100.

Display 1106 is used to implement display functions such as displaying user interfaces, images, video, and the like. The image capturing module 1107 is configured to implement capturing functions, such as capturing images, video, and the like. The audio module 208 is used to implement audio functions, such as playing audio, collecting speech, etc. The power module 209 is used to implement power management functions such as charging a battery, powering a device, monitoring a battery status, etc. The sensor module 1110 may include a depth sensor 11101, a pressure sensor 11102, a gyro sensor 11103, a barometric sensor 11104, etc. to implement a corresponding sensing function.

Those skilled in the art will appreciate that the various aspects of the present disclosure may be implemented as a system, method, or program product. Accordingly, various aspects of the disclosure may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

Exemplary embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon a program product capable of implementing the method described above in the present specification. In some possible implementations, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the disclosure as described in the "exemplary methods" section of this specification, when the program product is run on the terminal device.

It should be noted that the computer readable medium shown in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Furthermore, the program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for training a visual question-answering model, comprising:

acquiring initial training data, wherein the initial training data comprises image-text input data and true value answers corresponding to the image-text input data, and the image-text input data comprises image feature data and text feature data;

inputting the image-text input data into a first initial model to obtain a first reference answer;

screening the initial training data based on a first reference answer and the true value answer to obtain first target training data;

updating a second initial model by using a first reference answer of the first target training data and the corresponding true value answer to obtain the visual question-answer model;

the model framework of the first initial model is consistent with that of the second initial model.

2. The method according to claim 1, wherein the method further comprises:

inputting the image-text input data into the second initial model to obtain a second reference answer;

Screening the initial training data based on a second reference answer and the true value answer to obtain second target training data;

and updating the first initial model by using a second reference answer of the second target training data and the corresponding true value answer to obtain the visual question-answer model.

3. The method of claim 1, wherein the screening the initial training data based on the first reference answer and the true answer to obtain first target training data comprises:

calculating a loss value of the first reference answer and the true value answer;

and taking initial training data corresponding to a first reference answer with the loss value smaller than a preset threshold value as the first target training data.

4. The method of claim 1, wherein the screening the initial training data based on the first reference answer and the true answer to obtain first target training data comprises:

dividing the initial training data into a plurality of groups of initial sub-data;

screening each initial sub-data based on the first reference answer and the true value answer to obtain target sub-data;

and taking a plurality of groups of target sub-data as the first target training data.

5. The method of claim 4, wherein updating the second initial model with the first reference answer and the corresponding true answer of the first target training data to obtain the visual question-answer model comprises:

and updating the second initial model by using a second reference answer in each target sub-data and the corresponding true value answer to obtain the visual question-answer model.

6. The method of claim 1, wherein the acquiring initial training data comprises:

acquiring an initial image, an initial question text and a true value answer corresponding to the initial question text;

and carrying out feature extraction on the initial image and the initial question text to obtain the image feature data and the text feature data.

7. A visual question-answering task processing method, comprising:

acquiring reference image features corresponding to an image to be questioned and corresponding reference text features;

inputting the reference image features and the reference text features into a visual question-answering model to obtain a target answer;

wherein the visual question-answering model is obtainable according to the training method of the visual question-answering model according to any one of claims 1 to 6.

8. A training device for a visual question-answering model, comprising:

the data acquisition module is used for acquiring initial training data, wherein the initial training data comprises image-text input data and true value answers corresponding to the image-text input data, and the image-text input data comprises image characteristic data and text characteristic data;

the data processing module is used for inputting the image-text input data into a first initial model to obtain a first reference answer;

the data screening module is used for screening the initial training data based on the first reference answer and the true value answer to obtain first target training data;

the model updating module is used for updating a second initial model by using the first reference answer of the first target training data and the corresponding true value answer to obtain the visual question-answer model;

9. A visual question-answering task processing device, comprising:

the feature acquisition module is used for acquiring the reference image features corresponding to the image to be questioned and the corresponding reference text features;

the task processing module is used for inputting the reference image features and the reference text features into a visual question-answering model to obtain a target answer;

Wherein the visual question-answering model is obtained according to the training method of the visual question-answering model according to any one of claims 1 to 6.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1 to 7.

11. An electronic device, comprising:

one or more processors; and

a memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-7.