CN114972823A

CN114972823A - Data processing method, device, equipment and computer medium

Info

Publication number: CN114972823A
Application number: CN202210671652.2A
Authority: CN
Inventors: 张新松; 刁诗哲; 周王春澍; 王嘉伟
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2022-06-14
Filing date: 2022-06-14
Publication date: 2022-08-30
Also published as: WO2023241410A1

Abstract

The application discloses a data processing method, a device, equipment and a computer medium, wherein the method comprises the following steps: acquiring to-be-processed image-text characteristic information comprising mutually matched to-be-processed text characteristic information and to-be-processed image characteristic information; the text characteristic information or the image characteristic information to be processed contains mask identification; processing the image-text characteristic information to be processed based on an initial vector generation rule to obtain a first vector information group and a second vector information group; coding the first vector information group and the second vector information group through an initial coding rule to obtain a fused vector information group; decoding the fusion vector information group through an initial decoding rule to obtain a prediction result corresponding to the mask identification; and training the initial pre-training model based on the prediction result and the mask identification to obtain a target pre-training model. The training efficiency of training the corresponding pre-training model is improved when the task processing model corresponding to various tasks needs to be trained.

Description

Data processing method, device, equipment and computer medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a data processing method, apparatus, device, and computer medium.

Background

Currently, the pre-training model in the related art generally focuses on processing tasks of only a single task category, wherein the task category may be: to text understanding tasks, visual understanding tasks, multimodal understanding tasks, image-to-text generation tasks, and text-to-image generation tasks. The multi-modal understanding task is a task of simultaneously understanding visual information and language information to solve visual question and answer, visual reasoning, visual implication and the like. The image-to-text generation task is to generate corresponding text description by understanding the input image information. The task of generating the text to the image is to generate a corresponding image according to the input text information. When tasks of multiple categories need to be completed, in the related art, a task processing model corresponding to each category of task needs to be trained, and a pre-training model corresponding to each category of task needs to be trained, so that the training efficiency of the pre-training model is low.

Disclosure of Invention

The embodiment of the application provides an implementation scheme different from the prior art, so as to solve the technical problem that in the prior art, when a task processing model corresponding to various tasks needs to be trained, the training efficiency of a corresponding pre-training model is low.

In a first aspect, the present application provides a data processing method, including: the method comprises the steps of obtaining to-be-processed image-text characteristic information, wherein the to-be-processed image-text characteristic information comprises to-be-processed text characteristic information and to-be-processed image characteristic information; the text characteristic information to be processed or the image characteristic information to be processed comprises mask identification, and the text characteristic information to be processed is matched with the image characteristic information to be processed;

generating a first vector information group corresponding to the feature information of the text to be processed and a second vector information group corresponding to the feature information of the image to be processed based on an initial vector generation rule;

coding the first vector information group and the second vector information group through an initial coding rule to obtain a corresponding fusion vector information group; wherein the fused vector information set comprises a plurality of fused vector information, each fused vector information being associated with the first vector information set and the second vector information set;

decoding the fusion vector information group through an initial decoding rule to obtain a prediction result corresponding to the mask identification;

and training the initial pre-training model based on the prediction result and the mask identification to obtain a target pre-training model, wherein the target pre-training model is used for training a target task processing model corresponding to the target task category according to the obtained target task category.

In a second aspect, the present application provides a model training method, including:

acquiring a target task category and sample task information corresponding to the target task category, wherein the sample task information comprises sample task input information and a sample task result label corresponding to the sample task input information;

obtaining a plurality of candidate units in a target pre-training model, the plurality of candidate units comprising: the system comprises a preprocessing unit, a first target vector generating unit, a first target cross mode encoding unit and a first target cross mode decoding unit;

determining a target unit corresponding to the target task type from the candidate units according to a preset corresponding relation;

constructing an initial task processing model corresponding to the target task category based on the target unit;

training the initial task processing model by using the sample task information to obtain a target task processing model for completing a target task corresponding to the target task type;

the target pre-training model is obtained through training of an initial pre-training model in the data processing method.

In a third aspect, the present application provides a task processing method, including:

acquiring target task information of a target task category, wherein the target task information comprises target task input information;

determining a corresponding target task processing model according to the target task category;

inputting the target task input information into the target task processing model to obtain a target task processing result corresponding to the target task category and the target task input information;

the target task processing model is obtained through the model training method.

In a fourth aspect, the present application provides a data processing apparatus comprising:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring the image-text characteristic information to be processed, and the image-text characteristic information to be processed comprises the text characteristic information to be processed and the image characteristic information to be processed; the text characteristic information to be processed or the image characteristic information to be processed comprises mask identification, and the text characteristic information to be processed is matched with the image characteristic information to be processed;

the generating unit is used for generating a first vector information group corresponding to the feature information of the text to be processed and a second vector information group corresponding to the feature information of the image to be processed based on an initial vector generating rule;

the encoding unit is used for encoding the first vector information group and the second vector information group through an initial encoding rule to obtain a corresponding fusion vector information group; wherein the fused vector information set comprises a plurality of fused vector information, each fused vector information being associated with the first vector information set and the second vector information set;

the decoding unit is used for decoding the fusion vector information group through an initial decoding rule to obtain a prediction result corresponding to the mask identification;

and the determining unit is used for training the initial pre-training model based on the prediction result and the mask identification to obtain a target pre-training model, and the target pre-training model is used for training a target task processing model corresponding to the target task category according to the obtained target task category.

In a fifth aspect, the present application provides a model training apparatus, comprising:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a target task category and sample task information corresponding to the target task category, and the sample task information comprises sample task input information and a sample task result label corresponding to the sample task input information; and means for obtaining a plurality of candidate units in the target pre-training model, the plurality of candidate units comprising: the system comprises a preprocessing unit, a first target vector generating unit, a first target cross mode encoding unit and a first target cross mode decoding unit;

the determining unit is used for determining a target unit corresponding to the target task type from the candidate units according to a preset corresponding relation;

the construction unit is used for constructing an initial task processing model corresponding to the target task category based on the target unit;

the training unit is used for training the initial task processing model by utilizing the sample task information to obtain a target task processing model for completing a target task corresponding to the target task type;

the target task processing model is obtained through training of an initial pre-training model in the model training method.

In a sixth aspect, the present application provides a task processing apparatus, including:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring target task information of a target task category, and the target task information comprises target task input information;

the determining unit is used for determining a corresponding target task processing model according to the target task category;

the input unit is used for inputting the target task input information into the target task processing model to obtain a target task processing result corresponding to the target task type and the target task input information;

the target task processing model is obtained through the model training method.

In a seventh aspect, the present application provides an electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform any one of the methods of the first aspect, the second aspect, the third aspect, the possible implementations of the first aspect, the possible implementations of the second aspect, or the possible implementations of the third aspect via execution of executable instructions.

In an eighth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements any one of the methods in the first aspect, the second aspect, the third aspect, the possible implementations of the first aspect, the possible implementations of the second aspect, or the possible implementations of the third aspect.

The image-text characteristic information to be processed is obtained, and comprises the text characteristic information to be processed and the image characteristic information to be processed; the text characteristic information to be processed or the image characteristic information to be processed comprises mask identification, and the text characteristic information to be processed is matched with the image characteristic information to be processed; generating a first vector information group corresponding to the feature information of the text to be processed and a second vector information group corresponding to the feature information of the image to be processed based on an initial vector generation rule; coding the first vector information group and the second vector information group through an initial coding rule to obtain a corresponding fusion vector information group; wherein the fused vector information set comprises a plurality of fused vector information, each fused vector information being associated with the first vector information set and the second vector information set; decoding the fusion vector information group through an initial decoding rule to obtain a prediction result corresponding to the mask identification; the initial pre-training model is trained based on the prediction result and the mask identification to obtain a target pre-training model, the target pre-training model is used for training a scheme of a target task processing model corresponding to the target task category according to the obtained target task category, processing processes aiming at texts matched with images and images are unified into the same pre-training model, and sample data for training the target pre-training model relates to multi-modal information.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts. In the drawings:

FIG. 1 is a block diagram of a data processing system according to an embodiment of the present application;

fig. 2a is a schematic flowchart of a data processing method according to an embodiment of the present application;

fig. 2b is another schematic flow chart of a data processing method according to an embodiment of the present application;

fig. 2c is another schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 3a is a schematic flow chart illustrating a model training method according to an embodiment of the present disclosure;

FIG. 3b is a schematic diagram of a determination of a target unit according to an embodiment of the present application;

fig. 3c is an exemplary diagram of a processing result of a target task corresponding to an image when the target task is classified by analyzing a semantic recognition result corresponding to the image according to an embodiment of the present application;

fig. 3d is an exemplary diagram of a processing result of a corresponding target task when a target task type provided by an embodiment of the application is to answer a question according to image-text information;

fig. 3e is an exemplary diagram of a target task processing result corresponding to the case that the target task type provided in the embodiment of the present application is to determine whether the text correctly describes the image pair;

fig. 3f is an exemplary diagram of a processing result of a corresponding target task when the target task type provided in an embodiment of the present application is a task that gives an image and a text description and determines whether a relationship between the image and the text is an implication, a contradiction, or a neutral task;

fig. 3g is an exemplary diagram of a processing result of a corresponding target task when a text description of an image is output given as a target task type according to an embodiment of the present application;

fig. 3h is an exemplary diagram of a processing result of a corresponding target task when a given text description is used as the target task type and an image corresponding to the text description is output according to an embodiment of the present application;

fig. 3i is a comparison between a target task processing result of an image output by a target task processing model trained by the model training method of the present application, which is provided in an embodiment of the present application, and a target task processing result determined by a model corresponding to DALLE and OFA in the related art for a given text description;

fig. 4 is a schematic flowchart of a task processing method according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a model training apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a task processing device according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings. The embodiments described below with reference to the accompanying drawings are illustrative and intended to explain the present application and should not be construed as limiting the present application.

The terms "first" and "second," and the like in the description, the claims, and the drawings of the embodiments of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, some terms in the embodiments of the present application are explained below to facilitate understanding by those skilled in the art.

MIM Masked Image Model, mask visual Model.

MLM mask Language Model, mask Language Model.

FLAVA A fundamental Language And Vision Alignment Model.

CLIP: contextual Language-Image Pre-training, contrasted text-Image Pre-training models. The method is based on a multi-modal model with parallel images and texts, and then a training target is constructed through similarity calculation of feature vectors of two branches.

SimVLM: simple Visual Language Model Pre-training with week Supervision, weakly supervised Simple Visual Language Model Pre-training.

MNLI: Multi-Genre Natural Language reference, text implication recognition.

And (3) CoLA: the Corpus of Linguistic acceptable Corpus, grammatical data set, The task is mainly to determine whether a given sentence is grammatically correct.

MRPC: microsoft Research Paraphrase Corpus, judges whether two given sentences have the same semantics and belong to a text classification task of sentence pairs.

QQP: the Quora Question Pairs is a data set for text matching, wherein the data set is issued by Quora, whether two sentences have consistent semantemes or not, and belongs to a text classification task of a sentence pair.

SST: the Stanford Sentiment analysis dataset was mainly classified for motion picture reviews, so SST belongs to The text classification task of a single sentence (where SST-2 is a second classification, SST-5 is a fifth classification, and SST-5 is more finely divided in Sentiment polarity).

QNLI: the precursor of the Question Natural Language Inference is a SQuAD 1.0 data set, a Question is given, and whether the given text contains the correct answer of the Question needs to be judged. Belonging to the text classification task of sentence pairs.

And RTE, a recognition Textual inclusion recognition model, similar to MNLI, is also a Textual inclusion task, except that MNLI is a three-classification, and the RTE only needs to judge whether two sentences can be deduced or aligned, and belongs to a text two-classification task of a sentence pair.

STS-B: the semantic textual similarity score dataset.

ImageNet: is a large visual database for visual object recognition software research.

Food-101 dataset: the data set contains an image data set of 101 food categories with a total of 101,000 images, with an average of 250 test images and 750 training images per category. The training images were not data washed. All images have been resized to a maximum side length of 512 pixels.

CIFAR-10 dataset: is a small data set for identifying pervasive objects. A total of 10 classes of RGB color pictures are included, and there are 50000 training images and 10000 test images in the data set.

CIFAR100 dataset: there are 100 classes. Each class has 600 color images, 500 of which are used as training sets and 100 of which are used as test sets.

And (6) Cars: an automotive data set.

Aircraft data set: the data set contains images of 10,200 airplanes, 102 different airplanes, each with 100 images.

DTD, descriptive Textures Dataset, texture identification Dataset.

The Pets dataset, provided by Oxford, contains about 7000 images of a cat or dog, and a portion of the images mark the position of the cat or dog face.

Flowers102 dataset: the data set contains image data sets of 102 flower classes, each class containing 40-258 images. These images have rich variations in scale, pose, and lighting.

MNIST dataset: is a handwritten digital image data set.

STL-10 dataset: the method is an image recognition data set for developing unsupervised feature learning, deep learning and self-learning algorithms.

Country211, national statistics data set.

VQA _ v 2: visual query answering, version 2 of the visual question and answer task, is a form of outputting an answer to an image and a question about the image.

The SNLI-VE, The Stanford Natural Language Inference, Stanford Natural Language Inference corpus is a 50 ten thousand labeled English sentence pair.

NLVR2 is a data set containing 107,292 examples of paired photo-based human written english sentences introduced by the cornell university research team.

An OFA (unified architecture, tasks, models through a simple sequence-to-sequence learning frame) is a multi-task training frame, different tasks are unified to a training target from a sequence to a sequence, and the purpose of pre-training is achieved by simultaneously training a plurality of downstream multi-mode tasks. The model requires the use of annotation data for downstream tasks, and thus has drawbacks in terms of scalability and solution operability. .

The DALL.E model is a model for generating an image from a text, and achieves the purpose of generating a picture from the text by using a technology of jointly modeling an image token and a text token after discretizing the image.

The prefix language model, which is a front-to-back language model, may generate the remaining text from the input image and the prefix text, and generate the remaining image from the input text and the prefix image.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 1 is a schematic structural diagram of a data processing system according to an exemplary embodiment of the present application, where the structural diagram includes: a task processing device 11 and a model training device 12; the task processing device 11 and the model training device 12 may be computer devices, where the computer devices may be terminals or servers. The terminal can be a smart phone, a tablet computer, a notebook computer, an intelligent voice interaction device, an intelligent household appliance, a wearable intelligent device, an aircraft, an intelligent vehicle-mounted terminal and other devices, and can further comprise a client side which can be a video client side, a browser client side or an instant messaging client side and the like. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

Optionally, the task processing device 11 is configured to:

and the target task processing model is obtained by training through the model training method.

Specifically, the model training device 12 may be configured to train an initial pre-training model to obtain a target pre-training model, and train an initial task processing model to obtain the target task processing model.

Optionally, when the model training device 12 is used to train the initial pre-training model to obtain the target pre-training model, specifically, the model training device is configured to:

acquiring to-be-processed image-text characteristic information, wherein the to-be-processed image-text characteristic information comprises to-be-processed text characteristic information and to-be-processed image characteristic information; the text characteristic information to be processed or the image characteristic information to be processed comprises mask identification, and the text characteristic information to be processed is matched with the image characteristic information to be processed;

Further, when the model training device 12 is used to train the initial task processing model to obtain the target task processing model, it is specifically configured to:

and the target pre-training model is obtained based on the initial pre-training model.

The execution principle and the interaction process of the components in the embodiment of the system, such as the task processing device 11 and the model training device 12, can be referred to the following description of the embodiments of the method.

Fig. 2a is a schematic flow chart of a data processing method, which may be applied to a model training apparatus, for training an initial pre-training model to a target pre-training model according to an exemplary embodiment of the present application, and the method includes at least the following steps S201-S205:

s201, obtaining to-be-processed image-text characteristic information, wherein the to-be-processed image-text characteristic information comprises to-be-processed text characteristic information and to-be-processed image characteristic information; the text characteristic information to be processed or the image characteristic information to be processed comprises mask identification, and the text characteristic information to be processed is matched with the image characteristic information to be processed;

specifically, the mask identifier may be a preset identifier, for example, a preset arabic number 0, and for the determination manner of the to-be-processed image-text characteristic information, the method further includes the following steps S01-S04:

s01, obtaining sample image-text information, wherein the sample image-text information comprises sample text information and sample image information, and the sample text information is matched with the sample image information;

optionally, the matching of the sample text information and the sample image information refers to: the content indicated by the sample text information is related to the content indicated by the sample image information; for example, the content indicated by the sample text information is "puppy", and the content indicated by the sample image information is an image of the puppy.

For another example, when the sample text information indicates that: when The Last super of Jesus with The Twill apolites, painting by Leonardo daVinci, The content pointed by The sample image information is The Last dinner of The image.

S02, determining the label information of each character in the sample text information in a preset word stock according to the sample text information and the preset word stock to obtain a label information group corresponding to the sample text information;

s03, coding the sample image information according to a first preset coding rule to obtain an initial vector information group corresponding to the sample image information;

in some alternative embodiments of the present application, referring to fig. 2b and fig. 2c, the initial pre-training model includes a pre-processing unit, which includes a look-up table unit and a first image encoder; the aforementioned S02 can be performed by a lut unit, and the aforementioned first preset encoding rule can be embedded in the first image encoder, and the aforementioned S03 is performed by the first image encoder.

The mark information group comprises a plurality of mark information, and each mark information corresponds to the characters in the sample text information one by one. Alternatively, the flag information may be: address information, index information, location information, etc.

The plurality of tag information included in the tag information group may be arabic numbers, for example, the tag information may be the tag information in fig. 2b and fig. 2 c: a1 a 2a 3a 4 a5 a6 a7 a8 a9, a10 a11 a12 a13 a14, wherein a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, and a14 may be arabic numbers, and a plurality of pieces of label information are not necessarily continuous.

The first image encoder is used for extracting features of sample image information, and specifically comprises a splitting module and an encoding module, wherein the splitting module can split the sample image information into a plurality of sub-images with a preset number; then, a plurality of initial vector information corresponding to the plurality of sub-images is determined based on the encoding module, and the plurality of initial vector information form an initial vector information group. Optionally, the initial vector information corresponds to the sub-images one to one.

The initial vector information included in the initial vector information group may be a numerical vector, and specifically may be as shown in fig. 2 c: b1 b 2b 3b 4 b5 b6b7 b8 b 9.

And S04, performing mask processing on part of the marking information in the marking information group or part of the initial vector information in the initial vector information group based on a preset mask rule to obtain the image-text characteristic information to be processed.

Optionally, the initial pre-training model further includes a mask processing unit, the preset mask rule may be embedded in the mask processing unit, and the preset mask rule may include a first mask rule and a second mask rule; when the mask processing unit with a preset mask rule is used for performing mask processing on part of the marking information in the marking information group, specifically, the mask processing unit can perform mask processing on part of the marking information in the marking information group through a first mask processing module with a first mask rule in the mask processing unit, at this time, the feature information of the image to be processed is the same as a plurality of pieces of initial vector information in the initial vector information group, and after the mask processing is performed on part of the marking information in the marking information group, mask identification of the part of the marking information after being masked and residual marking information which is not processed by the mask in the marking information are obtained; and the sum of the mask mark and the residual mark information is the text feature information to be processed. At this time, the to-be-processed image-text characteristic information can be referred to as to-be-processed image-text characteristic information 1 in fig. 2 b.

The number of the mark information included in the partial mark information subjected to the mask processing is the same as the number of the mask marks.

When partial initial vector information in the initial vector information group is subjected to mask processing through a mask processing unit with a preset mask rule, specifically, partial initial vector information in the initial vector information group is subjected to mask processing through a second mask processing module with a second mask rule in the mask processing unit, at this time, text feature information to be processed is the same as a plurality of marking information in a marking information group, and after the partial initial vector information in the initial vector information group is subjected to mask processing, mask identification of the partial initial vector information after being masked is obtained, and the mask identification of the partial initial vector information and the residual initial vector information which is not subjected to mask processing in the initial vector information group are obtained; and the sum of the mask mark and the residual initial vector information is the characteristic information of the image to be processed. At this time, the to-be-processed image-text characteristic information can refer to the to-be-processed image-text characteristic information 2 in fig. 2 c.

The number of initial vector information included in the masked part of initial vector information is the same as the number of mask identifiers.

Optionally, when the mask processing unit performs the mask processing on part of the tag information in the tag information group, or performs the mask processing on part of the initial vector information in the initial vector information group, the mask processing unit may perform the mask processing from the back to the front with a random length or a random proportion.

Optionally, when the sample text information is matched with the sample image information, it is considered that the text feature information to be processed is matched with the image feature information to be processed.

Optionally, the first mask processing module and the second mask processing module may be the same module, or may not be the same module, which is not limited in this application. Specifically, an example of performing mask processing on part of the tag information in the tag information group by the mask processing unit may be shown in fig. 2b, where in fig. 2b, the tag information group: part of the labeling information in a1 a 2a 3a 4 a5 a6 a7 a8 a9, a10 a11 a12 a13 a 14: a5 a6 a7 a8 a9, a10 a11 a12 a13 a14 are masked content.

An example of performing a masking process on a part of the initial vector information in the initial vector information group by a masking processing unit with a preset masking rule built therein may be shown in fig. 2c, where in fig. 2c, the initial vector information group: the partial initial vector information b6b7 b8 b9 in b1 b 2b 3b 4 b5 b6b7 b8 b9 is mask content.

S202, generating a first vector information group corresponding to the feature information of the text to be processed and a second vector information group corresponding to the feature information of the image to be processed based on an initial vector generation rule;

for example, the first vector information set includes a plurality of first vector information, and optionally, referring to fig. 2b and 2c, the initial pre-training model may further include an initial vector generation unit, the initial vector generation rule is built in the initial vector generation unit, and the initial vector generation unit may specifically execute the step S202.

The first vector information may be t in fig. 2b and 2c ₁ 、t ₂ 、...、t _n (ii) a The second vector information group comprises a plurality of second vector information which can be 2b and v in fig. 2c ₁ 、v ₂ 、...、v _m 。

Specifically, the initial vector generation rule includes a first vector generation rule and a second vector generation rule. Processing the text characteristic information to be processed through a first vector generation module with a first vector generation rule built in an initial vector generation unit to obtain a first vector information group corresponding to the text characteristic information to be processed; and processing the characteristic information of the image to be processed by a second vector generation module with a second vector generation rule built in an initial vector generation unit to obtain a second vector information group corresponding to the characteristic information of the image to be processed. Wherein the second vector generation module may comprise a first three-layer network in Resnet 101.

The first vector generation module can perform position embedding and normalization processing on the feature information of the text to be processed, and correspondingly, the second vector generation module can perform position embedding and normalization processing on the feature information of the image to be processed.

S203, coding the first vector information group and the second vector information group through an initial coding rule to obtain a corresponding fusion vector information group; wherein the fused vector information set comprises a plurality of fused vector information, each fused vector information being associated with the first vector information set and the second vector information set;

optionally, the initial pre-training model may include an initial cross mode encoding unit, and the initial encoding rule may be embedded in the initial cross mode encoding unit, and specifically, the initial cross mode encoding unit may execute the foregoing S203.

In particular, the initial encoding rules may include splicing rules and cross-modal encoding rules.

Referring to fig. 2b and 2c, the initial cross mode coding unit may include a splicing unit with a splicing rule and a cross mode encoder with a cross mode coding rule; the splicing unit is used for splicing the first vector information group and the second vector information group to obtain a spliced vector, and the cross mode encoder is used for encoding the spliced vector to obtain the fusion vector information group.

Wherein, the stitching vector may be the following in fig. 2b and fig. 2 c: v. of ₁ 、v ₂ 、...、v _m 、t ₁ 、t ₂ 、...、t _n Or t is ₁ 、t ₂ 、...、t _n 、v ₁ 、v ₂ 、...、v _m . The plurality of fused vector information included in the fused vector information group includes: h is ₁ 、h ₂ 、...、h _l Wherein l is m + n.

S204, decoding the fusion vector information group through an initial decoding rule to obtain a prediction result corresponding to the mask identification;

the prediction result may include predicted text feature information corresponding to the text feature information to be processed, or predicted image feature information corresponding to the image feature information to be processed.

Optionally, referring to fig. 2b and fig. 2c, the initial pre-training model further includes an initial cross-mode decoding unit, and the initial decoding rule may be embedded in the initial cross-mode decoding unit. Specifically, the foregoing S204 may be performed by the initial cross-mode decoding unit.

Specifically, when the mask identifier is obtained by performing mask processing on part of the tag information in the tag information group through a mask processing unit, the prediction result may include predicted text feature information corresponding to the text feature information to be processed; when the mask identifier is obtained by performing mask processing on part of the initial vector information in the initial vector information group through a mask processing unit, the prediction result may include a prediction coding numerical value corresponding to the feature information of the image to be processed. The explanation of the predictive coding values is described in detail below.

S205, training the initial pre-training model based on the prediction result and the mask identification to obtain a target pre-training model, wherein the target pre-training model is used for training a target task processing model corresponding to the target task category according to the obtained target task category.

Alternatively, the aforementioned S205 may be specifically performed by a training unit in the initial pre-training model.

Optionally, in S205, training the initial pre-training model based on the prediction result and the mask identifier to obtain a target pre-training model, including the following S2051 to S2053:

s2051, acquiring target information corresponding to the mask mark;

optionally, when the mask identifier is obtained by performing mask processing on part of the tag information in the tag information group through a mask processing unit, the target information corresponding to the mask identifier is part of the tag information.

When the mask identifier is obtained by performing mask processing on part of the initial vector information in the initial vector information group through a mask processing unit, the method further includes, for a determination mode of target information corresponding to the mask identifier:

encoding the sample image information according to a second preset encoding rule to obtain an encoding value group corresponding to the sample image information, wherein the encoding value group comprises a plurality of encoding values, and the number of the encoding values in the encoding value group is the same as the number of the initial vector information in the initial vector information group; and determining target information corresponding to the mask identification according to the coding value group.

The code value may be an arabic number, as shown in fig. 2c, and the plurality of code values may be: 123234345987654321999888777.

specifically, as shown in fig. 2c, a second preset encoding rule may be built in the second image encoder, and the second image encoder with the built-in second preset encoding rule functions to convert the characteristics of the sample image information into discrete values, that is, a plurality of encoding values in the encoding value group, so that the initial cross mode decoding unit can output a predictive encoding value corresponding to the corresponding sample image information, compare the predictive encoding value with target information corresponding to a mask identifier in the encoding value group, and train the initial pre-training model.

Further, determining the target information corresponding to the mask id according to the encoding value set may include the following S001 to S003:

s001, acquiring first position information of the masked content corresponding to the mask identification in a mask object;

if the image-text characteristic information to be processed is obtained by performing mask processing on part of initial vector information in the initial vector information group based on a preset mask rule, the masked content is the part of initial vector information, and the mask object is an initial vector information group.

Wherein the first location information may include: index information of each initial vector information in the masked content in an initial vector information set, e.g., the initial vector information set in fig. 2 c: b1 b 2b 3b 4 b5 b6b7 b8 b 9: when the partial initial vector information b6b7 b8 b9 is masked content, the first position information of b6b7 b8 b9 may be: 6. 7, 8 and 9.

S002, selecting a target coding numerical value corresponding to the first position information from the coding numerical value group;

since the number of the encoded values included in the encoded value group is the same as the number of the initial vector information included in the initial vector information group, optionally, the second position information of each encoded value in the encoded value group corresponds to the first position information of each initial vector information in the initial vector information group. The second location information may also be index information. Specifically, the index information of each encoded value in the encoded value group in fig. 2c, and the index information of each initial vector information in the initial vector information group can be shown in table 1:

table 1 shows that when the encoded value group is different from the initial vector information group in the index information, the corresponding initial vector information and encoded value are obtained.

Index information	1	2	3	4	5	6	7	8	9
										Initial vector information set	b1	b2	b3	b4	b5	b6	b7	b8	b9
Encoding a set of values	123	234	345	987	654	321	999	888	777

Optionally, the encoded value group further includes second position information of each encoded value, and selecting the target encoded value corresponding to the first position information from the encoded value group may include: and determining a target encoding value corresponding to the second position information matched with the first position information from the encoding value group. When the first position information is the same as the second position information, the first position information and the second position information can be considered to be matched.

S003, taking the target coding numerical value as the target information;

if the image-text characteristic information to be processed is obtained by performing mask processing on part of initial vector information in the initial vector information group based on a preset mask rule, the masked content is the part of initial vector information, and the mask object is the initial vector information group.

For example, referring to fig. 2c and table 1, when the first position information is: 6. 7, 8, and 9, the target code value is: "321999888777".

S2052, determining a similarity value between the prediction result and the target information, wherein the similarity value can be determined through a cross entropy function;

alternatively, as can be seen in fig. 2b and 2c, the training unit may comprise a comparison unit, which may perform S2052.

S2053, if the similarity value is smaller than a preset similarity value, taking the initial pre-training model as a target pre-training model;

if the similarity value is not smaller than the preset similarity value, updating model parameters in the initial pre-training model based on the similarity value to obtain an initial pre-training model after the model parameters are updated, and returning to the step of executing a first vector information group corresponding to the feature information of the text to be processed and a second vector information group corresponding to the feature information of the image to be processed based on an initial vector generation rule until the similarity value is smaller than the preset similarity value to obtain a target pre-training model;

wherein the model parameters in the initial pre-training model comprise: at least one of a parameter in the initial vector generation rule, a parameter in the initial encoding rule, and a parameter in the initial decoding rule.

Optionally, the model parameters in the initial pre-training model may further include: parameters in the preprocessing unit.

Optionally, the initial pre-training model may further include another table look-up unit for a relevant person to confirm a training completion condition of the initial training model according to an output result of the table look-up unit, for example, as shown in fig. 2b, if an output of the initial cross-mode decoding unit is: a5 a6 a7 a8 a9, a10 a11 a12 a13 a14, wherein the following can be determined by a table lookup unit: "Jesus with the twelveve apolites, painting by Leonardo da Vinci".

Optionally, the initial pre-training model may further include an image decoder corresponding to the second image encoder, for allowing a relevant person to confirm that the training of the initial training model is completed according to an output result of the image decoder, for example, as shown in fig. 2c, if an output of the initial cross-mode decoding unit is: 321999888777, the image decoder can decode: among the aforementioned sub-images, the sub-image corresponding to 321, the sub-image corresponding to 999, the sub-image corresponding to 888, and the sub-image corresponding to 777.

In other optional embodiments of the present application, training the initial pre-training model based on the prediction result and the mask identifier to obtain a target pre-training model includes:

determining whether the updating times of the model parameters in the initial pre-training model are larger than preset times, if so, taking the initial pre-training model as a target pre-training model;

if not, acquiring target information corresponding to the mask identification; determining corresponding loss information according to the prediction result, the target information and a preset loss function; updating the model parameters in the initial pre-training model according to the loss information to obtain an initial pre-training model after the model parameters are updated, and returning to the step of executing a first vector information group corresponding to the feature information of the text to be processed and a second vector information group corresponding to the feature information of the image to be processed based on an initial vector generation rule, wherein the initial pre-training model is used as a target pre-training model when the updating times are more than the preset times;

Wherein the loss function may be a cross-entropy function.

In summary, referring to fig. 2b and 2c, the initial pre-training model may include: the preprocessing unit, the mask processing unit, the initial vector generating unit, the initial cross module encoding unit, the initial cross mode decoding unit, and the training unit.

Further, the method further includes the following S1-S2:

s1, acquiring the target task category and sample task information corresponding to the target task category, wherein the sample task information comprises sample task input information and a sample task result label corresponding to the sample task input information;

s2, training the target task processing model for completing the target task corresponding to the target task type by using the sample task information and the target pre-training model.

Alternatively, the target task categories may be: any of a text understanding task, an image understanding task (i.e., a visual understanding task), text-generated images, image-generated text, a multimodal recognition task, and the like.

The sample task input information comprises task preconditions for analysis required for obtaining a sample task result; the sample task result label is a task processing result obtained according to the sample task input information, for example, if the target task category is an image generation text, when a target task processing model corresponding to the target task category is trained, a plurality of sample images and a text generation result corresponding to each sample image are required, at least one sample image in the plurality of sample images is sample task input information, and a text generation result corresponding to each sample image is a sample task result label.

Optionally, the target pre-training model is a trained initial pre-training model. Specifically, the target pre-training model includes a plurality of candidate units: the system comprises a preprocessing unit, a first target vector generating unit, a first target cross mode encoding unit and a first target cross mode decoding unit;

optionally, each candidate unit may include multiple sub-units, and the determined target unit may include only a part of sub-units in one candidate unit, or may include all sub-units.

Optionally, the preprocessing unit includes a table lookup unit and a first image encoder, and the preprocessing unit included in the determined target unit may only include the table lookup unit or the first image encoder, or may include both the table lookup unit and the first image encoder.

Optionally, the first target cross mode encoding unit includes a splicing unit and a cross mode encoder, and the first target cross mode encoding unit included in the determined target unit may only include the cross mode encoder, or may include both the splicing unit and the cross mode encoder.

In S2, training the target task processing model for completing the target task corresponding to the target task category using the sample task information and the target pre-training model includes the following S21-S23:

s21, determining a target unit corresponding to the target task type from the candidate units according to a preset corresponding relation;

the preset corresponding relation stores a plurality of task categories and the incidence relation between each task category and the corresponding task processing model.

S22, constructing an initial task processing model corresponding to the target task type based on the target unit;

s23, training the initial task processing model by using the sample task information to obtain the target task processing model;

the first target vector generation unit corresponds to the initial vector generation unit, the first target cross mode coding unit corresponds to the initial cross mode coding unit, and the first target cross mode decoding unit corresponds to the initial cross mode decoding unit.

Specifically, the first target vector generation unit is: after a target pre-training model is trained, an initial vector generating unit in the target pre-training model; the first target cross mode coding unit is: after a target pre-training model is trained, an initial cross mode coding unit in the target pre-training model; the first target cross-mode decoding unit is: and after the target pre-training model is trained, an initial cross mode decoding unit in the target pre-training model.

In other alternative embodiments of the present application, the foregoing manner of determining the target unit may also be implemented based on a selection instruction of a user for the target unit, which is not limited in the present application.

Optionally, the sample data for training the target pre-training model may also include sample data of plain text. The sample data used to train the target pre-training model (e.g., sample image information, sample data in plain text) may be derived from a network, or a public data set.

Optionally, the target pre-training model in the present application is a prefix language model, which can perform sufficient association between language and images, so that the target pre-training model has text generation capability, image coding capability, and sufficient association between text and images, so as to enhance the ability of cross-modal understanding.

In addition, a second image encoder in the target pre-training model can endow the image generation capacity of the target pre-training model, and a table look-up unit in the target pre-training model can endow the character generation capacity of the target pre-training model; the first target cross mode coding unit and the first target cross mode decoding unit endow a target pre-training model with multi-mode understanding capability, text understanding capability and visual understanding capability; and further, the compatibility and expansibility of the target pre-training model are stronger. The method can provide materials for the training of various task processing models, so that the efficiency of processing various tasks is improved.

In addition, the scheme provides that the image is encoded into discrete data through the second image encoder, and then the target pre-training model can be trained by taking image-text information containing image information and text information as sample data, so that the processing mode of the image information is similar to that of the text information, and the speed of training the target pre-training model is higher.

The target task processing model obtained by training through the model training method is obtained by training based on the target pre-training model related to various modes, and has higher task processing accuracy when processing tasks. Table 2 shows a comparison between the accuracy value of the corresponding target task processing result when the target task processing model determined by the scheme of the present application processes the task and the accuracy value of the task processing result when the task processing model determined by another method in the related art processes the task.

Table 2: when the task processing model of the scheme processes the task, the corresponding accuracy value of the task processing result is compared with the accuracy value of the task processing result determined by other methods when the task processing model processes the task

Wherein, MIM, MLM, FLAVA, CLIP, SimVLM and SimVLM refer to the category of task processing models;

MNLI, CoLA, MRPC, QQP, SST-2, QNLI, RTE, STS-B are the class information of the processed task; wherein the MNLI results are the average values of MNLI-m and MNLI-mm. MRPC and QQP results are the average of accuracy and F1 score. CoLA reports the Mazis Correlation Coefficient (MCC) and STS-B reports the Pearson Correlation Coefficient (PCC).

"M" in 70M, 46.4M, and 647.7M refers to "million", i.e., 70M, 46.4M, and 647.7M refers to the amount of data used to calculate the accuracy value of the task processing result.

The NLP Avg pointer is the average of the accuracy values of the task processing results for the natural language processing level.

The Vision Avg pointer is an average value of accuracy values of task processing results at the level of visual recognition (i.e., image recognition).

An average of accuracy values of Multi-modal pointers to task processing results for the Multi-modal processing tasks.

The Eval method refers to a method for evaluating corresponding tasks, and specifically comprises the following steps: 1) fine-tuning refers to a complete training model on a corresponding task; 2) linear eval refers to a fixed model, and the result of a corresponding task is predicted by adding a classifier; 3) zero-shot refers to a completely fixed model, without adding any learnable parameters, to solve the corresponding task.

ImageNet, Food101, CIFAR10, Cars, Aircraft, DTD, Pets, Flowers102, MNIST, STL10, and Country211 index dataset names. The accuracy value of the task processing result corresponding to each data set refers to the accuracy value of the task processing result of the task analysis data which takes the current data set as the task.

VQav2, SNLI-VE, NLVR2 index data set names;

I2T and T2I represent the task of generating text from images, and generating images from text. I2T @ B4 and I2T @ C are Evaluation indexes of a task of generating text from an Image, B4 denotes a 4-gram Bilingual Evaluation substitution index (BLUE), and C denotes a Consensus-based Image Description Evaluation (CIDER). T2I @ IS, T2I @ FID, T2I @ IS and T2I @ FID are evaluation indexes of image generation tasks from text, IS means Inclusion Score (IS) and FID means Freehet Inclusion Distance (FID).

Wherein, "↓" indicates smaller corresponding accuracy value, which indicates higher accuracy of the task processing result.

As can be seen from table 2, when the target task processing model determined by the scheme is used for processing the target task, the most advanced effect of the same type of model is achieved under the condition of the same model scale and data scale. The relevant models compared to the target task processing model determined by the scheme of the present application are FLAVA and SimVLM. The target task processing model determined by the scheme of the application performs best for 22 tasks out of the 26 tasks. In addition to being equal to the relevant models in the text understanding task, the method has great improvement on all the advantages of the relevant models, including the visual understanding task, the multi-modal understanding task, the text-to-image generation task and the image-to-text generation task.

Fig. 3a is a schematic flow chart of a model training method, which is applicable to a model training apparatus and provided by an exemplary embodiment of the present application, and the method at least includes the following steps S301-S305:

s301, acquiring a target task category and sample task information corresponding to the target task category, wherein the sample task information comprises sample task input information and a sample task result label corresponding to the sample task input information;

s302, obtaining a plurality of candidate units in a target pre-training model, wherein the candidate units comprise: the system comprises a preprocessing unit, a first target vector generating unit, a first target cross mode encoding unit and a first target cross mode decoding unit;

s303, determining a target unit corresponding to the target task type from the candidate units according to a preset corresponding relation;

s304, constructing an initial task processing model corresponding to the target task category based on the target unit;

s305, training the initial task processing model by using the sample task information to obtain a target task processing model for completing a target task corresponding to the target task type;

optionally, the target pre-training model is obtained by training an initial pre-training model in the data processing method in the embodiment corresponding to fig. 2 a.

As can be seen in conjunction with the corresponding embodiment of fig. 2a, the initial pre-training model includes: the device comprises an initial vector generating unit, an initial cross mode encoding unit and an initial cross mode decoding unit.

In the data processing method, a first vector information group corresponding to the feature information of the text to be processed and a second vector information group corresponding to the feature information of the image to be processed may be generated specifically based on an initial vector generation rule built in an initial vector generation unit;

coding the first vector information group and the second vector information group through an initial coding rule built in an initial cross mode coding unit to obtain a corresponding fusion vector information group; wherein the fused vector information set comprises a plurality of fused vector information, each fused vector information being associated with the first vector information set and the second vector information set;

decoding the fused vector information group through an initial decoding rule built in an initial cross mode decoding unit to obtain a prediction result corresponding to the mask identification;

and training an initial pre-training model based on the prediction result and the mask identification to obtain a target pre-training model, wherein the target pre-training model is used for training a target task processing model corresponding to the target task type according to the obtained target task type.

The target pre-training model is a trained initial pre-training model, the first target vector generation unit is a trained initial vector generation unit in the initial pre-training model, the first target cross mode coding unit is a trained initial cross mode coding unit in the initial pre-training model, and the first target cross mode decoding unit is a trained initial cross mode decoding unit in the initial pre-training model. The target pre-training model comprises a plurality of candidate units, the plurality of candidate units comprising at least 2 of the following units: the device comprises a preprocessing unit, a first target vector generating unit, a first target cross mode coding unit and a first target cross mode decoding unit.

Optionally, when the initial task processing model is constructed based on the target unit, additional units other than the target unit may be acquired according to the target task category. That is, the initial task processing model may be constructed by the target unit and the additional unit, and in training the target task processing model, parameters of the additional unit may be updated in addition to parameters of the target unit.

Optionally, the aforementioned additional unit may further include a recovery unit, and the recovery unit may include a table look-up unit and/or an image decoder corresponding to the second image encoder.

Optionally, if the target unit includes: the pre-processing unit, the first target vector generating unit, the first target cross mode encoding unit, and the first target cross mode decoding unit, and the trained initial task processing model, that is, the target task processing model may include: the trained preprocessing unit, the second target vector generating unit, the second target cross mode encoding unit, the second target cross mode decoding unit and the image decoder can also be included.

The second target vector generation unit is a trained first target vector generation unit, the second target cross mode coding unit is a trained first target cross mode coding unit, and the second target cross mode decoding unit is a trained first target cross mode decoding unit.

Optionally, different target task categories and different determined target units are different, in some optional embodiments of the present application, if the target task category is a text understanding task, referring to fig. 3b, the determining, from the plurality of candidate units, a target unit corresponding to the target task category may include: a table look-up unit in the preprocessing unit, a first target vector generating unit, a cross mode encoder in the first target cross mode encoding unit, and a first target cross mode decoding unit.

In some optional embodiments of the present application, if the target task category is a text classification, only analysis of the text is involved, and analysis of the image is not involved, so that determining the target unit corresponding to the target task category from the plurality of candidate units may include: a table look-up unit in the preprocessing unit, a first target vector generating unit, a cross mode encoder in the first target cross mode encoding unit, and a first target cross mode decoding unit.

In the initial task processing model constructed based on the target unit, a table look-up unit in the preprocessing unit is connected with a first target vector generating unit, the first target vector generating unit is connected with a cross mode encoder in a first target cross mode encoding unit, and the cross mode encoder in the first target cross mode encoding unit is connected with a first target cross mode decoding unit.

Further, the initial task processing model may further include other units besides the target unit, for example, the initial task processing model further includes an initial classifier, an input interface of the initial classifier is connected to an output interface of the first target cross-mode decoding unit, and an output interface of the initial classifier is used to output a predicted task processing result, so that in the process of training the target task processing model, the initial task processing model is trained based on the sample task information, the predicted task processing result, and the sample task result label.

In some optional embodiments of the present application, when the target task category is obtained by training and is classified as a text, after the corresponding target task processing model, for the target task: acquiring two sentences, and determining the relationship between the second sentence and the first sentence, wherein:

a first statement: oneofourumberwellcaryoutyouutritionstitutionminutely;

the second statement: ameber of myteamwillebracteuteuroprostswithmmenserecision.

The first statement and the second statement can be used as target task input information of the target task processing model, and a target task processing result, namely implication (meaning that the semantics of the second statement imply the semantics of the first statement) can be obtained.

In some optional embodiments of the present application, if the target task category is a semantic recognition result corresponding to the analysis image, when the image is classified, the target task category only involves the analysis of the image and the image classification, and therefore, the determining, from the plurality of candidate units, a target unit corresponding to the target task category may include: the image processing device comprises a first image encoder in a preprocessing unit, a first target vector generating unit, a cross mode encoder in a first target cross mode encoding unit and a first target cross mode decoding unit.

In an initial task processing model constructed based on the target unit, a first image encoder in the preprocessing unit is connected with a first target vector generating unit, the first target vector generating unit is connected with a cross mode encoder in the first target cross mode encoding unit, and the cross mode encoder in the first target cross mode encoding unit is connected with a first target cross mode decoding unit.

In some optional embodiments of the present application, the semantic recognition result corresponding to the target task category as the analysis image is obtained through training, and when the images are classified, after the corresponding target task processing model, for the target task: and analyzing the semantic recognition result of the image 1 in fig. 3c, classifying the image, and obtaining a target task processing result, namely the table lamp, by using the image 1 as target task input information of the target task processing model. Aiming at the target task: and (5) analyzing the semantic recognition result of the image 2 in the image 3c, classifying the image, and taking the image 2 as target task input information of the target task processing model to obtain a target task processing result, namely the ice cream.

In some optional embodiments of the application, if the target task category is a task for answering a question according to the teletext information, the target task category involves analyzing an image and a text, and therefore, determining a target unit corresponding to the target task category from the plurality of candidate units may include: the device comprises a preprocessing unit, a first target vector generating unit, a first target cross mode encoding unit and a first target cross mode decoding unit.

In an initial task processing model constructed based on the target unit, the preprocessing unit is connected with the first target vector generating unit, the first target vector generating unit is connected with the first target cross mode encoding unit, and the first target cross mode encoding unit is connected with the first target cross mode decoding unit.

In some optional embodiments of the present application, when the target task class obtained by training is a task for answering a question according to the image-text information, after the corresponding target task processing model, for the target task: according to the images and text contained in the teletext information 1 in fig. 3 d: "Who is week glasses? "answer the question, can regard picture information 1 as the goal task input information of the goal task processing model, then can get the goal task processing result man. Aiming at the target task: according to the images and text contained in the teletext information 2 in fig. 3 d: "Who is week glasses? "answer the question, can regard picture information 2 as the goal task input information of the goal task processing model, then can get the goal task processing result, woman.

In some optional embodiments of the present application, if the target task category is a task of determining whether the text correctly describes the image pair, the target task category involves analysis of the image and the text, and therefore, determining the target unit corresponding to the target task category from the plurality of candidate units may include: the device comprises a preprocessing unit, a first target vector generating unit, a first target cross mode encoding unit and a first target cross mode decoding unit.

In some optional embodiments of the present application, when the target task category obtained by training is a task for judging whether the text correctly describes the image pair, after the corresponding target task processing model, for the target task: and judging whether the text in the image-text information 3, wherein the image-text information 3 can be used as target task input information of a target task processing model, and a target task processing result, namely the image-text pair, can be obtained. Aiming at the target task: and judging whether the text in the image-text information 4 correctly describes the image pair in the image-text information 4, and using the image-text information 4 as target task input information of a target task processing model to obtain a target task processing result, namely, a fault.

In some optional embodiments of the present application, if the target task category is a task that gives an image and a text description and determines whether a relationship between the image and the text is implication, contradiction or neutral, the target task category relates to analysis of the image and the text, and therefore, determining, from the plurality of candidate units, a target unit corresponding to the target task category may include: the device comprises a preprocessing unit, a first target vector generating unit, a first target cross mode encoding unit and a first target cross mode decoding unit.

Further, the initial task processing model may further include other units besides the aforementioned target unit, for example, the initial task processing model further includes an initial classifier, an input interface of the initial classifier is connected to an output interface of the first target cross-modal decoding unit, and an output interface of the initial classifier is used to output a predicted task processing result, so that in the process of training the target task processing model, the initial task processing model is trained based on the sample task information, the predicted task processing result, and the sample task result label.

In some optional embodiments of the present application, when the training obtains the target task category by providing an image and a text description, and determining whether the relationship between the image and the text is an implication, contradiction or neutral task, after the corresponding target task processing model,

aiming at the target task: giving the premise image in fig. 3f, together with the text description 1: "Two wooman are holding packages", and whether the relation between the precondition image and the text description 1 is inclusion, contradiction or neutral is judged, so that the precondition image and the text description 1 can be used as target task input information of a target task processing model, and a target task processing result, namely inclusion, can be obtained.

Aiming at the target task: giving the premise image in fig. 3f, together with text description 2: "The site area bagging goodbye hanging boring to go packages after meeting boring Lunch", judge whether The relation between The precondition image and The text description 2 is implication, contradiction or neutral, then can regard The precondition image and The text description 2 as The target task input information of The target task processing model, and then can obtain The target task processing result, namely neutral.

Aiming at the target task: giving the premise image in fig. 3f, together with text description 3: the men are lighting output a deli, judging whether The relation between The precondition image and The text description 3 is inclusive, contradictory or neutral, and using The precondition image and The text description 3 as target task input information of a target task processing model to obtain a target task processing result, i.e. contradictory.

In some optional embodiments of the application, if the target task category is a given image, when outputting a task described in a text of the image, the target task category is only related to analysis of the image, and therefore, determining, from the plurality of candidate units, a target unit corresponding to the target task category may include: the image processing device comprises a first image encoder in a preprocessing unit, a first target vector generating unit, a cross mode encoder in a first target cross mode encoding unit and a first target cross mode decoding unit.

In the initial task processing model constructed based on the target unit, a first image encoder in the preprocessing unit is connected with a first target vector generating unit, the first target vector generating unit is connected with a cross mode encoder in the first target cross mode encoding unit, and the cross mode encoder in the first target cross mode encoding unit is connected with a first target cross mode decoding unit.

In some optional embodiments of the present application, the target task class obtained by training is a given image, and when a text description of the image is output, after a corresponding target task processing model, for the target task: given the image in fig. 3g, and outputting the text description corresponding to the image, the image can be used as the target task input information of the target task processing model, and the target task processing result, "a seabird walks on the bank" can be obtained.

In some optional embodiments of the present application, if the target task category is a given text description, when outputting the image corresponding to the text description, only analysis of the text is involved, and therefore, determining the target unit corresponding to the target task category from the plurality of candidate units may include: a table look-up unit in the preprocessing unit, a first target vector generating unit, a cross mode encoder in the first target cross mode encoding unit, and a first target cross mode decoding unit.

In some optional embodiments of the present application, the target task class obtained by training is a given text description, and when an image corresponding to the text description is output, after the corresponding target task processing model, for the target task: given the text in FIG. 3 h: if "a base player holding a bat next to a base" outputs an image corresponding to the text, the text: as target task input information of the target task processing model, "a base layer accumulation a bat next to a base", the image in fig. 3h can be obtained.

Compared with the DALLE and OFA in the related art, the target task processing result of the target task processing model obtained by training through the model training method of the application is shown in fig. 3i for a given text description and the image of the text description is output, and as can be seen from fig. 3i, the image generation quality in the target task processing result obtained through the scheme of the application is higher, and the image is more real and accurate.

Fig. 4 is a flowchart illustrating a task processing method according to an exemplary embodiment of the present application, where the method includes the following steps S401 to S403:

s401, acquiring target task information of a target task category, wherein the target task information comprises target task input information;

s402, determining a corresponding target task processing model according to the target task category;

s403, inputting the target task input information into the target task processing model to obtain a target task processing result corresponding to the target task type and the target task input information;

the target task processing model is obtained through the model training method.

For the correspondence between the target task category and the target task input information, reference may be made to the embodiment corresponding to fig. 3a, which is not described herein again.

Fig. 5 is a schematic structural diagram of a data processing apparatus according to an exemplary embodiment of the present application;

wherein, the device includes:

the acquiring unit 51 is configured to acquire to-be-processed image-text characteristic information, where the to-be-processed image-text characteristic information includes to-be-processed text characteristic information and to-be-processed image characteristic information; the text characteristic information to be processed or the image characteristic information to be processed comprises mask identification, and the text characteristic information to be processed is matched with the image characteristic information to be processed;

a generating unit 52, configured to generate a first vector information group corresponding to the to-be-processed text feature information and a second vector information group corresponding to the to-be-processed image feature information based on an initial vector generation rule;

the encoding unit 53 is configured to encode the first vector information group and the second vector information group according to an initial encoding rule to obtain a corresponding fused vector information group; wherein the fused vector information set comprises a plurality of fused vector information, each fused vector information being associated with the first vector information set and the second vector information set;

a decoding unit 54, configured to decode the fused vector information group according to an initial decoding rule, to obtain a prediction result corresponding to the mask identifier;

a determining unit 55, configured to train the initial pre-training model based on the prediction result and the mask identifier to obtain a target pre-training model, where the target pre-training model is used to train a target task processing model corresponding to the target task category according to the obtained target task category.

According to one or more embodiments of the present application, the apparatus is further configured to:

acquiring sample image-text information, wherein the sample image-text information comprises sample text information and sample image information, and the sample text information is matched with the sample image information;

determining the mark information of each character in the sample text information in a preset word bank according to the sample text information and the preset word bank to obtain a mark information group corresponding to the sample text information;

coding the sample image information according to a first preset coding rule to obtain an initial vector information group corresponding to the sample image information;

and performing mask processing on part of the marking information in the marking information group or part of the initial vector information in the initial vector information group based on a preset mask rule to obtain the image-text characteristic information to be processed.

According to one or more embodiments of the present application, when the apparatus is configured to train the initial pre-training model based on the prediction result and the mask identifier to obtain a target pre-training model, the apparatus is specifically configured to:

acquiring target information corresponding to the mask identification;

determining a similarity value of the prediction result and the target information;

if the similarity value is smaller than a preset similarity value, taking the initial pre-training model as a target pre-training model;

wherein the model parameters in the initial pre-training model include: at least one of a parameter in the initial vector generation rule, a parameter in the initial encoding rule, and a parameter in the initial decoding rule.

encoding the sample image information according to a second preset encoding rule to obtain an encoding value group corresponding to the sample image information, wherein the encoding value group comprises a plurality of encoding values, and the number of the encoding values in the encoding value group is the same as the number of the initial vector information in the initial vector information group;

and determining target information corresponding to the mask identification according to the coding value group.

According to one or more embodiments of the present application, when the apparatus is configured to determine, according to the encoded value group, target information corresponding to the mask identifier, the apparatus is specifically configured to:

acquiring first position information of masked contents corresponding to the mask identification in a mask object;

selecting a target encoding numerical value corresponding to the first position information from the encoding numerical value group;

taking the target coded value as the target information;

According to one or more embodiments of the present application, when the apparatus is configured to train an initial pre-training model based on the prediction result and the mask identifier to obtain a target pre-training model, the apparatus is specifically configured to:

if not, acquiring target information corresponding to the mask identification; determining corresponding loss information according to the prediction result, the target information and a preset loss function; updating the model parameters in the initial pre-training model according to the loss information to obtain an initial pre-training model after the model parameters are updated, and returning to the step of executing the step of generating a first vector information group corresponding to the feature information of the text to be processed and a second vector information group corresponding to the feature information of the image to be processed based on an initial vector generation rule until the updating times are more than the preset times, and taking the initial pre-training model as a target pre-training model;

Fig. 6 is a schematic structural diagram of a model training apparatus according to an exemplary embodiment of the present application, where the apparatus includes:

the acquiring unit 61 is configured to acquire a target task category and sample task information corresponding to the target task category, where the sample task information includes sample task input information and a sample task result label corresponding to the sample task input information; and means for obtaining a plurality of candidate units in the target pre-training model, the plurality of candidate units comprising: the system comprises a preprocessing unit, a first target vector generating unit, a first target cross mode encoding unit and a first target cross mode decoding unit;

a determining unit 62, configured to determine, according to a preset correspondence, a target unit corresponding to the target task category from the multiple candidate units;

a constructing unit 63, configured to construct, based on the target unit, an initial task processing model corresponding to the target task category;

a training unit 64, configured to train the initial task processing model by using the sample task information to obtain a target task processing model for completing a target task corresponding to the target task category;

FIG. 7 is a schematic structural diagram of a task processing device according to an exemplary embodiment of the present application; wherein, the device includes:

an obtaining unit 71, configured to obtain target task information of a target task category, where the target task information includes target task input information;

a determining unit 72, configured to determine a corresponding target task processing model according to the target task category;

an input unit 73, configured to input the target task input information into the target task processing model, and obtain a target task processing result corresponding to the target task category and the target task input information;

the target task processing model is obtained through the model training method.

It is to be understood that apparatus embodiments and method embodiments may correspond to one another and that similar descriptions may refer to method embodiments. To avoid repetition, further description is omitted here. Specifically, the apparatus may perform the method embodiment, and the foregoing and other operations and/or functions of each module in the apparatus are respectively corresponding flows in each method in the method embodiment, and for brevity, are not described again here.

The apparatus of the embodiments of the present application is described above in connection with the drawings from the perspective of functional modules. It should be understood that the functional modules may be implemented by hardware, by instructions in software, or by a combination of hardware and software modules. Specifically, the steps of the method embodiments in the present application may be implemented by integrated logic circuits of hardware in a processor and/or instructions in the form of software, and the steps of the method disclosed in conjunction with the embodiments in the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. Alternatively, the software modules may be located in random access memory, flash memory, read only memory, programmable read only memory, electrically erasable programmable memory, registers, and the like, as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps in the above method embodiments in combination with hardware thereof.

Fig. 8 is a schematic block diagram of an electronic device provided in an embodiment of the present application, where the electronic device may include:

a memory 801 and a processor 802, the memory 801 being adapted to store a computer program and to transfer the program code to the processor 802. In other words, the processor 802 may call and run a computer program from the memory 801 to implement the method in the embodiment of the present application.

For example, the processor 802 may be configured to perform the above-described method embodiments in accordance with instructions in the computer program.

In some embodiments of the present application, the processor 802 may include, but is not limited to:

general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like.

In some embodiments of the present application, the memory 801 includes, but is not limited to:

volatile memory and/or non-volatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of example, but not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data Rate Synchronous Dynamic random access memory (DDR SDRAM), Enhanced Synchronous SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DR RAM).

In some embodiments of the present application, the computer program may be partitioned into one or more modules that are stored in the memory 801 and executed by the processor 802 to perform the methods provided herein. The one or more modules may be a series of computer program instruction segments capable of performing certain functions, the instruction segments describing the execution of the computer program in the electronic device.

As shown in fig. 8, the electronic device may further include:

a transceiver 803, the transceiver 803 being connectable to the processor 802 or the memory 801.

The processor 802 may control the transceiver 803 to communicate with other devices, and specifically, may transmit information or data to the other devices or receive information or data transmitted by the other devices. The transceiver 803 may include a transmitter and a receiver. The transceiver 803 may further include an antenna, and the number of antennas may be one or more.

It should be understood that the various components in the electronic device are connected by a bus system that includes a power bus, a control bus, and a status signal bus in addition to a data bus.

The present application also provides a computer storage medium having stored thereon a computer program which, when executed by a computer, enables the computer to perform the method of the above-described method embodiments. Alternatively, the present application also provides a computer program product containing instructions, which when executed by a computer, cause the computer to execute the method of the above method embodiment.

When implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the present application occur, in whole or in part, when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a Digital Video Disk (DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.

According to one or more embodiments of the present application, there is provided a data processing method including:

According to one or more embodiments of the present application, the method further comprises:

According to one or more embodiments of the present application, training the initial pre-training model based on the prediction result and the mask identifier to obtain a target pre-training model, includes:

acquiring target information corresponding to the mask identification;

According to one or more embodiments of the present application, determining the target information corresponding to the mask identifier according to the encoding value group includes:

taking the target coded value as the target information;

According to one or more embodiments of the present application, training an initial pre-training model based on the prediction result and the mask identifier to obtain a target pre-training model includes:

According to one or more embodiments of the present application, there is provided a model training method including:

acquiring a target task type and sample task information corresponding to the target task type, wherein the sample task information comprises sample task input information and a sample task result label corresponding to the sample task input information;

According to one or more embodiments of the present application, there is provided a task processing method including:

the target task processing model is obtained through the model training method.

According to one or more embodiments of the present application, there is provided a data processing apparatus including:

the image-text feature information processing device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring image-text feature information to be processed, and the image-text feature information to be processed comprises the text feature information to be processed and the image feature information to be processed; the text characteristic information to be processed or the image characteristic information to be processed comprises mask identification, and the text characteristic information to be processed is matched with the image characteristic information to be processed;

the generating unit is used for generating a first vector information group corresponding to the feature information of the text to be processed and a second vector information group corresponding to the feature information of the image to be processed based on an initial vector generating rule; (ii) a

acquiring target information corresponding to the mask identification;

taking the target code value as the target information;

According to one or more embodiments of the present application, the apparatus is configured to: training an initial pre-training model based on the prediction result and the mask identification to obtain a target pre-training model, which is specifically used for:

determining whether the updating times of the model parameters in the initial pre-training model are larger than preset times, and if so, taking the initial pre-training model as a target pre-training model;

According to one or more embodiments of the present application, there is provided a model training apparatus including:

According to one or more embodiments of the present application, there is provided a task processing apparatus including:

the target task processing model is obtained through the model training method.

According to one or more embodiments of the present application, there is provided an electronic device including:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the aforementioned methods via execution of the executable instructions.

According to one or more embodiments of the present application, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, carries out the aforementioned methods.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the module is merely a logical division, and other divisions may be realized in practice, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. For example, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of data processing, comprising:

2. The method of claim 1, further comprising:

3. The method of claim 2, wherein training the initial pre-training model based on the prediction result and the mask identification to obtain a target pre-training model comprises:

acquiring target information corresponding to the mask identification;

4. The method of claim 3, further comprising:

5. The method of claim 4, wherein determining the target information corresponding to the mask identification according to the encoded value set comprises:

taking the target coded value as the target information;

6. The method of claim 2, wherein training an initial pre-training model based on the prediction result and the mask identification to obtain a target pre-training model comprises:

7. A method of model training, comprising:

obtaining a plurality of candidate units in a target pre-training model, the plurality of candidate units comprising: the system comprises a preprocessing unit, a first target vector generating unit, a first target cross mode coding unit and a first target cross mode decoding unit;

wherein the target pre-training model is trained by an initial pre-training model in the data processing method of any one of claims 1 to 6.

8. A task processing method, comprising:

wherein the target task processing model is trained by the model training method as claimed in claim 7.

9. A data processing apparatus, comprising:

10. A model training apparatus, comprising:

wherein the target pre-training model is obtained by training an initial pre-training model in the data processing method according to any one of claims 1 to 6.

11. A task processing apparatus, comprising:

wherein the target task processing model is trained by the model training method of claim 7.

12. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the data processing method of any one of claims 1-6, or the model training method of claim 7, or the task processing method of claim 8, via execution of the executable instructions.

13. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the data processing method of any one of claims 1 to 6, or the model training method of claim 7, or the task processing method of claim 8.