CN116861995A

CN116861995A - Training of multi-mode pre-training model and multi-mode data processing method and device

Info

Publication number: CN116861995A
Application number: CN202310841914.XA
Authority: CN
Inventors: 齐心悦; 宋阳
Original assignee: Jingdong Technology Information Technology Co Ltd
Current assignee: Jingdong Technology Information Technology Co Ltd
Priority date: 2023-07-10
Filing date: 2023-07-10
Publication date: 2023-10-10

Abstract

The disclosure provides a training method and a multi-mode data processing method and device for a multi-mode pre-training model; relates to the technical field of computers. The method comprises the following steps: extracting features of the target training text through a language characterization module to obtain target text features; extracting features of the target training image through a visual characterization module to obtain target image features; performing feature alignment and feature fusion on the target text features and the target image features through a multi-mode alignment module to obtain fusion features; and predicting the fusion characteristics through the first pre-training task and the second pre-training task, and determining training loss information according to two prediction results so as to train the language characterization module, the visual characterization module and the multi-mode alignment module. The method and the device can solve the problems of low accuracy of the pre-training model, long calculation resources and long training time required by subsequent training caused by lack of cross-modal information learning in the related technology.

Description

Training of multi-mode pre-training model and multi-mode data processing method and device

Technical Field

The disclosure relates to the technical field of computers, and in particular relates to a training method and device of a multi-mode pre-training model, electronic equipment and a storage medium.

Background

With the continued development of internet platforms, image and text based multi-modal task predictions, such as teletext retrieval and teletext response, visual reasoning, etc., have emerged. Currently, a pre-training model is usually obtained by training a visual encoder and a speech encoder separately, and further training is performed on the basis of the pre-training model to obtain a prediction model. In the mode, cross-modal information learning is not performed in the pre-training stage, so that the accuracy of the obtained pre-training model is reduced, and the calculation resources and the training time required by subsequent training are relatively long.

It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The embodiment of the disclosure aims to provide a training method and device for a multi-modal pre-training model, electronic equipment and a storage medium, so that the problems of low accuracy of the pre-training model, long calculation resources and long training time required by subsequent training caused by lack of cross-modal information learning in the related technology are solved to a certain extent.

According to a first aspect of the present disclosure, there is provided a training method of a multimodal pre-training model, comprising: extracting features of the target training text through a language characterization module to obtain target text features; extracting features of the target training image through a visual characterization module to obtain target image features; the target training text and the target training image are obtained by performing content masking on the original training sample; performing feature alignment and feature fusion on the target text features and the target image features through a multi-mode alignment module to obtain fusion features; predicting the masking content of the fusion features through a first pre-training task to obtain a first prediction result; performing image-text matching prediction on the fusion features through a second pre-training task to obtain a second prediction result; determining training loss information according to the first prediction result and the second prediction result; and according to the training loss information, training the language characterization module, the visual characterization module and the multi-mode alignment module to obtain a multi-mode pre-training model.

Optionally, extracting features from the target training text includes: word segmentation is carried out on the target training text; mapping each word segmentation result into a corresponding symbol sequence based on a pre-training dictionary; coding each symbol sequence to correspondingly obtain word embedded vectors; and performing modal embedding on each word embedding vector according to the first modal matrix to obtain target text characteristics.

Optionally, extracting features from the target training image includes: extracting features of the target training image through a visual characterization module to obtain first features; the visual characterization module is a model for initializing network parameters; the channel number of the first feature is adjusted to obtain a second feature; and performing modal embedding on each second feature according to the second modal matrix to obtain target image features.

Optionally, the visual characterization module includes N convolution layers, and performs feature extraction on the target training image, including: for the first convolution layer, nonlinear activation processing and convolution processing are carried out on an input target training image, so that output first convolution characteristics are obtained; and for the 2 nd to N th convolution layers, performing nonlinear activation processing and convolution processing according to the output of each convolution layer before the current convolution layer to obtain a first characteristic of the output of the N th convolution layer, wherein N is a positive integer.

Optionally, performing feature alignment and feature fusion on the target text feature and the target image feature includes: splicing the target text features and the target image features; and carrying out weighting processing and full connection processing based on an attention mechanism based on the splicing result and the position embedded vector through the pre-training coding model to obtain fusion characteristics.

Optionally, predicting the mask content for the fusion feature includes: determining target fusion characteristics from the fusion characteristics according to the masking positions of the masking contents; and determining the probability of the target fusion feature on each preset category word so as to obtain a first prediction result.

Optionally, determining training loss information according to the first prediction result and the second prediction result includes: determining first loss information according to the first prediction result and the masking content corresponding to the masking position; determining second loss information according to the cross entropy loss between the second prediction result and the image-text matching label; the first loss information and the second loss information are weighted based on the loss weights, and training loss information is determined.

Optionally, according to the training loss information, the training language characterization module, the visual characterization module and the multi-modal alignment module include: adjusting model parameters and loss weights of a language characterization module, a visual characterization module, a multi-mode alignment module, a first pre-training task and a second pre-training task by taking minimum training loss information as a target; and determining a multi-mode pre-training model according to the language characterization module, the visual characterization module and the multi-mode alignment module which are obtained after training.

According to a second aspect of the present disclosure, there is provided a multi-modal data processing method comprising: acquiring multi-mode data to be processed; and inputting the multi-mode data to be processed into a multi-mode data processing model for data processing, wherein the multi-mode data processing model is obtained by performing task migration training on the multi-mode pre-training model obtained in any embodiment.

According to a third aspect of the present disclosure, there is provided a training apparatus for a multimodal pre-training model, the apparatus comprising: the feature extraction module is configured to perform feature extraction on the target training text to obtain target text features; extracting features of the target training image to obtain target image features; the target training text and the target training image are obtained by performing content masking on the original training sample; the multi-mode alignment module is configured to perform feature alignment and feature fusion on the target text features and the target image features to obtain fusion features; the prediction module is configured to predict the masking content of the fusion characteristics to obtain a first prediction result; carrying out image-text matching prediction on the fusion characteristics to obtain a second prediction result; the loss determination module is configured to determine training loss information according to the first prediction result and the second prediction result; the training module is configured to train the language characterization module, the visual characterization module and the multi-modal alignment module according to the training loss information to obtain a multi-modal pre-training model.

According to a fourth aspect of the present disclosure there is provided a multi-modal data processing apparatus, the apparatus comprising: the system comprises an acquisition module and a processing module, wherein the acquisition module is configured to acquire multi-mode data to be processed; the processing module is configured to input the multi-modal data to be processed into a multi-modal data processing model for data processing, wherein the multi-modal data processing model is obtained by performing task migration training based on the multi-modal pre-training model obtained in any embodiment.

According to a fifth aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of the above embodiments.

According to a sixth aspect of the present disclosure, there is provided an electronic device comprising: one or more processors; and storage means for one or more programs which, when executed by the one or more processors, cause the one or more processors to perform the method of any of the embodiments described above.

Exemplary embodiments of the present disclosure may have some or all of the following advantages:

in the training method of the multi-modal pre-training model provided by the exemplary embodiment of the disclosure, on one hand, the feature alignment and feature fusion can be performed on the target text features and the target image features through the multi-modal alignment module, so that the cross-modal features are learned through the training of the multi-modal alignment module, the prediction accuracy of the multi-modal pre-training model is improved, and the calculation resources and the training time required by the subsequent training are reduced. On the other hand, the fusion features are respectively subjected to prediction of masking content and image-text matching prediction through a first pre-training task and a second pre-training task, training loss information is determined according to the obtained first prediction result and the obtained second prediction result, and based on the training language characterization module, the visual characterization module and the multi-modal alignment module, the trained multi-modal pre-training model combines the feature prediction and image-text matching capability, so that the prediction accuracy of the multi-modal pre-training model is further improved. Furthermore, through training of the visual representation module, the multi-mode pre-training model can comprehensively extract image features of an original training sample, and image prediction accuracy of the multi-mode pre-training model is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.

FIG. 1 schematically illustrates a flow chart of a training method of a multi-modal pre-training model in accordance with one embodiment of the present disclosure.

FIG. 2 schematically illustrates a feature extraction process flow diagram of a language characterization module in one embodiment according to the disclosure.

Fig. 3 schematically illustrates a feature extraction process flow diagram of the visual characterization module in one embodiment according to the disclosure.

FIG. 4 schematically illustrates a schematic diagram of a training process of a multi-modal pre-training model in one embodiment according to the present disclosure.

FIG. 5 schematically illustrates a flow chart of a method of multimodal data processing in one embodiment according to the disclosure.

Fig. 6 schematically illustrates a block diagram of a training apparatus of a multi-modal pre-training model in one embodiment of the present disclosure.

Fig. 7 illustrates a block diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present disclosure. However, those skilled in the art will recognize that the aspects of the present disclosure may be practiced with one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.

The invention provides an application scene exemplary system diagram of a training method and device of a multi-mode pre-training model. The method of the embodiment can be applied to a server, and it can be understood that the method can also be applied to a terminal, and can also be applied to a system comprising the terminal and the server, and implemented through interaction between the terminal and the server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms, or nodes in a blockchain.

The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a vehicle-mounted device, etc. When the training method of the multi-mode pre-training model provided in this embodiment is implemented through interaction between the terminal and the server, the terminal and the server may be directly or indirectly connected through wired or wireless communication, for example, the terminal sends a training instruction to the server, the server performs a training process of the multi-mode pre-training model according to the training instruction, and the server sends a training completion signal to the terminal after training is completed. The present disclosure is not limited herein.

The training method of the multi-mode pre-training model provided by the embodiment of the disclosure can be executed in a server, and accordingly, the multi-mode pre-training device is generally arranged in the server.

The following describes the training method of the multimodal pre-training model disclosed in the embodiments of the present specification with reference to specific embodiments.

Referring to fig. 1, a training method of a multi-modal pre-training model according to an exemplary embodiment provided by the present disclosure may be applied to a server, and may include the following steps.

Step S110, extracting features of a target training text through a language characterization module to obtain target text features; extracting features of the target training image through a visual characterization module to obtain target image features; the target training text and the target training image are obtained by performing content masking on the original training sample.

And step S120, performing feature alignment and feature fusion on the target text features and the target image features through a multi-mode alignment module to obtain fusion features.

Step S130, predicting the masking content of the fusion features through a first pre-training task to obtain a first prediction result; and carrying out image-text matching prediction on the fusion characteristics through a second pre-training task to obtain a second prediction result.

Step S140, determining training loss information according to the first prediction result and the second prediction result.

Step S150, training a language characterization module, a visual characterization module and a multi-mode alignment module according to the training loss information to obtain a multi-mode pre-training model.

In the training method of the multi-modal pre-training model provided by the present exemplary embodiment, on one hand, feature alignment and feature fusion can be performed on the target text features and the target image features through the multi-modal alignment module, so that the cross-modal features are learned through training of the multi-modal alignment module, the prediction accuracy of the multi-modal pre-training model is improved, and the calculation resources and the training time required by subsequent training are reduced. On the other hand, the fusion features are respectively subjected to prediction of masking content and image-text matching prediction through a first pre-training task and a second pre-training task, training loss information is determined according to the obtained first prediction result and the obtained second prediction result, and based on the training language characterization module, the visual characterization module and the multi-modal alignment module, the trained multi-modal pre-training model combines the feature prediction and image-text matching capability, so that the prediction accuracy of the multi-modal pre-training model is further improved. Furthermore, through training of the visual representation module, the multi-mode pre-training model can comprehensively extract image features of an original training sample, and image prediction accuracy of the multi-mode pre-training model is improved.

In step S110, feature extraction is performed on the target training text by the language characterization module, so as to obtain target text features; and extracting the characteristics of the target training image through a visual characterization module to obtain the characteristics of the target image.

In this example embodiment, the target training text and the target training image are obtained by content masking the original training sample. The original training sample refers to an original text corpus sample, for example, an image and a sentence describing the image may be an original training sample. Content masking, which is to randomly MASK part of the content of the original training sample, can be performed on the text or the image in the original training sample according to a certain proportion (such as 15%), for example, the original content is replaced by 'MASK'. In some embodiments, content masking of the original training samples may be performed by first selecting a content location and then replacing the selected location with "MASK" in part, and replacing the portion with other content at random, leaving the content of the location unchanged.

In this example embodiment, the language characterization module may be various text characterization models, such as a pre-trained text encoder (Bert), and the visual characterization model may be various image characterization models, such as various convolutional network models.

For example, referring to fig. 2, feature extraction of the target training text may include the following steps.

First, word segmentation is carried out on target training texts.

And secondly, mapping each word segmentation result into a corresponding symbol sequence based on the pre-training dictionary.

And thirdly, carrying out coding processing on each symbol sequence to correspondingly obtain word embedded vectors.

Fourthly, performing modal embedding on the word embedding vectors according to the first modal matrix to obtain target text features.

In this example embodiment, the target training text may be segmented to obtain a plurality of word segmentation results, and then each word segmentation result is mapped into a symbol sequence (such as symbol sequence 1, symbol sequence 2, and symbol sequence 3) by using a dictionary of the pretrained Bert model, where the lengths of each symbol sequence are the same; and carrying out coding processing (such as onehot coding) on each symbol sequence to obtain word embedding vectors (such as word embedding vector 1, word embedding vector 2 and word embedding vector 3) corresponding to each word segmentation result, and carrying out modal embedding by using a first modal matrix to obtain target text characteristics. The first modal matrix may be a modal embedded matrix corresponding to the text, which is used to indicate the modal information (such as text or image) corresponding to the current feature, and one training sample may correspond to one first modal matrix, for example, the first modal matrix may be subjected to matrix operation with a word embedded matrix formed by word embedded vectors of the word segmentation results, so as to obtain the target text feature.

For example, referring to fig. 3, feature extraction of the target training image may include the following steps.

The first step, extracting features of a target training image through a visual characterization module to obtain first features.

And step two, adjusting the channel number of the first feature to obtain a second feature.

Thirdly, performing modal embedding on each second feature according to the second modal matrix to obtain target image features.

In this example embodiment, the visual characterization model that initializes the network parameters may be used as the visual characterization module to perform feature extraction on the target training image, so that features of the target training image may be extracted more comprehensively, limitations caused by extracting features of an image region by using the target detection model that passes through a specific visual scene are avoided, and an extraction breadth of the features of the image region is improved. For example, the dense convolution network DenseNet121 may be used to extract features from the target training image to obtain Gao Weidi features with a stacked number of channels, and then the number of channels of the first feature is reduced by 1×1 convolution, so that the number of channels of the second feature obtained is consistent with the multi-modal alignment module, for example, the number of channels may be converted from M to C by using a convolution kernel of M×1×1×C. The second mode matrix may be a mode embedded matrix corresponding to the image and is used for indicating mode information (such as text or image) corresponding to the current feature, and one training sample may correspond to one second mode matrix, for example, the second mode matrix and the second feature may be subjected to matrix operation, so as to obtain the target image feature.

In some embodiments, the image may be partitioned (patch), then feature extracted for each image block, and then the block information (image block) mapped to a low-dimensional vector by linear transformation.

For example, the visual representation model includes N convolution layers, and feature extraction is performed on the target training image, including: for the first convolution layer, nonlinear activation processing and convolution processing are carried out on an input target training image, so that output first convolution characteristics are obtained; and for the 2 nd to N th convolution layers, performing nonlinear activation processing and convolution processing according to the output of each convolution layer before the current convolution layer to obtain a first characteristic of the output of the N th convolution layer, wherein N is a positive integer.

In this example embodiment, the visual representation model may include N convolution layers that are cascaded in sequence, where each convolution layer may include a normalization module, a nonlinear activation module, and a convolution module, and an output of a convolution layer at a front end needs to be input into each subsequent convolution layer for processing, so that an output of each convolution layer at a back end is formed by splicing a current layer and a previous layer along a channel dimension. The convolution kernel size of the convolution processes of different convolution layers may be different, and the nonlinear activation process may be performed using a ReLU activation function. A pooling process and a discard process may also be included in each convolutional layer, which is not limited in this example.

In step S120, feature alignment and feature fusion are performed on the target text features and the target image features by the multi-modality alignment module, so as to obtain fusion features.

In this example embodiment, the multi-modal alignment module is configured to perform alignment and fusion of text features and image features, so as to implement cross-modal information learning.

For example, feature alignment and feature fusion may be performed on the target text feature and the target image feature by the following steps.

Firstly, splicing the target text features and the target image features; and then weighting processing and full connection processing based on an attention mechanism are carried out based on the splicing result and the position embedded vector through the pre-training coding model, so that fusion characteristics are obtained.

In this example embodiment, since the multimodal alignment module is insensitive to the input order, feature location information is introduced through the location embedding vector, a SEP separator may be set between the target text feature and the target image feature to perform feature stitching, and a CLS identifier may be added at the beginning, so that the CLS output may be directly used for prediction. For example, for the splice result X, a pre-trained coding model may be employedBert performs feature alignment and fusion, and Bert includes m coding layers (e.g., 12) each containing an attention unit and a full connection unit. The attention unit performs a weighted processing based on an attention mechanism: by means of w _q 、w _k 、w _v The matrix obtains intermediate variables Q (query vector), K (key vector) and V (value vector), then generates a weight matrix gamma based on Q, K, and performs matrix operation by adopting the weight matrix gamma and V to obtain a weighted output X _att The method comprises the steps of carrying out a first treatment on the surface of the A fully connected unit for outputting X of the attention unit _att Obtaining output X through full connection processing _out 。

Q＝W _q X,K＝W _k X,V＝W _v X

X _att ＝γV

X _out ＝FFN(X _att )

Where d represents the dimension of the query vector, the superscript T represents the transpose operation, and FFN represents the fully connected processing. According to the process, through m layers of processing, elements at each position in the finally obtained fusion feature contain element information at other positions, and meanwhile, the text feature and the image feature are aligned.

In step S130, predicting the masking content of the fusion feature through the first pre-training task to obtain a first prediction result; and carrying out image-text matching prediction on the fusion characteristics through a second pre-training task to obtain a second prediction result.

In this example embodiment, the first pre-training task refers to a task of predicting the masked content, which may be regarded as a multi-classification problem, for example, prediction of the masked content may be performed by a multi-layer perceptron. The second pre-training task refers to prediction of matching degree of images and texts, which can be regarded as a classification problem, and the corresponding classifier (such as a fully connected network) is adopted to predict the matching degree of the images and texts.

Illustratively, the first predictor may be obtained by: firstly, determining target fusion characteristics from fusion characteristics according to a masking position of masking content; and determining the probability of the target fusion feature on each preset category word so as to obtain a first prediction result.

In this exemplary embodiment, a vector corresponding to the mask content, that is, the target fusion feature, may be screened from the fusion feature matrix, and the target fusion feature is mapped by the multi-layer perceptron and converted into a probability of each mask word on each word class, where the number of word classes is the total number of words.

In step S140, training loss information is determined according to the first prediction result and the second prediction result.

In this exemplary embodiment, the total training loss information may be determined by combining the loss functions corresponding to the first pretraining task and the second pretraining task, for example, the sum of the loss functions of the two pretraining tasks is used as the training loss information, or the loss functions of the two pretraining tasks are weighted, which is not limited in this example.

Illustratively, the training loss may be determined by: determining first loss information according to the first prediction result and the masking content corresponding to the masking position; determining second loss information according to the cross entropy loss between the second prediction result and the image-text matching label; the first loss information and the second loss information are weighted based on the loss weights, and training loss information is determined.

In this exemplary embodiment, the masking content corresponding to the masking position, that is, the real word of the position, may be the onehot code of the real word on the whole vocabulary, the cross entropy is calculated by using the code vector and the first prediction result, and then weighted average is performed, and the first loss information L ₁ The (θ) calculation formula is as follows:

L ₁ (θ)＝-E _W logP _θ (W _m )

wherein W is _m Coding vector representing masking content corresponding to masking position, theta representing model parameter, P _θ (W _m ) The first prediction result is indicated as such,E _W representing weighted averaging.

In the second loss information determining process, the graph-text matching label can be y epsilon {0,1}, wherein a value of 0 indicates that the text content is not matched with the image, and a value of 1 indicates that the text content is matched with the image. Second loss information L ₂ The calculation formula of (θ) is as follows:

L ₂ (θ)＝-[ylogS _θ (W)+(1-y)log(1-S _θ (W)]

wherein S is _θ (W) represents a second prediction result.

The two loss information are weighted and summed to form the final training loss information L (theta), and the loss weight alpha is used as a variable parameter to participate in the gradient updating process.

L(θ)＝L ₂ (θ)+αL ₁ (θ)

Training loss information is determined through the above process.

In step S150, the language characterization module, the visual characterization module, and the multi-modal alignment module are trained according to the training loss information to obtain a multi-modal pre-training model.

In this example embodiment, according to the training loss information, gradient back propagation may be performed by using a gradient descent method to update model parameters until a training cutoff condition is reached (e.g., the model converges or reaches a maximum training number of times), so as to obtain a multi-mode pre-training model.

Illustratively, the language characterization module, the visual characterization module, the multi-modal alignment module, the model parameters of the first pre-training task, the second pre-training task, and the loss weights are adjusted with the aim of minimizing training loss information; and determining a multi-mode pre-training model according to the language characterization module, the visual characterization module and the multi-mode alignment module which are obtained after training.

In this example embodiment, the model parameters may be adjusted by gradient back propagation during each round of training, where the adjusted model parameters may include a language characterization module, a visual characterization module, a multi-modal alignment module, model parameters in a first pre-training task and a second pre-training task, and a loss weight (such as α), until training loss information is less than a preset value, to obtain a multi-modal pre-training model including the language characterization module, the visual characterization module, and the multi-modal alignment module.

In some embodiments, the multi-mode pre-training model may include a language characterization module, a visual characterization module, a multi-mode alignment module, and a classifier corresponding to the second pre-training task, so as to form a pre-training model for predicting the degree of matching of graphics and text, and the graphics and text information to be matched may be matched by using the model. In the actual application process (such as image-text retrieval, image-text question answering and the like), other model structures (such as other classifiers) can be added on the basis of a multi-modal pre-training model (consisting of a language characterization module, a visual characterization module and a multi-modal alignment module) according to the actual downstream task, and then combined model training is performed so as to predict in an actual scene.

For example, the implementation process of the training method of the multimodal pre-training model of the present disclosure is shown in fig. 4, where the multimodal pre-training model may include a language characterization module, a visual characterization module, and a multimodal alignment module, and may be specifically implemented by the following steps.

Firstly, building a pre-training framework, and adding a first pre-training task and a second pre-training task after a pre-training model.

And secondly, acquiring a training sample, wherein the image and the text describing the image can be used as one training sample, and the image and other texts can be combined to form the training sample, wherein the training sample can comprise a plurality of fields and scenes, and the example does not limit the scenes of the training sample.

And thirdly, partially covering the training sample to obtain a target training sample and initializing model parameters.

Inputting the text in the target training sample into an initialized language characterization module, inputting the image in the target training sample into an initialized visual characterization module, and enabling the output of the two characterization modules to enter a multi-mode alignment module, wherein the output of the multi-mode alignment module parallelly enters two pre-training tasks.

Fifth, the training loss is calculated using the outputs of the two pre-training tasks.

And sixthly, model parameters and loss weights of all the modules are adjusted through training loss.

And performing iterative training according to the second step to the sixth step until the model converges to obtain the multi-mode pre-training model consisting of the language characterization module, the visual characterization module and the multi-mode alignment module.

In the multi-mode pre-training process in the above embodiment, the language characterization module, the visual characterization module, the multi-mode alignment module, the model parameters of the first pre-training task, the second pre-training task, and the loss weights of the two tasks need to be trained.

According to the method, on one hand, the initialized visual characterization module is used, global semantic information of an image training sample can be learned, and the finally obtained multi-mode pre-training model can be used for extracting the semantic information of an image to be processed more comprehensively; on the other hand, the pre-training model learns the capacities of image-text matching and image-text fusion prediction by combining the first pre-training task and the second pre-training task and continuously training the loss weights of the first pre-training task and the second pre-training task; feature extraction, feature alignment and fusion are carried out through the language characterization module, the visual characterization module and the multi-mode alignment module, so that cross-mode information is learned, model accuracy is improved, and model training calculation resources and time of downstream tasks are saved.

In other embodiments, referring to fig. 5, the present disclosure further provides a multi-modal data processing method, the method comprising the steps of:

step S510, acquiring multi-modal data to be processed.

Step S520, inputting the multi-modal data to be processed into a multi-modal data processing model for data processing, where the multi-modal data processing model is obtained by performing task migration training based on the multi-modal pre-training model obtained in the above embodiment.

In the present exemplary embodiment, the multimodal data to be processed may include text and images to be processed, for example, may be a consultation problem including images transmitted in a user consultation process, or a search text including images input in a user search target information process, etc., which is not limited in this example. The task migration training refers to determining and training a migration task according to actual needs and application scenes, wherein the migration task can be user intention prediction or user search target prediction, and can also be other tasks, and the example is not limited to the task migration task. The method comprises the steps of training a model consisting of a multi-modal pre-training model and a migration task through a small quantity of training samples of a specific application scene based on the migration task, wherein the training is mainly performed on the migration task, and the data processing is performed on a multi-modal data processing model obtained after the training is completed.

Further, the present disclosure also provides a training device 600 for a multimodal pre-training model. The multi-modal pretraining apparatus 600 may be applied to a server. Referring to fig. 6, the multi-modal pretraining apparatus 600 may include: a feature extraction module 610, a multi-modal alignment module 620, a prediction module 630, a loss determination module 640, and a training module 650; wherein: the feature extraction module 610 is configured to perform feature extraction on the target training text to obtain target text features; extracting features of the target training image to obtain target image features; the target training text and the target training image are obtained by performing content masking on the original training sample; the multi-mode alignment module 620 is configured to perform feature alignment and feature fusion on the target text feature and the target image feature to obtain a fusion feature; the prediction module 630 is configured to predict the masking content of the fusion feature to obtain a first prediction result; carrying out image-text matching prediction on the fusion characteristics to obtain a second prediction result; a loss determination module 640 configured to determine training loss information based on the first prediction result and the second prediction result; the training module 650 is configured to train the language characterization module, the visual characterization module, and the multimodal alignment module to obtain a multimodal pre-training model based on the training loss information.

In an exemplary embodiment of the present disclosure, the feature extraction module 610 includes a language characterization module 611, the text characterization module 611 configured to: word segmentation is carried out on the target training text; mapping each word segmentation result into a corresponding symbol sequence based on a pre-training dictionary; coding each symbol sequence to correspondingly obtain word embedded vectors; and performing modal embedding on each word embedding vector according to the first modal matrix to obtain target text characteristics.

In an exemplary embodiment of the present disclosure, the feature extraction module 610 further includes a visual characterization module 612, the visual characterization module 612 configured to: extracting features of the target training image through a visual characterization module to obtain first features; the visual characterization module is a visual characterization model for initializing network parameters; the channel number of the first feature is adjusted to obtain a second feature; and performing modal embedding on each second feature according to the second modal matrix to obtain target image features.

In one exemplary embodiment of the present disclosure, the visual characterization module 612 includes N convolution layers, and the visual characterization module 612 is further configured to: for the first convolution layer, nonlinear activation processing and convolution processing are carried out on an input target training image, so that output first convolution characteristics are obtained; and for the 2 nd to N th convolution layers, performing nonlinear activation processing and convolution processing according to the output of each convolution layer before the current convolution layer to obtain a first characteristic of the output of the N th convolution layer, wherein N is a positive integer.

In one exemplary embodiment of the present disclosure, the multi-modality alignment module 630 is further configured to: splicing the target text features and the target image features; and carrying out weighting processing and full connection processing based on an attention mechanism based on the splicing result and the position embedded vector through the pre-training coding model to obtain fusion characteristics.

In an exemplary embodiment of the present disclosure, the prediction module 630 is further configured to: determining target fusion characteristics from the fusion characteristics according to the masking positions of the masking contents; and determining the probability of the target fusion feature on each preset category word so as to obtain a first prediction result.

In one exemplary embodiment of the present disclosure, the loss determination module 640 is further configured to: determining first loss information according to the first prediction result and the masking content corresponding to the masking position; determining second loss information according to the cross entropy loss between the second prediction result and the image-text matching label; the first loss information and the second loss information are weighted based on the loss weights, and training loss information is determined.

In one exemplary embodiment of the present disclosure, training module 650 is further configured to: adjusting model parameters and loss weights of a language characterization module, a visual characterization module, a multi-mode alignment module, a first pre-training task and a second pre-training task by taking minimum training loss information as a target; and determining a multi-mode pre-training model according to the language characterization module, the visual characterization module and the multi-mode alignment module which are obtained after training.

The specific details of each module or unit in the training device of the multimodal pre-training model are described in detail in the training method of the corresponding multimodal pre-training model, so that the details are not repeated here.

The disclosure also provides a multi-mode data processing device, which comprises an acquisition module and a processing module, wherein: the acquisition module is configured to acquire multi-modal data to be processed; the processing module is configured to input the multi-mode data to be processed into a multi-mode data processing model for data processing, wherein the multi-mode data processing model is obtained by performing task migration training based on the multi-mode pre-training model obtained in the embodiment.

As another aspect, the present application also provides a computer-readable medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer-readable medium carries one or more programs which, when executed by one of the electronic devices, cause the electronic device to implement the methods in the embodiments described below. For example, the electronic device may implement the respective flow steps and the like shown in fig. 1 to 5.

It should be noted that the computer readable medium shown in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

An electronic device 700 according to such an embodiment of the present disclosure is described below with reference to fig. 7. The electronic device 700 shown in fig. 7 is merely an example and should not be construed to limit the functionality and scope of use of embodiments of the present disclosure in any way.

As shown in fig. 7, the electronic device 700 is embodied in the form of a general purpose computing device. Components of electronic device 700 may include, but are not limited to: the at least one processing unit 710, the at least one storage unit 720, a bus 730 connecting the different system components (including the storage unit 720 and the processing unit 710), and a display unit 740.

Wherein the storage unit stores program code that is executable by the processing unit 710 such that the processing unit 710 performs steps according to various exemplary embodiments of the present disclosure described in the above-described "exemplary methods" section of the present specification.

The memory unit 720 may include readable media in the form of volatile memory units, such as Random Access Memory (RAM) 7201 and/or cache memory 7202, and may further include Read Only Memory (ROM) 7203.

The storage unit 720 may also include a program/utility 7204 having a set (at least one) of program modules 7205, such program modules 7205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

Bus 730 may be a bus representing one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 700 may also communicate with one or more external devices 770 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 700, and/or any device (e.g., router, modem, etc.) that enables the electronic device 700 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 750. Also, electronic device 700 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through network adapter 760. As shown, network adapter 760 communicates with other modules of electronic device 700 over bus 730. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 700, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RA identification systems, tape drives, data backup storage systems, and the like.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

Furthermore, the above-described figures are only schematic illustrations of processes included in the method according to the exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.

It should be noted that although the steps of the methods of the present disclosure are illustrated in the accompanying drawings in a particular order, this does not require or imply that the steps must be performed in that particular order or that all of the illustrated steps be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc., all are considered part of the present disclosure.

It should be understood that the present disclosure disclosed and defined herein extends to all alternative combinations of two or more of the individual features mentioned or evident from the text and/or drawings. All of these different combinations constitute various alternative aspects of the present disclosure. Embodiments of the present disclosure describe the best mode known for carrying out the disclosure and will enable one skilled in the art to utilize the disclosure.

Claims

1. A method of training a multimodal pre-training model, comprising:

extracting features of the target training text through a language characterization module to obtain target text features; extracting features of the target training image through a visual characterization module to obtain target image features; the target training text and the target training image are respectively obtained by performing content masking on an original training sample;

performing feature alignment and feature fusion on the target text features and the target image features through a multi-mode alignment module to obtain fusion features;

predicting the masking content of the fusion features through a first pre-training task to obtain a first prediction result; performing image-text matching prediction on the fusion features through a second pre-training task to obtain a second prediction result;

Determining training loss information according to the first prediction result and the second prediction result;

and training the language characterization module, the visual characterization module and the multi-mode alignment module according to the training loss information to obtain a multi-mode pre-training model.

2. The method of claim 1, wherein the feature extraction of the target training text comprises:

word segmentation is carried out on the target training text;

mapping each word segmentation result into a corresponding symbol sequence based on a pre-training dictionary;

coding each symbol sequence to correspondingly obtain word embedded vectors;

and performing modal embedding on each word embedding vector according to the first modal matrix to obtain the target text feature.

3. The method of claim 1, wherein the feature extraction of the target training image comprises:

extracting features of the target training image through a visual characterization module to obtain first features; the visual characterization module is a model for initializing network parameters;

the channel number of the first feature is adjusted to obtain a second feature;

and performing modal embedding on each second feature according to the second modal matrix to obtain the target image feature.

4. A method according to claim 1 or 3, wherein the visual representation module comprises N convolution layers, and wherein feature extraction of the target training image comprises:

for the first convolution layer, performing nonlinear activation processing and convolution processing on the input target training image to obtain a first convolution characteristic of output;

and for the 2 nd to N th convolution layers, performing nonlinear activation processing and convolution processing according to the output of each convolution layer before the current convolution layer to obtain the first characteristic of the output of the N th convolution layer, wherein N is a positive integer.

5. The method of claim 1, wherein feature alignment and feature fusion of the target text feature and the target image feature comprises:

splicing the target text features and the target image features;

and carrying out weighting processing and full connection processing based on an attention mechanism based on the splicing result and the position embedded vector through a pre-training coding model to obtain the fusion characteristic.

6. The method of claim 1, wherein predicting the mask content for the fused feature comprises:

Determining target fusion characteristics from the fusion characteristics according to the masking positions of the masking contents;

and determining the probability of the target fusion feature on each preset category word so as to obtain the first prediction result.

7. The method of claim 6, wherein determining training loss information based on the first prediction result and the second prediction result comprises:

determining first loss information according to the first prediction result and the masking content corresponding to the masking position;

determining second loss information according to the cross entropy loss between the second prediction result and the image-text matching label;

and carrying out weighting processing on the first loss information and the second loss information based on the loss weight, and determining the training loss information.

8. The method of claim 7, wherein training the language characterization module, the visual characterization module, and the multi-modal alignment module according to the training loss information comprises:

adjusting model parameters of the language characterization module, the visual characterization module, the multi-modal alignment module, the first pre-training task, the second pre-training task, and the loss weights with the aim of minimizing the training loss information;

And determining the multi-modal pre-training model according to the language characterization module, the visual characterization module and the multi-modal alignment module which are obtained after training.

9. A method of multi-modal data processing comprising:

acquiring multi-mode data to be processed;

inputting the multi-modal data to be processed into a multi-modal data processing model for data processing, wherein the multi-modal data processing model is obtained by performing task migration training based on the multi-modal pre-training model obtained in any one of claims 1-8.

10. A training device for a multimodal pre-training model, the device comprising:

the feature extraction module is configured to perform feature extraction on the target training text to obtain target text features; extracting features of the target training image to obtain target image features; the target training text and the target training image are obtained by performing content masking on an original training sample;

the multi-mode alignment module is configured to perform feature alignment and feature fusion on the target text features and the target image features to obtain fusion features;

the prediction module is configured to predict the masking content of the fusion characteristics to obtain a first prediction result; performing image-text matching prediction on the fusion characteristics to obtain a second prediction result;

A loss determination module configured to determine training loss information based on the first prediction result and the second prediction result;

and the training module is configured to train the language characterization module, the visual characterization module and the multi-modal alignment module according to the training loss information so as to obtain a multi-modal pre-training model.

11. An electronic device, comprising: one or more processors; and

storage means for storing one or more programs which when executed by the one or more processors cause the one or more processors to implement the method of any of claims 1 to 9.

12. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1 to 9.