CN117453880A

CN117453880A - Multi-mode data processing method and device, electronic equipment and storage medium

Info

Publication number: CN117453880A
Application number: CN202311433898.7A
Authority: CN
Inventors: 张韵璇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-10-30
Filing date: 2023-10-30
Publication date: 2024-01-26

Abstract

The application provides a method and a device for processing multi-mode data, electronic equipment and a storage medium; the method comprises the following steps: acquiring multi-mode data included in input content; performing feature extraction processing based on multi-mode data to obtain sub-mode features corresponding to each mode respectively, wherein the dimensions of different sub-mode features are different; performing causal relation conversion on the sub-mode feature with the highest dimensionality in the plurality of sub-mode features to obtain a causal feature vector, wherein the causal feature vector is the same as the dimensionality of other sub-mode features, and the other sub-mode features are sub-mode features except the sub-mode feature with the highest dimensionality in the plurality of sub-mode features; splicing the causal feature vector with other sub-modal features to obtain a spliced feature sequence; reply content to the input content is generated based on the splice feature sequence. Through the method and the device, the diversity and the accuracy of generating the reply content according to the input content can be improved.

Description

Multi-mode data processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to computer technology, and in particular, to a method and apparatus for processing multi-mode data, an electronic device, and a storage medium.

Background

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

In the related technology, in the artificial intelligence question-answering process, text content input by a user can be received and corresponding answer content can be generated, however, the data modes are various, the artificial intelligence question-answering can receive limited input content, the replied content is single in form, and various requirements of the user are difficult to meet.

In the related art, a better mode is not available for improving the performance of the artificial intelligence question and answer.

Disclosure of Invention

The embodiment of the application provides a processing method, a processing device, electronic equipment, a computer readable storage medium and a computer program product for multi-mode data, which can realize taking the multi-mode data as input content and improve the diversity and accuracy of reply content generated according to the input content.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a processing method of multi-mode data, which comprises the following steps:

acquiring multi-mode data included in input content;

performing feature extraction processing based on the multi-mode data to obtain sub-mode features respectively corresponding to each mode, wherein the dimensions of different sub-mode features are different;

performing causal relation conversion on the sub-mode feature with the highest dimensionality in the sub-mode features to obtain a causal feature vector, wherein the causal feature vector is the same as the dimensionality of other sub-mode features, and the other sub-mode features are sub-mode features except for the sub-mode feature with the highest dimensionality in the sub-mode features;

splicing the causal feature vector with the other sub-mode features to obtain a spliced feature sequence;

and generating reply content of the input content based on the spliced characteristic sequence.

The embodiment of the application provides a processing device for multi-mode data, which comprises:

the data acquisition module is configured to acquire multi-mode data included in the input content;

the feature extraction module is configured to perform feature extraction processing based on the multi-mode data to obtain sub-mode features corresponding to each mode respectively, wherein the dimensions of different sub-mode features are different;

The causal conversion module is configured to perform causal relation conversion on the sub-mode feature with the highest dimensionality in the plurality of sub-mode features to obtain a causal feature vector, wherein the causal feature vector has the same dimensionality as other sub-mode features, and the other sub-mode features are sub-mode features except the sub-mode feature with the highest dimensionality in the plurality of sub-mode features;

the causal conversion module is further configured to splice the causal feature vector with the other sub-model features to obtain a spliced feature sequence;

and the generation module is configured to generate reply content of the input content based on the splicing characteristic sequence.

An embodiment of the present application provides an electronic device, including:

a memory for storing computer executable instructions;

and the processor is used for realizing the processing method of the multi-mode data provided by the embodiment of the application when executing the computer executable instructions stored in the memory.

The embodiment of the application provides a computer readable storage medium, which stores computer executable instructions for implementing the method for processing multi-mode data provided by the embodiment of the application when the computer executable instructions are executed by a processor.

Embodiments of the present application provide a computer program product, including a computer program or a computer executable instruction, where the computer program or the computer executable instruction implement the method for processing multimodal data provided in the embodiments of the present application when executed by a processor.

The embodiment of the application has the following beneficial effects:

the method and the device allow the input content to be multi-mode data, and compared with the scheme that the input content and the reply content are text in the related technology, the method and the device improve the richness of the input content and the richness of the reply content and improve user experience. The multi-modal data is used for extracting the characteristics, the characteristics of different modalities are aligned through causal relation conversion, the computing resources required for generating the reply content based on the characteristics of the different modalities are saved, compared with a scheme for directly respectively generating the data of different dimension data, the complexity of the computing process is reduced, the relevance between the input content of different modalities is utilized, the relevance between the reply content and the input content can be further improved, the reply content is more accurate, and the requirements of users are met.

Drawings

Fig. 1 is an application mode schematic diagram of a method for processing multimodal data according to an embodiment of the present application;

Fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 3A to 3C are schematic flow diagrams of a method for processing multi-mode data according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a multi-modal processing model provided in an embodiment of the present application;

FIG. 5 is a schematic flow chart of an alternative method for processing multi-modal data according to an embodiment of the present application;

FIG. 6A is a schematic diagram of an alternative architecture of a multi-modal processing model provided by embodiments of the present application;

FIG. 6B is a schematic diagram of a cause and effect converter provided by an embodiment of the present application;

fig. 7 is a schematic diagram of a man-machine interaction interface provided in an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

In the following description, the terms "first", "second", "third" and the like are merely used to distinguish similar objects and do not represent a particular ordering of the objects, it being understood that the "first", "second", "third" may be interchanged with a particular order or sequence, as permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.

It should be noted that, in the embodiment of the present application, the relevant data collection process (for example, the input content of the user in the question and answer application program or the chat application program) should obtain the informed consent or the independent consent of the personal information body strictly according to the requirements of the relevant national laws and regulations during the application of the instance, and develop the subsequent data use and processing actions within the scope of the laws and regulations and the authorization of the personal information body.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.

Before further describing embodiments of the present application in detail, the terms and expressions that are referred to in the embodiments of the present application are described, and are suitable for the following explanation.

1) The Modality (Modality), the source or form of each piece of information, may be referred to as a Modality. For example, the media of the information include voice, image, text, etc.; there are a wide variety of sources of information such as radar, infrared, accelerometers, etc. Each of the above may be referred to as a modality.

2) Pre-training: pretraining a pretraining model independent of specific tasks is obtained from large-scale data through self-supervision learning. The semantic representation of a word in a particular context is embodied. The second step is fine tuning, which modifies the network for a specific task. The training data may be text, text-image pairs, text-video pairs. The training method of the pre-training model can use self-supervision learning technology (such as autoregressive language model and self-coding technology). Single-language, multi-language, and multi-modal models may be trained. Such models can be used to support techniques such as classification, sequence tagging, structure prediction, and sequence generation, and to construct applications such as abstracts, machine translation, picture retrieval, video annotation, etc.

3) The input of a prompt word (prompt), a transducer model (transducer), may be used as a template. For example: the method aims at enabling the model to output the description of the picture, wherein the prompting words are picture characteristics and image contents a photo of corresponding to the picture characteristics, the model can complement the input contents according to the prompting words, and text contents of the description of the picture are obtained, namely a dog sits on a lawn; the goal is to let the model perform a classification function, the hint word "picture feature + what what is in the picture in the image? The model outputs "dog".

4) Contrast learning (contrastive learning), which is one type of self-supervised learning, requires learning of feature representations from unlabeled image data and use in downstream tasks. The guiding principle is as follows: by automatically constructing similar and dissimilar instances, a representation learning model is learned by which similar instances are made closer together in projection space and dissimilar instances are made farther apart in projection space.

5) A Stable Diffusion model (Stable Diffusion) is a system of multiple components and models based on a potential Diffusion model (Latent Diffusion Models), rather than a single model, that can be used for the document map generation task. It contains three main components, each of which has a separate neural network.

6) A stable diffusion Decoder (Stable Diffusion Decoder), referred to as an Auto-Decoder (Auto-Decoder), uses the processed information matrix to render the final image, has an input of the processed information matrix, dimensions (4, 64, 64), an output of the resulting image, and dimensions (3, 512, 512), i.e., (red/green/blue, wide, high).

7) Discrete variant self-Encoder (discrete Variance Auto-Encoder, dVAE), which is an unsupervised model whose core model is a self-Encoder, consists of Encoder (Encoder) and Decoder (Decoder). Wherein the encoder is operative to encode the image into a feature vector and the decoder is operative to reconstruct the image using the feature vector. Discrete variation self-encoders are just pre-trained by implementing a model through the encoding and decoding of such images.

8) Vision pre-training model (EVA) is a vision-centric basic model that aims to explore the limitations of large-scale visual representations using only publicly accessible data. The method can reconstruct the target task by using the mask, effectively expand the model and improve the performance in various visual tasks. Based on the vision Pre-training model, a multi-mode model which needs large-scale training, such as a contrast language Image Pre-training model (Contrastive Language-Image Pre-training, EVA-CLIP), can obtain better performance with fewer samples and calculation amount, and the strategy provides a new direction for expanding and accelerating the expensive training of the multi-mode basic model. The visual pre-training model is a pre-trained generic visual transducer for reconstructing visual features in masked image-text alignment conditioned on visible image blocks. This agent task can effectively extend the vision pre-training model to 10 hundred million parameters and create new records on a wide range of representative vision downstream tasks such as image recognition, video motion recognition, object detection, instance segmentation, and semantic segmentation, without extensive supervised training. Furthermore, scaling the number of vision pre-training models changes can result in quality changes in the transfer learning performance, which is not present in other models. In addition to a purely visual encoder, the visual pre-training model may also serve as a vision-centric multimodal fulcrum to connect images and text. Initializing the giant contrast language image pre-training model from the visual pre-training model can stabilize the training and is greatly superior to training from scratch.

9) The transducer model (Bidirectional Encoder Representations from Transformers, BERT) of the bi-directional encoder representation is a pre-trained language model algorithm, whose core architecture is the encoder part of the transducer model. It enables training of models in unlabeled data by masking the two tasks of language models (Mask Language Model, MLM) and next sentence prediction models (Next Sentence Prediction, NSP). Through the training mode, the feature vector of each sentence or word on the general corpus can be obtained, and the feature vector has general semantic features, so that the training efficiency of downstream tasks can be greatly improved.

10 Virtual objects, objects that interact in a virtual scene, objects that are under the control of a user or a robot program (e.g., an artificial intelligence based robot program) that are capable of being stationary, moving, and performing various actions in the virtual scene, such as various characters in a game, etc.

The embodiment of the application provides a processing method of multi-mode data, a processing device of the multi-mode data, electronic equipment, a computer readable storage medium and a computer program product, which can promote the diversity and the accuracy of generating reply content according to input content.

An exemplary application of the electronic device provided by the embodiments of the present application is described below, where the electronic device provided by the embodiments of the present application may implement various types of user terminals, such as a notebook computer, a tablet computer, a desktop computer, a set-top box, a smart television, a mobile device (for example, a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, a portable game device), a vehicle-mounted terminal, a Virtual Reality (VR) device, an augmented Reality (Augmented Reality, AR) device, and the like, and may also be implemented as a server. In the following, an exemplary application when the electronic device is implemented as a terminal device or a server will be described.

Referring to fig. 1, fig. 1 is an application mode schematic diagram of a method for processing multi-mode data according to an embodiment of the present application; for example, fig. 1 relates to a server 200, a network 300, a terminal device 400, and a database 500. The terminal device 400 is connected to the server 200 via the network 300, and the network 300 may be a wide area network or a local area network, or a combination of both.

In some embodiments, server 200 may be a server of a chat bot platform, and database 500 is used to store corpus data required by chat bots for replying to images and text of users. The user performs a chat with the artificial intelligence robot through the terminal apparatus 400 at the chat application.

For example, a user inputs an image and a text in a chat application program through the terminal device 400, the terminal device 400 uploads the image and the text to the server 200, the server 200 invokes the processing method of multimodal data provided in the embodiment of the present application, generates a reply content of the input content, feeds back the reply content to the terminal device 400, and displays the reply content in a chat interface of the terminal device 400. In the offline mode, the terminal device 400 may also call the processing method of the multimodal data provided in the embodiment of the present application, and display the reply content in the chat interface of the terminal device 400.

In some embodiments, the method for processing multi-mode data according to the embodiments of the present application may also be applied in the following application scenarios: the vehicle map navigation application program is used for inputting voice and images in the application program by a user, converting the voice into texts by the application program, searching corresponding places based on the texts and the images, and replying data such as images, texts, routes and the like associated with the places.

The embodiment of the application can be realized through a Database technology, and a Database (Database) can be taken as a place where the electronic file cabinet stores electronic files in short, so that a user can perform operations such as adding, inquiring, updating, deleting and the like on the data in the files. A "database" is a collection of data stored together in a manner that can be shared with multiple users, with as little redundancy as possible, independent of the application.

The database management system (Database Management System, DBMS) is a computer software system designed for managing databases, and generally has basic functions of storage, interception, security, backup, and the like. The database management system may classify according to the database model it supports, e.g., relational, XML (Extensible Markup Language ); or by the type of computer supported, e.g., server cluster, mobile phone; or by classification according to the query language used, such as structured query language (SQL, structured Query Language), XQuery; or by performance impact emphasis, such as maximum scale, maximum speed of operation; or other classification schemes. Regardless of the manner of classification used, some DBMSs are able to support multiple query languages across categories, for example, simultaneously.

The embodiment of the application can also be realized by Cloud Technology, and the Cloud Technology (Cloud Technology) is based on the general terms of network Technology, information Technology, integration Technology, management platform Technology, application Technology and the like applied by a Cloud computing business mode, can form a resource pool, and is used as required, flexible and convenient. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the advanced development and application of the internet industry and the promotion of requirements of search services, social networks, mobile commerce, open collaboration and the like, each article possibly has a hash code identification mark, the hash code identification mark needs to be transmitted to a background system for logic processing, data of different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing.

Embodiments of the present application may be implemented by a large language model (Large Language Model, LLM), which is an artificial intelligence model intended to understand and generate human language. They train on a large amount of text data and can perform a wide range of tasks including text summarization, translation, emotion analysis, and so forth. Large language models are characterized by a large scale, containing billions of parameters, which help them learn complex patterns in language data. These models are typically based on deep learning architectures, such as translators, which help them to achieve impressive performance on various natural language processing tasks.

In some embodiments, the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms, and the like. The electronic device may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiments of the present application.

In some embodiments, the method for processing multi-mode data provided in the embodiments of the present application may be applied in a game scenario, and may be implemented in cooperation with a terminal device and a server. The user in the game virtual scene may be a player, the first virtual object being a player controlled virtual object, and the second virtual object being a non-player character. The player can control the first virtual object to talk with the second virtual object, and the second virtual object calls the processing method of the multi-mode data provided by the embodiment of the application to feed back reply contents to the player.

Aiming at the scheme of collaborative implementation of terminal equipment and a server, two game modes, namely a local game mode and a cloud game mode, are mainly involved, wherein the local game mode refers to that the terminal equipment and the server cooperatively run game processing logic, an operation instruction input by a player in the terminal equipment is partially processed by the game logic run by the terminal equipment, the other part is processed by the game logic run by the server, and the game logic process run by the server is more complex and consumes more calculation power; the cloud game mode is that a server runs game logic processing, and a cloud server renders game scene data into audio and video streams and transmits the audio and video streams to a terminal device for display. The terminal device only needs to have the basic streaming media playing capability and the capability of acquiring the operation instruction of the player and sending the operation instruction to the server.

The processing method of multi-mode data provided by the embodiment of the application is suitable for an application mode of completing virtual scene calculation depending on the calculation capability of the server 200 and outputting a virtual scene at the terminal device 400.

Taking the example of forming the visual perception of the virtual scene, the server 200 performs calculation of the virtual scene related display data (such as scene data) and sends the calculated display data to the terminal device 400 through the network 300, the terminal device 400 finishes loading, analyzing and rendering the calculated display data depending on the graphic calculation hardware, and outputs the virtual scene depending on the graphic output hardware to form the visual perception, for example, a two-dimensional video frame can be presented on a display screen of a smart phone, or a video frame for realizing a three-dimensional display effect can be projected on a lens of an augmented reality/virtual reality glasses; as regards the perception of the form of the virtual scene, it is understood that the auditory perception may be formed by means of the corresponding hardware output of the terminal device 400, for example using a microphone, the tactile perception may be formed using a vibrator, etc.

Taking a computer program as an example of an application program, in actual implementation, the terminal device 400 installs and runs an application program supporting a virtual scene. The application may be any one of a First person shooter game (FPS), a third person shooter game, a virtual reality application, a three-dimensional map program, or a multiplayer game. The user uses the terminal device 400 to operate a virtual object located in a virtual scene to perform activities including, but not limited to: at least one of body posture adjustment, crawling, walking, running, riding, jumping, driving, picking up, shooting, attacking, throwing, building a virtual building. Illustratively, the virtual object may be a virtual character, such as an emulated persona or a cartoon persona, or the like.

Referring to fig. 2, fig. 2 is a schematic structural diagram of an electronic device provided in the embodiment of the present application, where the electronic device may be the server 200 or the terminal device 400 in fig. 1, and in the embodiment of the present application, the terminal device is taken as an example for explanation, and the terminal device 400 shown in fig. 2 includes: at least one processor 410, a memory 450, at least one network interface 420, and a user interface 430. The various components in terminal device 400 are coupled together by bus system 440. It is understood that the bus system 440 is used to enable connected communication between these components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled in fig. 2 as bus system 440.

The processor 410 may be an integrated circuit chip having signal processing capabilities such as a general purpose processor, such as a microprocessor or any conventional processor, or the like, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.

The user interface 430 includes one or more output devices 431, including one or more speakers and/or one or more visual displays, that enable presentation of the media content. The user interface 430 also includes one or more input devices 432, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

Memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 450 optionally includes one or more storage devices physically remote from processor 410.

Memory 450 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a random access Memory (RAM, random Access Memory). The memory 450 described in the embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 450 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 451 including system programs, e.g., framework layer, core library layer, driver layer, etc., for handling various basic system services and performing hardware-related tasks, for implementing various basic services and handling hardware-based tasks;

a network communication module 452 for accessing other electronic devices via one or more (wired or wireless) network interfaces 420, the exemplary network interface 420 comprising: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (USB, universal Serial Bus), etc.;

A presentation module 453 for enabling presentation of information (e.g., a user interface for operating peripheral devices and displaying content and information) via one or more output devices 431 (e.g., a display screen, speakers, etc.) associated with the user interface 430;

an input processing module 454 for detecting one or more user inputs or interactions from one of the one or more input devices 432 and translating the detected inputs or interactions.

In some embodiments, the apparatus provided in the embodiments of the present application may be implemented in software, and fig. 2 shows a processing apparatus 455 of multi-mode data stored in a memory 450, which may be software in the form of a program and a plug-in, and includes the following software modules: the data acquisition module 4551, the feature extraction module 4552, the cause and effect conversion module 4553, and the generation module 4554 are logical and therefore may be arbitrarily combined or further split depending on the functions implemented.

In some embodiments, the terminal or the server may implement the method for processing multi-mode data provided by the embodiments of the present application by running a computer program. For example, the computer program may be a native program or a software module in an operating system; a local (Native) Application program (APP), i.e. a program that needs to be installed in an operating system to run, such as a question-answering APP or an instant messaging APP; the method can also be an applet, namely a program which can be run only by being downloaded into a browser environment; but also an applet that can be embedded in any APP. In general, the computer programs described above may be any form of application, module or plug-in.

The method for processing multi-mode data provided by the embodiment of the present application will be described with reference to exemplary applications and implementations of the terminal provided by the embodiment of the present application.

In the following, the method for processing multimodal data provided in the embodiment of the present application is described, and as described above, the electronic device implementing the method for processing multimodal data in the embodiment of the present application may be a terminal or a server, or a combination of both. The execution subject of the respective steps will not be repeated hereinafter.

It should be noted that, in the following examples of processing multi-modal data, the multi-modal data includes text and image, and those skilled in the art may apply the processing method of multi-modal data provided in the embodiments of the present application to processing data including other modalities, for example: audio, video, etc.

Referring to fig. 3A, fig. 3A is a flowchart of a method for processing multi-mode data according to an embodiment of the present application, and will be described with reference to the steps shown in fig. 3A.

In step 301, multimodal data included in input content is acquired.

For example, the multimodal data includes at least two different modalities of data. The modality is a source or form of each information, and types of modalities include: images, text, audio, and video.

The input content is input by the user through the terminal device, and the processing method of the multimodal data provided by the embodiment of the application is described by taking the server and the terminal device cooperatively execute the processing method of the multimodal data as an example, if the processing method of the multimodal data provided by the embodiment of the application is applied to a chat application program of the user and the virtual object, the user inputs the input content including the multimodal data through the terminal device, for example: text and expression package pictures. And the terminal equipment sends the text and the expression package picture to the server for processing, so that the server feeds back the reply content of the virtual object.

In step 302, feature extraction processing is performed based on the multi-mode data, so as to obtain sub-mode features corresponding to each mode respectively.

By way of example, the dimensions of the different sub-mode features are different. For example: the sub-mode features of the image mode are characterized by two-dimensional features and are characterized by one-dimensional features, wherein the sub-mode features of the image mode are characterized by character sequences obtained by text character (token) and are formed by color information corresponding to each pixel in the image.

In some embodiments, the multimodal data includes at least image data and text data; step 302 may be implemented by: encoding the image data to obtain image features, wherein the image features are used as sub-mode features of an image mode; word segmentation processing is carried out on the text data to obtain a plurality of first word segments; and carrying out symbolization processing on the plurality of first segmentation words to obtain a text identifier sequence of the text data, and taking the text identifier sequence as a sub-mode characteristic of the text mode.

By way of example, image feature extraction may be achieved by convolutional neural networks, such as: and extracting the pixel value of each pixel in the image, and converting all the pixel values of the image into image features in a two-dimensional matrix form. For text data, feature extraction can be realized through a converter model, text is segmented, each segmented word is converted into a corresponding identifier according to a word list, the mapping relation between each segmented word and the identifier is stored in the word list, and the identifiers of each segmented word are combined according to the sequence in the text to obtain sub-model features.

In some embodiments, the multimodal data includes video data as well as text data; step 302 may be implemented by: carrying out framing treatment on the video data to obtain each video frame image of the video data; coding each video frame image to obtain video frame characteristics of each video frame image, wherein the video frame characteristics are used as sub-mode characteristics of an image mode; word segmentation processing is carried out on the text data to obtain a plurality of first word segments; and carrying out symbolization processing on the plurality of first segmentation words to obtain a text identifier sequence of the text data, and taking the text identifier sequence as a sub-mode characteristic of the text mode.

For example, the framing process may be to extract each video frame in the image and process all frame images of the video. To save computing resources, it is also possible to divide the video into a plurality of segments, extract at least one video frame image for each segment, and perform feature extraction on the extracted video frame images. The manner of feature extraction refers to the principle of feature extraction for images above. The manner of text feature extraction processing is referred to above and will not be described in detail here.

In some embodiments, when the multimodal data includes only video data, step 302 may be implemented by: carrying out framing treatment on the video data to obtain each video frame image of the video data; coding each video frame image to obtain video frame characteristics of each video frame image, wherein the video frame characteristics are used as sub-mode characteristics of an image mode; performing text recognition processing on each video frame image to obtain a first video text contained in each video frame image, and performing voice recognition processing on audio of video data to obtain a second video text contained in the audio of the video data; word segmentation processing is respectively carried out on the first video text and the second video text, so that a plurality of second sub words are obtained; and carrying out symbolization processing on the plurality of second sub-words to obtain a text identifier sequence of the text data, and taking the text identifier sequence as a sub-mode characteristic of the text mode.

By way of example, for example: the user only inputs video data as the content to be replied, and the video may include images, text in the images, audio carrying voice, and other multi-modality data. The first video text may be a video subtitle, and the second video text may be a video dubbing text, and if the content of the first video text is the same as that of the second video text, only one of the texts is processed. The manner of text feature extraction processing is referred to above and will not be described in detail here.

In the embodiment of the application, the data supporting different modes are used as the input content, so that the diversity of dialogue processing performed by the user through the terminal equipment and the artificial intelligence is enriched, the richness of the input content and the richness of the reply content are improved, and the user experience is improved.

In step 303, causal relationship conversion is performed on the highest dimension sub-model feature in the plurality of sub-model features, so as to obtain a causal feature vector.

Illustratively, the causal feature vector is the same dimension as other sub-modal features, which are sub-modal features of the plurality of sub-modal features except for the sub-modal feature of the highest dimension. Taking an input mode as a text and an image as an example for explanation, the sub-mode characteristics of the video mode are higher than other sub-mode characteristics, and causal relation conversion is performed on the sub-mode characteristics of the video mode. The causal relationship conversion means that the relevance information between the sub-mode characteristics and other mode characteristics is acquired, and the conversion result is characterized as a sequence form so as to align the characteristics.

In some embodiments, the highest dimensional sub-modal feature is an image feature; referring to fig. 3B, fig. 3B is a flow chart of a method for processing multi-mode data according to an embodiment of the present application; step 303 may be implemented by steps 3031 through 3032 of fig. 3B, as described in detail below.

In step 3031, a randomly initialized pre-configured embedded vector is obtained.

Illustratively, the pre-configured embedded vector may be a latent hidden layer output, as illustrated by the converter model. The preconfigured embedding vector may be characterized as { e1, e2,..en }, where N is a positive integer.

In step 3032, the attention-based conversion process is performed on the sub-model feature with the highest dimension and the preconfigured embedded vector, so as to obtain at least one causal feature vector.

Here, the number of causal feature vectors is the same as the number of dimensions of the preconfigured embedding vectors, and the dimensions of the causal feature vectors are the same as the dimensions of the other sub-modal features.

By way of example, assuming that the input data includes text and images, the highest dimensional sub-modal feature is an image feature. The sub-modal features of the two-dimensional image modality are converted into a one-dimensional causal sequence in the potential space Z. For example: for image I and sub-mode feature g (I) of the image modality, a randomly initialized embedding sequence { e1, e2, & gt, eN } is obtained as a preconfigured embedding vector, and N embedding vectors { z1, z2, & gt, zN }, i.e., a plurality of causal feature vectors, are output. The above process can be characterized by the following equation (1):

{z1，z2，...，zN}＝CausalTransformer(g(I)，{e1，e2，...，eN}) (1)

Wherein cause transducer () characterizes the processing function of the cause and effect transformer.

In some embodiments, step 3032 may be implemented by: determining a self-attention based query vector based on the highest-dimensional sub-model features and the preconfigured embedding vector; determining at least one key vector and at least one value vector based on cross attention based on the highest dimensional sub-mode feature; generating an attention matrix based on the query vector and the at least one key vector; and carrying out normalization processing on the attention moment array, and taking the product between the normalization processing result and each value vector as a causal characteristic vector.

For example, in the cross-attention mechanism, the highest-dimensional sub-model feature S1 and the preconfigured embedded vector S2; calculating a key vector and a value vector according to the sequence S1; calculating a query vector according to the sequence S2; multiplying the key vector and the query vector by the matrix to obtain an attention matrix; applying the query vector to the attention matrix; the output sequence S3 is the same size and length as the sequence S2. The output sequence S3 is a causal feature vector. The normalization process may be implemented by a softmax function, for example by a Feed Forward mechanism (Feed Forward). The attention mechanism can be characterized by the following formula (2)

Where softmax is a normalization function, softmax may make the sum of weight probability distributions 1.Is the original score of attention, which is obtained by looking up the dot product of the matrix Q and the key matrix K. />Is the scaling factor.

In the embodiment of the application, the causal dependency among the input features is acquired by calling the attention mechanism so as to further align the vision and language modes, and the accuracy of the generated reply content can be improved. The high-dimensional modal data is converted into the low-dimensional causal relation sequence, so that the calculation resources required by subsequent processing are saved, and the efficiency of outputting the reply content can be improved.

With continued reference to FIG. 3A, in step 304, the causal feature vector is spliced with other sub-modal features to obtain a spliced feature sequence.

For example, other sub-mode feature stitching is also characterized as feature vector form, each feature vector is used as a dimension, and the feature vectors are combined into a sequence in turn, so that a stitched feature sequence is obtained. In the spliced feature sequence, marks of the marking types corresponding to different sub-mode features are different.

In some embodiments, referring to fig. 3C, fig. 3C is a flow chart of a method for processing multi-modal data provided in an embodiment of the present application; step 304 may be implemented by steps 3041 through 3043 of fig. 3C, as described in detail below.

In step 3041, a modal start marker and a modal end marker corresponding to the causal feature vector are obtained.

By way of example, assuming that the modality corresponding to the causal feature vector is an image modality, the modality start of the image modality is labeled [ IMG ], and the modality end is labeled [/IMG ].

In step 3042, a modal start marker is added to the head of the causal feature vector and a modal end marker is added to the tail of the causal feature vector to form the sub-modal feature to be spliced.

Continuing with the above illustration, for example: and marking the causal feature vector Zi of a certain image to obtain [ IMG Zi/IMG ] which is the sub-model feature to be spliced.

In step 3043, the sub-mode features to be spliced and other sub-mode features are connected end to end in sequence to obtain a spliced feature sequence.

For example, the head of the sub-mode feature to be spliced has a sequence start tag, and the tail of the sub-mode feature to be spliced has a sequence end tag. The head and tail of each sub-modal feature are also labeled with corresponding start-stop markers. After adding a start mark and a stop mark to each sub-mode feature, performing splicing processing, and determining the start-stop position and the type of the sub-mode feature according to the marks corresponding to the feature vectors of each dimension in the spliced feature sequence obtained by splicing. Continuing to explain based on the above example, the sub-mode features [ IMG Zi/IMG ] are spliced with the features of other different modes respectively. The beginning of the splice feature sequence can also be marked by < s > and the end of the splice feature sequence can be marked by < s >, so that the subsequent processing can be facilitated.

With continued reference to fig. 3A, in step 305, reply content to the input content is generated based on the splice feature sequence.

For example, the modality of the reply content may be the same as the input content or one of the input content. For example: the input content is an expression package picture, and the reply content is a picture. The input content is text and the reply content is pictures and text. And processing the spliced characteristic sequences through a stable diffusion model or a converter model to obtain reply content.

Referring to fig. 7, fig. 7 is a schematic diagram of a human-computer interaction interface provided in an embodiment of the present application. The man-machine interaction interface may be a chat interface 700 in the terminal device, in which chat interface 700 a user may chat with a virtual object controlled by artificial intelligence, and the user may input data of different modalities through an input control 705. For example: the text 702 "what is in the picture? "and input image 702, the terminal device invokes the processing method of multimodal data provided in the embodiment of the present application, based on what is in the input text 702" picture? "and input image 701, output content output text 703 and output image 704 are generated. The content of the output text 703 is "a pizza shaped cat", and the output image 704 is a pizza menu picture generated from the input text 702 and the input image 702, including a description of the pizza.

In some embodiments, the multimodal data includes image data and text data, the type of reply content being an image; step 305 may be implemented by: decoding the spliced characteristic sequences to obtain decoded images; and performing stable diffusion processing on the decoded image to obtain image reply content of the input content.

By way of example, the stable diffusion process may be implemented by a stable diffusion model. Referring to fig. 4, fig. 4 is a schematic structural diagram of a multi-modal processing model provided in an embodiment of the present application, where a multi-modal data processing method is implemented by using a multi-modal processing model 401, where the multi-modal processing model includes: an encoder 402, a causal converter 403, and a decoder 404; the encoder 402 is for performing a feature extraction process, the causal converter 403 is for performing causal conversion and feature stitching, and the decoder 404 is for generating a reply content to the input content based on the stitched feature sequence.

The decoder 404 may be a stable diffusion model that adds noise to an image, performs forward propagation processing, and obtains other images different from the input image to obtain image reply content of the input content.

In some embodiments, the multimodal data includes image data and text data, the reply content being of the type text; step 305 may be implemented by: decoding the spliced characteristic sequence to obtain a discrete text marking sequence; and mapping the discrete text label sequence to obtain text reply content of the input content.

Illustratively, the decoding process is implemented by the decoder above, which invokes language modeling capability of natural language processing, predicts a discrete text tag sequence of text corresponding to the following based on the concatenation feature sequence, and maps the discrete text tag sequence from tag to text by vocabulary, thereby obtaining reply content.

According to the embodiment of the application, the multi-mode data can be replied based on the multi-mode data, so that the richness of replied contents is improved, and the contents replied by the artificial intelligence are more vivid and accurate in an artificial intelligence question-answering scene.

In some embodiments, before step 301 or after step 305, the multimodal processing model may also be trained to obtain sample text data and sample image data, wherein the sample image data has a higher feature dimension than the sample text data; and invoking a multi-mode processing model to perform joint training processing based on the sample text data and the sample image data to obtain a trained multi-mode processing model, wherein the trained multi-mode processing model is used for generating reply content of at least one mode based on the input multi-mode data.

By way of example, the joint training process in embodiments of the present application may be self-supervised or unsupervised training. The training targets of the joint training process include: the multi-mode processing model can be used for classifying sub-mode features of the text data; the multi-modal processing model can be used to generate an image based on the stitched feature sequence.

In the embodiment of the application, the training targets of the combined training process mutually promote, and compared with the existing independently existing understanding and generating tasks, the two tasks mutually promote in the training process, so that the final performance is improved.

According to the embodiment of the application, the input content is allowed to be multi-mode data, and compared with the scheme that the input content and the reply content are text in the related technology, the richness of the input content and the richness of the reply content are improved, and user experience is improved. The method has the advantages that the characteristics of different modes are aligned through causal relation conversion based on the multi-mode data, the computing resources required for generating the reply content based on the characteristics of the different modes are saved, compared with the scheme of directly respectively generating the data of the different dimension data, the complexity of the computing process is reduced, the relevance between the input content of the different modes is utilized, the relevance between the reply content and the input content can be improved, the reply content is more accurate, and the requirements of users are met. In the artificial intelligence question-answering scene, the content replied by the artificial intelligence is more vivid and accurate, the sense of reality of the artificial intelligence is improved, and the experience of using the artificial intelligence question-answering by a user is further improved.

Next, an exemplary application of the method for processing multimodal data according to the embodiment of the present application in an actual application scenario will be described.

The question-answering system between the virtual object and the user becomes the development trend of the artificial intelligence technology, and the single-mode virtual object system has boring dialogue and single interaction form, so that the interest of the user is difficult to promote. The multi-mode multi-task is a recent trend of artificial intelligence technology development, and on one hand, the multi-mode multi-task can be regarded as a continuation of 'pile data', and cross-mode correlation of the data is utilized; on the other hand, the AI defined under multi-modal multitasking is also a form that is more similar to general artificial intelligence.

For processing of multi-modal data, the related art has the following ways:

1. and generating a picture based on the discrete characters. The BEIT series converts visual signals into discrete characters (token) that can be pre-trained like language data, and BEIT-3 is entered using a unified patterned BERT-style training object, i.e., mask portion, to allow the model to repair the masked portion.

BEiT successfully uses the BERT training concept in image processing tasks, which is also a form of self-monitoring training, and is therefore named as the BERT pre-training model of the vision transducer. The BEIT is also a pre-training algorithm based on image singlemode. The biggest innovation point is that a discrete variation self-encoder is combined with BERT, and an MIM pre-training method based on visual signs is provided, so that excessive training of a model on image details and adjacent pixel relations is reduced. This visual marker is generated from the encoder based on discrete variations, which is equivalent to compressing the image into a low-dimensional semantic space by the discrete variations from the encoder.

2. Multimodal data processing model. Flamingo is a model that can quickly adapt to new tasks using only a few annotation examples, including critical architectural innovations; 1) Connecting strong pre-training vision and language models; 2) Processing arbitrary interlaced visual and text data sequences; 3) Images or videos are seamlessly taken as input. Because of their flexibility, the Flamingo model can be trained on a large multi-modal web corpus containing arbitrarily interlaced text and images, which is critical to giving them learning ability on top of Wen Xiao samples. The Flamingo integrates a pre-trained visual and language model, so that the multi-mode capability is integrated, and multi-mode independent sample learning (one-shot) and small sample learning (few-shot) capabilities are realized first. However, in Flamingo, both modalities are separately pre-trained.

3. Multimodal pre-training. With the enhancement of the influence and openness of large language models, some models integrate two modes based on the large language models, for example: BLIP-services utilize Q-precursors to align visual and textual features. However, these methods generally use a method of predicting the next character (next token) as a training target, but do not use reasonable supervision for the data modality.

The existing independent understanding model and generating model have the defects of high deployment cost and high training cost, and the multi-mode data processing method provided by the embodiment of the application can simultaneously support understanding and generating functions, so that intelligent chat model deployment is greatly facilitated, two tasks are mutually promoted in the training process, and the final performance is improved. For visual symbolization (token) characterization, existing methods resemble image sequences (image sequences) based on Vector quantized-VariationalAutoEncoder, VQ-VAE, are structured raster discrete token with low-level (low-level) pixel-level information for image generation, and this low-level discretization method has no image understanding capability, and embodiments of the present application have semantic information and satisfy causal correlation (cause correlation), and are formally aligned with language sequences (language sequence) so that both modalities can be trained together.

The following explains the method for processing multi-mode data provided in the embodiment of the present application with reference to fig. 5, where a server is taken as an execution body, and fig. 5 is a schematic flowchart of an alternative method for processing multi-mode data provided in the embodiment of the present application.

In step 501, input multimodal data is received.

By way of example, the multimodal data may be entered by a user into a chat interface of the terminal device, and the multimodal data may include modalities such as images, text, video, audio, and the like.

In step 502, an encoder of a multi-modal processing model is called based on multi-modal data to perform feature extraction processing, so as to obtain a visual feature sequence and a text feature sequence.

The model used in the multi-mode data processing method is a multi-mode processing model and is used for sensing input of multi-mode data and generating output of different modes. Referring to FIG. 6A, with reference to FIG. 6A, FIG. 6A is a schematic illustration of an alternative architecture of a multi-modal processing model provided by embodiments of the present application; the multi-modal processing model includes an encoder 610, a causal converter 620, a multi-channel modeling large language model 630, and a diffusion decoder 640. Using autoregressive mode to unify the different modalities, the visual signal is first encoded by encoder 610 into a visual embedded vector (empdding), converted into a causal sequence (causal sequence) by causal converter 620, and combined with text characters (token) into an interdigitated sequence (sequence).

By way of example, the encoder may be a visual pre-training model-a contrast language image pre-training model for extracting image features that can reconstruct target tasks using masks, effectively expanding the model and improving performance in various visual tasks.

In the embodiment of the application, the multi-mode understanding and generating integrated model is compared with the existing independent understanding model and generating model, and the functions of understanding and generating are supported, so that the intelligent chat model deployment is greatly facilitated.

In step 503, a causal converter of the multi-modal processing model is invoked based on the visual feature sequence to obtain a causal relationship sequence, and the text feature sequence and the causal relationship sequence are spliced to obtain a spliced embedded feature.

Illustratively, after image features (i.e., a sequence of visual features) are obtained, the image features are input to a causal converter to obtain a causal embedding vector (visual causal embedding) such that the image and natural language levels of abstraction are aligned. The visual causal embedding vector is a fixed length N embedding vector, which can be characterized as a causal sequence.

Referring to FIG. 6B, FIG. 6B is a schematic diagram of a cause and effect converter provided by an embodiment of the present application; the causal converter consists of Self-Attention mechanism (Self-Attention), cross-Attention mechanism (Cross-Attention) and Feed Forward mechanism (Feed Forward). The present embodiments use a self-attention mechanism to capture causal dependencies between input potential embedded vectors (input potential embedding) in order to further align visual and linguistic modalities. The cross-attention mechanism aggregates visual information extracted from the contrast language image pre-training model, where the visual embedding vectors are treated as key vectors K and value vectors V, and the output of the self-attention mechanism is used as query vector Q.

The processing of the visual embedded vector by the causal converter may be achieved by: the 2D spatial visual signal is converted to a 1D one-dimensional causal sequence in the potential space Z using a causal converter. For example: the causal converter takes as input a randomly initialized embedding sequence { e1, e2,..en } and outputs N embedding vectors { z1, z2,..zn } (i.e. causal feature vectors above) to determine causal dependencies of the image I. The above process can be characterized by the following equation (1):

{z1，z2，...，zN}＝CausalTransformer(g(I)，{e1，e2，...，eN}) (1)

Taking video data as an example, a video has T frames, and t×n visual causal embedded vectors are obtained through the conversion process of the causal converter. Two special image markers [ IMG ] and [/IMG ] are then added to each image or frame to indicate the beginning and end of the encoded image/frame embedding. The visual causal embedding vector is combined with the text feature sequences to form a multimodal sequence, and the < s > and < s > tags are appended to the beginning and end of each sequence. The sequence is sent to a multimodal modeling large language model for unified autoregressive modeling.

In step 504, when the preset output content is text, text reply content is generated based on the splice embedding feature.

The model provided by the embodiment of the application can perform unified autoregressive modeling on different modes, so that the model has strong multi-mode understanding capability, can accept multi-mode sequences as input, and outputs signals across visual and text modes. For example: given a multi-modal context, when the intended output format is text, the model will generate discrete text labels using a language modeling head and then convert the discrete text labels into text reply content.

In step 505, when the preset output content is an image, invoking a decoder of the multi-mode processing model based on the mosaic embedded feature to generate an image reply content.

Illustratively, embodiments of the present application use a latent Diffusion model (Latent Diffusion Model) to decode visual embedded vectors into images and initialize them with the weights of Stable Diffusion (Stable Diffusion). Specifically, the embodiment of the application inputs N visual embedding vectors into the diffusion model as the basis for image decoding. For example: when two test image pairs (image-test pairs) of the same task are used as prompt words (templates), the multimodal processing model can automatically infer and complete the corresponding task given the new input. When the desired output is an image, embodiments of the present application will append an IMG tag at the end of the input sequence, the multimodal processing model auto-regress the input sequence to generate N visual embedded vectors, and invoke the visual decoder to decode the visual embedded vectors into a real world image.

In some embodiments, the training objectives of the multimodal processing model in embodiments of the present application are: classifying a next text character (text token); and regressing the next visual embedding vector, and calling the fine-tuned stable diffusion decoder to generate a picture based on the regressive visual embedding vector. Compared with the existing independent understanding and generating tasks, the multi-mode combined training method has the advantages that the two tasks are mutually promoted in the training process, and the final performance is improved. Compared with the related art, the method of the embodiment of the application takes the prediction (next token prediction) of the next character of the vision and text modes which are aligned autoregressive as a training target, and the normalization training mode ensures that all modes can be well fused, so that various downstream tasks can be supported.

In some embodiments, referring to fig. 7, fig. 7 is a schematic diagram of a human-computer interaction interface provided in an embodiment of the present application. The man-machine interaction interface may be a chat interface 700 in the terminal device, in which chat interface 700 a user may chat with a virtual object controlled by artificial intelligence, and the user may input data of different modalities through an input control 705. For example: the text 702 "what is in the picture? "and input image 702, the terminal device invokes the processing method of multimodal data provided in the embodiment of the present application, based on what is in the input text 702" picture? "and input image 701, output content output text 703 and output image 704 are generated. The content of the output text 703 is "a pizza shaped cat", and the output image 704 is a pizza menu picture generated from the input text 702 and the input image 702, including a description of the pizza.

At present, the demands of virtual objects are gradually increased, and the model applied by the multi-mode data processing method provided by the embodiment of the application can be used as an important basic model of the virtual objects, and meanwhile, the model has the capability of understanding and generating, so that the liveliness of chatting of the virtual objects is greatly enhanced, and the training cost and the deployment cost are greatly reduced by understanding and generating an integrated model.

In the embodiment of the application, through a multi-mode combined training mode, training efficiency is improved, training cost and deployment cost are reduced, and meanwhile, the device has the capability of understanding and generating, so that the liveliness of chatting between the virtual object and the user is greatly enhanced.

Continuing with the description below of an exemplary architecture of the multi-modal data processing apparatus 455 implemented as a software module provided by embodiments of the present application, in some embodiments, as shown in fig. 2, the software modules stored in the multi-modal data processing apparatus 455 of the memory 450 may include: the data acquisition module 4551 is configured to acquire multi-modal data included in the input content, where the multi-modal data includes at least two different modalities of data; the feature extraction module 4552 is configured to perform feature extraction processing based on the multi-mode data to obtain sub-mode features corresponding to each mode respectively, where dimensions of different sub-mode features are different; the causal conversion module 4553 is configured to perform causal relation conversion on a sub-mode feature with a highest dimension among the plurality of sub-mode features to obtain a causal feature vector, where the causal feature vector has the same dimension as other sub-mode features, and the other sub-mode features are sub-mode features except for the sub-mode feature with the highest dimension among the plurality of sub-mode features; the causal conversion module 4553 is further configured to splice the causal feature vector with the other sub-model features to obtain a spliced feature sequence; a generating module 4554 configured to generate reply content to the input content based on the stitching feature sequence.

In some embodiments, the multimodal data includes at least image data and text data; the feature extraction module 4552 is configured to perform encoding processing on the image data to obtain image features, and take the image features as sub-mode features of an image mode; performing word segmentation processing on the text data to obtain a plurality of first word segments; and carrying out symbolization processing on the plurality of first segmentation words to obtain a text identifier sequence of the text data, and taking the text identifier sequence as a sub-mode characteristic of a text mode.

In some embodiments, the multimodal data includes video data as well as text data; the feature extraction module 4552 is configured to perform frame-division processing on the video data to obtain each video frame image of the video data; coding each video frame image to obtain video frame characteristics of each video frame image, wherein the video frame characteristics are used as sub-mode characteristics of an image mode; performing word segmentation processing on the text data to obtain a plurality of first word segments; and carrying out symbolization processing on the plurality of first segmentation words to obtain a text identifier sequence of the text data, and taking the text identifier sequence as a sub-mode characteristic of a text mode.

In some embodiments, the feature extraction module 4552 is configured to, when the multimodal data includes only video data, frame-frame the video data to obtain each video frame image of the video data; coding each video frame image to obtain video frame characteristics of each video frame image, wherein the video frame characteristics are used as sub-mode characteristics of an image mode; performing text recognition processing on each video frame image to obtain a first video text contained in each video frame image, and performing voice recognition processing on the audio of the video data to obtain a second video text contained in the audio of the video data; word segmentation processing is respectively carried out on the first video text and the second video text to obtain a plurality of second words; and carrying out symbolization processing on the plurality of second sub-words to obtain a text identifier sequence of the text data, and taking the text identifier sequence as a sub-mode characteristic of the text mode.

In some embodiments, the highest dimensional sub-modal feature is an image feature; the causal conversion module 4553 is configured to obtain a randomly initialized preconfigured embedding vector; and performing attention-based conversion processing on the highest-dimension sub-model feature and the preconfigured embedded vector to obtain at least one causal feature vector, wherein the number of the causal feature vectors is the same as the number of the dimensions of the preconfigured embedded vector, and the dimensions of the causal feature vectors are the same as the dimensions of the other sub-model features.

In some embodiments, the cause and effect conversion module 4553 is configured to determine a self-attention based query vector based on the highest-dimensional sub-modal feature and the preconfigured embedded vector; determining at least one key vector and at least one value vector based on cross attention based on the highest dimensional sub-mode feature; generating an attention matrix based on the query vector and the at least one key vector; and normalizing the attention moment array, and taking the product between the normalization result and each value vector as a causal characteristic vector.

In some embodiments, the cause and effect conversion module 4553 is configured to obtain a modal start marker and a modal end marker corresponding to the cause and effect feature vector; adding the modal start marker to the head of the causal feature vector and the modal end marker to the tail of the causal feature vector to form sub-modal features to be spliced; and sequentially connecting the sub-mode features to be spliced with other sub-mode features end to obtain a spliced feature sequence, wherein the head of the spliced feature sequence is provided with a sequence start mark, and the tail of the spliced feature sequence is provided with a sequence end mark.

In some embodiments, the multimodal data includes image data and text data, the reply content being of the type of image; the generating module 4554 is configured to decode the spliced feature sequence to obtain a decoded image; and performing stable diffusion treatment on the decoded image to obtain image reply content of the input content.

In some embodiments, the multimodal data includes image data and text data, the reply content being of the type text; the generating module 4554 is configured to decode the spliced feature sequence to obtain a discrete text tag sequence; and mapping the discrete text mark sequence to obtain text reply content of the input content.

In some embodiments, the method of processing multi-modal data is implemented by a multi-modal processing model, the multi-modal processing model comprising: an encoder, a causal converter, and a decoder; the encoder is configured to perform the feature extraction process, the causal converter is configured to perform the causal conversion and the feature stitching, and the decoder is configured to generate a reply content of the input content based on the stitched feature sequence.

In some embodiments, the data acquisition module is configured to acquire sample text data and sample image data before the acquiring the multimodal data included in the input content, wherein a feature dimension of the sample image data is higher than that of the sample text data; and calling the multi-modal processing model to perform joint training processing based on the sample text data and the sample image data to obtain a trained multi-modal processing model, wherein the trained multi-modal processing model is used for generating reply content of at least one mode based on the input multi-modal data.

In some embodiments, the training objectives of the joint training process include:

the multi-modal processing model is operable to classify sub-modal features of the text data; the multimodal processing model can be used to generate an image based on the stitched feature sequence.

Embodiments of the present application provide a computer program product comprising a computer program or computer-executable instructions stored in a computer-readable storage medium. The processor of the electronic device reads the computer executable instructions from the computer readable storage medium, and the processor executes the computer executable instructions, so that the electronic device executes the method for processing multi-mode data according to the embodiment of the application.

The present embodiments provide a computer-readable storage medium storing computer-executable instructions or a computer program stored therein, which when executed by a processor, cause the processor to perform a method for processing multi-modal data provided by the embodiments of the present application, for example, a method for processing multi-modal data as illustrated in fig. 3A.

In some embodiments, the computer readable storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.

In some embodiments, computer-executable instructions may be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, in the form of programs, software modules, scripts, or code, and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.

As an example, computer-executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, such as in one or more scripts in a hypertext markup language (HTML, hyper Text Markup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, executable instructions may be deployed to be executed on one electronic device or on multiple electronic devices located at one site or, alternatively, on multiple electronic devices distributed across multiple sites and interconnected by a communication network.

In summary, the embodiment of the application allows the input content to be multi-modal data, and compared with the scheme that the input content and the reply content are both text in the related art, the method improves the richness of the input content and the richness of the reply content, and improves the user experience. The multi-modal data is used for extracting the characteristics, the characteristics of different modalities are aligned through causal relation conversion, the computing resources required for generating the reply content based on the characteristics of the different modalities are saved, compared with a scheme for directly respectively generating the data of different dimension data, the complexity of the computing process is reduced, the relevance between the input content of different modalities is utilized, the relevance between the reply content and the input content can be further improved, the reply content is more accurate, and the requirements of users are met.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modifications, equivalent substitutions, improvements, etc. that are within the spirit and scope of the present application are intended to be included within the scope of the present application.

Claims

1. A method for processing multi-modal data, the method comprising:

acquiring multi-mode data included in input content;

2. The method of claim 1, wherein the multimodal data includes at least image data and text data;

the feature extraction processing is performed based on the multi-mode data to obtain sub-mode features corresponding to each mode respectively, including:

Coding the image data to obtain image features, wherein the image features are used as sub-mode features of an image mode;

performing word segmentation processing on the text data to obtain a plurality of first word segments;

and carrying out symbolization processing on the plurality of first segmentation words to obtain a text identifier sequence of the text data, and taking the text identifier sequence as a sub-mode characteristic of a text mode.

3. The method of claim 1, wherein the multimodal data includes video data and text data;

carrying out framing treatment on the video data to obtain each video frame image of the video data;

coding each video frame image to obtain video frame characteristics of each video frame image, wherein the video frame characteristics are used as sub-mode characteristics of an image mode;

4. The method according to claim 1, wherein when the multi-modality data includes only video data, the performing feature extraction processing based on the multi-modality data to obtain sub-modality features corresponding to each modality respectively includes:

performing text recognition processing on each video frame image to obtain a first video text contained in each video frame image, and performing voice recognition processing on the audio of the video data to obtain a second video text contained in the audio of the video data;

word segmentation processing is respectively carried out on the first video text and the second video text to obtain a plurality of second words;

and carrying out symbolization processing on the plurality of second sub-words to obtain a text identifier sequence of the text data, and taking the text identifier sequence as a sub-mode characteristic of the text mode.

5. The method of claim 1, wherein the highest dimensional sub-modal feature is an image feature;

performing causal relation conversion on the highest-dimension sub-model feature in the plurality of sub-model features to obtain a causal feature vector, wherein the causal feature vector comprises:

obtaining a randomly initialized pre-configured embedded vector;

and performing attention-based conversion processing on the highest-dimension sub-model feature and the preconfigured embedded vector to obtain at least one causal feature vector, wherein the number of the causal feature vectors is the same as the number of the dimensions of the preconfigured embedded vector, and the dimensions of the causal feature vectors are the same as the dimensions of the other sub-model features.

6. The method of claim 5, wherein performing an attention-based transformation process on the highest-dimensional sub-model feature and the preconfigured embedded vector to obtain at least one causal feature vector comprises:

determining a self-attention based query vector based on the highest-dimensional sub-model features and the preconfigured embedding vector;

determining at least one key vector and at least one value vector based on cross attention based on the highest dimensional sub-mode feature;

Generating an attention matrix based on the query vector and the at least one key vector;

and normalizing the attention moment array, and taking the product between the normalization result and each value vector as a causal characteristic vector.

7. The method according to any one of claims 1 to 6, wherein said concatenating the causal feature vector with the other sub-modal features to obtain a concatenated feature sequence comprises:

acquiring a modal starting mark and a modal ending mark corresponding to the causal feature vector;

adding the modal start marker to the head of the causal feature vector and the modal end marker to the tail of the causal feature vector to form sub-modal features to be spliced;

and sequentially connecting the sub-mode features to be spliced with other sub-mode features end to obtain a spliced feature sequence, wherein the head of the spliced feature sequence is provided with a sequence start mark, and the tail of the spliced feature sequence is provided with a sequence end mark.

8. The method according to any one of claims 1 to 6, wherein the multimodal data includes image data and text data, and the reply content is of the type of image; the generating reply content of the input content based on the splicing feature sequence comprises the following steps:

Decoding the spliced characteristic sequence to obtain a decoded image;

and performing stable diffusion treatment on the decoded image to obtain image reply content of the input content.

9. The method according to any one of claims 1 to 6, wherein the multimodal data includes image data and text data, and the reply content is of a type of text; the generating reply content of the input content based on the splicing feature sequence comprises the following steps:

decoding the spliced characteristic sequence to obtain a discrete text marking sequence;

and mapping the discrete text mark sequence to obtain text reply content of the input content.

10. The method according to any one of claims 1 to 6, wherein the method is implemented by a multi-modal processing model comprising: an encoder, a causal converter, and a decoder; the encoder is configured to perform the feature extraction process, the causal converter is configured to perform the causal conversion and the feature stitching, and the decoder is configured to generate a reply content of the input content based on the stitched feature sequence.

11. The method of claim 10, wherein prior to the obtaining the multimodal data included in the input content, the method further comprises:

acquiring sample text data and sample image data, wherein the characteristic dimension of the sample image data is higher than that of the sample text data;

and calling the multi-modal processing model to perform joint training processing based on the sample text data and the sample image data to obtain a trained multi-modal processing model, wherein the trained multi-modal processing model is used for generating reply content of at least one mode based on the input multi-modal data.

12. The method of claim 11, wherein the training objectives of the joint training process comprise:

the multi-modal processing model is operable to classify sub-modal features of the text data;

the multimodal processing model can be used to generate an image based on the stitched feature sequence.

13. A multi-modal data processing apparatus, the apparatus comprising:

14. An electronic device, the electronic device comprising:

a memory for storing computer executable instructions or computer programs;

a processor for implementing the method of processing multimodal data according to any of claims 1 to 12 when executing computer executable instructions or computer programs stored in the memory.

15. A computer-readable storage medium storing computer-executable instructions or a computer program, wherein the computer-executable instructions or the computer program when executed by a processor implement the method of processing multimodal data as claimed in any one of claims 1 to 12.

16. A computer program product comprising computer executable instructions or a computer program which when executed by a processor implements the method of processing multimodal data as claimed in any of claims 1 to 12.