CN116861363A

CN116861363A - Multi-mode feature processing method and device, storage medium and electronic equipment

Info

Publication number: CN116861363A
Application number: CN202310854673.2A
Authority: CN
Inventors: 姚顺雨
Original assignee: China Telecom Technology Innovation Center; China Telecom Corp Ltd
Current assignee: China Telecom Technology Innovation Center; China Telecom Corp Ltd
Priority date: 2023-07-12
Filing date: 2023-07-12
Publication date: 2023-10-10

Abstract

The disclosure provides a multi-mode feature processing method, a multi-mode feature processing device, a storage medium and electronic equipment, and relates to the technical fields of artificial intelligence and multi-modes. The method comprises the following steps: acquiring information to be processed in a first mode and information to be processed in a second mode; extracting first-mode initial characteristics from the first-mode information to be processed, and extracting second-mode initial characteristics from the second-mode information to be processed; determining first query information according to the initial features of the first modality, determining first key information and first value information according to the initial features of the second modality, and processing through an attention mechanism to obtain first intermediate features; determining second query information according to the initial features of the second modality, determining second key information and second value information according to the initial features of the first modality, and processing through an attention mechanism to obtain second intermediate features; and fusing the first intermediate feature and the second intermediate feature to obtain the multi-mode target feature. The present disclosure enables high quality fusion of multi-modal features.

Description

Multi-mode feature processing method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of artificial intelligence and multi-modal technologies, and in particular, to a multi-modal feature processing method, a multi-modal feature processing apparatus, a computer readable storage medium, and an electronic device.

Background

With the development of artificial intelligence technology, more and more artificial intelligence tasks need to be implemented by using multi-modal information. For example, in an event detection task, using multi-modal information can provide more comprehensive information than using single-modal information, facilitating better implementation of event detection.

In the related art, when multi-modal information is adopted, it is difficult to well fuse the characteristics of the multi-modalities. For example, the fusion of the two modal features is achieved by stitching the image features and the text features. Such fusion approaches lack a representation of the correlation of multiple modalities, thereby affecting the outcome of the use of the features of multiple modalities.

Disclosure of Invention

The disclosure provides a multi-modal feature processing method, a multi-modal feature processing device, a computer readable storage medium and an electronic device, so as to solve the problem that the related technology is difficult to well fuse multi-modal features at least to a certain extent.

According to a first aspect of the present disclosure, there is provided a multi-modal feature processing method, including: acquiring information to be processed in a first mode and information to be processed in a second mode; extracting a first mode initial feature from the first mode information to be processed, and extracting a second mode initial feature from the second mode information to be processed; determining first query information according to the first mode initial feature, determining first key information and first value information according to the second mode initial feature, and processing the first query information, the first key information and the first value information through an attention mechanism to obtain a first intermediate feature; determining second query information according to the second mode initial characteristics, determining second key information and second value information according to the first mode initial characteristics, and processing the second query information, the second key information and the second value information through an attention mechanism to obtain second intermediate characteristics; and fusing the first intermediate feature and the second intermediate feature to obtain the multi-mode target feature.

According to a second aspect of the present disclosure, there is provided a multi-modal feature processing apparatus comprising: the information acquisition module is configured to acquire first-mode information to be processed and second-mode information to be processed; the initial feature extraction module is configured to extract first-mode initial features from the first-mode information to be processed and extract second-mode initial features from the second-mode information to be processed; the first attention processing module is configured to determine first query information according to the first modal initial characteristics, determine first key information and first value information according to the second modal initial characteristics, and process the first query information, the first key information and the first value information through an attention mechanism to obtain first intermediate characteristics; the second attention processing module is configured to determine second query information according to the second modal initial characteristics, determine second key information and second value information according to the first modal initial characteristics, and process the second query information, the second key information and the second value information through an attention mechanism to obtain second intermediate characteristics; and the intermediate feature fusion module is configured to fuse the first intermediate feature and the second intermediate feature to obtain a multi-mode target feature.

According to a third aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the multi-modal feature processing method of the first aspect described above and possible implementations thereof.

According to a fourth aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the multi-modal feature processing method of the first aspect described above and possible implementations thereof via execution of the executable instructions.

The technical scheme of the present disclosure has the following beneficial effects:

on the one hand, two attention mechanism processing modes are adopted: the first method comprises the steps of determining first query information according to first mode initial characteristics, determining first key information and first value information according to second mode initial characteristics, and processing the first query information, the first key information and the first value information through an attention mechanism to obtain first intermediate characteristics; and the second mode is to determine second query information according to the initial features of the second mode, determine second key information and second value information according to the initial features of the first mode, and process the second query information, the second key information and the second value information through an attention mechanism to obtain second intermediate features. The first intermediate feature mainly characterizes second mode information under the condition of mode alignment, and the second intermediate feature mainly characterizes first mode information under the condition of mode alignment, so that information aggregation is realized by utilizing the correlation between two modes, and the first intermediate feature and the second intermediate feature are fused to obtain a multi-mode target feature, information complementation between the two modes can be realized, and the quality of the multi-mode target feature is improved. On the other hand, the method is applicable to feature fusion of any two different modes, and the scheme is simple in implementation process and strong in universality.

Drawings

Fig. 1 shows a flowchart of a multi-modal feature processing method in the present exemplary embodiment.

Fig. 2 shows a flowchart of one way of deriving the first intermediate feature in the present exemplary embodiment.

Fig. 3 shows a flowchart for obtaining multi-modal target features in the present exemplary embodiment.

Fig. 4 shows a flowchart of another multi-modal feature processing method in the present exemplary embodiment.

Fig. 5 shows a schematic diagram of generating multi-modal target features in the present exemplary embodiment.

Fig. 6 shows a schematic diagram of model training in the present exemplary embodiment.

Fig. 7 shows a schematic structural diagram of a multi-modal feature processing apparatus in the present exemplary embodiment.

Fig. 8 shows a schematic structural diagram of an electronic device in the present exemplary embodiment.

Detailed Description

Exemplary embodiments of the present disclosure will be described more fully hereinafter with reference to the accompanying drawings.

The drawings are schematic illustrations of the present disclosure and are not necessarily drawn to scale. Some of the block diagrams shown in the figures may be functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software, or in hardware modules or integrated circuits, or in networks, processors or microcontrollers. Embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein. The described features, structures, or characteristics of the disclosure may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough description of embodiments of the present disclosure. However, it will be recognized by one skilled in the art that one or more of the specific details may be omitted, or other methods, components, devices, steps, etc. may be used instead of one or more of the specific details in implementing the aspects of the present disclosure.

Multimodal information is increasingly being used in a variety of artificial intelligence tasks. Taking event detection task as an example, the purpose of event detection is to determine whether a piece of information expresses the occurrence of a specific event or what category the occurrence of an event belongs to. If the information adopts single-mode information such as images or texts, the information quantity is relatively limited, and the accuracy of event detection results is affected. If the information adopts multi-mode information, two different modes of information including images and texts are adopted synchronously, the images and the texts have correlation and complementarity, namely, the images can provide missing information in the texts, the texts can provide missing information in the images, so that the content which can be learned by the model is more comprehensive and sufficient, and the accuracy of event detection results is improved.

Since multimodal information corresponds to different dimensions or feature spaces. When multi-modality information is employed, a combination between different modalities needs to be achieved.

In one scheme of the related technology, features of different modes are spliced, for example, the features are respectively extracted from the images and the texts and then spliced, so that fusion of the features of different modes is realized. However, the solution does not unify the dimensions or feature space of the features of different modes, cannot characterize the correlation of the multiple modes, and lacks the utilization of the complementarity of the multiple modes, so that the features of the multiple modes cannot be truly fused. And further, the subsequent processing results are affected, such as event detection based on the spliced features, and the accuracy of the event detection results is low.

In another scheme of the related technology, the features of different modes are not directly fused, but the corresponding task processing result (such as an event detection result) is obtained by the information of each mode independently, and then the task processing results corresponding to different modes are combined, such as a final processing result is determined by voting and the like. The scheme does not consider multi-modal correlation, and also lacks utilization of multi-modal complementarity, which affects the accuracy of the final processing result. Moreover, this approach does not result in a multi-modal fusion feature, which limits its use in partial task scenarios.

In view of one or more of the above problems, exemplary embodiments of the present disclosure provide a multi-modal feature processing method, which can well fuse multi-modal features and output high-quality multi-modal target features.

Fig. 1 shows an exemplary flow of a multi-modal feature processing method, which may include the following steps S110 to S150:

step S110, obtaining information to be processed in a first mode and information to be processed in a second mode;

step S120, extracting first mode initial characteristics from the first mode information to be processed, and extracting second mode initial characteristics from the second mode information to be processed;

Step S130, determining first query information according to the initial features of the first modality, determining first key information and first value information according to the initial features of the second modality, and processing the first query information, the first key information and the first value information through an attention mechanism to obtain first intermediate features;

step S140, determining second query information according to the initial features of the second modality, determining second key information and second value information according to the initial features of the first modality, and processing the second query information, the second key information and the second value information through an attention mechanism to obtain second intermediate features;

and step S150, fusing the first intermediate feature and the second intermediate feature to obtain the multi-mode target feature.

In the method shown in fig. 1, on the one hand, two attention mechanism processing methods are adopted: the first method comprises the steps of determining first query information according to first mode initial characteristics, determining first key information and first value information according to second mode initial characteristics, and processing the first query information, the first key information and the first value information through an attention mechanism to obtain first intermediate characteristics; and the second mode is to determine second query information according to the initial features of the second mode, determine second key information and second value information according to the initial features of the first mode, and process the second query information, the second key information and the second value information through an attention mechanism to obtain second intermediate features. The first intermediate feature mainly characterizes second mode information under the condition of mode alignment, and the second intermediate feature mainly characterizes first mode information under the condition of mode alignment, so that information aggregation is realized by utilizing the correlation between two modes, and the first intermediate feature and the second intermediate feature are fused to obtain a multi-mode target feature, information complementation between the two modes can be realized, and the quality of the multi-mode target feature is improved. On the other hand, the method is applicable to feature fusion of any two different modes, and the scheme is simple in implementation process and strong in universality.

Each step in fig. 1 is described in detail below.

Referring to fig. 1, in step S110, first-modality information to be processed and second-modality information to be processed are acquired.

The first-modality information to be processed and the second-modality information to be processed are information of different modalities. The first-modality information to be processed and the second-modality information to be processed can be any two different-modality information in text, images, audio and video. For example, the first modality of information to be processed may be text to be processed, and the second modality of information to be processed may be an image to be processed.

The first-modality information to be processed and the second-modality information to be processed may have a preset association relationship, and both may be used to express the same or related information content. In one embodiment, the information to be processed in the first mode and the information to be processed in the second mode are the same in source, for example, the information to be processed in the first mode and the information to be processed in the second mode are both from the same page or the same theme on the internet. For example, a message (e.g., an article, a microblog, a circle of friends, a post, etc.) on a news or social platform includes text and an image, which may be obtained as a text to be processed, i.e., first-modality information to be processed, and an image to be processed, i.e., second-modality information to be processed.

With continued reference to fig. 1, in step S120, a first modality initial feature is extracted for the first modality information to be processed, and a second modality initial feature is extracted for the second modality information to be processed.

The first-mode initial feature is a feature which is extracted from the information to be processed in the first mode independently and does not contain the feature of the information to be processed in the second mode. The initial characteristics of the second mode are the characteristics of the information to be processed of the second mode which are extracted independently and do not contain the characteristics of the information to be processed of the first mode. The feature extraction mode under various modes can be adopted to extract the initial features of the first mode and the initial features of the second mode respectively.

In one embodiment, the first modality information to be processed may be text to be processed, and the first modality initial feature may be a text feature. The extracting the first modality initial feature from the information to be processed in the first modality may include the following steps:

encoding words in the text to be processed to obtain word vectors, and fusing the word vectors to obtain text characteristics of the text to be processed.

Each word in the text to be processed may be encoded, or only keywords in the text to be processed may be encoded. For example, to-be-processed text word segmentation, each word is matched with a pre-configured word stock, so that existing words in the word stock are screened out and used as keywords to be coded. Alternatively, the text to be processed is segmented, the words without any meaning therein (such as stop words) are deleted, and the rest words are used as keywords and are coded. The word may be densely encoded using an embedded model or the like to obtain a word vector. In one embodiment, the BERT model (Bidirectional Encoder Representations from Transformers, a transducer-based bi-directional semantic coding characterization model) may be employed to encode words in the text to be processed. The BERT model can guide each word to capture semantic and syntactic dependency information in a context, has the capability of giving different expressions to the same word in different contexts, and enables the coded word vector information to be quite comprehensive.

After the word vector is obtained, the word vector is fused into text features, and the fusing manner can include, but is not limited to: splicing word vectors into vectors with more dimensions; splicing word vectors into a matrix; the word vectors are added or fully concatenated. Text features are obtained after fusion, and the present disclosure does not limit the text feature data form, for example, the text feature data form may be a vector or a matrix, which is related to the specific manner of fusing word vectors.

In one embodiment, the second modality information to be processed may be an image to be processed and the second modality initial feature may be an image feature. The extracting the initial characteristic of the second modality from the information to be processed of the second modality may include the following steps:

and processing the image to be processed by utilizing a pre-trained residual error network to obtain the image characteristics of the image to be processed.

Wherein the residual network (ResNet) is a convolutional neural network with residual connections that can combine the features of different intermediate layers. In general, as network depth increases, extracted image features become more abstract, and macro-level features may be favored, and either the appearance or microscopic information may be lost, resulting in difficulty in alignment (or matching) with features of other modalities such as text. Through residual connection, the characteristics of different layers can be combined, and the comprehensiveness of the characteristics is improved. Thus, utilizing the residual network facilitates extracting high quality image features from the graphics to be processed.

By way of example, a data set of image classification (or object detection, or other image processing task) may be employed to train a residual network for image classification. After training is completed, the residual network may be used to extract image features, specifically, the image to be processed may be input into the residual network, and features may be obtained from a specific middle layer (e.g., may be the last convolution layer, the last pooling layer, the first full connection layer, the last full connection layer, etc.) of the residual network as the image features of the image to be processed.

The above description is directed to information of two modes, namely text and image, how to extract initial features. If information of other modes is adopted, the initial characteristics can be extracted by using a corresponding model. For example, if the information to be processed in the second mode is audio to be processed, an audio coding model or an audio feature extraction model may be used to extract audio features, that is, initial features of the second mode.

With continued reference to fig. 1, in step S130, first query information is determined according to the first modality initial feature, first key information and first value information are determined according to the second modality initial feature, and the first query information, the first key information and the first value information are processed through an attention mechanism, so as to obtain a first intermediate feature.

In step S140, second query information is determined according to the second mode initial feature, second key information and second value information are determined according to the first mode initial feature, and the second query information, the second key information and the second value information are processed through an attention mechanism to obtain a second intermediate feature.

Since the processing in steps S130 and S140 is similar, the two steps will be explained together. Query information (query), key information (key), value information (value) are three representations of information in the attention mechanism. The query information and the key information are used to calculate a correlation between the information, which can be used to further represent the value information, resulting in a feature based on the attention mechanism.

The first modality initial feature is a text feature and the second modality initial feature is an image feature. In step S130, the first query information is determined according to the text feature, the first key information and the first value information are determined according to the image feature, and the information alignment of the two modes of graphics context can be performed through the first query information and the first key information, which can be understood as the alignment of the image information with the text information as a reference. The result of the information alignment is to generate parameters for re-characterization of the image information, and then re-characterize the first value information, essentially by taking the information correlation of the image and the text as a guide, re-learn the features in the image to obtain a first intermediate feature, which mainly characterizes the image information under the condition of the modal alignment. Similarly, in step S140, the second query information is determined according to the image feature, and the second key information and the second value information determined according to the text feature are used to perform the alignment of the information of the two modes of the graphics context through the second query information and the second key information, which can be understood as the alignment of the text information with the image information as the reference. The result of the information alignment is to generate parameters for text information re-characterization, and then to re-characterize the second value information, essentially by guiding the information correlation of the text and the image, to re-learn the features in the text, and to obtain a second intermediate feature, which is mainly used for characterizing the text information under the condition of modal alignment.

The attention mechanism enables that in the process of re-characterizing the first value information or the second value information, when the information of the current position is coded, the information of the current position is not paid attention to any more, and the information of other positions (particularly the information of the position with higher correlation) can be learned, so that the comprehensiveness of the first intermediate feature is ensured. In addition, the exemplary embodiment can better realize the complementation of information in different modes and improve the comprehensive expression capability of the information by changing the modes of the source characteristics of the query information, the key information and the value information.

In one embodiment, referring to fig. 2, the determining the first query information according to the first modality initial feature, determining the first key information and the first value information according to the second modality initial feature, and processing the first query information, the first key information, and the first value information through the attention mechanism to obtain the first intermediate feature may include the following steps S210 to S240:

step S210, multiplying the initial characteristic of the first mode with a first query weight parameter to obtain first query information;

step S220, multiplying the initial characteristic of the second mode by the first key weight parameter to obtain first key information, and multiplying the initial characteristic of the second mode by the first value weight parameter to obtain first value information;

Step S230, determining a first attention weight according to the first query information and the first key information;

step S240, the first value information is weighted based on the first attention weight, and a first intermediate feature is obtained.

The processing procedure of the attention mechanism can be implemented through an attention model, and the parameters of the model comprise a first query weight parameter, a first key weight parameter and a first value weight parameter, wherein all three parameters can be vectors or matrixes. Multiplying the initial features of the first modality by the first query weight parameters is equivalent to extracting features of the initial features of the first modality once by using the first query weight parameters to obtain first query information. Multiplying the initial features of the second modality by the first key weight parameters and multiplying the initial features by the first value weight parameters, and extracting features of the initial features of the second modality by the first key weight parameters and the first value weight parameters respectively to obtain first key information and first value information.

And determining a first attention weight according to the first query information and the first key information. For example, similarity calculation can be performed on the first query information and the first key information to calculate the distribution of the correlation of the two modal information, so as to obtain the first attention weight. And finally, carrying out weighting processing on the first value information based on the first attention weight, and if the first attention weight can be multiplied by the first value information, obtaining a first intermediate feature.

The calculation process of fig. 2 can refer to the following formula:

wherein Q1, K1 and V1 respectively represent first query information, first key information and first value information; d, d _K Represents the dimension of K; the attention represents the operation of the attention mechanism; softmax represents the normalized exponential function; m1 represents a first intermediate feature.

In one embodiment, a multi-headed attention mechanism may also be used for feature aggregation to obtain a first intermediate feature. For example, in the method shown in fig. 2, a plurality of sets of first query weight parameters, first key weight parameters, and first value weight parameters may be set, where each set of first query weight parameters, first key weight parameters, and first value weight parameters may correspondingly calculate a set of first attention weights, and weight the first value information, so that each set of parameters may correspondingly calculate a first intermediate feature. And the first intermediate features corresponding to the parameters of each group can be fused, for example, the first intermediate features are spliced and then the dimension is reduced, so that the final first intermediate features are obtained. The use of a multi-headed attention mechanism allows the attention model to learn more information in the feature space and organize that information, giving the first intermediate feature information containing encoded representation information in different spaces, enhancing the expressive power of the model.

In an embodiment, the determining the second query information according to the second mode initial feature, determining the second key information and the second value information according to the first mode initial feature, and processing the second query information, the second key information, and the second value information through an attention mechanism to obtain the second intermediate feature may include the following steps:

multiplying the initial characteristic of the second mode with a second query weight parameter to obtain second query information;

multiplying the first modal initial feature by a second key weight parameter to obtain second key information, and multiplying the first modal initial feature by a second value weight parameter to obtain second value information;

determining a second attention weight according to the second query information and the second key information;

and weighting the second value information based on the second attention weight to obtain a second intermediate feature.

The second query weight parameter may be the same as or different from the first query weight parameter. For example, an attention model may be trained, where the attention model is used when the first intermediate feature is extracted and the second intermediate feature is extracted, and the query weight parameters in the attention model are the first query weight parameter and the second query weight parameter, which are the same. Alternatively, two attention models may be trained, one for extracting a first intermediate feature, the query weight parameter of which is the first query weight parameter, and the other for extracting a second intermediate feature, the query weight parameter of which is the second query weight parameter, the two generally being different. Similarly, the second key weight parameter may be the same as the first key weight parameter or may be different from the first key weight parameter. The second value weight parameter may be the same as or different from the first value weight parameter.

The above-mentioned step of obtaining the second intermediate feature is specifically implemented, and reference may be made to the content of fig. 2. And thus will not be described in detail.

With continued reference to fig. 1, in step S150, the first intermediate feature and the second intermediate feature are fused to obtain a multi-modal target feature.

The first intermediate feature and the second intermediate feature are features obtained by aggregating the first modality information and the second modality information, and the information emphasis in the features is different. And further fusing the first intermediate feature and the second intermediate feature to obtain a multi-modal target feature as a multi-modal feature fusion final result.

In one embodiment, referring to fig. 3, the above-mentioned merging the first intermediate feature and the second intermediate feature to obtain the multi-modal target feature may include the following steps S310 and S320:

step S310, carrying out weighted fusion on the first intermediate feature and the second intermediate feature based on the first fusion weight parameter and the second fusion weight parameter;

step S320, activating the weighted fusion result to obtain the multi-mode target feature.

The first fusion weight parameter is a weight parameter for weighting the first intermediate feature, and the second fusion weight parameter is a weight parameter for weighting the second intermediate feature. Based on the first fusion weight parameter and the second fusion weight parameter, the fusion duty ratio of the first intermediate feature and the second intermediate feature can be adjusted, and the adjustment of the fusion duty ratio of the second modal information (because the first intermediate feature mainly characterizes the second modal information) and the first modal information (because the second intermediate feature mainly characterizes the first modal information) can be understood. Taking image-text fusion as an example, based on the first fusion weight parameter and the second fusion weight parameter, the contribution of image information and text information in the multi-mode target feature can be dynamically controlled, for example, the alignment of the virtual words (such as the and the of) in the text and any image block can be avoided, and compared with the direct addition or the splicing of the features of different modes, the fusion error can be reduced by adopting the scheme.

After the weighted fusion result is obtained, the weighted fusion result can be activated to enable the characteristics to have nonlinear characteristics, useless information in the characteristics can be screened out, and the multi-mode target characteristics can be obtained.

For example, the process of fusing the first intermediate feature and the second intermediate feature may refer to the following formula:

wherein W is ₁ Representing a first fusion weight parameter, which may be a vector or matrix, with the upper right-hand corner T representing the transpose; w (W) ₂ Representing a second fusion weight parameter, which may be a vector or matrix, with the upper right-hand corner T representing the transpose; sigma represents a sigmoid activation function, although other activation functions, such as ReLU (Rectified Linear Unit, modified linear units), etc., may be employed; g represents the multi-modal target feature.

In one embodiment, the weighted fusion and activation processes may be implemented by a multi-mode fusion model, where the multi-mode fusion model may include one or more fully connected layers and an activation layer, and may also include other types of intermediate layers. The weight parameters in the full-connection layer comprise a first fusion weight parameter and a second fusion weight parameter, the first intermediate feature and the second intermediate feature are subjected to weighted fusion through the full-connection layer, and then the activation layer is used for carrying out activation processing to output the multi-mode target feature.

In one embodiment, referring to fig. 4, after step S150, the multi-modal feature processing method may further include the following step S160:

step S160, obtaining event detection results of the first-mode to-be-processed information and the second-mode to-be-processed information by classifying the multi-mode target features.

For example, regression processing may be performed on the multi-mode target feature to obtain probability values of occurrence of the specific event represented by the first-mode to-be-processed information and the second-mode to-be-processed information, so as to obtain an event detection result. Alternatively, an event detection model (e.g., may be a fully connected network) for classification may be pre-trained, multi-modal target features may be input into the event detection model, event detection results may be output, including whether the first modality information to be processed and the second modality information to be processed indicate the occurrence of a particular event, and the type of event that occurred.

Because the multi-mode target features in the exemplary embodiment fully integrate the information of the first mode and the second mode, the complementation of the information of the two modes is realized, and the event detection is carried out based on the multi-mode target features, thereby being beneficial to obtaining the event detection result with higher accuracy.

In one embodiment, event argument identification may also be performed based on multimodal target features or event detection results, i.e., identifying the argument type in an event, such as the time, place, person, etc., at which the event occurred. Therefore, the event extraction is realized, and the method has important application value in the fields of information retrieval, text abstracts, knowledge maps and the like.

Of course, in addition to event detection, the multi-modal object features may also be used for other tasks, such as generating related information maps, outputting semantic information (or translation information) of text or images, and so forth.

In one embodiment, the multimodal feature fusion process may be as shown with reference to FIG. 5. Firstly, inputting a text to be processed into a BERT model to obtain text characteristics, and inputting an image to be processed into a residual error network to obtain image characteristics. Then, query information (Q) is determined according to the text features, key information (K) and value information (V) are determined according to the image features, and a first intermediate feature is output through a multi-mode multi-head attention model. Query information (Q) is determined according to the image features, key information (K) and value information (V) are determined according to the text features, and a second intermediate feature is output through a multi-mode multi-head attention model. And finally, inputting the first intermediate feature and the second intermediate feature into a multi-modal fusion model, and outputting a multi-modal target feature.

In one embodiment, the training process for one or more of the models described above may be as described with reference to FIG. 6. Firstly, training data is acquired, wherein the training data comprises data in the form of a triplet, such as a sample text, a sample image and an event detection label (group trunk), the sample text and the sample image in the same triplet have a preset association relationship, and the event detection label is used for indicating whether the sample text and the sample image indicate that a specific event occurs or not. And inputting the sample text into the BERT model to obtain the characteristics of the sample text. And inputting the sample image into a residual error network to obtain the characteristics of the sample image. Then, query information (Q) is determined according to the sample text features, key information (K) and value information (V) are determined according to the sample image features, and a first intermediate sample feature is output through a multi-mode multi-head attention model. And determining query information (Q) according to the sample image characteristics, determining key information (K) and value information (V) according to the sample text characteristics, and outputting second intermediate sample characteristics through a multi-mode multi-head attention model. The first intermediate sample feature and the second intermediate sample feature are input into a multi-modal fusion model, and the multi-modal sample feature is output. Then, the multi-mode sample feature is input into the event detection model, and the event detection sample result is output. The loss function is determined from the event detection sample result and the event detection label, e.g., the loss function may be determined from a difference value of the event detection sample result and the event detection label. Finally, the parameters of the one or more models to be trained are updated based on the loss function, the models to be trained may include, for example, an event detection model, a multi-modal fusion model, a multi-modal multi-head attention model, the BERT model and the residual network may be pre-trained models, and thus do not need to be trained in the flow of fig. 6. The high-quality model is obtained through training, and the method can be applied to processing the information to be processed in the first mode and the information to be processed in the second mode, extracting high-quality multi-mode target characteristics and obtaining accurate event detection results.

Exemplary embodiments of the present disclosure also provide a multi-modal feature processing apparatus. Referring to fig. 7, the multi-modal feature processing apparatus 700 may include the following program modules:

the information acquisition module 710 is configured to acquire first-modality information to be processed and second-modality information to be processed;

the initial feature extraction module 720 is configured to extract a first modality initial feature for the first modality information to be processed and extract a second modality initial feature for the second modality information to be processed;

the first attention processing module 730 is configured to determine first query information according to the first modality initial feature, determine first key information and first value information according to the second modality initial feature, and process the first query information, the first key information and the first value information through an attention mechanism to obtain a first intermediate feature;

the second attention processing module 740 is configured to determine second query information according to the second modal initial feature, determine second key information and second value information according to the first modal initial feature, and process the second query information, the second key information and the second value information through an attention mechanism to obtain a second intermediate feature;

The intermediate feature fusion module 750 is configured to fuse the first intermediate feature and the second intermediate feature to obtain a multi-modal target feature.

In one embodiment, the first modality of information to be processed includes text to be processed and the second modality of information to be processed includes images to be processed.

In one implementation, the first modality initial feature comprises a text feature; the extracting the first modality initial feature from the information to be processed in the first modality includes:

In one embodiment, the second modality initial feature comprises an image feature; extracting the initial characteristics of the second modality from the information to be processed of the second modality includes:

In an embodiment, the determining the first query information according to the first modality initial feature, determining the first key information and the first value information according to the second modality initial feature, and processing the first query information, the first key information, and the first value information through an attention mechanism to obtain a first intermediate feature includes:

Multiplying the initial characteristic of the first modality by a first query weight parameter to obtain first query information;

multiplying the initial characteristic of the second mode with the first key weight parameter to obtain first key information, and multiplying the initial characteristic of the second mode with the first value weight parameter to obtain first value information;

determining a first attention weight according to the first query information and the first key information;

and weighting the first value information based on the first attention weight to obtain a first intermediate feature.

In one embodiment, the fusing the first intermediate feature and the second intermediate feature to obtain the multi-modal target feature includes:

weighting and fusing the first intermediate feature and the second intermediate feature based on the first fusion weight parameter and the second fusion weight parameter;

and activating the weighted fusion result to obtain the multi-mode target feature.

In one embodiment, the multi-modal feature processing apparatus 700 may further include an event detection module configured to: after the intermediate feature fusion module 750 obtains the multi-mode target feature, the event detection result of the first-mode to-be-processed information and the second-mode to-be-processed information is obtained by classifying the multi-mode target feature.

The specific details of each part in the above apparatus are already described in the method part embodiments, and the details not disclosed can refer to the embodiment content of the method part, so that the details are not repeated.

Exemplary embodiments of the present disclosure also provide a computer readable storage medium, which may be implemented in the form of a program product comprising program code for causing an electronic device to carry out the steps according to the various exemplary embodiments of the disclosure as described in the above section of the "exemplary method" when the program product is run on the electronic device. In an alternative embodiment, the program product may be implemented as a portable compact disc read only memory (CD-ROM) and comprises program code and may run on an electronic device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

Exemplary embodiments of the present disclosure also provide an electronic device. The electronic device may include a processor and a memory. The memory stores executable instructions of the processor, such as program code. The processor performs the method of the present exemplary embodiment by executing the executable instructions.

With reference now to FIG. 8, an electronic device is illustrated in the form of a general purpose computing device. It should be understood that the electronic device 800 illustrated in fig. 8 is merely an example and should not be taken as limiting the functionality and scope of use of embodiments of the present disclosure.

As shown in fig. 8, an electronic device 800 may include: processor 810, memory 820, bus 830, I/O (input/output) interface 840, network adapter 850.

The memory 820 may include volatile memory such as RAM 821, cache unit 822, and nonvolatile memory such as ROM 823. Memory 820 may also include one or more program units 824, such program units 824 include, but are not limited to: an operating system, one or more application programs, other program elements, and program data, each or some combination of which may include an implementation of a network environment. For example, the program unit 824 may include modules in the apparatus described above.

Bus 830 is used to enable connections between the different components of electronic device 800 and may include a data bus, an address bus, and a control bus.

The electronic device 800 may communicate with one or more external devices 900 (e.g., keyboard, mouse, external controller, etc.) via the I/O interface 840.

The electronic device 800 may communicate with one or more networks through the network adapter 850, e.g., the network adapter 850 may provide a mobile communication solution such as 3G/4G/5G, or a wireless communication solution such as wireless local area network, bluetooth, near field communication, etc. Network adapter 850 may communicate with other units of electronic device 800 via bus 830.

Although not shown in fig. 8, other hardware and/or software elements may also be provided in electronic device 800, including, but not limited to: displays, microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

It should be noted that although in the above detailed description several units or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, in accordance with exemplary embodiments of the present disclosure, the features and functions of two or more units or units described above may be embodied in one unit or unit. Conversely, the features and functions of one unit or unit described above may be further divided into a plurality of units or units to be embodied.

Those skilled in the art will appreciate that the various aspects of the present disclosure may be implemented as a system, method, or program product. Accordingly, various aspects of the disclosure may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," unit "or" system. Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of multi-modal feature processing, comprising:

acquiring information to be processed in a first mode and information to be processed in a second mode;

extracting a first mode initial feature from the first mode information to be processed, and extracting a second mode initial feature from the second mode information to be processed;

determining first query information according to the first mode initial feature, determining first key information and first value information according to the second mode initial feature, and processing the first query information, the first key information and the first value information through an attention mechanism to obtain a first intermediate feature;

determining second query information according to the second mode initial characteristics, determining second key information and second value information according to the first mode initial characteristics, and processing the second query information, the second key information and the second value information through an attention mechanism to obtain second intermediate characteristics;

and fusing the first intermediate feature and the second intermediate feature to obtain the multi-mode target feature.

2. The method of claim 1, wherein the first modality treatment information comprises treatment text and the second modality treatment information comprises treatment images.

3. The method of claim 2, wherein the first modality initial feature comprises a text feature; the extracting the first modality initial feature from the first modality information to be processed includes:

and encoding words in the text to be processed to obtain word vectors, and fusing the word vectors to obtain text characteristics of the text to be processed.

4. The method of claim 2, wherein the second modality initial feature comprises an image feature; the extracting the initial characteristics of the second modality from the information to be processed of the second modality includes:

5. The method according to claim 1, wherein determining the first query information according to the first modality initial feature, determining the first key information and the first value information according to the second modality initial feature, and processing the first query information, the first key information, and the first value information through an attention mechanism to obtain a first intermediate feature includes:

multiplying the initial features of the first modality by a first query weight parameter to obtain the first query information;

Multiplying the second mode initial feature by a first key weight parameter to obtain the first key information, and multiplying the second mode initial feature by a first value weight parameter to obtain the first value information;

and weighting the first value information based on the first attention weight to obtain the first intermediate feature.

6. The method of claim 1, wherein the fusing the first intermediate feature and the second intermediate feature to obtain a multi-modal target feature comprises:

weighting and fusing the first intermediate feature and the second intermediate feature based on a first fusion weight parameter and a second fusion weight parameter;

7. The method according to any one of claims 1 to 6, wherein after obtaining the multi-modal target feature, the method further comprises:

and classifying the multi-mode target features to obtain event detection results of the information to be processed in the first mode and the information to be processed in the second mode.

8. A multi-modal feature processing apparatus, comprising:

the information acquisition module is configured to acquire first-mode information to be processed and second-mode information to be processed;

the initial feature extraction module is configured to extract first-mode initial features from the first-mode information to be processed and extract second-mode initial features from the second-mode information to be processed;

the first attention processing module is configured to determine first query information according to the first modal initial characteristics, determine first key information and first value information according to the second modal initial characteristics, and process the first query information, the first key information and the first value information through an attention mechanism to obtain first intermediate characteristics;

the second attention processing module is configured to determine second query information according to the second modal initial characteristics, determine second key information and second value information according to the first modal initial characteristics, and process the second query information, the second key information and the second value information through an attention mechanism to obtain second intermediate characteristics;

and the intermediate feature fusion module is configured to fuse the first intermediate feature and the second intermediate feature to obtain a multi-mode target feature.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any of claims 1 to 7.

10. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of any one of claims 1 to 7 via execution of the executable instructions.