CN114491006A

CN114491006A - Text abstract generation method, electronic device and medium for referring to multi-mode information

Info

Publication number: CN114491006A
Application number: CN202210104367.2A
Authority: CN
Inventors: 张梓键; 付卫婷
Original assignee: Zhejiang Tongshan Artificial Intelligence Technology Co ltd
Current assignee: Zhejiang Tongshan Artificial Intelligence Technology Co ltd
Priority date: 2022-01-28
Filing date: 2022-01-28
Publication date: 2022-05-13

Abstract

The invention relates to a text abstract generating method, electronic equipment and a medium for referring to multi-mode information, which comprises an encoding step and a decoding step; the encoding step includes: obtaining the serialized characteristics of multi-modal information through characteristic mapping, wherein the multi-modal information comprises text information, audio information and image information; the method comprises the steps of enhancing the serialized features of the multi-modal information through an attention mechanism to obtain enhanced features of the multi-modal information; classifying the enhanced features of the multi-modal information to obtain a plurality of types of enhanced feature sets; inputting a plurality of types of enhanced feature sets into a plurality of feedforward neural networks in a one-to-one correspondence manner, and correspondingly obtaining a plurality of groups of fusion features; splicing the plurality of groups of fusion features into feature fusion vectors; the decoding step includes: and performing feature fusion on the enhanced features of the text information and the feature fusion vector by taking the feature fusion vector as hidden state input through a cross-attention mechanism to generate a text abstract. Compared with the prior art, the method has the advantages of high efficiency, high accuracy and the like.

Description

Text abstract generation method, electronic device and medium for referring to multi-mode information

Technical Field

The present invention relates to a text summary generation technology, and in particular, to a text summary generation method, an electronic device, and a medium for referencing multimodal information.

Background

With the abundance of network multimedia information, people have an increasing demand for short and effective information. Many cross-media platforms rely on text related to video or audio, such as news headlines. Therefore, the development of the abstract technology of the multi-modal input abstract sentence can avoid the waste of resources, reduce the repetitive work and promote the development of the society.

What the present invention needs to solve is the automated generation of text summaries based on multi-modal input information. The existing abstract research mainly focuses on single mode (text abstract or video abstract), and some researchers summarize multi-mode input into text or multi-mode output, and although some progress is made, two major problems exist:

(1) multimodal information cannot be aligned effectively;

(2) all features are considered the same thing, and no divide-and-conquer approach is taken to treat different parts of a feature.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provide a text abstract generation method, electronic equipment and medium for referring to multi-mode information, and the method, the electronic equipment and the medium are high in efficiency and high in accuracy.

The purpose of the invention can be realized by the following technical scheme:

a text abstract generating method referring to multi-mode information comprises an encoding step and a decoding step;

the encoding step comprises:

obtaining the serialized characteristics of multi-modal information through characteristic mapping, wherein the multi-modal information comprises text information, audio information and image information;

the method comprises the steps that the serialized characteristics of multi-modal information are enhanced through an attention mechanism, and the enhanced characteristics of the multi-modal information are obtained;

classifying the enhanced features of the multi-modal information to obtain a plurality of types of enhanced feature sets;

inputting a plurality of types of enhanced feature sets into a plurality of feedforward neural networks in a one-to-one correspondence manner, and correspondingly obtaining a plurality of groups of fusion features;

splicing the plurality of groups of fusion features into feature fusion vectors;

the decoding step comprises:

and performing feature fusion on the enhanced features of the text information and the feature fusion vector by taking the feature fusion vector as hidden state input through a cross-attention mechanism to generate a text abstract.

Further, the serialization characteristic of the text information is a word embedding characteristic;

the serialized features of the audio information are MFCC features;

the process for acquiring the serialization characteristic of the image information comprises the following steps:

and cutting the image, wherein the characteristic dimension of the cut image is the same as that of the word embedding characteristic, and the data length of the cut image is the same as that of the MFCC characteristic.

Further, the enhanced features of the multi-modal information are classified through a classifier, and the expression of the classifier is as follows:

where n is the number of feedforward neural networks, R is the input information, R is the number of feedforward neural networks_iFor the classified i-th class enhanced feature set,

for text information, L_TFor the length of the text information, D_TFor the feature dimension of the text information,

for audio information, L_IFor the length of the audio information, D_IIs a characteristic dimension of the audio information and,

is image information, L_AIs the length of the image information, D_AIs a characteristic dimension of the image information.

Further, the expression of the fusion feature is as follows:

wherein, FFN_iIs the ith feedforward neural network.

Further, the process of enhancing the serialized features of the multi-modal information through the attention mechanism includes:

the text information is enhanced through a self-attention mechanism, and the audio information and the image information are enhanced through a cross-attention mechanism by utilizing the text information.

Further, the self-attention mechanism is realized through a Transformer model.

Further, the cross-attention mechanism is realized through a Transformer model.

Further, the mask matrix in the attention mechanism is masked.

An electronic device comprising a memory storing a computer program and a processor calling the program instructions to be able to execute the text summary generation method.

A computer-readable storage medium comprising a computer program executable by a processor to implement the text summary generation method.

Compared with the prior art, the invention has the following beneficial effects:

(1) the invention relates to a method for generating a text abstract prediction result, which comprises the steps of mapping characteristics of a three-element multi-mode data group of a text, an image and an audio, then converting the three-element multi-mode data group into a serialized data structure, utilizing a self-attention mechanism to enhance the characteristic expression of text information, utilizing a cross-attention mechanism to learn information which is beneficial to the text expression from the image information and the audio information respectively, obtaining the enhanced characteristics of the three-element multi-mode data, classifying the enhanced characteristics, inputting the classified enhanced characteristics into a plurality of feedforward neural networks in a one-to-one correspondence manner, correspondingly obtaining a plurality of groups of fused characteristics, splicing the plurality of groups of fused characteristics into a characteristic fused vector, taking the characteristic fused vector as a hidden vector, taking the enhanced characteristics of the text information as input, carrying out interaction through the cross-attention mechanism, and generating a final text abstract prediction result. Different information of the fusion characteristics can be better captured, the final expression is more efficient and accurate, and multi-mode information can be better fused by utilizing the characteristics fusion vector to react on the text abstract generation process;

(2) in order to enable the whole text abstract to follow a strategy of serialization generation in the generation process, the method adopts a mask mode to operate, namely, a mask matrix in a cross-attention mechanism and a self-attention mechanism is covered.

Drawings

FIG. 1 is a schematic diagram of a multi-modal hierarchical selection framework structure based on a Transformer model.

Detailed Description

The invention is described in detail below with reference to the figures and the specific embodiments. The present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the scope of the present invention is not limited to the following embodiments.

Example 1

the encoding step includes:

the method comprises the steps of enhancing the serialized features of the multi-modal information through an attention mechanism to obtain enhanced features of the multi-modal information;

the decoding step includes:

The feature characterization (feature mapping) mainly maps features of data of different modalities to enable the data to have a serialized data structure form, the serialized features of text information are word embedding features, the serialized features of audio information are MFCC features, and image information needs to be segmented, because an image does not have serialized data characteristics, and the acquisition process of the serialized features of the image information comprises the following steps:

the original image is a three-dimensional matrix of length x width x channel (H x W x C), and then the image is sliced into pieces of size P x P (the value P is set in advance, usually 8 pixels), then the dimension of the sliced image can be represented as P2 x C (corresponding to the feature dimension of the word embedding feature), and the length of the image is HW/P2 (corresponding to the data length of the MFCC feature).

Classifying the enhanced features of the multi-modal information through a classifier, wherein the expression of the classifier is as follows:

is image information, L_AFor the length of the image information, D_AIs a characteristic dimension of the image information.

The expression of the fusion features is:

wherein, FFN_iIs the ith feedforward neural network.

Because the text abstract generation method provided by the embodiment mainly aims at the text abstract, only the text abstract needs to be concerned in the text data feature expression process and other modal features do not need to be concerned, and the process of enhancing the sequencing features of the multi-modal information by the attention mechanism comprises the following steps:

As shown in fig. 1, the text summary generation method for referring to multi-modal information according to the present embodiment is implemented by a multi-modal hierarchical selection framework based on a transform model, where the framework includes a cross-modal encoder and a sequence mask decoder;

the cross-modal encoder comprises a feature characterization module, a low-dimensional cross-modal interaction module and a high-dimensional selective routing module;

the feature characterization module mainly performs feature mapping on data of different modalities to enable the data to have a serialized data structure form (data is required to have serialization characteristics because a Transformer model is parallel processing), and performs feature mapping on input information (text-image-audio triples) respectively.

The low-dimensional cross-modality interaction module enhances the serialized features of each modality to make them more suitable for the task, namely simply to interact with different data in a low dimension (low-dimensional relative to other operations, since it is done immediately after feature characterization). The existing multi-modal triples are serialized data after feature mapping, that is, the dimensions of the existing multi-modal triples are all L × D, wherein L represents the length of data, and D represents the dimension of feature expression, and then a text is input into a self-attention module (a part of a Transformer model) to enhance the feature expression (because the text abstract generation method provided by the embodiment takes a text abstract as a target, only the text data itself needs to be concerned in the feature expression process without the attention of the features of other modalities); images and audio are used as supplementary information to text information and are aligned to the text by a cross-attention mechanism (a mechanism between an Encoder and a Decoder in a transform model).

The high-dimensional selective routing module adopts a divide-and-conquer strategy to treat different parts in the characteristics (for example, data among different modes are divided into a plurality of parts, for example, three parts are semantic strong correlation, semantic weak correlation and semantic irrelevance, and different mode data belonging to the three parts are respectively input into respective corresponding downstream sub-modules for processing), so that after characteristic enhancement and residual mode information supplement, characteristic splicing (fusion) is carried out on the data of the three different modes, and different information of fused characteristics can be better captured.

The sequence mask decoder takes the enhanced features of the text information as input, takes the fusion features as hidden state input for interaction, and finally generates the required text abstract.

In order to enable the whole text summary to follow the strategy of serialization generation in the generation process, the operation is carried out in the form of masks, namely, mask matrixes in the cross-attention mechanism and the self-attention mechanism are masked (a diagonal matrix is masked and an upper triangular matrix is masked).

Example 2

An electronic device comprising a memory storing a computer program and a processor calling the program instructions to be able to execute the text digest generation method of embodiment 1.

Example 3

A computer-readable storage medium comprising a computer program executable by a processor to implement the text summary generation method of embodiment 1.

Embodiments 1, 2, and 3 provide a text summary generation method, an electronic device, and a medium for referring to multimodal information, which focus on how well multimodal information is fused and expressed, utilize cross-modal interaction fusion information at a low level, make final expression more efficient at a high level through routing, utilize the fusion information at a decoder side to react with text summary generation, can better fuse multimodal information, and have many application scenarios in real life, such as news tags, summary generation, and automatic generation of video conference summary.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. A text abstract generating method referring to multi-mode information is characterized by comprising an encoding step and a decoding step;

the encoding step comprises:

the decoding step comprises:

2. The method of claim 1, wherein the text summary of the text message is a word embedding feature;

the serialized features of the audio information are MFCC features;

3. The method according to claim 1, wherein the enhanced features of the multi-modal information are classified by a classifier, the expression of the classifier is:

where n is the number of feedforward neural networks, R is the input information, R_iFor the classified i-th class enhanced feature set,

4. The method according to claim 3, wherein the expression of the fusion feature is:

wherein, FFN_iIs the ith feedforward neural network.

5. The method of claim 1, wherein the enhancing the serialized features of the multimodal information through the attention mechanism comprises:

6. The method of claim 5, wherein the self-attention mechanism is implemented by a transform model.

7. The method of claim 5, wherein the cross-attention mechanism is implemented by a transform model.

8. The method of claim 1, wherein a mask matrix in the attention mechanism is masked.

9. An electronic device comprising a memory storing a computer program and a processor calling the program instructions to be able to execute the text summary generation method according to any one of claims 1 to 8.

10. A computer-readable storage medium, comprising a computer program executable by a processor to implement the text summary generation method of any of claims 1-8.