CN114491006A - Text abstract generation method, electronic device and medium for referring to multi-mode information - Google Patents

Text abstract generation method, electronic device and medium for referring to multi-mode information Download PDF

Info

Publication number
CN114491006A
CN114491006A CN202210104367.2A CN202210104367A CN114491006A CN 114491006 A CN114491006 A CN 114491006A CN 202210104367 A CN202210104367 A CN 202210104367A CN 114491006 A CN114491006 A CN 114491006A
Authority
CN
China
Prior art keywords
information
text
features
enhanced
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210104367.2A
Other languages
Chinese (zh)
Inventor
张梓键
付卫婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Tongshan Artificial Intelligence Technology Co ltd
Original Assignee
Zhejiang Tongshan Artificial Intelligence Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Tongshan Artificial Intelligence Technology Co ltd filed Critical Zhejiang Tongshan Artificial Intelligence Technology Co ltd
Priority to CN202210104367.2A priority Critical patent/CN114491006A/en
Publication of CN114491006A publication Critical patent/CN114491006A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention relates to a text abstract generating method, electronic equipment and a medium for referring to multi-mode information, which comprises an encoding step and a decoding step; the encoding step includes: obtaining the serialized characteristics of multi-modal information through characteristic mapping, wherein the multi-modal information comprises text information, audio information and image information; the method comprises the steps of enhancing the serialized features of the multi-modal information through an attention mechanism to obtain enhanced features of the multi-modal information; classifying the enhanced features of the multi-modal information to obtain a plurality of types of enhanced feature sets; inputting a plurality of types of enhanced feature sets into a plurality of feedforward neural networks in a one-to-one correspondence manner, and correspondingly obtaining a plurality of groups of fusion features; splicing the plurality of groups of fusion features into feature fusion vectors; the decoding step includes: and performing feature fusion on the enhanced features of the text information and the feature fusion vector by taking the feature fusion vector as hidden state input through a cross-attention mechanism to generate a text abstract. Compared with the prior art, the method has the advantages of high efficiency, high accuracy and the like.

Description

Text abstract generation method, electronic device and medium for referring to multi-mode information
Technical Field
The present invention relates to a text summary generation technology, and in particular, to a text summary generation method, an electronic device, and a medium for referencing multimodal information.
Background
With the abundance of network multimedia information, people have an increasing demand for short and effective information. Many cross-media platforms rely on text related to video or audio, such as news headlines. Therefore, the development of the abstract technology of the multi-modal input abstract sentence can avoid the waste of resources, reduce the repetitive work and promote the development of the society.
What the present invention needs to solve is the automated generation of text summaries based on multi-modal input information. The existing abstract research mainly focuses on single mode (text abstract or video abstract), and some researchers summarize multi-mode input into text or multi-mode output, and although some progress is made, two major problems exist:
(1) multimodal information cannot be aligned effectively;
(2) all features are considered the same thing, and no divide-and-conquer approach is taken to treat different parts of a feature.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provide a text abstract generation method, electronic equipment and medium for referring to multi-mode information, and the method, the electronic equipment and the medium are high in efficiency and high in accuracy.
The purpose of the invention can be realized by the following technical scheme:
a text abstract generating method referring to multi-mode information comprises an encoding step and a decoding step;
the encoding step comprises:
obtaining the serialized characteristics of multi-modal information through characteristic mapping, wherein the multi-modal information comprises text information, audio information and image information;
the method comprises the steps that the serialized characteristics of multi-modal information are enhanced through an attention mechanism, and the enhanced characteristics of the multi-modal information are obtained;
classifying the enhanced features of the multi-modal information to obtain a plurality of types of enhanced feature sets;
inputting a plurality of types of enhanced feature sets into a plurality of feedforward neural networks in a one-to-one correspondence manner, and correspondingly obtaining a plurality of groups of fusion features;
splicing the plurality of groups of fusion features into feature fusion vectors;
the decoding step comprises:
and performing feature fusion on the enhanced features of the text information and the feature fusion vector by taking the feature fusion vector as hidden state input through a cross-attention mechanism to generate a text abstract.
Further, the serialization characteristic of the text information is a word embedding characteristic;
the serialized features of the audio information are MFCC features;
the process for acquiring the serialization characteristic of the image information comprises the following steps:
and cutting the image, wherein the characteristic dimension of the cut image is the same as that of the word embedding characteristic, and the data length of the cut image is the same as that of the MFCC characteristic.
Further, the enhanced features of the multi-modal information are classified through a classifier, and the expression of the classifier is as follows:
Figure BDA0003493413940000021
where n is the number of feedforward neural networks, R is the input information, R is the number of feedforward neural networksiFor the classified i-th class enhanced feature set,
Figure BDA0003493413940000022
for text information, LTFor the length of the text information, DTFor the feature dimension of the text information,
Figure BDA0003493413940000023
for audio information, LIFor the length of the audio information, DIIs a characteristic dimension of the audio information and,
Figure BDA0003493413940000024
is image information, LAIs the length of the image information, DAIs a characteristic dimension of the image information.
Further, the expression of the fusion feature is as follows:
Figure BDA0003493413940000025
wherein, FFNiIs the ith feedforward neural network.
Further, the process of enhancing the serialized features of the multi-modal information through the attention mechanism includes:
the text information is enhanced through a self-attention mechanism, and the audio information and the image information are enhanced through a cross-attention mechanism by utilizing the text information.
Further, the self-attention mechanism is realized through a Transformer model.
Further, the cross-attention mechanism is realized through a Transformer model.
Further, the mask matrix in the attention mechanism is masked.
An electronic device comprising a memory storing a computer program and a processor calling the program instructions to be able to execute the text summary generation method.
A computer-readable storage medium comprising a computer program executable by a processor to implement the text summary generation method.
Compared with the prior art, the invention has the following beneficial effects:
(1) the invention relates to a method for generating a text abstract prediction result, which comprises the steps of mapping characteristics of a three-element multi-mode data group of a text, an image and an audio, then converting the three-element multi-mode data group into a serialized data structure, utilizing a self-attention mechanism to enhance the characteristic expression of text information, utilizing a cross-attention mechanism to learn information which is beneficial to the text expression from the image information and the audio information respectively, obtaining the enhanced characteristics of the three-element multi-mode data, classifying the enhanced characteristics, inputting the classified enhanced characteristics into a plurality of feedforward neural networks in a one-to-one correspondence manner, correspondingly obtaining a plurality of groups of fused characteristics, splicing the plurality of groups of fused characteristics into a characteristic fused vector, taking the characteristic fused vector as a hidden vector, taking the enhanced characteristics of the text information as input, carrying out interaction through the cross-attention mechanism, and generating a final text abstract prediction result. Different information of the fusion characteristics can be better captured, the final expression is more efficient and accurate, and multi-mode information can be better fused by utilizing the characteristics fusion vector to react on the text abstract generation process;
(2) in order to enable the whole text abstract to follow a strategy of serialization generation in the generation process, the method adopts a mask mode to operate, namely, a mask matrix in a cross-attention mechanism and a self-attention mechanism is covered.
Drawings
FIG. 1 is a schematic diagram of a multi-modal hierarchical selection framework structure based on a Transformer model.
Detailed Description
The invention is described in detail below with reference to the figures and the specific embodiments. The present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the scope of the present invention is not limited to the following embodiments.
Example 1
A text abstract generating method referring to multi-mode information comprises an encoding step and a decoding step;
the encoding step includes:
obtaining the serialized characteristics of multi-modal information through characteristic mapping, wherein the multi-modal information comprises text information, audio information and image information;
the method comprises the steps of enhancing the serialized features of the multi-modal information through an attention mechanism to obtain enhanced features of the multi-modal information;
classifying the enhanced features of the multi-modal information to obtain a plurality of types of enhanced feature sets;
inputting a plurality of types of enhanced feature sets into a plurality of feedforward neural networks in a one-to-one correspondence manner, and correspondingly obtaining a plurality of groups of fusion features;
splicing the plurality of groups of fusion features into feature fusion vectors;
the decoding step includes:
and performing feature fusion on the enhanced features of the text information and the feature fusion vector by taking the feature fusion vector as hidden state input through a cross-attention mechanism to generate a text abstract.
The feature characterization (feature mapping) mainly maps features of data of different modalities to enable the data to have a serialized data structure form, the serialized features of text information are word embedding features, the serialized features of audio information are MFCC features, and image information needs to be segmented, because an image does not have serialized data characteristics, and the acquisition process of the serialized features of the image information comprises the following steps:
the original image is a three-dimensional matrix of length x width x channel (H x W x C), and then the image is sliced into pieces of size P x P (the value P is set in advance, usually 8 pixels), then the dimension of the sliced image can be represented as P2 x C (corresponding to the feature dimension of the word embedding feature), and the length of the image is HW/P2 (corresponding to the data length of the MFCC feature).
Classifying the enhanced features of the multi-modal information through a classifier, wherein the expression of the classifier is as follows:
Figure BDA0003493413940000041
where n is the number of feedforward neural networks, R is the input information, R is the number of feedforward neural networksiFor the classified i-th class enhanced feature set,
Figure BDA0003493413940000042
for text information, LTFor the length of the text information, DTFor the feature dimension of the text information,
Figure BDA0003493413940000043
for audio information, LIFor the length of the audio information, DIIs a characteristic dimension of the audio information and,
Figure BDA0003493413940000044
is image information, LAFor the length of the image information, DAIs a characteristic dimension of the image information.
The expression of the fusion features is:
Figure BDA0003493413940000045
wherein, FFNiIs the ith feedforward neural network.
Because the text abstract generation method provided by the embodiment mainly aims at the text abstract, only the text abstract needs to be concerned in the text data feature expression process and other modal features do not need to be concerned, and the process of enhancing the sequencing features of the multi-modal information by the attention mechanism comprises the following steps:
the text information is enhanced through a self-attention mechanism, and the audio information and the image information are enhanced through a cross-attention mechanism by utilizing the text information.
As shown in fig. 1, the text summary generation method for referring to multi-modal information according to the present embodiment is implemented by a multi-modal hierarchical selection framework based on a transform model, where the framework includes a cross-modal encoder and a sequence mask decoder;
the cross-modal encoder comprises a feature characterization module, a low-dimensional cross-modal interaction module and a high-dimensional selective routing module;
the feature characterization module mainly performs feature mapping on data of different modalities to enable the data to have a serialized data structure form (data is required to have serialization characteristics because a Transformer model is parallel processing), and performs feature mapping on input information (text-image-audio triples) respectively.
The low-dimensional cross-modality interaction module enhances the serialized features of each modality to make them more suitable for the task, namely simply to interact with different data in a low dimension (low-dimensional relative to other operations, since it is done immediately after feature characterization). The existing multi-modal triples are serialized data after feature mapping, that is, the dimensions of the existing multi-modal triples are all L × D, wherein L represents the length of data, and D represents the dimension of feature expression, and then a text is input into a self-attention module (a part of a Transformer model) to enhance the feature expression (because the text abstract generation method provided by the embodiment takes a text abstract as a target, only the text data itself needs to be concerned in the feature expression process without the attention of the features of other modalities); images and audio are used as supplementary information to text information and are aligned to the text by a cross-attention mechanism (a mechanism between an Encoder and a Decoder in a transform model).
The high-dimensional selective routing module adopts a divide-and-conquer strategy to treat different parts in the characteristics (for example, data among different modes are divided into a plurality of parts, for example, three parts are semantic strong correlation, semantic weak correlation and semantic irrelevance, and different mode data belonging to the three parts are respectively input into respective corresponding downstream sub-modules for processing), so that after characteristic enhancement and residual mode information supplement, characteristic splicing (fusion) is carried out on the data of the three different modes, and different information of fused characteristics can be better captured.
The sequence mask decoder takes the enhanced features of the text information as input, takes the fusion features as hidden state input for interaction, and finally generates the required text abstract.
In order to enable the whole text summary to follow the strategy of serialization generation in the generation process, the operation is carried out in the form of masks, namely, mask matrixes in the cross-attention mechanism and the self-attention mechanism are masked (a diagonal matrix is masked and an upper triangular matrix is masked).
Example 2
An electronic device comprising a memory storing a computer program and a processor calling the program instructions to be able to execute the text digest generation method of embodiment 1.
Example 3
A computer-readable storage medium comprising a computer program executable by a processor to implement the text summary generation method of embodiment 1.
Embodiments 1, 2, and 3 provide a text summary generation method, an electronic device, and a medium for referring to multimodal information, which focus on how well multimodal information is fused and expressed, utilize cross-modal interaction fusion information at a low level, make final expression more efficient at a high level through routing, utilize the fusion information at a decoder side to react with text summary generation, can better fuse multimodal information, and have many application scenarios in real life, such as news tags, summary generation, and automatic generation of video conference summary.
The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims (10)

1. A text abstract generating method referring to multi-mode information is characterized by comprising an encoding step and a decoding step;
the encoding step comprises:
obtaining the serialized characteristics of multi-modal information through characteristic mapping, wherein the multi-modal information comprises text information, audio information and image information;
the method comprises the steps of enhancing the serialized features of the multi-modal information through an attention mechanism to obtain enhanced features of the multi-modal information;
classifying the enhanced features of the multi-modal information to obtain a plurality of types of enhanced feature sets;
inputting a plurality of types of enhanced feature sets into a plurality of feedforward neural networks in a one-to-one correspondence manner, and correspondingly obtaining a plurality of groups of fusion features;
splicing the plurality of groups of fusion features into feature fusion vectors;
the decoding step comprises:
and performing feature fusion on the enhanced features of the text information and the feature fusion vector by taking the feature fusion vector as hidden state input through a cross-attention mechanism to generate a text abstract.
2. The method of claim 1, wherein the text summary of the text message is a word embedding feature;
the serialized features of the audio information are MFCC features;
the process for acquiring the serialization characteristic of the image information comprises the following steps:
and cutting the image, wherein the characteristic dimension of the cut image is the same as that of the word embedding characteristic, and the data length of the cut image is the same as that of the MFCC characteristic.
3. The method according to claim 1, wherein the enhanced features of the multi-modal information are classified by a classifier, the expression of the classifier is:
Figure FDA0003493413930000011
where n is the number of feedforward neural networks, R is the input information, RiFor the classified i-th class enhanced feature set,
Figure FDA0003493413930000012
for text information, LTFor the length of the text information, DTFor the feature dimension of the text information,
Figure FDA0003493413930000013
for audio information, LIFor the length of the audio information, DIIs a characteristic dimension of the audio information and,
Figure FDA0003493413930000014
is image information, LAIs the length of the image information, DAIs a characteristic dimension of the image information.
4. The method according to claim 3, wherein the expression of the fusion feature is:
Figure FDA0003493413930000021
wherein, FFNiIs the ith feedforward neural network.
5. The method of claim 1, wherein the enhancing the serialized features of the multimodal information through the attention mechanism comprises:
the text information is enhanced through a self-attention mechanism, and the audio information and the image information are enhanced through a cross-attention mechanism by utilizing the text information.
6. The method of claim 5, wherein the self-attention mechanism is implemented by a transform model.
7. The method of claim 5, wherein the cross-attention mechanism is implemented by a transform model.
8. The method of claim 1, wherein a mask matrix in the attention mechanism is masked.
9. An electronic device comprising a memory storing a computer program and a processor calling the program instructions to be able to execute the text summary generation method according to any one of claims 1 to 8.
10. A computer-readable storage medium, comprising a computer program executable by a processor to implement the text summary generation method of any of claims 1-8.
CN202210104367.2A 2022-01-28 2022-01-28 Text abstract generation method, electronic device and medium for referring to multi-mode information Pending CN114491006A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210104367.2A CN114491006A (en) 2022-01-28 2022-01-28 Text abstract generation method, electronic device and medium for referring to multi-mode information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210104367.2A CN114491006A (en) 2022-01-28 2022-01-28 Text abstract generation method, electronic device and medium for referring to multi-mode information

Publications (1)

Publication Number Publication Date
CN114491006A true CN114491006A (en) 2022-05-13

Family

ID=81477210

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210104367.2A Pending CN114491006A (en) 2022-01-28 2022-01-28 Text abstract generation method, electronic device and medium for referring to multi-mode information

Country Status (1)

Country Link
CN (1) CN114491006A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115659987A (en) * 2022-12-28 2023-01-31 华南师范大学 Multi-mode named entity recognition method, device and equipment based on double channels
CN115797943A (en) * 2023-02-08 2023-03-14 广州数说故事信息科技有限公司 Multimode-based video text content extraction method, system and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115659987A (en) * 2022-12-28 2023-01-31 华南师范大学 Multi-mode named entity recognition method, device and equipment based on double channels
CN115797943A (en) * 2023-02-08 2023-03-14 广州数说故事信息科技有限公司 Multimode-based video text content extraction method, system and storage medium

Similar Documents

Publication Publication Date Title
CN113762322B (en) Video classification method, device and equipment based on multi-modal representation and storage medium
CN111177366B (en) Automatic generation method, device and system for extraction type document abstract based on query mechanism
CN112069811B (en) Electronic text event extraction method with multi-task interaction enhancement
CN108595708A (en) A kind of exception information file classification method of knowledge based collection of illustrative plates
CN114491006A (en) Text abstract generation method, electronic device and medium for referring to multi-mode information
CN115982350A (en) False news detection method based on multi-mode Transformer
Wang et al. On the role of scene graphs in image captioning
CN116401376A (en) Knowledge graph construction method and system for manufacturability inspection
Wang et al. Tag: Boosting text-vqa via text-aware visual question-answer generation
CN111597816A (en) Self-attention named entity recognition method, device, equipment and storage medium
CN117173730A (en) Document image intelligent analysis and processing method based on multi-mode information
CN116842944A (en) Entity relation extraction method and device based on word enhancement
CN111368532A (en) Topic word embedding disambiguation method and system based on LDA
CN114399646B (en) Image description method and device based on transform structure
CN113177478B (en) Short video semantic annotation method based on transfer learning
CN115730071A (en) Electric power public opinion event extraction method and device, electronic equipment and storage medium
Yang et al. Modality-specific multimodal global enhanced network for text-based visual question answering
Wang et al. RSRNeT: a novel multi-modal network framework for named entity recognition and relation extraction
Li et al. Without detection: Two‐step clustering features with local–global attention for image captioning
Xu et al. Stdnet: Spatio-temporal decomposed network for video grounding
CN116911268B (en) Table information processing method, apparatus, processing device and readable storage medium
CN117934997B (en) Large language model system and method for generating camera case sample
Kitada et al. DM 2 S 2: Deep Multimodal Sequence Sets With Hierarchical Modality Attention
CN111008283B (en) Sequence labeling method and system based on composite boundary information
CN117540810A (en) Visual question-answering method and system based on multi-mode hierarchical structure representation and alignment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination