CN115631504A

CN115631504A - Emotion identification method based on bimodal graph network information bottleneck

Info

Publication number: CN115631504A
Application number: CN202211645853.1A
Authority: CN
Inventors: 李丽; 李平; 苟丽
Original assignee: Southwest Petroleum University
Current assignee: Southwest Petroleum University
Priority date: 2022-12-21
Filing date: 2022-12-21
Publication date: 2023-01-20
Anticipated expiration: 2042-12-21
Also published as: CN115631504B

Abstract

The invention provides an emotion recognition method based on bimodal graph network information bottleneck, which comprises the steps of preprocessing data, and respectively coding pictures and texts through corresponding pre-training models; respectively extracting the characteristics of the text and the image by using a long-short term memory network and a feedforward neural network; constructing a topological graph in the modality based on the grammar dependency relationship and the adjacent position relationship of the visual block, and constructing a dual-modality topological graph based on a complete bipartite graph; designing a modal interaction module based on a bimodal graph network, and realizing information interaction in and among the modalities by utilizing a graph convolution network; converting node representation of the bimodal topological graph into graph representation through a graph pooling technology; and (4) performing bimodal emotion recognition by adopting a multilayer perceptron. In addition, an information bottleneck module is established, and the generalization capability of the method is improved. The emotion recognition method based on the bimodal graph network information bottleneck can effectively fuse modal information and is used for guiding emotion recognition.

Description

Emotion identification method based on bimodal graph network information bottleneck

Technical Field

The invention belongs to the field of bimodal emotion recognition in the fields of natural language processing and vision intersection, and particularly relates to an emotion recognition method based on bimodal graph network information bottleneck.

Background

The emotion recognition aims at mining subjective information in data by using a natural language processing technology, and is widely applied to various fields, such as: financial market forecasting, business review analysis, and the like. With the rapid development of internet technology, information in the internet gradually changes from plain text to bimodal, so that the existing emotion analysis method faces new challenges and opportunities. How to effectively extract and fuse features from bimodal data is key to bimodal emotion characterization.

General bimodal emotion recognition can be realized by splicing, adding and calculating Hadamard products of all monomodal features, but correlation among the modals cannot be obtained in the mode. Recently, a cross attention mechanism method is introduced to enhance the feature fusion of bimodal data; however, cross-attention merely establishes the association of global semantics of one modality with local features on another modality, and is not sufficient to reflect the alignment relationship of the modalities on the local features, and using a global feature representation of a modality for semantic alignment may generate a large noise. Furthermore, the attention-based methods have another drawback, and such methods usually require careful attention patterns, such as: multi-layer/multi-pass attention, multi-layer attention will introduce more parameters, increasing the likelihood of overfitting.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, provides an emotion recognition method based on bimodal graph network information bottleneck, and decomposes data of each modality into semantic units with fine granularity, such as: the text word and image visual block establishes the relation between the bimodal fine-grained semantic units by utilizing the relevance in each modality and among the modalities, so that bimodal feature fusion is directly performed among the fine-grained semantic units, namely, a mapping relation is established for the representation information of each modality by adopting a local alignment local mode, and the semantic information of a text and the local information of an image can be fully fused. In addition, an information bottleneck mechanism is added, so that the generalization capability of the method can be effectively improved.

In order to achieve the purpose, the invention adopts the following technical scheme:

s1: preprocessing data, processing the text by adopting a word embedding technology Glove to obtain a text embedding matrix

(ii) a The image is processed using an image processing technique ResNet152, where the image is cut into pieces prior to processing

A visual block for obtaining an image representation matrix

(ii) a Wherein,

indicating the number of visual blocks.

S2: extracting the features of the preprocessed embedded expression, and extracting the text features by using a bidirectional long-short term memory network

Extracting image features using feed-forward neural networks

。

S3: and constructing a topological graph by using the grammar dependency relationship in the text and the spatial position relationship in the image. The specific operation is as follows:

s31, constructing a topological graph in a text mode by taking words in the text as nodes and grammatical dependency relationship in a dependency tree as undirected edges

。

S32: the visual blocks in the image are taken as nodes, and the spatial position relation between the visual blocks is taken as an undirected edgeConstructing a topology map within an image modality

。

S33: taking words in a text and a visual block in an image as two groups of nodes, forming a non-directional edge by any node in the words and each node in the visual block, and constructing a complete bipartite graph as a dual-mode topological graph

。

S4: and designing a modal interaction module based on a bimodal graph network, and performing representation learning by using a message transmission mechanism of the graph convolution network to realize information interaction and feature fusion in and among the modes. The specific operation is as follows:

s41: topological graph in text mode

The extracted text features are word node feature vectors, the expression learning of the word nodes is carried out through a graph convolution network, the information interaction in the text mode is realized, and the calculation formula is as follows:

in the above formula, the first and second carbon atoms are,

in order to train the parameters, the user may,

the function is activated for sigmoid.

S42: in topological graph in image mode

The image features extracted in S2 are visual block node feature vectors, the representation learning of the visual block nodes is carried out through a graph convolution network,the information interaction in the image modality is realized, and the calculation formula is as follows:

in the above formula, the first and second carbon atoms are,

in order to train the parameters, the user may,

the function is activated for sigmoid.

S43: in a bimodal topology

As an adjacency matrix, splicing the text and image features extracted by S2 into a node feature vector

Information aggregation is carried out through a graph convolution network, information fusion between modes is achieved, and a calculation formula is as follows:

in the above formula, the first and second carbon atoms are,

in order to train the parameters, the user may,

the function is activated for sigmoid.

S44: loops S41-S43 are set according to the specific parameters of the model.

S5: and an information bottleneck module is established, and the generalization capability of the method is improved. The specific operation is as follows:

s51: splicing the text embedding and the image embedding after the S1 data preprocessing to obtain the input characteristics of the information bottleneck module

。

S52: splicing the text features and the image features extracted in the step S2 to obtain intermediate features of the information bottleneck module

。

S53: splicing the text representation and the image representation after the modal interaction based on the bimodal graph network S4 to serve as the output characteristic of the information bottleneck module

。

S54: the goal of the information bottleneck is to reduce

And with

Mutual information between, increase

And

the calculation formula is as follows:

in the above formula, the first and second carbon atoms are,

the goal of the optimization required for the information bottleneck module,

for the parameters of the emotion recognition method based on the bimodal graph network information bottleneck,

is composed of

And with

The mutual information between the two groups is obtained,

is composed of

And with

The mutual information between the two groups of the information,

is an adjustable factor.

S6: obtaining a graph representation vector by adopting a graph pooling technology represented by all nodes in the spliced bimodal topological graph, wherein a calculation formula is as follows:

in the above formula, the first and second carbon atoms are,

a graph representation vector representing the merged text and all node representations of the visual block,

for all of the nodes in the bimodal topology map,

as nodes after S4

Is shown.

S7: and identifying bimodal emotional tendency by using a multi-layer perceptron as a classifier.

S8: the model is trained through bimodal data, a cross entropy loss function and an information bottleneck objective function are used as model training targets, and an Adam optimizer with hot start is used for training the model. The training goals for the model are as follows:

in the above formula, the first and second carbon atoms are,

in order to train one sample in the set,

for the set of all the training samples,

is a coefficient which can be adjusted,

is the true value of the sample or samples,

is a predicted value.

S9: and classifying the bimodal data to be classified through the trained model to obtain an emotion recognition result.

Compared with the existing bimodal emotion recognition method, the emotion recognition method based on the bimodal graph network information bottleneck has the following beneficial effects:

1. forming a bimodal topological graph by the text words and the visual blocks, and utilizing grammatical information of the text and spatial position information of the image;

2. the bi-modal topological graph establishes the relation between the bi-modal fine-grained semantic units, so that the multi-modal feature fusion is directly carried out between the fine-grained semantic units, the semantic information of texts and the local information of images can be fully fused, and the defects of the existing method are greatly supplemented;

3. by utilizing an information bottleneck mechanism, the generalization capability of the method is effectively improved.

Drawings

FIG. 1 is an overall flow chart of the present invention;

FIG. 2 is a diagram of a system model of the present invention;

FIG. 3 is a module for constructing a bimodal topology of the present invention.

Detailed Description

In order that the public may better understand the present invention, specific embodiments thereof will be described below with reference to the accompanying drawings. Wherein the showings are for the purpose of illustration only and not for the purpose of limiting the invention; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The invention provides an emotion recognition method based on bimodal graph network information bottleneck, which comprises the following steps:

s1: and (3) data preprocessing, namely preprocessing the text and the image respectively through corresponding pre-training models.

As shown in FIG. 1, the text and image in the bimodal data are separated and then preprocessed separately. For texts, the representation of words is searched in pre-trained Glove, each word is mapped to a 300-dimensional vector, and a text embedding matrix is obtained

(ii) a For images, it is first cut into

A visual block, and then adopting image processing techniqueThe operation ResNet152 processes each visual block, processes each visual block into 1024-dimensional expression vectors, and finally obtains an image embedding matrix

(ii) a Wherein,

indicating the number of visual blocks.

S2: and performing feature extraction on the preprocessed embedded representation.

As shown in fig. 1, the text embedding and the image embedding obtained in S1 are subjected to feature extraction, respectively.

Because the text has a front-back order relation, in order to integrate more context information into word embedding, a bidirectional long-short early-stage memory network is adopted to carry out context semantic dependency learning, and text characteristics are extracted

. The specific calculation formula is as follows:

in the above formula, the first and second carbon atoms are,

in order to forget to leave the door,

in order to input the information into the gate,

in order to output the output gate, the output gate is provided with a gate,

is a vector of candidate values, and is,

the memory cells of the last moment are the memory cells,

the memory cells at the current time are the memory cells,

for the hidden state representation at the last moment,

is a hidden state representation of the current time instant,

、

、

、

and

、

、

、

indicating trainable parameters, subscripts, of long and short term memory networks

Representing the index of the position of the current word in the text.

Because no sequence features exist among visual blocks of the image, the feedforward neural network is adopted to extract the image features

. The specific calculation formula is as follows:

in the above formula, the first and second carbon atoms are,

trainable parameters representing a feed forward neural network.

To facilitate the implementation of subsequent feature fusion, text features

And image features

Is set to 128.

S3: and constructing a topological graph by using the grammar dependency relationship in the text and the spatial position relationship in the image.

In order to solve the defects of the prior art, the alignment relation of each modality on local features is reflected. As shown in fig. 3, this step will construct three topologies, namely: two intra-modal topographies and one bi-modal topographies, the operation is as follows.

S31: for the text modality, complex grammatical dependencies exist between words, and modeling grammatical dependencies facilitate learning of text information. Therefore, a topological graph in a text mode is constructed by taking words in the text as nodes and grammar dependency relationship in a dependency tree as undirected edges

。

S32: constructing a topological graph in an image mode by taking visual blocks in an image as nodes and taking spatial position relations among the visual blocks as undirected edges

。

S33: establishing the relation between bimodal fine-grained semantic units, so that bimodal feature fusion can be directly carried out between the fine-grained semantic units, namely: and establishing a mapping relation for the representation information of each mode by adopting a local alignment local mode, so that the semantic information of the text and the local information of the image are fully fused. Therefore, a word in the text and a visual block in the image are used as two groups of nodes, any node in the word and each node in the visual block form an undirected edge, and a complete bipartite graph is constructed to serve as a bimodal topological graph

。

S4: and designing a modal interaction module based on a bimodal graph network, and performing representation learning by using a message transfer mechanism of the graph convolution network to realize information interaction and feature fusion in and among the modes.

As shown in fig. 2, the text features extracted in S2

And image features

And sending the data to a bimodal graph network, and carrying out information interaction and feature fusion through a graph convolution network on the basis of a topological graph constructed in the S3.

S41: topological graph in text mode

In the form of a contiguous matrix, the matrix,

the expression learning of the word nodes is carried out for the word node feature vectors through a graph convolution network, each word node transmits information to a neighbor word node with a grammar dependency relationship, and the information interaction in a text mode is realized, wherein the calculation formula is as follows:

in the above-mentioned formula, the compound has the following structure,

in order to train the parameters, the user may,

the function is activated for sigmoid.

S42: in topological graph in image mode

In the form of a contiguous matrix, the matrix,

for the feature vectors of the visual block nodes, the representation learning of the visual block nodes is carried out through a graph convolution network, and the information transmission is carried out between the adjacent visual blocks, so as to realize the information interaction in the image modality, and the calculation formula is as follows:

in the above formula, the first and second carbon atoms are,

in order to train the parameters, the user may,

the function is activated for sigmoid.

S43: in a bimodal topology

As an adjacent matrix, splicing the text and image features extracted by S2 into a node feature vector

Information aggregation is carried out through a graph convolution network, all neighbor nodes of each node belong to another mode node, so that information fusion between modes is realized, and a calculation formula is as follows:

in the above formula, the first and second carbon atoms are,

in order to train the parameters in a trainable manner,

the function is activated for sigmoid.

S44: as shown in fig. 2, S41 to S43 form a convolutional network block, and after the model is parametrized, a better parameter value of the number of layers of the convolutional network block is obtained, and S41 to S43 are cycled according to the specific parameter value.

S5: and an information bottleneck module is established, and the generalization capability of the method is improved.

The information bottleneck module runs through the whole process of the method, and the specific operation is as follows.

S51: splicing the text embedding and the image embedding after the S1 data preprocessingObtaining input characteristics of information bottleneck module

。

。

S53: s4, splicing the text representation and the image representation after modal interaction based on the bimodal graph network, wherein the text representation and the image representation are used as the output characteristics of the information bottleneck module

。

S54: the goal of information bottlenecks is to reduce

And with

Mutual information between, increase

And with

The mutual information between the two is calculated according to the following formula:

in the above-mentioned formula, the compound has the following structure,

for the goal of the information bottleneck module requiring optimization,

parameters of emotion recognition method based on bimodal graph network information bottleneckThe number of the first and second groups is,

is composed of

And

the mutual information between the two groups is obtained,

is composed of

And

the mutual information between the two groups is obtained,

is an adjustable factor.

S6: a graph pooling technique is employed to convert the node representation of the bimodal topology graph into a graph representation.

The bimodal emotion recognition is to classify the overall emotional tendency of the data, and needs to combine the feature information of all nodes in the bimodal topological graph. Therefore, a graph pooling technology represented by all nodes in the spliced bimodal topological graph is adopted to obtain a graph representation vector, and a calculation formula is as follows:

in the above formula, the first and second carbon atoms are,

for all of the nodes in the bimodal topology map,

is a node after S4

Is shown.

S7: and expressing the vector through the graph obtained in the S6, and identifying the bimodal emotional tendency by using a multilayer perceptron as a classifier, wherein a calculation formula is as follows:

in the above formula, the first and second carbon atoms are,

for the bimodal characterization to be finally learned,

the emotional tendency predicted for the model,

and

representing a trainable weight that is to be weighted,

and

is a trainable bias.

S8: the model is trained through the bimodal data.

In the training process, a cross entropy loss function and an information bottleneck objective function are used as model training targets, and an Adam optimizer with hot start is used for training the model. Wherein the training targets of the model are as follows:

in the above formula, the first and second carbon atoms are,

in order to train one sample in the set of samples,

for the set of all the training samples,

is a coefficient which can be adjusted,

is the true value of the sample or samples,

is a predicted value.

The described embodiments of the present invention are only for describing the preferred embodiments of the present invention, and do not limit the concept and scope of the present invention, and various modifications and improvements made to the technical solutions of the present invention by those skilled in the art should fall into the protection scope of the present invention without departing from the design concept of the present invention, and the technical contents of the present invention which are claimed are all described in the claims.

Claims

1. An emotion recognition method based on bimodal graph network information bottleneck is characterized by comprising the following steps:

s1: data preprocessing, namely preprocessing a text and an image respectively through corresponding pre-training models;

s2: extracting the features of the preprocessed embedded representation, and extracting the text features by using a bidirectional long-short term memory network

Extracting image features using feed-forward neural networks

；

S3: constructing a topological graph by using a syntax dependency relationship in a text and a spatial position relationship in an image;

s4: designing a modal interaction module based on a bimodal graph network, and utilizing a message transmission mechanism of the graph convolution network to carry out representation learning so as to realize information interaction and feature fusion in and among the modes;

s5: an information bottleneck module is established, and the generalization capability of the method is improved;

s6: converting the node representation of the bimodal topological graph into a graph representation by adopting a graph pooling technology;

s7: identifying bimodal emotional tendency by taking a multilayer perceptron as a classifier;

s8: training the model through the bimodal data;

2. The emotion recognition method based on bimodal graph network information bottleneck, according to claim 1, wherein the S1 specifically is: text is processed by adopting word embedding technology Glove to obtain a text embedding matrix

A visual block for obtaining an image representation matrix

(ii) a Wherein,

representing the number of visual blocks.

3. The emotion recognition method based on bimodal graph network information bottleneck, according to claim 1, wherein the specific step of S3 comprises:

；

；

。

4. The emotion recognition method based on bimodal graph network information bottleneck, according to claim 1, wherein the specific step of S4 comprises:

s41: topological graph in text mode

The extracted text features are word node feature vectors, and the expression learning of word nodes is carried out through a graph convolution network, so that information interaction in a text mode is realized;

s42: topology map in image modality

The image features extracted in S2 are visual block node feature vectors, and the representation learning of the visual block nodes is carried out through a graph convolution network, so that information interaction in an image mode is realized;

s43: in a bimodal topology

Information aggregation is carried out through a graph convolution network, and information fusion between modes is realized;

s44: loops S41-S43 are set according to the specific parameters of the model.

5. The emotion recognition method based on bimodal graph network information bottleneck, according to claim 1, wherein the specific step of S5 comprises:

；

；

；

S54: the goal of information bottlenecks is to reduce

And with

Mutual information between, increase

And with

in the above-mentioned formula, the compound has the following structure,

the goal of the optimization required for the information bottleneck module,

is composed of

And

the mutual information between the two groups is obtained,

is composed of

And

the mutual information between the two groups is obtained,

is an adjustable coefficient.

6. The emotion recognition method based on bimodal graph network information bottleneck, as claimed in claim 1, wherein the S6 specifically is: obtaining a graph representation vector by adopting a graph pooling technology of all node representations in the spliced bimodal topological graph, wherein a calculation formula is as follows:

in the above formula, the first and second carbon atoms are,

for all of the nodes in the bimodal topology,

is a node after S4

Is shown.

7. The emotion recognition method based on bimodal graph network information bottleneck, as claimed in claim 1, wherein said S8 specifically is: using a cross entropy loss function and an information bottleneck objective function as a model training objective, and using an Adam optimizer with hot start to train the model; wherein the training targets of the model are as follows:

in the above-mentioned formula, the compound has the following structure,

in order to train one sample in the set,

for the set of all the training samples,

is a coefficient which can be adjusted,

is the true value of the sample or samples,

is a predicted value.