CN116844095A

CN116844095A - Video emotion polarity analysis method based on multi-mode depth feature level fusion

Info

Publication number: CN116844095A
Application number: CN202311064915.4A
Authority: CN
Inventors: 谢珺; 刘琴; 续欣莹; 郝戍峰; 郝雅卉
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2023-08-23
Filing date: 2023-08-23
Publication date: 2023-10-03

Abstract

The invention relates to the technical field of intelligent information processing and multi-modal emotion, in particular to a video emotion polarity analysis method based on multi-modal depth feature level fusion, which solves the technical problems in the background technology and comprises the steps of extracting original features, constructing a video segment emotion polarity analysis model, training and testing the model; the video segment emotion polarity analysis model comprises a multi-modal feature level interaction fusion unit and an emotion polarity discrimination unit, wherein the multi-modal feature level interaction fusion unit comprises a bottom-layer dual-modal feature interaction module and a high-layer tri-modal feature level fusion module, and the high-layer tri-modal feature level fusion module comprises a paired dual-linear gating fusion unit and a tri-modal self-attention feedforward fusion unit. The invention fully fuses the multi-modal characteristics, simultaneously filters redundancy and noise in the fused characteristics, improves the representation capability of the multi-modal emotion characteristics, and effectively identifies the emotion positive and negative polarities of speakers in the video clips.

Description

Video emotion polarity analysis method based on multi-mode depth feature level fusion

Technical Field

The invention relates to the technical field of intelligent information processing and multi-mode emotion, in particular to a video emotion polarity analysis method based on multi-mode depth feature level fusion.

Background

With the rapid development of domestic and foreign social media platforms such as networks, intelligent mobile terminals, B stations, voice trembling, fast handholding, youTube and the like, more and more users tend to share the own opinion of certain events, topics, policies, products, services and the like in the form of videos on various network social media, and a large number of multimedia video resources comprising personal emotion attitudes are emerging on the networks. The video content generated by the users contains a large amount of information, can reflect emotion, attitude and view of a speaker, and has huge commercial value and application value. For example, government departments can infer the opinion and attitude of netizens to a policy through video analysis, and when excessive negative messages are in time, effective measures are taken to block negative content transmission, and meanwhile, guidance is provided for policy improvement; the brand company can clearly judge the evaluation of the masses to a certain brand through social media video analysis, and can take corresponding actions and put forward an optimization scheme when the negative evaluation increases in a short term. In addition, studies have shown that short video contribution behavior of non-professional content creators is significantly enhanced, and government, social organizations, etc. deal with possibly sudden public opinion impact by analyzing these multimedia video resources containing a large number of personal emotional attitudes. In addition, with the advent of ChatGPT, emotion conversation technology has triggered research hotspots, and accurately understanding and recognizing the emotion of a speaker in a video is the primary basis for generating emotion replies.

Compared with literal statement, the multi-mode video containing visual, voice and text information is more in line with the nature of human multi-sense expression and multi-sense perception, a user can express and perceive emotion in the video from multiple dimensions, the emotion has very important social value and environmental adaptation significance, the multi-mode video is the core of cross-cultural interaction of human beings, and the human beings naturally depend on judging the behavior tendency of the opposite party in a mode of identifying the emotion of the opposite party, so that proper brain resources are mobilized, the behaviors of the user are adjusted, and reasonable decisions are made.

As early as the sixties of the last century, the role of emotion in "machine intelligence" has attracted attention from many scholars, for example, the professor Herb Simon in 1967 has proposed that the influence of emotion must be included in the general theory of thinking and problem solving; the professor Minsky of the university of Massa Medicata Fermentata was proposed in The Society of Mind published in 1986 as "emotion is an important component of machine intelligence"; in 1997, the Picard professor of the multimedia laboratory at the university of hemp, institute of technology proposed the concept of emotion calculation (Affective Computing): "emotion calculation" refers to calculation that is related to emotion, originates from emotion, or can exert influence on emotion; 1999. the manual psychological theory was proposed by the professor Wang Zhiliang of Beijing university of science and technology; professor Hu Baogang, the institute of automation of the national academy of sciences, et al, also gives definition of emotion calculations in conjunction with own study: the purpose of emotion computing is to create a harmonious man-machine environment by giving the computer the ability to recognize, understand, express and adapt to the emotion of a person, and to give the computer a higher, comprehensive intelligence. "emotion calculation is gradually becoming an emerging research field, along with the development of artificial intelligence technology, emotion theory system is continuously accumulated and perfected, and emotion calculation is more mature to be applied in the fields of remote education, medical care, smart city, financial science and technology, intelligent household appliances, network entertainment, psychological construction, civil service, man-machine natural interaction and the like.

Emotion analysis (Sentiment Analysis, SA) is one of key technical links in the emotion calculation field, and aims to automatically process and analyze collected physiological data (electrocardio, electroencephalogram, myoelectricity, skin conductance, respiratory signals and the like) and behavior data (gestures, facial expressions, body gestures, voice intonation and the like) by using a computer, extract relevant emotion characteristics, model according to the emotion characteristics, analyze mapping relations between external expression and internal state of emotion, and predict information such as current emotion state, emotion type, emotion intensity value and the like of a speaker.

Early in research, emotion analysis only considered information of text modality, and emotion clues were extracted from the internal context of the text and used for emotion recognition. In recent years, with the popularization of multi-mode Human-computer interaction equipment (MHCI) and the development of short video network platforms, new professional identities such as video bloggers, vloggers, dawners, self-media persons and the like are generated successively, various videos can be uploaded and released by the professional and common users at any time and any place, user generated content is gradually inclined from a text form to a video form, and video emotion analysis research is generated.

Two types of emotion analysis theoretical models mainly exist in the field of psychology, including a discrete emotion classification model and a dimensional emotion classification model. The discrete emotion classification model represents emotion as individual tags, with no correlation between each emotion. Ekman has first demonstrated that there is a correlation between facial expressions and emotions, and cross-cultural studies have shown that the perception of certain basic emotions by people in different cultural environments is identical, and accordingly classification models based on 6 basic emotions are proposed. The dimension emotion model represents emotion with finer granularity through multiple dimensions, emotion is defined through axes and poles by a common three-dimensional model, emotion is distributed at different positions between two poles of each axis, and a classical model comprises a PAD model and an inverse cone emotion three-dimensional model. At present, the use of discrete emotion models for emotion analysis is still the most popular method in the field of multi-modal emotion calculation.

The target object of the video emotion analysis comprises multi-modal information such as text, audio and vision separated from the video, and in order to utilize the video emotion analysis technology to mine emotion polarity from the multi-modal data, the prior art method generally considers that all multi-modal characteristics have an effect of promoting emotion analysis, firstly, various characteristic extraction models are adopted to obtain original characterization of each mode, and then a linear classifier or a multi-layer perceptron (Multilayer Perceptron, MLP) is used for integrating all modal characteristics to identify emotion in the video; MMMU-BA, DEAN and other technical methods focus on the innovation of a multi-modal fusion method, and the aligned multi-modal feature sequences are fused by using a cross-modal attention mechanism, so that emotion analysis at the speaking level is realized; the CIA technical method introduces the relation among different modes of the self-coding network guiding model.

However, the above method can only capture cross-modal interaction information from one angle, and information redundancy and mutual interference between different modalities are not considered, and improper multi-modal information fusion can bring about noise characteristics during emotion analysis, so that the above method still has limitations. In order to make up for the defects of the traditional video emotion analysis method, the method constructs a proper multi-mode fusion method according to the characteristics of each mode in the video segment, establishes an emotion polarity analysis model, pays attention to the data characteristics most relevant to emotion analysis in the multi-mode video sequence, and combines the associated information among the learning modes.

Disclosure of Invention

The invention provides a video emotion polarity analysis method based on multi-mode depth feature level fusion, which aims to overcome the technical defect that the prior method can only capture cross-mode interaction information from one angle and brings noise features during emotion analysis.

The invention discloses a video emotion polarity analysis method based on multi-mode depth feature level fusion, which is realized by a video processing unit and an emotion analysis unit and comprises the following steps:

s1: extracting original features:

dividing a complete video into a plurality of video fragments by a video processing unit, and dividing the plurality of video fragments into training data, verification data and test data based on a random sampling method; collecting facial expression data, voice signal data and text subtitle data of a speaker in each video segment, and sending the facial expression data, the voice signal data and the text subtitle data to a single-mode original feature extraction unit to obtain original depth features of three single-mode data;

s2: constructing a video fragment emotion polarity analysis model:

the video segment emotion polarity analysis model comprises a multi-modal feature level interaction fusion unit and an emotion polarity discrimination unit, wherein the multi-modal feature level interaction fusion unit comprises a bottom-layer dual-modal feature interaction module and a high-layer tri-modal feature level fusion module, and the high-layer tri-modal feature level fusion module comprises a paired dual-linear gating fusion unit and a tri-modal self-attention feedforward fusion unit; firstly, processing original depth features of single-mode data through a bottom layer double-mode feature interaction module, introducing a paired attention mechanism, and capturing semantic relations between any two single-mode data; the multi-modal characteristics after the hierarchical interaction fusion are finally obtained through the processing of a paired bilinear gating fusion unit and a tri-modal self-attention feedforward fusion unit of the high-level tri-modal characteristic hierarchical fusion module; the multi-modal characteristics are processed through the emotion polarity judging unit, namely the multi-modal characteristics extracted by the multi-modal characteristic level interaction fusion unit pass through the classification layer, the emotion probability distribution result of the target video segment is calculated, and the category corresponding to the maximum probability is the emotion polarity type judged for the video segment;

s3: model training and testing:

training the constructed video segment emotion polarity analysis model by using training data; using verification data to evaluate the training effect of the video segment emotion polarity analysis model in the training process, and continuously adjusting and optimizing to obtain an optimal video segment emotion polarity analysis model; and testing the optimal video segment emotion polarity analysis model by using the test data, and calculating a final emotion polarity type classification effect index.

Compared with the prior art, the technical scheme provided by the invention has the following advantages: the emotion polarity recognition effect of the speaker in the video clip is better, and the recognition rate is higher; redundancy and noise information in the fusion characteristics are filtered while the multi-modal characteristics are fully fused, the representation capability of the multi-modal emotion characteristics is improved to a certain extent, and the emotion positive and negative polarities of speakers in the video clips can be effectively identified.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

Fig. 1 is a flow chart of a video emotion polarity analysis method based on multi-mode depth feature level fusion according to an embodiment of the present invention;

FIG. 2 is a technical roadmap of a video emotion polarity analysis method based on multi-mode depth feature level fusion provided by an embodiment of the invention;

FIG. 3 is a schematic diagram of an overall framework of a video emotion polarity analysis method based on multi-modal depth feature level fusion according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a tri-modal self-care feedforward fusion unit according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a paired bilinear fusion unit according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a multi-mode gating mechanism according to an embodiment of the present invention.

Detailed Description

In order that the above objects, features and advantages of the invention will be more clearly understood, a further description of the invention will be made. It should be noted that, without conflict, the embodiments of the present invention and features in the embodiments may be combined with each other.

In the description, it should be noted that the terms "first," "second," and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. It should be noted that, unless explicitly stated or limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the terms described above will be understood by those of ordinary skill in the art as the case may be.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced otherwise than as described herein; it will be apparent that the embodiments in the specification are only some, but not all, embodiments of the invention.

Specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

In an embodiment of the invention, a video emotion polarity analysis method based on multi-mode depth feature level fusion is disclosed, which is realized by a video processing unit and an emotion analysis unit, and comprises the following steps:

s1: extracting original features: dividing a complete video into a plurality of video fragments by a video processing unit, and dividing the plurality of video fragments into training data, verification data and test data based on a random sampling method; collecting facial expression data, voice signal data and text subtitle data of a speaker in each video segment, and sending the facial expression data, the voice signal data and the text subtitle data to a single-mode original feature extraction unit to obtain original depth features of three single-mode data;

s2: constructing a video fragment emotion polarity analysis model: the video segment emotion polarity analysis model comprises a multi-modal feature level interaction fusion unit and an emotion polarity discrimination unit, wherein the multi-modal feature level interaction fusion unit comprises a bottom-layer dual-modal feature interaction module and a high-layer tri-modal feature level fusion module, and the high-layer tri-modal feature level fusion module comprises a paired dual-linear gating fusion unit and a tri-modal self-attention feedforward fusion unit; firstly, processing original depth features of single-mode data through a bottom layer double-mode feature interaction module, introducing a paired attention mechanism, and capturing semantic relations between any two single-mode data; the multi-modal characteristics after the hierarchical interaction fusion are finally obtained through the processing of a paired bilinear gating fusion unit and a tri-modal self-attention feedforward fusion unit of the high-level tri-modal characteristic hierarchical fusion module; the multi-modal characteristics extracted by the multi-modal characteristic hierarchy interaction fusion unit pass through a classification layer, wherein in a specific embodiment, the classification layer is a full connection layer and a Softmax normalization layer, the emotion probability distribution result of a target video segment is calculated, and the category corresponding to the maximum probability is the emotion polarity type judged for the video segment;

s3: model training and testing: training the constructed video segment emotion polarity analysis model by using training data; using verification data to evaluate the training effect of the video segment emotion polarity analysis model in the training process, and continuously adjusting and optimizing to obtain an optimal video segment emotion polarity analysis model; and testing the optimal video segment emotion polarity analysis model by using the test data, and calculating a final emotion polarity type classification effect index.

On the basis of the above embodiment, in a preferred embodiment, in step S1, when dividing the complete video, the complete video is divided in units of utterances, where an Utterance (Utterance) is defined as a plurality of speaking segments cut out according to pauses or breaks of speech signal data in the video; facial expression data, voice signal data and text subtitle data respectively correspond to a visual mode, an audio mode and a text mode, facial expression data of a speaker in a video segment are obtained from a camera, facial expression data of the speaker in the video segment are obtained from a microphone, text subtitle data of the video segment are obtained from a subtitle document, and the extraction steps of original depth features of three single-mode data are as follows:

s11, aligning data of a visual mode and an audio mode to a text mode for extracting three mode characteristics of each segment, so that time step lengths of the visual mode, the audio mode and the text mode are consistent, aligning subsequences of the three modes according to word levels because the most basic language components are words in various languages, acquiring start and stop time stamps of each word in the text mode by using a P2FA tool, and aligning the start and stop time stamps to the visual mode sequence and the audio mode sequence according to the word level, so that the subsequences of the visual mode, the audio mode and the text mode are aligned according to the word level, and obtaining three single mode data aligned video segments;

s12, performing feature extraction on three single-mode data aligned video clips by using different depth feature extraction methods, converting heterogeneous multi-mode information with different forms and sources into computer-understandable dense feature vectors, namely extracting original depth features of a text mode by using a word embedding technology and a CNN network, extracting original depth features of an audio mode by using common voice analysis frames such as OpenSMILE or COVAREP, and extracting original depth features of a visual mode by using common visual analysis frames such as FACET or 3D-CNN.

Based on the foregoing embodiment, in a preferred embodiment, in step S2, the processing procedure of the underlying bimodal feature interaction module is: capturing internal time sequence dependence of original depth features of each single-mode data, respectively encoding the original depth features, then mapping the original depth features of each single-mode data to a public semantic space by utilizing a Dense layer, eliminating semantic gaps, finally sending the original depth features of each single-mode data into a pair of attention mechanism units to learn interaction dependence among double modes, respectively independently training three double-mode combinations of a text mode-audio mode, an audio mode-visual mode and a text mode-visual mode, and finally taking out hidden layer features of the three double-mode combinations output by the pair of attention mechanism units as initial input of a high-level module; the processing procedure of the high-level three-mode feature hierarchy fusion module is as follows: the noise characteristics of the hidden layer characteristics of three bimodal combinations are filtered by utilizing the trimodal self-attention feedforward fusion unit, then the trimodal characteristics after noise is removed are extracted, dependence among the trimodal characteristics is acquired by utilizing the paired bilinear gating fusion unit, then characteristic filtering is carried out, and only the multimode characteristics related to emotion analysis are reserved.

On the basis of the above embodiment, in a preferred embodiment, dividing the plurality of video clips into the training data, the verification data and the test data based on the random sampling method means that the divided video clips are divided into the total training data and the test data according to the ratio of 8:2, and then the total training data is divided into the training data and the verification data according to the ratio of 8:2.

Based on the above embodiment, in a preferred embodiment, when the underlying bimodal feature interaction module processes, a cross-modal context attention unit is selected as a paired attention mechanism unit, which specifically includes the following steps:

s211: capturing the time sequence dependency relationship inside a single mode by the original depth feature of each single mode data through a bidirectional gating circulating unit, and obtaining fragment feature characterization containing context information;

s212: projecting the segment characteristic representation to a public semantic space with a dimension D through a full-connection layer with nonlinear excitation, and obtaining vector representations of a text mode, an audio mode and a visual mode in the public semantic space;

s213: combining original depth features of each single-mode data in pairs, and fusing the dual-mode information by adopting three pairs of cross-mode context attention units to obtain feature expression vectors fused by the three dual-mode information;

s214: s211 is applied to encode original depth features of each piece of single-mode data, the feature expression vectors fused by three pieces of double-mode information are spliced with the original depth feature pairs of the encoded single-mode data respectively to serve as input of an emotion polarity judging unit, and emotion probability distribution of a target utterance is obtained through a full-connection layer and a Softmax normalization layer;

s215: and (3) respectively performing independent training on three bimodal combinations of a text modality-audio modality, an audio modality-visual modality and a text modality-visual modality by applying step S211-step S214, and finally taking three hidden layer features output by a cross-modality context attention unit as initial input of a high-level module.

In a specific embodiment, the feature expression vector of step S213 isThe specific calculation steps of the cross-modal context attention unit are as follows:

wherein ,，respectively representing two different modes of operation,representing a matrix multiplication of the number of bits,the product of the Hadamard is represented,and representing the feature vector after the text mode and the visual mode are fused.

In a specific embodiment, the emotion probability distribution of the target utterance in step S214 is specifically calculated as follows:；

wherein ,andthe weight matrix and the bias term which can be learned in the full connection layer are respectively;and representing the emotion type probability distribution corresponding to the target video segment.

In a specific embodiment, during training of the bottom layer module, important network parameter values are set as follows: the number of neurons in the hidden layer is set to 100, the learning rate is set to 1e-3, the batch size is set to 16, 100 epochs are trained altogether, and the dropout rate is set to 0.5; compared with a random gradient descent method, the Adam optimizer is simple to realize and efficient in calculation by adopting the Adam optimizer to train the network.

Based on the above embodiment, in a preferred embodiment, the tri-modal self-attention feedforward fusion unit includes a self-attention filter layer and a feedforward network shallow fusion layer which are connected, and the paired bilinear gating fusion unit includes a paired bilinear fusion unit and a multi-modal gating output layer which are connected, so that the processing of the high-level tri-modal feature hierarchy fusion module specifically includes the following steps: the self-attention filter layer respectively filters noise characteristics of hidden layer characteristics of three bimodal combinations, namely a text mode-audio mode, an audio mode-visual mode and a text mode-visual mode, so as to obtain a self-attention filtered bimodal characteristic, then a feedforward network shallow fusion layer is utilized to extract the noise-removed bimodal characteristic, the self-attention filtered bimodal characteristic is spliced with the feedforward fused bimodal characteristic, gradient back propagation is promoted, and finally a self-attention feedforward fused trimodal characteristic is obtained; the dependence among the three-mode characteristic characterization is obtained by utilizing a pair of bilinear fusion units, then the specific gravity occupied by different input characteristics is adaptively learned by adopting a multi-mode gating output layer, the characteristics useful for emotion classification are activated, the characteristics are filtered, only the multi-mode characteristics most relevant to emotion analysis are reserved, and the specific strategy that the multi-mode gating layer distributes different weights for different input characteristics is as follows:

；

wherein ,representing respectively different input characteristics of the multi-mode gating output layer,、andthe weight of each input information obtained by utilizing three independent double-layer nonlinear feedforward neural networks in self-adaptive learning is represented; after each weight is distributed to each input feature, the average is carried out on feature dimensions, and feature information is reserved to the greatest extent when the average operation is carried out on dimension reduction; finally, the tri-modal characteristics after bilinear gating fusion are obtained。

The multi-mode gating layer can adaptively learn the proportions of different input features and activate features useful for emotion classification. Specifically, features subjected to bilinear fusion are input, different weights can be classified for different input features through a multi-mode gating output mechanism of a multi-mode gating layer, redundant features and noise features are eliminated, and the distinguishing property of emotion features is improved. The multi-mode gating mechanism is widely used in multi-mode emotion analysis tasks, for example, a DEAN technical method utilizes the mechanism to calculate the importance of different modes and weight and control the output of each target mode.

In some embodiments, all bilinear fused features are first stitched to obtainThen, three independent double-layer nonlinear feedforward neural networks are utilized to adaptively learn the weight of each input information, each weight is distributed to each input feature and then is averaged in the feature dimension, and the average is carried outThe operation is carried out while dimension reduction is carried out, and meanwhile, characteristic information is reserved to the maximum extent. Finally, the tri-modal characteristics after bilinear gating fusion are obtainedThe specific calculation process is as follows:

；

wherein ,representing the sigmoid activation function,, and,a weight matrix and bias terms for the first layer feed forward network., Is a weight matrix of the second layer feedforward network.

The module introduces a two-level Tri-modal fusion mechanism, which is simply referred to as Tri-safpu and PBGFU, respectively. The Tri-SAFFU adopts a self-attention mechanism to filter noise in the bimodal characteristics obtained by training the underlying bimodal characteristic interaction module, and then utilizes feedforward fusion to obtain a trimodal characteristic representation. The PBGFU acquires dependence among three-mode features by using a bilinear Fusion module (PBF), and then adopts a multi-mode gating output unit (Gated Ouput Module, GOM) to perform feature filtering, so that only the features most relevant to emotion analysis are reserved.

In some embodiments of the present invention, in some embodiments,

s221: step S211-step S215 are applied to obtain three bimodal input feature matrixes, and the three bimodal input feature matrixes are subjected to BiGRU coding time sequence dependency relationship so that the specific segment contains the information of the front context and the back context of the specific segment; adding a full-connection layer with nonlinear excitation, and projecting each bimodal speech feature into a public feature space with a dimension D;

s222: the bimodal vectors of the public feature space are respectively subjected to two-level fusion units to obtain three-modal fine granularity emotion characterization, which is specifically as follows: a trimodal self-attention feedforward fusion unit and a paired bilinear gating fusion unit.

Step S221 obtains three groups of bimodal feature matrixes through BiGRU and nonlinear full-connection coding, on the basis, self-attention operation is respectively carried out on the three feature matrixes, redundant components in bimodal interaction information are removed, and the specific process is as follows:

；

the self-noted bimodal feature characterization is combined in pairs, the characteristic dimension is spliced and then passes through a full-connection layer, and the feedforward fusion of the trimodal feature is realized, wherein the specific process is as follows:

；

.

wherein ，Andis a weight matrix of a full-connection layer.

In some embodiments, the residual network is inspired by stitching the feedforward fused bimodal characterization with the self-care filtered bimodal characterization, facilitating gradient back propagation. Finally, the self-attention feedforward fused three-mode characteristic representation is obtainedThe method specifically comprises the following steps:

。

in a specific embodiment, the obtaining of the dependence between the three-mode feature characterization by using the paired bilinear fusion unit specifically includes that the paired bilinear fusion unit uses a low-rank bilinear model to enable the input bimodal feature matrix to be embedded into a new feature space through a nonlinear Dense layer; the Hadamard product is applied to approximate a bilinear model for full feature interaction, and a self-attention mechanism is added to correlate the context information, so that the emotion feature representation is further improved.

The low-rank bilinear model is widely used in classification tasks, and in order to reduce the calculation cost, the bilinear model is approximated by using Hadamard products, and the specific calculation process is as follows:

；

wherein, the ". Alt represents Hadamard product, all incoming bimodal pairs share weight parametersAnd。

in some embodiments, to further refine emotion feature representation to enhance the model's ability to recognize emotion, the module uses self-attention to correlate context information after bilinear computation toFor example, the specific calculation process is as follows:

；

likewise, toAndcan be used as input to obtainAnd。

in some embodiments, the self-care filtered bimodal characterization is fused with the feedforward bimodal characterization, i.eAndsplicing while inspiring the residual network, to facilitate model gradient back propagation, input features of the module，Andthe jump is linked to the output of the module. Finally, emotion characteristics for emotion discrimination are expressed as:

；

for emotion polarity classification tasksAnd obtaining the emotion probability distribution of the target video segment through a full connection layer and a Softmax normalization layer, wherein the emotion probability distribution comprises the following formula:

；

wherein Andrespectively a weight matrix and a bias term which can be learned in the full connection layer.Representing the first in videoAnd the probability distribution of emotion labels corresponding to the segments.

In some embodiments, to fairly compare the performance of the models, all of the comparison models use a Cross Entropy (CE) loss function to train the emotion polarity classification model, specifically formulated as:

；

wherein ,the number of videos is represented and,represent the firstThe number of segments that a video contains,representation model prediction NoVideo No.The probability that the individual segments are positive examples,representation model prediction NoVideo No.Probability that the segments are negative examples.The first of the representationsVideo No.And the true label of each fragment is 1 if the fragment is a positive example, otherwise, 0 is taken.

In some embodiments, to prevent overfitting due to data volume and neural network parameter mismatch, a Dropout layer is added to mitigate overfitting;

in some embodiments, during training, important network hyper-parameter values are set as follows: the learning rate is set to be 1e-3, the Dropout rate of the Dense layer and the Dropout rate of the BiGRU layer are set to be 0.7 and 0.5 respectively, the number of hidden layer units is set to be 300, and the batch size is set to be 16; in the experiment, an Early Stop strategy (Early Stop) is adopted, training is stopped when the number of times that the Loss value on the test set is not reduced is accumulated to a set threshold value, and the threshold values of two data sets in the experiment are set to be 10; in training the neural network, an Adam optimizer based on random gradient descent is used to optimize model parameters.

On the basis of the above embodiment, in a preferred embodiment, the model training and testing method includes:

s41: training the constructed video segment emotion polarity analysis model by using training data, and adjusting model parameters according to a training structure;

s42: the training effect of the video segment emotion polarity analysis model after the parameters are adjusted in the training process is evaluated by using the verification data, the training effect is used for adjusting and optimizing the model, and when the number of times that the model is no longer excellent in performance on the verification data is accumulated to a set threshold value, the training is stopped, and the optimal video segment emotion polarity analysis model is obtained;

s43: and testing the optimal emotion polarity analysis model of the video segment by using the test data to obtain an emotion polarity analysis result of the video segment to be analyzed, and calculating a final classification effect index of the emotion polarity type.

In some embodiments, to measure the emotion analysis effect of the present invention, the experiment uses two evaluation indexes, namely, classification Accuracy Accurcy and Weight-Avg-F1 value, to evaluate the effect of the model, which are abbreviated as Acc and F1, respectively. Meanwhile, a classification result confusion matrix is introduced to intuitively display the quality of model classification. In order to reduce the randomness of the experimental process and simultaneously explain the stability of the performance of the method provided by the invention, ten fixed random seeds are selected for experiments, and finally, the average value of Acc and F1 values and the standard deviation thereof in the ten experiments are taken as experimental results. The standard deviation is small, and the model has more consistent performance under the condition of multiple experiments, and the more stable the model performance is.

The foregoing is only a specific embodiment of the invention to enable those skilled in the art to understand or practice the invention. Although described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the embodiments, and they should be construed as covering the scope of the appended claims.

Claims

1. The video emotion polarity analysis method based on multi-mode depth feature level fusion is realized by a video processing unit and an emotion analysis unit and is characterized by comprising the following steps of:

s1: extracting original features:

s2: constructing a video fragment emotion polarity analysis model:

s3: model training and testing:

2. The video emotion polarity analysis method based on multi-modal depth feature level fusion of claim 1, wherein in step S1, when dividing a complete video, the complete video is divided in units of utterances, wherein the utterances are a plurality of speaking segments segmented according to pauses or breaks of speech signal data in the video; facial expression data, voice signal data and text subtitle data respectively correspond to a visual mode, an audio mode and a text mode, and the extraction steps of the original depth features of the three single-mode data are as follows:

s11, aligning data of a visual mode and an audio mode to a text mode, enabling time step lengths of the visual mode, the audio mode and the text mode to be consistent, acquiring a start and stop time stamp of each word in the text mode by using a P2FA tool, and aligning to a visual mode sequence and an audio mode sequence according to the start and stop time stamp, so that subsequences of the visual mode, the audio mode and the text mode are aligned according to word levels, and obtaining three video segments with aligned single-mode data;

s12, performing feature extraction on the video segments aligned with the three single-mode data by using different depth feature extraction methods, namely extracting original depth features of a text mode by using a word embedding technology and a CNN network, extracting original depth features of an audio mode by using a voice analysis framework, and extracting original depth features of a visual mode by using a visual analysis framework.

3. The video emotion polarity analysis method based on multi-modal depth feature level fusion of claim 2, wherein the processing procedure of the bottom layer dual-modal feature interaction module is as follows: capturing internal time sequence dependence of original depth features of each single-mode data, respectively encoding the original depth features, then mapping the original depth features of each single-mode data to a public semantic space by utilizing a Dense layer, finally, sending the original depth features of each single-mode data into a pair of attention mechanism units to learn interaction dependence among double modes, respectively independently training three double-mode combinations of a text mode-audio mode, an audio mode-visual mode and a text mode-visual mode, and finally taking out hidden layer features of the three double-mode combinations output by the pair of attention mechanism units as initial input of a high-level module; the processing procedure of the high-level three-mode feature hierarchy fusion module is as follows: the noise characteristics of the hidden layer characteristics of three bimodal combinations are filtered by utilizing the trimodal self-attention feedforward fusion unit, then the trimodal characteristics after noise is removed are extracted, dependence among the trimodal characteristics is acquired by utilizing the paired bilinear gating fusion unit, then characteristic filtering is carried out, and only the multimode characteristics related to emotion analysis are reserved.

4. The video emotion polarity analysis method based on multi-modal depth feature level fusion of claim 3, wherein when the underlying dual-modal feature interaction module processes, a cross-modal context attention unit is selected as a pair attention mechanism unit, and the specific steps are as follows:

5. The video emotion polarity analysis method based on multi-modal depth feature level fusion of claim 4, wherein the three-modal self-attention feedforward fusion unit comprises a self-attention filter layer and a feedforward network shallow fusion layer which are connected, the paired bilinear gating fusion unit comprises a paired bilinear fusion unit and a multi-modal gating output layer which are connected, and the specific steps of the high-level three-modal feature level fusion module when processing are as follows: the self-attention filter layer respectively filters noise characteristics of hidden layer characteristics of three bimodal combinations, namely a text mode-audio mode, an audio mode-visual mode and a text mode-visual mode, so as to obtain a self-attention filtered bimodal characteristic, then a feedforward network shallow fusion layer is utilized to extract the noise-removed bimodal characteristic, the self-attention filtered bimodal characteristic is spliced with the feedforward fused bimodal characteristic, and finally a self-attention feedforward fused trimodal characteristic is obtained; the dependence among the three-mode characteristic characterization is obtained by utilizing a pair of bilinear fusion units, then the specific gravity occupied by different input characteristics is adaptively learned by adopting a multi-mode gating output layer, the characteristics useful for emotion classification are activated, the characteristics are filtered, only the multi-mode characteristics most relevant to emotion analysis are reserved, and the specific strategy that the multi-mode gating layer distributes different weights for different input characteristics is as follows:

；

wherein ,representing different input features of the multi-mode gating output layer, respectively +.>、/> and />The weight of each input information obtained by utilizing three independent double-layer nonlinear feedforward neural networks in self-adaptive learning is represented; assigning weights to inputsAfter the characteristics, the characteristic dimensions are averaged, and the characteristic information is reserved to the greatest extent while the dimension is reduced by the averaging operation; finally, the tri-modal characteristic +.>。

6. The video emotion polarity analysis method based on multi-modal depth feature level fusion of claim 5, wherein when the emotion polarity discrimination unit processes multi-modal features, the classification layers are a full connection layer and a Softmax normalization layer.

7. The video emotion polarity analysis method based on multi-modal depth feature level fusion of claim 6, wherein the model training and testing method comprises:

8. The video emotion polarity analysis method based on multi-modal depth feature level fusion of claim 7, wherein dividing a plurality of video segments into training data, verification data and test data based on a random sampling method means dividing the divided video segments into total training data and test data in a ratio of 8:2, and then dividing the total training data into training data and verification data in a ratio of 8:2.