CN111680541A - Multi-modal emotion analysis method based on multi-dimensional attention fusion network - Google Patents
Multi-modal emotion analysis method based on multi-dimensional attention fusion network Download PDFInfo
- Publication number
- CN111680541A CN111680541A CN202010292014.0A CN202010292014A CN111680541A CN 111680541 A CN111680541 A CN 111680541A CN 202010292014 A CN202010292014 A CN 202010292014A CN 111680541 A CN111680541 A CN 111680541A
- Authority
- CN
- China
- Prior art keywords
- fusion
- autocorrelation
- target
- modal
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000004927 fusion Effects 0.000 title claims abstract description 158
- 230000008451 emotion Effects 0.000 title claims abstract description 66
- 238000004458 analytical method Methods 0.000 title claims abstract description 42
- 238000000034 method Methods 0.000 claims abstract description 29
- 238000007781 pre-processing Methods 0.000 claims abstract description 27
- 238000000605 extraction Methods 0.000 claims abstract description 22
- 238000013507 mapping Methods 0.000 claims description 25
- 230000007246 mechanism Effects 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 10
- 230000010354 integration Effects 0.000 claims description 10
- 238000002372 labelling Methods 0.000 claims description 8
- 238000013528 artificial neural network Methods 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 7
- 230000002902 bimodal effect Effects 0.000 claims description 5
- 239000011159 matrix material Substances 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 2
- 230000003044 adaptive effect Effects 0.000 claims description 2
- 238000010606 normalization Methods 0.000 claims description 2
- 238000005457 optimization Methods 0.000 description 12
- 230000009286 beneficial effect Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 239000013598 vector Substances 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 2
- 230000008921 facial expression Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000007499 fusion processing Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000008909 emotion recognition Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 230000007935 neutral effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/174—Facial expression recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Molecular Biology (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Child & Adolescent Psychology (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Signal Processing (AREA)
- Computing Systems (AREA)
- Acoustics & Sound (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a multi-modal emotion analysis method based on a multi-dimensional attention fusion network, which comprises the following steps of: extracting voice preprocessing characteristics, video preprocessing characteristics and text preprocessing characteristics aiming at sample data containing multiple modals such as voice, video and text; then, constructing the multi-dimensional attention fusion network for each mode, extracting a first-level autocorrelation feature and a second-level autocorrelation feature by using an autocorrelation feature extraction module in the network, combining autocorrelation information of the three modes, and obtaining cross-mode fusion features of the three modes by using a cross-mode fusion module in the network; combining the secondary autocorrelation characteristics and the cross-modal fusion characteristics to obtain modal multi-dimensional characteristics; finally, splicing the modal multi-dimensional characteristics, determining emotion scores and performing emotion analysis; the method can effectively perform feature fusion in a non-aligned multi-modal data scene, and perform emotion analysis by fully utilizing multi-modal associated information.
Description
Technical Field
The invention belongs to the field of multi-modal emotion calculation, and particularly relates to a multi-modal emotion analysis method based on a multi-dimensional attention fusion network.
Background
Mood analysis has numerous applications in daily life. With the development of big data and multimedia technology, different modes of voice, video and text of data are analyzed by means of a multi-mode emotion analysis technology, and shallow meanings behind the data are better mined. In a return visit survey, for example, the degree of satisfaction of a user with a service or a product is known through comprehensive analysis of the user's voice, face, and speech content.
At present, the difficulty of multimodal emotion analysis lies in how to effectively fuse multimodal information, and the acquisition modes of voice, video and text characteristics are completely different. When the same content is described, the sequence length of the two modes of voice and video is greatly different from the text in the time dimension, and the characteristics of the three modes are in one-to-one correspondence in the time dimension, which causes great difficulty in the fusion between the modes.
At present, two common methods are available, one method is based on modal integration, namely a data layer, a feature layer and a decision layer are selected from the whole emotion analysis system to splice intermediate results, and then emotion prediction is carried out. The method only simply integrates the results of three modes, does not consider the correlation information among the modes, and is easy to cause model overfitting due to information redundancy. The other method is based on mode labeling alignment, namely, when data labeling is carried out, the three modes are forcibly aligned in a time dimension according to characters or phonemes, so that the corresponding relation of the three modes in time is guaranteed, and mode fusion is carried out by utilizing a cyclic neural network, a convolutional neural network, an attention mechanism and a Seq2Seq framework, so that the mode labeling cost is high, and the mode labeling is not beneficial to the actual production and living environment.
Disclosure of Invention
The invention aims to provide a multi-modal emotion analysis method based on a multi-dimensional attention fusion network, which can effectively avoid the problem of overlarge labeling cost caused by overfitting based on an integration method and alignment based on modal labeling, and obtain more accurate and reliable emotion analysis results by fully utilizing multi-dimensional information in and among the modalities.
The invention solves the technical problem and adopts the following steps:
step one, a multi-modal emotion analysis database is established, the size of the database is N, each sample in the database contains three target modal data of voice, video and text, preprocessing characteristics of the three target modalities are extracted in advance, and emotion labeling is carried out on each sample.
And step two, constructing respective multi-dimensional attention fusion networks of the three target modes in the step one.
The multidimensional attention fusion network of each of the three target modes comprises an autocorrelation feature extraction module and a cross-mode fusion module, wherein the autocorrelation feature extraction module and the cross-mode fusion module are formed by a transform network.
And step three, the preprocessed features in the step one are respectively processed by the autocorrelation feature extraction module in the step two, and autocorrelation information of three modes, namely voice autocorrelation information, text autocorrelation information and video autocorrelation information, is extracted.
The voice autocorrelation feature extraction module, the text autocorrelation feature extraction module and the video autocorrelation feature extraction module respectively comprise a primary autocorrelation feature extractor and a secondary autocorrelation feature extractor.
The voice autocorrelation feature extraction module is configured to extract autocorrelation information of the input voice pre-processing features.
The text autocorrelation feature extraction module is configured to extract autocorrelation information of input text pre-processing features.
The video autocorrelation feature extraction module is configured to extract autocorrelation information of input video pre-processing features.
The autocorrelation information of the three modes comprises a first-level autocorrelation characteristic and a second-level autocorrelation characteristic.
And step four, selecting the first-level autocorrelation characteristics of any target modality in the step three as target characteristics to be fused, and the second-level autocorrelation characteristics of the other two target modalities as auxiliary fusion characteristics, and sending the target characteristics to a cross-modality fusion module where the target modalities are located according to a certain grouping mode to respectively obtain voice-based, text-based cross-modality fusion characteristics and video-based cross-modality fusion characteristics.
The cross-modal fusion module comprises two bimodal fusion devices and a weighted integration network.
And step five, adding the cross-modal fusion characteristics of each target mode and the secondary autocorrelation characteristics of the step three to obtain multi-dimensional fusion characteristics.
And step six, splicing the voice, text and video multi-dimensional fusion characteristics obtained in the step five to obtain full-scale multi-dimensional characteristics, and sending the full-scale multi-dimensional characteristics to a scoring module to obtain emotion scores.
As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the three target modal preprocessing feature extractions in the step one are as follows: the voice preprocessing features are obtained by extracting mfcc features from voice by using a kaldi voice recognition tool package, the video preprocessing features are obtained by extracting facial expression unit features from Facet, and the text preprocessing features can be obtained by extracting word vector features by using word2 vec.
And respectively carrying out feature dimension alignment on the voice preprocessing feature, the video preprocessing feature and the text preprocessing feature through linear transformation.
As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the emotion in the step one is marked as a limited continuous interval, the interval can be continuously and equidistantly divided into M sub-intervals, and each sub-interval represents the emotion degree range.
Wherein the interval can be divided into integer intervals of [ -K, K ], wherein more than 0 is judged positive, equal to 0 is judged neutral, less than 0 is judged negative, and the specific subinterval can be further divided according to the required emotion granularity size ].
The further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network is characterized in that the multi-dimensional attention fusion network in the step two has the same structure aiming at the three target modes of voice, video and text.
The further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network is characterized in that the self-correlation feature extraction module in the step two is a Transformer network.
The further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network is characterized in that the first-level autocorrelation feature extractor and the first-level autocorrelation feature extractor in the third step are in a cascading mode.
The first-stage autocorrelation feature extractor is configured to extract first-stage autocorrelation features from input target modal preprocessing features based on a transform multi-head self-attention mechanism.
The calculation formula of the multi-head self-attention mechanism is as follows:
Qi=XWi Q;Ki=XWi K;Vi=XWi V;
MultiHead_X=Concat(head1,…,headn)
wherein X is the target modal preprocessing characteristic in the second step, WQ′Query mapping weight, W, for target modality mK′Weight of key mapping for target modality m, WV′Mapping weights for the values of the target modality m, softmax being a weight normalization function, Ki TIs KiTransposed matrix of dkFor the scaling factor, n is the number of headers, MultiHead _ X1The obtained first-order autocorrelation characteristics.
As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the second-level correlation feature extractor in step three is configured to extract second-level autocorrelation features from the first-level autocorrelation features of the input target modality based on a feedforward neural network.
Wherein the feed-forward neural network is configured to:
MultiHead_X2=max(0,MultiHead_X1·W1+b1)·W2+b2
wherein Multihead _ X2Is a second order autocorrelation feature of the target mode, W1Is the weight of a first-order autocorrelation feature, W2Hiding layer weights for the network, b1Biasing for first order autocorrelation characteristics, b2Biasing for the network hidden layer.
As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the grouping mode in the fourth step is (X)0,main,X1,aide),(X0,main,X2,aide) For subsequent packet fusion, wherein X0,mainFor the target feature to be fused in step four, X1,aide,X2,aideIs an auxiliary fusion feature of the other two modalities.
As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the dual-modal fusion device in the fourth step is used for the grouping fusion and is configured to input the grouping to obtain two groups of cross-modal fusion characteristics.
As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the dual-modal fusion device in the fourth step is cross-modal fusion based on a multi-head attention mechanism of a Transformer, and the calculation mode is as follows:
Qm,main=Xm,mainWQ′;K=XaideWK′;V=Xj,aideWV′;
CrossFusion_Xaide→main=Concat(head1,…,headn)
wherein Xm,mainSelf-attention feature, X, representing the current target modality mj,aideIndicating a secondary fusion feature, CrossFusion _ Xaide→mainDenotes the fusion result, WQ′Query mapping weight, W, for target modality mK′Weight of key mapping for target modality m, WV′Mapping weights for values of the target modality m, dk' is the scaling factor.
As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the cross-modal fusion of the dual-modal fusion device based on the multi-head attention mechanism is carried out, and the specific fusion process is as follows:
(1) the target feature X to be fused in the step four is0,mainAnd performing query mapping. The mapping method is as follows:
Q0,main=X0,mainWQ′
wherein, WQWeights are learned for the query mappings.
(2) Fusing the auxiliary fusion feature X described in the step four1,aide,X2,aideAnd mapping the grouping key and the value.
The mapping method is as follows:
Wherein the content of the first and second substances,to assist the key mapping weights of modality 1,weights are mapped for the values of the auxiliary modality 1,to assist the key mapping weights of modality 2,mapping weights for auxiliary modality 2 values, K1Representing auxiliary fusion features X1,aideMapped key features, V1Representing auxiliary fusion features X1,aideThe value characteristic of the mapping. K2Representing auxiliary fusion features X2,aideMapped key features, V2Representing auxiliary fusion features X2,aideThe value characteristic of the mapping.
(3) Performing cross-modal fusion based on a multi-head attention mechanism based on the mapping result:
for the packet (X)0,main,X1,aide) The fusion mode is as follows:
CrossFusion_X1→0=Concat(head1,…,headn)
for the packet (X)0,main,X2,aide) The fusion mode is as follows:
CrossFusion_X2→0=Concat(head1,…,headn)
wherein X0,mainSelf-attention feature, X, representing the current target modality1,aide、X2,aideAn auxiliary fusion feature representing the remaining target modalities, Cross fusion _ Xaide→mainRepresenting the target modality based on the auxiliary modality fusion result.
As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the weighted integration network in the fourth step is based on an adaptive weighted fusion algorithm and is configured to input the two sets of cross-modal fusion features to extract cross-modal fusion features of the target.
The formula of the self-adaptive weighting fusion algorithm is as follows:
wherein Wj,bjIs the hidden layer network weight and bias, λ, of the jth fusion submodulenCross fusion _ X, the integration weight of all fusion submodulesmThe cross-modal fusion characteristics obtained in the fourth step.
As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the scoring module in the sixth step is configured to input the full-scale multi-dimensional features to obtain a final emotion score based on a regression network, and the calculation process is as follows:
Score=WoutRelu(WsConcat([CrossFusion_X1…CrossFusion_Xm]))
wherein Relu is the activation function, WsAs weights of said full-scale multi-dimensional features, WoutIs hidden by the last full connection layerThe hidden layer parameter, Concat, is the matrix splicing operation.
Generally, compared with the prior art, the technical scheme of the invention has the following beneficial effects:
(1) in the invention, by utilizing the multi-attention mechanism based on the Transformer to process the preprocessing characteristics of the three modes, compared with the traditional circulating neural network and the traditional convolution neural network, the multi-mode data does not need to be aligned in the time dimension in advance. The method is beneficial to reducing the data marking cost and is better applied to the actual production environment;
(2) in the invention, by utilizing the transform-based autocorrelation feature extraction module and the cross-modal fusion module, compared with the traditional method, the method considers the inherent information which is beneficial to depicting emotion in the modal and the fusion information between the modalities, and avoids the problem of model overfitting caused by directly cascading modal features;
(3) in the invention, when the mode fusion is carried out by the self-adaptive weighting fusion algorithm, different self-adaptive weights are given during the mode fusion by learning the dependency relationship between the modes, and the inherent difference between the modes is better considered compared with the traditional method.
Drawings
FIG. 1 is a flow chart of a multi-modal emotion analysis method based on a multi-dimensional attention fusion network;
FIG. 2 is a schematic structural diagram of a multidimensional attention fusion network based on an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a cross-mode fusion module according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides a multi-modal emotion analysis method based on a multi-dimensional attention fusion network, and the specific flow is as shown in fig. 1, in addition, fig. 2 is a structural schematic diagram of the multi-dimensional attention fusion network in the embodiment of the invention, and fig. 3 is a structural schematic diagram of a cross-modal fusion module in the embodiment of the invention. The method comprises the following implementation steps:
1. and processing the multi-mode emotion database, and aligning the feature dimensions.
The experiment of the invention is based on MOSEI multi-modal emotion database, the database comprises 23454 data samples, each data sample comprises preprocessing characteristics of three modes of voice, video and text, wherein the video preprocessing characteristics are obtained by extracting 35-dimensional facial expression unit characteristics from Facet, the voice preprocessing characteristics are obtained by extracting 39-dimensional mfcc characteristics from voice by using a kaldi voice recognition tool package, and the text preprocessing characteristics can be obtained by extracting 300-dimensional word vector characteristics from word2 vec. And each data sample comprises an emotion mark score, and the emotion mark score of the whole sample is in the range of [ -3, 3], wherein (0,3] is positive emotion and [ -3, 0) is negative emotion. The defined emotion classification is determined according to the interval, for example, if the interval is 1, the interval is divided into [ -3, -2, -1, 0, 1, 2, 3], that is, there are 7 emotion classifications.
Because the three features are not distributed uniformly and have different feature dimensions, the three features are mapped into the same dimension through linear transformation in order to facilitate subsequent cross-modal fusion.
2. And extracting autocorrelation information of three modes.
The method comprises the step of extracting autocorrelation information of three modes by using a Transformer feature extractor, wherein the autocorrelation information of the modes is important information which is helpful for emotion recognition in the modes and is extracted by using a Transformer network. The Transformer itself contains two important parts of a multi-headed self-attention mechanism and a feedforward network. As shown in FIG. 2, the present invention uses a first-order autocorrelation feature extractor based on a multi-head attention mechanism to extract important information of a modal from a preprocessed feature, regards the result as a first-order autocorrelation feature, uses a feedforward network as a second-order autocorrelation feature extractor to perform nonlinear fitting on the preprocessed feature, and regards the result as a second-order autocorrelation feature. Three sets of autocorrelation information of video, voice and text are obtained.
3. And extracting multi-dimensional fusion features.
And (3-1) extracting the autocorrelation characteristics.
Based on the autocorrelation information of the three modalities extracted in step 2 and the cross-modality fusion module shown in fig. 3, the primary autocorrelation feature of one target modality is used as a target feature to be fused, and the secondary autocorrelation features of the other two modalities are used as auxiliary fusion modality features, for example, the primary autocorrelation feature of voice is selected, and the primary autocorrelation features of video and text are sent to the cross-modality fusion module; selecting first-level autocorrelation characteristics of the video, and first-level autocorrelation characteristics of the voice and the text, and sending the first-level autocorrelation characteristics to a cross-mode fusion module; selecting first-level autocorrelation characteristics of the text and first-level autocorrelation characteristics of the voice and the video, and sending the first-level autocorrelation characteristics of the voice and the video into a cross-mode fusion module; wherein the cross-modality fusion module comprises two bimodal fusions and a weighted integration network.
And (3-2) extracting cross-modal fusion features.
The feature combinations in (3-1) are sent to the cross-modal fusion module shown in fig. 3, and then cross-modal fusion features are calculated based on the key and value query concept of the attention mechanism in the Transformer. For example, one-level autocorrelation characteristic Q of the speech would be selected0,mainLinear mapping is carried out to obtain a query vector, and the video X is processed1,aideText X2,aideLinear mapping is carried out on the two-level autocorrelation characteristics of the two modes to obtain respective key and value vectors, then multi-mode fusion is carried out according to a graph 3 to respectively obtain videos>Cross-modal fusion of features, text, and speech>Features are fused across modalities for speech. The specific calculation process is as follows:
for X0,main:Q0,main=X0,mainWQ′
For packet (X)0,main,X1,aide) The fusion mode is as follows:
CrossFusion_X1→0=Concat(head1,…,headn)
for packet (X)0,main,X2,aide) The fusion mode is as follows:
CrossFusion_X2→0=Concat(head1,…,headn)
wherein X0,mainSelf-attention feature, X, representing the current target modality1,aide、X2,aideAn auxiliary fusion feature representing the remaining target modalities, Cross fusion _ Xaide→mainThe representation target modality main is based on the auxiliary modality aide fusion result.
The two groups of features are sent to a weighted integration network shown in fig. 2, and a (video, text) - > voice cross-modal fusion feature is obtained. The specific calculation process is as follows
CrossFusion_Xm=λ*CrossFusion_X1→0+(1-λ)*CrossFusion_X2→0
The above processes are performed simultaneously in three sets of multidimensional attention fusion networks as shown in fig. 1, and finally, a (video, text) - > speech cross-modal fusion feature, a (video, speech) - > text cross-modal fusion feature, and a (speech, text) - > video cross-modal fusion feature are obtained
And (3-3) extracting multi-dimensional characteristics of videos, voices and texts.
In order to take account of the characteristics of two types of features, the embodiment obtains a multidimensional fusion feature by fusing an autocorrelation feature and a cross-modal feature, and the specific fusion process is as follows:
and adding the secondary auto-correlation characteristics of the voice and the (voice and text) - > video cross-modal fusion characteristics to obtain the video multi-dimensional fusion characteristics.
And adding the two-stage autocorrelation characteristics of the voice and the (video, text) - > voice cross-modal fusion characteristics to obtain the voice multi-dimensional fusion characteristics.
And adding the secondary autocorrelation characteristics of the text and the (video, voice) - > text cross-modal fusion characteristics to obtain the text multi-dimensional fusion characteristics.
4. An emotion score is calculated.
As shown in fig. 1, the obtained voice multidimensional fusion features, video multidimensional fusion features and text multidimensional fusion features are spliced, and then regression calculation is performed to obtain specific emotion scores, wherein the calculation process is as follows:
Score=WoutRRelu(WsRConcat([CrossFusion_X1,CrossFusion_X2,CrossFusion_X3]))
wherein Cross fusion _ X1,CrossFusion_X2,CrossFusion_X3Respectively representing video multi-dimensional fusion characteristics, voice multi-dimensional fusion characteristics and text multi-dimensional fusion characteristics. WoutHiding layer parameters for the regression network.
And determining the emotion interval in which the score falls specifically by calculating the emotion score of the sample and combining the emotion label of the database sample, thereby obtaining the final emotion grade.
The effectiveness of the invention is proved by the following experimental examples, and the experimental results prove that the invention can improve the recognition accuracy of emotion analysis.
The method is compared with 4 existing representative emotion analysis methods on an MOSEI data set, and Table 1 shows that the method and the 4 comparison methods for comparison are based on the Accuracy (ACC) of 2 classification and 7 classification and the performance of an F1 index on the data set, the larger the numerical value of the result is, the higher the emotion analysis quality is, and the improvement of the method (namely OurRMethod noted in Table 1) is very obvious.
TABLE 1R Performance of ACCR and F1 indices on MOSEI data set by different methods
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (10)
1. A multi-modal emotion analysis method based on a multi-dimensional attention fusion network is characterized by comprising the following steps:
step one, a multi-modal emotion analysis database is established, each sample in the database contains three target modal data of voice, video and text, preprocessing characteristics of the three target modalities are extracted in advance, and emotion labeling is carried out on each sample;
step two, constructing respective multidimensional attention fusion networks for the three target modalities in the step one, wherein the respective multidimensional attention fusion networks of the three target modalities all comprise an autocorrelation feature extraction module and a cross-modality fusion module, the multidimensional attention fusion network of the voice target modality comprises a voice autocorrelation feature extraction module and a voice cross-modality fusion module, the multidimensional attention fusion network of the video target modality comprises a video autocorrelation feature extraction module and a video cross-modality fusion module, and the multidimensional attention fusion network of the text target modality comprises a text autocorrelation feature extraction module and a text cross-modality fusion module;
step three, the preprocessing characteristics of the three target modes in the step one are respectively passed through the self-correlation characteristic extraction module corresponding to each target module in the step two, and the self-correlation information of the three modes, namely the voice self-correlation information, the text self-correlation information and the video self-correlation information, is extracted; the voice autocorrelation feature extraction module, the text autocorrelation feature extraction module and the video autocorrelation feature extraction module respectively comprise a primary autocorrelation feature extractor and a secondary autocorrelation feature extractor, and the autocorrelation information of the three target modes comprises primary autocorrelation features and secondary autocorrelation features;
step four, selecting the first-level autocorrelation characteristics of any one target mode in the step three as target characteristics to be fused, and the second-level autocorrelation characteristics of the other two target modes as auxiliary fusion characteristics, and sending the target characteristics to a cross-mode fusion module where the target modes are located according to a preset grouping mode to respectively obtain voice cross-mode fusion characteristics, text cross-mode fusion characteristics and video cross-mode fusion characteristics, wherein the cross-mode fusion module comprises two dual-mode fusion devices and a weighted integration network;
fifthly, adding the cross-modal fusion characteristics of each target mode in the fourth step and the secondary autocorrelation characteristics in the third step to obtain multi-dimensional fusion characteristics;
and step six, splicing the voice, text and video multi-dimensional fusion characteristics obtained in the step five to obtain full-scale multi-dimensional characteristics, and sending the full-scale multi-dimensional characteristics to a scoring module to obtain emotion scores.
2. The multi-modal sentiment analysis method based on the multi-dimensional attention fusion network as claimed in claim 1, wherein the sentiment in the step one is labeled as a finite continuous interval, the inside of the interval is continuously and equidistantly divided into M sub-intervals, and each sub-interval represents a degree range of the sentiment.
3. The method for multi-modal emotion analysis based on multi-dimensional attention fusion network as claimed in claim 1 or 2, wherein in step two, the autocorrelation feature extraction module is a Transformer network.
4. The method for multi-modal emotion analysis based on multidimensional attention fusion network as recited in claim 1 or 2, wherein the primary autocorrelation feature extractor and the secondary autocorrelation feature extractor are cascaded in step three, wherein the primary autocorrelation feature extractor is configured to input the preprocessed features of the target modality to extract primary autocorrelation features based on a multi-head autocorrelation mechanism of a Transformer, and the secondary autocorrelation feature extractor is configured to input the primary autocorrelation features of the target modality to extract secondary autocorrelation features based on a feedforward neural network of the Transformer.
5. The method of claim 4, wherein the multi-modal emotion analysis method based on the multidimensional attention fusion network is characterized in that the multi-head self-attention mechanism of the Transformer is calculated by the following formula:
Qi=XWi Q;Ki=XWi K;Vi=XWi V;
MultiHead_X1=Concat(head1,…,headn)
wherein X is the target modal preprocessing characteristic in the second step, Wi QIs the query mapping weight, W, of the ith header of the target modality preprocessing featurei KIs the key mapping weight of the ith head of the target modality preprocessing feature, Wi VIs the value mapping weight of the ith head of the target modal preprocessing characteristic, softmax is the weight normalization function, Ki TIs KiTransposed matrix of dkFor the scaling factor, n is the number of headers, MultiHead _ X1To obtainA first order autocorrelation feature.
6. The multi-modal emotion analysis method based on multi-dimensional attention fusion network as claimed in claim 4, wherein the feedforward neural network formula is:
MultiHead_X2=max(0,MultiHead_X1·W1+b1)·W2+b2
wherein Multihead _ X2Is a second order autocorrelation feature of the target mode, W1Is the weight of a first-order autocorrelation feature, W2Hiding layer weights for the network, b1Biasing for first order autocorrelation characteristics, b2Biasing for the network hidden layer.
7. The multi-modal emotion analysis method based on multi-dimensional attention fusion network as claimed in claim 1 or 2, wherein the grouping manner in step four is (X)0,main,X1,aide),(X0,main,X2,aide) For subsequent packet fusion, wherein X0,mainFor the target feature to be fused in step four, X1,aide,X2,aideFor the auxiliary fusion feature of the other two modalities, (X) is0,main,X1,aide) Inputting a bimodal fusion device, (X)0,main,X2,aide) Input into another bimodal fuser.
8. The multi-modal emotion analysis method based on the multidimensional attention fusion network as claimed in claim 1 or 2, wherein the bimodal fusion device in step four is a cross-modal fusion based on a multi-head attention mechanism of a Transformer, and the calculation method is as follows:
Qm,main=Xm,mainWQ′;K=XaideWK′;V=Xj,aideWV′;
CrossFusion_Xaide→main=Concat(head1,…,headn)
wherein Xm,mainSelf-attention feature, X, representing the current target modality mj,aideIndicating a secondary fusion feature, CrossFusion _ Xaide→mainDenotes the fusion result, WQ′Query mapping weight, W, for target modality mK′Weight of key mapping for target modality m, WV′Mapping weights for values of the target modality m, dk' is the scaling factor.
9. The multi-modal emotion analysis method based on multi-dimensional attention fusion network as claimed in claim 1 or 2, wherein the weighted integration network is based on an adaptive weighted fusion algorithm, and the formula is as follows:
wherein Wj,bjIs the hidden layer network weight and bias, λ, of the jth fusion submodulenCross fusion _ X, the integration weight of all fusion submodulesmThe cross-modal fusion characteristics obtained in the fourth step.
10. The multi-modal emotion analysis method based on multi-dimensional attention fusion network as claimed in claim 1 or 2, wherein the scoring module in the sixth step is based on a regression network and is configured to input the full-scale multi-dimensional features to obtain a final emotion score, and the calculation process is as follows:
Score=Wout·Relu(WsConcat([CrossFusion_X1…CrossFusion_Xm]))
wherein Relu is the activation function, WsAs weights of said full-scale multi-dimensional features, WoutConcat is the hidden layer parameter of the last fully connected layer, and is the matrix splicing operation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010292014.0A CN111680541B (en) | 2020-04-14 | 2020-04-14 | Multi-modal emotion analysis method based on multi-dimensional attention fusion network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010292014.0A CN111680541B (en) | 2020-04-14 | 2020-04-14 | Multi-modal emotion analysis method based on multi-dimensional attention fusion network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111680541A true CN111680541A (en) | 2020-09-18 |
CN111680541B CN111680541B (en) | 2022-06-21 |
Family
ID=72433356
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010292014.0A Active CN111680541B (en) | 2020-04-14 | 2020-04-14 | Multi-modal emotion analysis method based on multi-dimensional attention fusion network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111680541B (en) |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112053690A (en) * | 2020-09-22 | 2020-12-08 | 湖南大学 | Cross-modal multi-feature fusion audio and video voice recognition method and system |
CN112233698A (en) * | 2020-10-09 | 2021-01-15 | 中国平安人寿保险股份有限公司 | Character emotion recognition method and device, terminal device and storage medium |
CN112489635A (en) * | 2020-12-03 | 2021-03-12 | 杭州电子科技大学 | Multi-mode emotion recognition method based on attention enhancement mechanism |
CN112560811A (en) * | 2021-02-19 | 2021-03-26 | 中国科学院自动化研究所 | End-to-end automatic detection research method for audio-video depression |
CN112765323A (en) * | 2021-01-24 | 2021-05-07 | 中国电子科技集团公司第十五研究所 | Voice emotion recognition method based on multi-mode feature extraction and fusion |
CN112819052A (en) * | 2021-01-25 | 2021-05-18 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Multi-modal fine-grained mixing method, system, device and storage medium |
CN112989977A (en) * | 2021-03-03 | 2021-06-18 | 复旦大学 | Audio-visual event positioning method and device based on cross-modal attention mechanism |
CN113723112A (en) * | 2021-11-02 | 2021-11-30 | 天津海翼科技有限公司 | Multi-modal emotion analysis prediction method, device, equipment and storage medium |
CN113807440A (en) * | 2021-09-17 | 2021-12-17 | 北京百度网讯科技有限公司 | Method, apparatus, and medium for processing multimodal data using neural networks |
CN113806609A (en) * | 2021-09-26 | 2021-12-17 | 郑州轻工业大学 | Multi-modal emotion analysis method based on MIT and FSM |
CN114387997A (en) * | 2022-01-21 | 2022-04-22 | 合肥工业大学 | Speech emotion recognition method based on deep learning |
CN114821385A (en) * | 2022-03-08 | 2022-07-29 | 阿里巴巴(中国)有限公司 | Multimedia information processing method, device, equipment and storage medium |
WO2022199504A1 (en) * | 2021-03-26 | 2022-09-29 | 腾讯科技(深圳)有限公司 | Content identification method and apparatus, computer device and storage medium |
CN115205179A (en) * | 2022-07-15 | 2022-10-18 | 小米汽车科技有限公司 | Image fusion method and device, vehicle and storage medium |
CN116070169A (en) * | 2023-01-28 | 2023-05-05 | 天翼云科技有限公司 | Model training method and device, electronic equipment and storage medium |
CN116189272A (en) * | 2023-05-05 | 2023-05-30 | 南京邮电大学 | Facial expression recognition method and system based on feature fusion and attention mechanism |
WO2023138188A1 (en) * | 2022-01-24 | 2023-07-27 | 腾讯科技(深圳)有限公司 | Feature fusion model training method and apparatus, sample retrieval method and apparatus, and computer device |
CN117975342A (en) * | 2024-03-28 | 2024-05-03 | 江西尚通科技发展有限公司 | Semi-supervised multi-mode emotion analysis method, system, storage medium and computer |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160378965A1 (en) * | 2015-06-26 | 2016-12-29 | Samsung Electronics Co., Ltd. | Electronic apparatus and method for controlling functions in the electronic apparatus using a bio-metric sensor |
CN109614895A (en) * | 2018-10-29 | 2019-04-12 | 山东大学 | A method of the multi-modal emotion recognition based on attention Fusion Features |
US20190163965A1 (en) * | 2017-11-24 | 2019-05-30 | Genesis Lab, Inc. | Multi-modal emotion recognition device, method, and storage medium using artificial intelligence |
CN110033029A (en) * | 2019-03-22 | 2019-07-19 | 五邑大学 | A kind of emotion identification method and device based on multi-modal emotion model |
CN110188343A (en) * | 2019-04-22 | 2019-08-30 | 浙江工业大学 | Multi-modal emotion identification method based on fusion attention network |
CN110287389A (en) * | 2019-05-31 | 2019-09-27 | 南京理工大学 | The multi-modal sensibility classification method merged based on text, voice and video |
CN110399841A (en) * | 2019-07-26 | 2019-11-01 | 北京达佳互联信息技术有限公司 | A kind of video classification methods, device and electronic equipment |
-
2020
- 2020-04-14 CN CN202010292014.0A patent/CN111680541B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160378965A1 (en) * | 2015-06-26 | 2016-12-29 | Samsung Electronics Co., Ltd. | Electronic apparatus and method for controlling functions in the electronic apparatus using a bio-metric sensor |
US20190163965A1 (en) * | 2017-11-24 | 2019-05-30 | Genesis Lab, Inc. | Multi-modal emotion recognition device, method, and storage medium using artificial intelligence |
CN109614895A (en) * | 2018-10-29 | 2019-04-12 | 山东大学 | A method of the multi-modal emotion recognition based on attention Fusion Features |
CN110033029A (en) * | 2019-03-22 | 2019-07-19 | 五邑大学 | A kind of emotion identification method and device based on multi-modal emotion model |
CN110188343A (en) * | 2019-04-22 | 2019-08-30 | 浙江工业大学 | Multi-modal emotion identification method based on fusion attention network |
CN110287389A (en) * | 2019-05-31 | 2019-09-27 | 南京理工大学 | The multi-modal sensibility classification method merged based on text, voice and video |
CN110399841A (en) * | 2019-07-26 | 2019-11-01 | 北京达佳互联信息技术有限公司 | A kind of video classification methods, device and electronic equipment |
Non-Patent Citations (1)
Title |
---|
JIAN HUANG等: "Multimodal continuous emotion recognition with data augmentation using recurrent neural networks", 《ACM》 * |
Cited By (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112053690A (en) * | 2020-09-22 | 2020-12-08 | 湖南大学 | Cross-modal multi-feature fusion audio and video voice recognition method and system |
CN112053690B (en) * | 2020-09-22 | 2023-12-29 | 湖南大学 | Cross-mode multi-feature fusion audio/video voice recognition method and system |
CN112233698A (en) * | 2020-10-09 | 2021-01-15 | 中国平安人寿保险股份有限公司 | Character emotion recognition method and device, terminal device and storage medium |
CN112233698B (en) * | 2020-10-09 | 2023-07-25 | 中国平安人寿保险股份有限公司 | Character emotion recognition method, device, terminal equipment and storage medium |
CN112489635A (en) * | 2020-12-03 | 2021-03-12 | 杭州电子科技大学 | Multi-mode emotion recognition method based on attention enhancement mechanism |
CN112489635B (en) * | 2020-12-03 | 2022-11-11 | 杭州电子科技大学 | Multi-mode emotion recognition method based on attention enhancement mechanism |
CN112765323A (en) * | 2021-01-24 | 2021-05-07 | 中国电子科技集团公司第十五研究所 | Voice emotion recognition method based on multi-mode feature extraction and fusion |
CN112765323B (en) * | 2021-01-24 | 2021-08-17 | 中国电子科技集团公司第十五研究所 | Voice emotion recognition method based on multi-mode feature extraction and fusion |
US20220237420A1 (en) * | 2021-01-25 | 2022-07-28 | Harbin Inst. of Tech (Shenzhen) (Shenzhen Inst. of Science and Tech Innovation, Harbin Inst. of Tech | Multimodal fine-grained mixing method and system, device, and storage medium |
CN112819052A (en) * | 2021-01-25 | 2021-05-18 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Multi-modal fine-grained mixing method, system, device and storage medium |
US11436451B2 (en) * | 2021-01-25 | 2022-09-06 | Harbin Institute Of Technology (Shenzhen) (Shenzhen Institute Of Science And Technology Innovation, Harbin Institute Of Technology) | Multimodal fine-grained mixing method and system, device, and storage medium |
US11963771B2 (en) * | 2021-02-19 | 2024-04-23 | Institute Of Automation, Chinese Academy Of Sciences | Automatic depression detection method based on audio-video |
CN112560811A (en) * | 2021-02-19 | 2021-03-26 | 中国科学院自动化研究所 | End-to-end automatic detection research method for audio-video depression |
US20220265184A1 (en) * | 2021-02-19 | 2022-08-25 | Institute Of Automation, Chinese Academy Of Sciences | Automatic depression detection method based on audio-video |
CN112989977A (en) * | 2021-03-03 | 2021-06-18 | 复旦大学 | Audio-visual event positioning method and device based on cross-modal attention mechanism |
WO2022199504A1 (en) * | 2021-03-26 | 2022-09-29 | 腾讯科技(深圳)有限公司 | Content identification method and apparatus, computer device and storage medium |
CN113807440A (en) * | 2021-09-17 | 2021-12-17 | 北京百度网讯科技有限公司 | Method, apparatus, and medium for processing multimodal data using neural networks |
CN113806609B (en) * | 2021-09-26 | 2022-07-12 | 郑州轻工业大学 | Multi-modal emotion analysis method based on MIT and FSM |
CN113806609A (en) * | 2021-09-26 | 2021-12-17 | 郑州轻工业大学 | Multi-modal emotion analysis method based on MIT and FSM |
CN113723112A (en) * | 2021-11-02 | 2021-11-30 | 天津海翼科技有限公司 | Multi-modal emotion analysis prediction method, device, equipment and storage medium |
CN113723112B (en) * | 2021-11-02 | 2022-02-22 | 天津海翼科技有限公司 | Multi-modal emotion analysis prediction method, device, equipment and storage medium |
CN114387997A (en) * | 2022-01-21 | 2022-04-22 | 合肥工业大学 | Speech emotion recognition method based on deep learning |
CN114387997B (en) * | 2022-01-21 | 2024-03-29 | 合肥工业大学 | Voice emotion recognition method based on deep learning |
WO2023138188A1 (en) * | 2022-01-24 | 2023-07-27 | 腾讯科技(深圳)有限公司 | Feature fusion model training method and apparatus, sample retrieval method and apparatus, and computer device |
CN114821385A (en) * | 2022-03-08 | 2022-07-29 | 阿里巴巴(中国)有限公司 | Multimedia information processing method, device, equipment and storage medium |
CN115205179A (en) * | 2022-07-15 | 2022-10-18 | 小米汽车科技有限公司 | Image fusion method and device, vehicle and storage medium |
CN116070169A (en) * | 2023-01-28 | 2023-05-05 | 天翼云科技有限公司 | Model training method and device, electronic equipment and storage medium |
CN116189272A (en) * | 2023-05-05 | 2023-05-30 | 南京邮电大学 | Facial expression recognition method and system based on feature fusion and attention mechanism |
CN117975342A (en) * | 2024-03-28 | 2024-05-03 | 江西尚通科技发展有限公司 | Semi-supervised multi-mode emotion analysis method, system, storage medium and computer |
CN117975342B (en) * | 2024-03-28 | 2024-06-11 | 江西尚通科技发展有限公司 | Semi-supervised multi-mode emotion analysis method, system, storage medium and computer |
Also Published As
Publication number | Publication date |
---|---|
CN111680541B (en) | 2022-06-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111680541B (en) | Multi-modal emotion analysis method based on multi-dimensional attention fusion network | |
Audhkhasi et al. | End-to-end ASR-free keyword search from speech | |
Hazarika et al. | Self-attentive feature-level fusion for multimodal emotion detection | |
US11488586B1 (en) | System for speech recognition text enhancement fusing multi-modal semantic invariance | |
CN111506732B (en) | Text multi-level label classification method | |
CN116450796B (en) | Intelligent question-answering model construction method and device | |
CN116702091B (en) | Multi-mode ironic intention recognition method, device and equipment based on multi-view CLIP | |
CN110888980A (en) | Implicit discourse relation identification method based on knowledge-enhanced attention neural network | |
CN110569869A (en) | feature level fusion method for multi-modal emotion detection | |
Wang et al. | Cross-modal enhancement network for multimodal sentiment analysis | |
Khare et al. | Multi-modal embeddings using multi-task learning for emotion recognition | |
CN116303977B (en) | Question-answering method and system based on feature classification | |
CN114417097A (en) | Emotion prediction method and system based on time convolution and self-attention | |
CN113705315A (en) | Video processing method, device, equipment and storage medium | |
do Carmo Nogueira et al. | Reference-based model using multimodal gated recurrent units for image captioning | |
CN112988970A (en) | Text matching algorithm serving intelligent question-answering system | |
CN113569553A (en) | Sentence similarity judgment method based on improved Adaboost algorithm | |
Zhao et al. | Knowledge-aware bayesian co-attention for multimodal emotion recognition | |
CN114004220A (en) | Text emotion reason identification method based on CPC-ANN | |
Sun et al. | Multi-classification speech emotion recognition based on two-stage bottleneck features selection and MCJD algorithm | |
CN114003700A (en) | Method and system for processing session information, electronic device and storage medium | |
CN112949284B (en) | Text semantic similarity prediction method based on Transformer model | |
KR102297480B1 (en) | System and method for structured-paraphrasing the unstructured query or request sentence | |
CN116108856B (en) | Emotion recognition method and system based on long and short loop cognition and latent emotion display interaction | |
CN116341521A (en) | AIGC article identification system based on text features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |