CN112101096B - Multi-mode fusion suicide emotion perception method based on voice and micro-expression - Google Patents

Multi-mode fusion suicide emotion perception method based on voice and micro-expression Download PDF

Info

Publication number
CN112101096B
CN112101096B CN202010764408.1A CN202010764408A CN112101096B CN 112101096 B CN112101096 B CN 112101096B CN 202010764408 A CN202010764408 A CN 202010764408A CN 112101096 B CN112101096 B CN 112101096B
Authority
CN
China
Prior art keywords
layer
emotion
fusion
feature
micro
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010764408.1A
Other languages
Chinese (zh)
Other versions
CN112101096A (en
Inventor
杜广龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202010764408.1A priority Critical patent/CN112101096B/en
Publication of CN112101096A publication Critical patent/CN112101096A/en
Application granted granted Critical
Publication of CN112101096B publication Critical patent/CN112101096B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2218/00Aspects of pattern recognition specially adapted for signal processing
    • G06F2218/08Feature extraction
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a multi-mode fusion suicide emotion perception method based on voice and micro-expressions. The method comprises the following steps: collecting video and audio by using Kinect with an infrared camera; analyzing and converting the image frames and the audio in the video into corresponding characteristic texts by using different methods; fusing the feature texts, namely performing dimension reduction processing to obtain fusion features; and classifying the fusion characteristics by using a SoftMax activation function, and judging whether the emotion belongs to suicide emotion. The present invention aligns multimodal data with a text layer. The text intermediate representation and the proposed fusion method form a framework for fusing speech and facial expressions. The invention reduces the dimension of the voice and the facial expression, and unifies two pieces of information into one component. The invention utilizes Kinect for data acquisition and has the characteristics of no wound, high performance and convenient operation.

Description

Multi-mode fusion suicide emotion perception method based on voice and micro-expression
Technical Field
The invention belongs to the field of emotion perception, and particularly relates to a multi-mode fusion suicide emotion perception method based on voice and micro-expressions.
Background
In daily life, people who do not want to suicide often encounter the suicide, even possibly because of a small and unpleasant suicide, which can cause huge psychological injury to people loving them and also bring mental and material loss to society. In fact, before suicide, the people have physiological anomalies such as corresponding speech, limbs, expressions and the like, if the people can carefully observe and know the physiological anomalies through proper technologies, a life which is perhaps saved is saved.
Besides grasping the corresponding psychological knowledge and popularizing the sound psychological course, the method is an effective method by means of scientific and technical means, people who observe abnormality under the camera can judge the abnormality index and the emotion state of the observed person from multiple aspects by the computer, and then professionals can timely conduct dispersion and release, so that manpower and material resources can be effectively utilized. In terms of technical implementation, there are high bridges that classify the emotion of a video using electroencephalogram signals (k.takahashi, "Remarks on emotion recognition from multi-mode bio-potential signals", proc.ieee int.conf.ind.technology (ici), vol.3, pp.1138-1143, jun.2004.), budeson uses electroencephalogram time-frequency characteristics for three emotion recognition (G.Chanel, J.J.M.Kierkels, M.Soleymani, T.Pun, "Short-term emotion assessment in a recall paradigm", int.j.human comput.student, vol.67, no.8, pp.607-627, aug.2009.), jin M et al uses biosensors to classify music emotion from electromyography, electrocardiogram, skin conductance and respiratory changes (j.kim, and e.andre, "Emotion recognition based on physiological changes in music listening," IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.30, no.12, pp.7-2083,2008). Although physiological signals are easy to obtain, the wearing of the sensor hardware is inconvenient in special cases, so that non-contact data extraction should be considered. Starting from facial expression perception, xu et al propose a method for perceiving a person's emotion through a micro-expression of a video sequence (F.xu, J.zhang and J.Z.Wang, "Microexpression Identification and Categorization Using a Facial Dynamics Map," IEEE Transactions on Affective Computing, vol.8, issue 2, pp.1-1,2017.), literature (X.Zhang, U.A.Ciftci, and L.yin, "Mouth gesture based emotion awareness and interaction in virtual readiness." Acm signature ACM, pp.1-1,2015.) propose a probabilistic facial expression recognition method based on two-dimensional geometric features. In speech-based, many studies were based on emotion recognition of plain text data (C.—H.Wu, Z.—J.Chuang and Y.—C.Lin, "Emotion Recognition from Text Using Semantic Label and Separable Mixture Model", ACM Trans.Asian Language Information Processing, vol.5, no.2, pp.165-182, june 2006.C.- -M.Lee and S.S. Narayana, & ldquo, "Toward Detecting Emotions in Spoken Dialogs, & rdquo", IEEE Trans.spech and Audio Processing, vol.13, no.2, pp.293-303,Mar.2005.L.Devillers,L.Lamel and I.Vasilescu, & ldquo, "Emotion Detection in Task-Oriented Spoken Dialogues, & rdquo", proc.IEEE Int', l Conf.Multimedia and Expo, pp.549-552,2003.). It follows that most emotion recognition methods focus on the study of a single factor, which is obviously one-sided, because individuals can control the internal emotion well without appearing, which results in insufficient reliability of the study results. Then, multiple emotion perception techniques based on speech and micro-expressions must exist and develop.
Disclosure of Invention
The invention aims to solve the defects in the prior art, and provides a multi-mode fusion suicide emotion perception method based on voice and micro-expressions. The invention combines the characteristics of expression and language, and is more true. The method comprises the steps of firstly collecting audio and video, then respectively extracting features, converting the extracted features into corresponding text descriptions, fusing a plurality of features through a neural network and other algorithms, enabling the features to be more representative, and finally achieving the emotion recognition effect through classification. Experiments prove that compared with other algorithms, the method can greatly improve the emotion recognition degree.
The object of the invention is achieved by at least one of the following technical solutions.
A multi-mode fusion suicide emotion perception method based on voice and micro-expressions comprises the following steps:
s1, acquiring video and audio by using Kinect with an infrared camera;
s2, analyzing and converting the image frames and the audio in the video into corresponding characteristic texts by using different methods;
s3, fusing the feature texts, namely obtaining fusion features after dimension reduction processing;
and S4, classifying the fusion characteristics by using a SoftMax activation function, and judging whether the emotion belongs to suicide emotion.
Further, in step S2, for the obtained audio, different feature extraction is performed from three dimensions of the voice content, the intonation and the speech speed, and the three dimensions are converted into three sets of corresponding feature texts; and (3) capturing the facial expression of the acquired image frame, extracting the characteristics and reducing the dimension of the captured image frame, and classifying and converting the captured image frame into corresponding expression text description through a neural network.
Further, the step S2 specifically includes the following steps:
s2.1, after noise reduction processing is carried out on the audio signal, voice is sequentially converted into three corresponding characteristic text descriptions according to voice content, intonation and speech speed, and then converted into tone symbols through a BP neural network (Back Propagation Neural Network, BPNN) for emotion recognition;
the BP neural network is the most basic neural network and comprises an input layer, a hidden layer and an output layer, wherein an output result adopts forward propagation, and an error adopts a reverse propagation mode;
s2.2, facial expression recognition is carried out by adopting a local method, namely, each segmented region of a human face is obtained according to the information of the human face image frames captured by Kinect in real time, after the image is subjected to cutting, scaling, filtering, denoising, histogram equalization and gray level equalization, gabor wavelet is adopted for carrying out feature extraction, a linear discriminant analysis method is adopted for carrying out dimension reduction so as to obtain corresponding feature vectors, and finally, a three-layer neural network is used for classification so as to obtain a human face recognition result, namely corresponding feature text description.
The three-layer neural network structure comprises an input layer, a hidden layer and an output layer; the input layer receives data, the output layer outputs data, and the hidden layer transmits information after 'activation'.
Further, in step S3, the characteristic text fusion and compensation are performed by using LSTM network, self-organizing map (SOM-Organization Mapping) and compensation layer, and the specific steps are as follows:
s3.1, firstly, inputting the characteristic text description generated in the step S2 into an LSTM network so that the characteristic is embedded into a vector with a fixed size;
s3.2, normalizing the vector in the step S3.1 by adopting a Self-organizing map (SOM) algorithm;
s3.3, since the SOM algorithm may lose information, a compensation layer is arranged to compensate the lost information;
and S3.4, performing global optimization and fusion on the result vectors generated in the steps S3.2 and S3.3 to obtain fusion feature vectors.
Further, in step S3.1, it is assumed that there is a given input sequence x= { x 1 ,x 2 ,…,x t ,…,x T T represents the T-th feature, and the T common feature texts; the calculation formula of each layer of LSTM network is as follows:
h t =σ h (W xh x t +W hh h t-1 +b h );
in the formula, h t Representing the output of the hidden layer at time t, W xh Weight matrix representing input layer to hidden layer, W hh A weight matrix representing hidden layers to hidden layers, b h Representing the deviation of hidden layer, sigma h Representing an activation function.
Further, in step S3.2, the Self-organizing map (SOM) algorithm includes the steps of:
s3.2.1 the vector generated in step S3.1 is used as the input of the SOM algorithm, the corresponding output is determined, and the emotion type is represented by the maximum function value;
s3.2.2 determining a neighborhood range of winning neurons and adjusting weights of neurons within the range so that the weights converge toward the text description embedding vector;
s3.2.3 with the development of continuous learning, the neighborhood range is reduced, the feature vectors of the text description are separated from each other, and the output result vector represents a specific emotion category.
Further, in step S3.3, the compensation layer is composed of a layer weight matrix, a represents the number of emotion categories, and all nodes on each layer have respective weights, and the calculation formula is as follows:
u i =w i ·μ i +b;
u i represents the output of the i-th layer weight matrix, w i Weight matrix, μ representing the i-th layer i An input of the i-th layer, b representing a bias constant of 1; to obtain a compensation value of a suitable magnitude, a tanh function is set after the compensation layer such that the magnitude of the compensation value is [ -1,1]Between them.
Further, in step S4, the SoftMax activation function is used to classify the fusion feature vector, and it is determined whether the emotion belongs to a suicidal emotion.
Compared with the prior art, the invention has the following advantages:
(1) The present invention aligns multimodal data with a text layer. The text intermediate representation and the proposed fusion method form a framework for fusing speech and facial expressions. The invention reduces the dimension of the voice and the facial expression, and unifies two pieces of information into one component.
(2) In order to fuse text descriptions, the invention provides a two-stage multi-mode emotion recognition framework for fusing voice and facial expressions.
(3) The invention utilizes Kinect for data acquisition and has the characteristics of high performance and convenient operation.
Drawings
FIG. 1 is a flow chart of a multi-modal fusion suicide emotion perception method based on speech and micro-expressions according to the present invention;
FIG. 2 is a diagram of a multi-modal emotion recognition neural network of the present invention;
fig. 3 is a block diagram of a BP neural network in an embodiment of the present invention.
Detailed Description
Specific embodiments of the present invention will be described further below with reference to examples and drawings, but the embodiments of the present invention are not limited thereto.
Examples:
a multi-modal fusion suicide emotion perception method based on voice and micro-expressions is shown in fig. 1, and comprises the following steps:
s1, acquiring video and audio by using Kinect with an infrared camera;
s2, analyzing and converting the image frames and the audio in the video into corresponding characteristic texts by using different methods;
for the acquired audio, carrying out different feature extraction from three dimensions of voice content, intonation and speech speed, and converting the three dimensions into three groups of corresponding feature texts; and (3) capturing the facial expression of the acquired image frame, extracting the characteristics and reducing the dimension of the captured image frame, and classifying and converting the captured image frame into corresponding expression text description through a neural network.
The step S2 specifically comprises the following steps:
s2.1, after noise reduction processing is carried out on the audio signal, voice is sequentially converted into three corresponding characteristic text descriptions according to voice content, intonation and speech speed, and then converted into tone symbols through a BP neural network (Back Propagation Neural Network, BPNN) for emotion recognition;
the BP neural network is the most basic neural network and comprises an input layer, a hidden layer and an output layer, wherein an output result adopts forward propagation, and an error adopts a reverse propagation mode; the BP neural network structure of the invention is shown in figure 3, the input layer has three nodes, the hidden layer has three nodes, and the output layer has one node.
For each layer, the following formula is used:
z=w T x+b
a=σ(z)
wherein x represents the layer input, parameter w T Representing the weight matrix of the layer, the parameter b represents the bias of the layer, and the a after the activation function processing is used as the input of the next layer, which represents the forward propagation process. The back propagation process refers to the gradient decreasing toward the gradual decrease of the error, thereby changing the parameter w T And b, until reaching the minimum error point.
S2.2, facial expression recognition is carried out by adopting a local method, namely, each segmented region of a human face is obtained according to the information of the human face image frames captured by Kinect in real time, after the image is subjected to cutting, scaling, filtering, denoising, histogram equalization and gray level equalization, gabor wavelet is adopted for carrying out feature extraction, a linear discriminant analysis method is adopted for carrying out dimension reduction so as to obtain corresponding feature vectors, and finally, a three-layer neural network is used for classification so as to obtain a human face recognition result, namely corresponding feature text description.
The three-layer neural network structure comprises an input layer, a hidden layer and an output layer, wherein the input layer receives data, the output layer outputs data, and the hidden layer transmits information after being activated; the structure is built on the BP neural network, the input layer comprises an input node for inputting the feature vector, the hidden layer comprises three nodes for processing information and increasing nonlinearity, and the output layer comprises a node for outputting the corresponding feature text.
S3, fusing the feature texts, namely obtaining fusion features after dimension reduction processing; the method comprises the steps of carrying out a first treatment on the surface of the
As shown in fig. 2, the characteristic text fusion and compensation are performed by adopting an LSTM network, self-organizing map (Self-Organization Mapping, SOM) and a compensation layer, and the specific steps are as follows:
s3.1, firstly, inputting the characteristic text description generated in the step S2 into an LSTM network so that the characteristic is embedded into a vector with a fixed size;
assume that there is a given input sequence x= { x 1 ,x 2 ,…,x t ,…,x T T represents the T-th feature, and the T common feature texts; the calculation formula of each layer of LSTM network is as follows:
h t =σ h (W xh x t +W hh h t-1 +b h );
in the formula, h t Representing the output of the hidden layer at time t, W xh Weight matrix representing input layer to hidden layer, W hh A weight matrix representing hidden layers to hidden layers, b h Representing the deviation of hidden layer, sigma h Representing an activation function.
S3.2, carrying out normalization processing on the vector in the step S3.1 by adopting a Self-organizing map (Self-Organization Mapping, SOM) algorithm, wherein the method comprises the following steps of:
s3.2.1 the vector generated in step S3.1 is used as the input of the SOM algorithm, the corresponding output is determined, and the emotion type is represented by the maximum function value;
s3.2.2 determining a neighborhood range of winning neurons and adjusting weights of neurons within the range so that the weights converge toward the text description embedding vector;
s3.2.3 with the development of continuous learning, the neighborhood range is reduced, the feature vectors of the text description are separated from each other, and the output result vector represents a specific emotion category.
S3.3, since the SOM algorithm may lose information, a compensation layer is arranged to compensate the lost information;
the compensation layer consists of a layer weight matrix A, wherein A represents the number of emotion categories, and all nodes on each layer have respective weights, and the calculation formula is as follows:
u i =w i ·μ i +b;
u i represents the output of the i-th layer weight matrix, w i Weight matrix, μ representing the i-th layer i An input of the i-th layer, b representing a bias constant of 1; to obtain a compensation value of a suitable magnitude, a tanh function is set after the compensation layer such that the magnitude of the compensation value is [ -1,1]Between which are located
And S3.4, performing global optimization and fusion on the result vectors generated in the steps S3.2 and S3.3 to obtain fusion feature vectors.
And S4, classifying the fusion feature vectors by using a SoftMax activation function, and judging whether the emotion belongs to suicide emotion.

Claims (5)

1. A multi-mode fusion suicide emotion perception method based on voice and micro-expressions is characterized by comprising the following steps:
s1, acquiring video and audio by using Kinect with an infrared camera;
s2, analyzing and converting the image frames and the audio in the video into corresponding characteristic texts by using different methods;
s3, fusing the feature texts, namely obtaining fusion features after dimension reduction processing; the LSTM network, the self-organizing mapping and the compensation layer are adopted to carry out feature text fusion and compensation, and the specific steps are as follows:
s3.1, firstly, inputting the characteristic text description generated in the step S2 into an LSTM network so that the characteristic is embedded into a vector with a fixed size; assume that there is a given input sequence x= { x 1 ,x 2 ,…,x t ,…,x T T represents the T-th feature, and the T common feature texts; the calculation formula of each layer of LSTM network is as follows:
h t =σ h (W xh x t +W hh h t-1 +b h );
in the formula, h t Representing the output of the hidden layer at time t, W xh Weight matrix representing input layer to hidden layer, W hh A weight matrix representing hidden layers to hidden layers, b h Representing the deviation of hidden layer, sigma h Representing an activation function;
s3.2, carrying out normalization processing on the vectors in the step S3.1 by adopting a self-organizing map algorithm; the self-organizing map algorithm comprises the following steps:
s3.2.1 the vector generated in step S3.1 is used as the input of the SOM algorithm, the corresponding output is determined, and the emotion type is represented by the maximum function value;
s3.2.2 determining a neighborhood range of winning neurons and adjusting weights of neurons within the range so that the weights converge toward the text description embedding vector;
s3.2.3 with the development of continuous learning, the neighborhood range is reduced, the feature vectors of the text description are mutually separated, and the output result vector represents an emotion type;
s3.3, because the SOM algorithm can lose information, a compensation layer is arranged to compensate the lost information;
s3.4, performing global optimization and fusion on the result vectors generated in the steps S3.2 and S3.3 to obtain fusion feature vectors;
and S4, classifying the fusion characteristics by using a SoftMax activation function, and judging whether the emotion belongs to suicide emotion.
2. The method for multi-modal fusion of suicidal emotion perception based on speech and micro-expression according to claim 1, wherein in step S2, the acquired audio is subjected to different feature extraction from three dimensions of speech content, intonation and speech speed, and converted into three sets of corresponding feature texts; and (3) capturing the facial expression of the acquired image frame, extracting the characteristics and reducing the dimension of the captured image frame, and classifying and converting the captured image frame into corresponding expression text description through a neural network.
3. The method for multi-modal fusion of suicidal emotion perception based on speech and micro-expressions according to claim 2, wherein step S2 comprises the following steps:
s2.1, after noise reduction processing is carried out on the audio signal, voice is sequentially converted into three corresponding characteristic text descriptions according to voice content, intonation and speech speed, and then converted into tone marks through a BP neural network for emotion recognition;
the BP neural network is the most basic neural network and comprises an input layer, a hidden layer and an output layer, wherein an output result adopts forward propagation, and an error adopts a reverse propagation mode;
s2.2, carrying out facial expression recognition by adopting a local method, namely obtaining each segmented region of a human face according to the information of a human face image frame captured by Kinect in real time, carrying out cutting, scaling, filtering, denoising, histogram equalization and gray level equalization on the image, carrying out feature extraction by adopting Gabor wavelet, carrying out dimension reduction by using a linear discriminant analysis method so as to obtain corresponding feature vectors, and finally obtaining a human face recognition result, namely corresponding feature text description by classifying a three-layer neural network;
the three-layer neural network structure comprises an input layer, a hidden layer and an output layer; the input layer receives data, the output layer outputs data, and the hidden layer transmits information after 'activation'.
4. The method for multi-modal fusion of suicidal emotion perception based on speech and micro-expressions according to claim 1, wherein in step S3.3, the compensation layer is composed of a layer weight matrix, a represents the number of emotion categories, and all nodes on each layer have respective weights, and the calculation formula is as follows:
u i =w i ·μ i +b;
u i represents the output of the i-th layer weight matrix, w i Weight matrix, μ representing the i-th layer i An input of the i-th layer, b representing a bias constant of 1; to obtain a compensation value of a suitable magnitude, a tanh function is set after the compensation layer such that the magnitude of the compensation value is [ -1,1]Between them.
5. The method for multi-modal fusion of suicidal emotion perception based on speech and micro-expressions according to claim 1, wherein in step S4, softMax activation function is used to classify fusion feature vectors, and determine whether the emotion belongs to suicidal emotion.
CN202010764408.1A 2020-08-02 2020-08-02 Multi-mode fusion suicide emotion perception method based on voice and micro-expression Active CN112101096B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010764408.1A CN112101096B (en) 2020-08-02 2020-08-02 Multi-mode fusion suicide emotion perception method based on voice and micro-expression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010764408.1A CN112101096B (en) 2020-08-02 2020-08-02 Multi-mode fusion suicide emotion perception method based on voice and micro-expression

Publications (2)

Publication Number Publication Date
CN112101096A CN112101096A (en) 2020-12-18
CN112101096B true CN112101096B (en) 2023-09-22

Family

ID=73749954

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010764408.1A Active CN112101096B (en) 2020-08-02 2020-08-02 Multi-mode fusion suicide emotion perception method based on voice and micro-expression

Country Status (1)

Country Link
CN (1) CN112101096B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112766063B (en) * 2020-12-31 2024-04-23 沈阳康泰电子科技股份有限公司 Micro-expression fitting method and system based on displacement compensation
CN112784804B (en) * 2021-02-03 2024-03-19 杭州电子科技大学 Micro expression recognition method based on neural network sensitivity analysis
CN113326703B (en) * 2021-08-03 2021-11-16 国网电子商务有限公司 Emotion recognition method and system based on multi-modal confrontation fusion in heterogeneous space
CN113469153B (en) * 2021-09-03 2022-01-11 中国科学院自动化研究所 Multi-modal emotion recognition method based on micro-expressions, limb actions and voice

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516696A (en) * 2019-07-12 2019-11-29 东南大学 It is a kind of that emotion identification method is merged based on the adaptive weighting bimodal of voice and expression
CN110569869A (en) * 2019-07-23 2019-12-13 浙江工业大学 feature level fusion method for multi-modal emotion detection

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11074495B2 (en) * 2013-02-28 2021-07-27 Z Advanced Computing, Inc. (Zac) System and method for extremely efficient image and pattern recognition and artificial intelligence platform

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516696A (en) * 2019-07-12 2019-11-29 东南大学 It is a kind of that emotion identification method is merged based on the adaptive weighting bimodal of voice and expression
CN110569869A (en) * 2019-07-23 2019-12-13 浙江工业大学 feature level fusion method for multi-modal emotion detection

Also Published As

Publication number Publication date
CN112101096A (en) 2020-12-18

Similar Documents

Publication Publication Date Title
CN112101096B (en) Multi-mode fusion suicide emotion perception method based on voice and micro-expression
Liu et al. Emotion recognition by deeply learned multi-channel textual and EEG features
Abdullah et al. Multimodal emotion recognition using deep learning
WO2020248376A1 (en) Emotion detection method and apparatus, electronic device, and storage medium
Tuncer et al. Novel dynamic center based binary and ternary pattern network using M4 pooling for real world voice recognition
Chen et al. Emotion recognition with audio, video, EEG, and EMG: a dataset and baseline approaches
Jayanthi et al. An integrated framework for emotion recognition using speech and static images with deep classifier fusion approach
Hussain et al. A radial base neural network approach for emotion recognition in human speech
Renjith et al. Speech based emotion recognition in Tamil and Telugu using LPCC and hurst parameters—A comparitive study using KNN and ANN classifiers
Jinliang et al. EEG emotion recognition based on granger causality and capsnet neural network
CN116230234A (en) Multi-mode feature consistency psychological health abnormality identification method and system
Kuang et al. Simplified inverse filter tracked affective acoustic signals classification incorporating deep convolutional neural networks
Ribeiro et al. Binary neural networks for classification of voice commands from throat microphone
Atkar et al. Speech emotion recognition using dialogue emotion decoder and CNN Classifier
CN112466284B (en) Mask voice identification method
Zhou et al. Multimodal emotion recognition method based on convolutional auto-encoder
Hu et al. Speech emotion recognition based on attention mcnn combined with gender information
CN112069897B (en) Knowledge-graph-based speech and micro-expression recognition suicide emotion perception method
Aggarwal et al. Acoustic methodologies for classifying gender and emotions using machine learning algorithms
CN114881668A (en) Multi-mode-based deception detection method
Mostafa et al. Voiceless Bangla vowel recognition using sEMG signal
CN112489787A (en) Method for detecting human health based on micro-expression
Ghosh et al. Classification of silent speech in english and bengali languages using stacked autoencoder
Mavaddati Voice-based age, gender, and language recognition based on ResNet deep model and transfer learning in spectro-temporal domain
Ying et al. A Multimodal Driver Emotion Recognition Algorithm Based on the Audio and Video Signals in Internet of Vehicles Platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant