CN110931043A - Integrated speech emotion recognition method, device, equipment and storage medium - Google Patents

Integrated speech emotion recognition method, device, equipment and storage medium Download PDF

Info

Publication number
CN110931043A
CN110931043A CN201911246545.XA CN201911246545A CN110931043A CN 110931043 A CN110931043 A CN 110931043A CN 201911246545 A CN201911246545 A CN 201911246545A CN 110931043 A CN110931043 A CN 110931043A
Authority
CN
China
Prior art keywords
data
feature
preset
training
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911246545.XA
Other languages
Chinese (zh)
Inventor
孙亚新
叶青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubei University of Arts and Science
Original Assignee
Hubei University of Arts and Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubei University of Arts and Science filed Critical Hubei University of Arts and Science
Priority to CN201911246545.XA priority Critical patent/CN110931043A/en
Publication of CN110931043A publication Critical patent/CN110931043A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of voice signal processing and mode recognition, and discloses an integrated voice emotion recognition method, device, equipment and storage medium. The method comprises the following steps: performing feature extraction on a voice sample to be recognized to obtain voice signal features with preset dimensionality; carrying out feature statistics on the voice signal features through a preset statistical function to obtain a feature statistical result; carrying out normalization processing on the feature statistical result to obtain feature initial data; screening the characteristic initial data to obtain characteristic target data; the feature target data are input into a preset training classification model to obtain an integrated speech emotion recognition result, and through the method, the feature selection result is prevented from being over-fitted to the training data, so that the speech emotion recognition degree is better improved.

Description

Integrated speech emotion recognition method, device, equipment and storage medium
Technical Field
The invention relates to the technical field of voice signal processing and mode recognition, in particular to an integrated voice emotion recognition method, device, equipment and storage medium.
Background
The purpose of speech emotion recognition is to enable a computer to find human emotional states from speech signals and enable a machine to understand human perceptual thinking, so that the computer has more humanized and complex functions. Speech as the most dominant way for human communication, and the ease of capture of speech signals, enables speech emotion recognition technology to be used in a very large number of fields (1) recommendation systems. A recommendation that can understand your mood, not an advertisement, but a human mind; recommended is not a service, but rather an understanding of you; and the recommendation effect can be greatly improved naturally. (2) A telephone customer service emotion management system. The quality of customer service can be improved, and psychological diseases caused by customer service can be avoided. (3) Personal health monitoring. A person can be prevented from being in a passive state for a long time. In addition, the method has great effects in the fields of smart home, distance education, game feedback, emotion treatment and the like. If the speech emotion recognition technology is developed to be mature, the user can understand the emotional state of the user at any time by using the portable equipment, and then the user is served according to the emotional state of the user. Will have subversive effect on the whole Internet service mode, generate one-time shuffling to the equipment commonly used in daily life, after all, no one is willing to face the machine with cold ice! The method has important significance for improving the competitiveness of IT industries such as electronic commerce, social software, smart televisions, mobile phones, robots and the like in China.
At present, a plurality of speech emotion recognition methods exist, wherein ensemble learning is a good method for improving speech emotion recognition effect. At present, the common speech emotion recognition methods based on ensemble learning include the following methods: (1) the layered integration framework is defined manually. (2) And manually specifying the feature submodel. (3) A variety of classifiers are used. (4) Random submodels, and other general ensemble learning methods. The methods (1) and (2) require more manual participation, and the designed model has poor universality. Most of the (3) th method is executed in the original characteristic model, and the problem of dimension disaster can exist. The diversity and classification capability of the (4) th feature submodel are difficult to ensure.
The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.
Disclosure of Invention
The invention mainly aims to provide an integrated speech emotion recognition method, device, equipment and storage medium, and aims to solve the technical problem of improving speech recognition accuracy.
In order to achieve the above object, the present invention provides an integrated speech emotion recognition method, which comprises the following steps:
performing feature extraction on a voice sample to be recognized to obtain voice signal features with preset dimensionality;
carrying out feature statistics on the voice signal features through a preset statistical function to obtain a feature statistical result;
carrying out normalization processing on the feature statistical result to obtain feature initial data;
screening the characteristic initial data to obtain characteristic target data;
and inputting the characteristic target data into a preset training classification model to obtain an integrated speech emotion recognition result.
Preferably, the step of performing normalization processing on the feature statistical result to obtain feature initial data includes:
carrying out primary normalization processing on the characteristic statistical result to obtain sample characteristic data;
and carrying out speaker normalization processing on the sample characteristic data to obtain sample processing data, and taking the sample processing data as characteristic initial data.
Preferably, the step of screening the characteristic initial data to obtain characteristic target data includes:
and according to the sample processing data, obtaining label sample processing data through a preset feature selection algorithm, and taking the label sample target data as feature target data.
Preferably, before the step of performing feature extraction on the voice sample to be recognized to obtain the voice signal features of the preset dimension, the method further includes:
performing feature extraction on the training recognition voice sample to obtain training voice signal features with preset dimensionality;
performing feature statistics on the training voice signal features through a preset statistical function to obtain a training feature statistical result;
carrying out primary normalization processing on the training characteristic statistical result to obtain training sample data;
carrying out speaker normalization processing on the training sample data to obtain training sample processing data;
obtaining label training sample processing data through a preset feature selection algorithm according to the training sample processing data;
obtaining a class label corresponding to the label training sample processing data according to the label training sample processing data;
and establishing a preset training classification model according to the label training sample processing data and the class label.
Preferably, the preset training classification model comprises a plurality of preset training classification submodels;
the step of inputting the feature target data into a preset training classification model to obtain an integrated speech emotion recognition result comprises the following steps:
inputting the characteristic target data into the preset training classification submodel to obtain voice emotion category data;
performing data statistics on the voice emotion type data to obtain a voice emotion type data value;
and acquiring an integrated voice emotion recognition result according to the voice emotion category data value.
Preferably, the step of obtaining an integrated speech emotion recognition result according to the speech emotion category data value includes:
judging whether the voice emotion type data value belongs to a preset voice emotion type threshold range or not;
and if the voice emotion category data value belongs to the preset voice emotion category threshold range, acquiring an integrated voice emotion recognition result according to the voice emotion category data value.
Preferably, after the step of determining whether the speech emotion category data value belongs to a preset speech emotion category threshold range, the method further includes:
and if the voice emotion category data value does not belong to the preset voice emotion category threshold range, returning to the step of inputting the characteristic target data into the preset training classification submodel to obtain voice emotion category data.
In addition, to achieve the above object, the present invention further provides an integrated speech emotion recognition apparatus, including:
the acquisition module is used for extracting the characteristics of the voice sample to be recognized to acquire the voice signal characteristics with preset dimensionality;
the statistical module is used for carrying out feature statistics on the voice signal features through a preset statistical function to obtain a feature statistical result;
the processing module is used for carrying out normalization processing on the characteristic statistical result to obtain characteristic initial data;
the screening module is used for screening the characteristic initial data to obtain characteristic target data;
and the determining module is used for inputting the characteristic target data into a preset training classification model to obtain an integrated speech emotion recognition result.
In addition, to achieve the above object, the present invention also provides an electronic device, including: a memory, a processor and an integrated speech emotion recognition program stored on the memory and executable on the processor, the integrated speech emotion recognition program configured to implement the steps of the integrated speech emotion recognition method as described in any of the above.
In addition, to achieve the above object, the present invention further provides a storage medium, on which an integrated speech emotion recognition program is stored, and the integrated speech emotion recognition program, when executed by a processor, implements the steps of the integrated speech emotion recognition method as described in any one of the above.
The invention extracts the characteristics of a voice sample to be recognized, then obtains the voice signal characteristics of a preset dimension, performs characteristic statistics on the voice signal characteristics through a preset statistical function to obtain a characteristic statistical result, performs initial normalization processing and speaker normalization processing on the characteristic statistical result to obtain characteristic initial data, then screens the characteristic initial data to obtain characteristic target data, obtains voice emotion category data by inputting the characteristic target data into a preset training classification submodel, performs data statistics on the voice emotion category data to obtain a voice emotion category data value, finally obtains an integrated voice emotion recognition result according to the voice emotion category data value, avoids the characteristic selection result from being over-fitted to the training data, and selects the characteristics beneficial to recognizing the voice emotion of a speaker, therefore, the diversity and the classification capability of the feature submodels can be well improved, and the effect of the integrated classifier is further improved.
Drawings
Fig. 1 is a schematic structural diagram of an electronic device in a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a first embodiment of an integrated speech emotion recognition method according to the present invention;
FIG. 3 is a flowchart illustrating a second embodiment of the integrated speech emotion recognition method according to the present invention;
FIG. 4 is a block diagram of the integrated speech emotion recognition device according to the first embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, fig. 1 is a schematic structural diagram of an electronic device in a hardware operating environment according to an embodiment of the present invention.
As shown in fig. 1, the electronic device may include: a processor 1001 such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., a WIreless-FIdelity (WI-FI) interface). The Memory 1005 may be a Random Access Memory (RAM) Memory, or a Non-Volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.
Those skilled in the art will appreciate that the configuration shown in fig. 1 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a storage medium, may include an operating system, a network communication module, a user interface module, and an integrated speech emotion recognition program therein.
In the electronic apparatus shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 of the electronic device according to the present invention may be disposed in the electronic device, and the electronic device calls the integrated speech emotion recognition program stored in the memory 1005 through the processor 1001 and executes the integrated speech emotion recognition method according to the embodiment of the present invention.
An embodiment of the present invention provides an integrated speech emotion recognition method, and referring to fig. 2, fig. 2 is a schematic flowchart of a first embodiment of the integrated speech emotion recognition method according to the present invention.
In this embodiment, the integrated speech emotion recognition method includes the following steps:
step S10: and performing feature extraction on the voice sample to be recognized to obtain voice signal features with preset dimensionality.
In addition, before the step of performing feature extraction on a voice sample to be recognized to obtain voice signal features of a preset dimension, feature extraction is performed on a training recognition voice sample to obtain training voice signal features of the preset dimension, feature statistics is performed on the training voice signal features through a preset statistical function to obtain a training feature statistical result, preliminary normalization processing is performed on the training feature statistical result to obtain labeled training sample data and unlabeled training sample data, speaker normalization processing is performed on the labeled training sample data and the unlabeled training sample data to obtain labeled training sample processing data and unlabeled training sample processing data, and labeled training sample selection data is obtained through a training half-governor characteristic selection algorithm according to the labeled training sample processing data and the unlabeled training sample processing data, and obtaining a category label corresponding to the label training sample according to the label training sample selection data, and establishing a preset training classification model according to the label training sample selection data and the category label.
It should be noted that the speech signal features extracted from the speech sample to be recognized include: mel Frequency Cepstrum Coefficient (MFCC), Log Frequency Power Coefficient (LFPC), Linear Predictive Cepstrum Coefficient (LPCC), Zero Crossing with peak Amplitude (zcap), Perceptual Linear Prediction (PLP), Rasta filter Perceptual Linear prediction (R-PLP).
It should be understood that the above feature extraction result of each type of feature is a two-dimensional matrix, wherein one dimension is a time dimension, and then each type of feature F is calculatediFirst derivative in the time dimension Δ FiSecond derivative Δ Δ FiConnecting the original features, the first derivative result and the second derivative result in series in a non-time dimension to form a final feature extraction result of each type of features; and (4) connecting the final feature extraction results of the features of all the classes in series on a non-time dimension to obtain the feature extraction result of the sample.
Further, for ease of understanding, the following is exemplified:
suppose, MFCC corresponds to FMFCC∈R39×z,ΔFMFCC∈R39×z,ΔΔFi∈R39×zWherein z is the number of frames, i.e. the number of degrees of the time dimension, the concatenation result in the non-time dimension
Figure RE-GDA0002370638650000061
When MFCC and LPCC are connected, suppose
Figure RE-GDA0002370638650000062
After being connected in series, are
Figure RE-GDA0002370638650000063
Furthermore, it should be understood that at each time of speech signal feature extraction, MFCC, L FPC, LPCC, ZCPA, PLP, R-PLP features are extracted, where the number of Mel filters of MFCC, LFPC is 40; the linear prediction orders of the LPCC, the PLP and the R-PLP are respectively 12, 16 and 16; the frequency segmentation of the ZCAP is as follows: 0,106, 223, 352, 495, 655, 829, 1022, 1236, 1473, 1734, 2024, 2344, 2689, 3089, 3522, 4000. So that the dimensions of each class of features of each statement are respectively: ti 39, ti 40, ti 12, ti 16, where ti is the number of frames in the i-th sentence, and the number following the multiplier is the dimension of each frame feature. To obtain the change of the speech signal in the time dimension, a first derivative, a second derivative, is also calculated for the above features in the time dimension. Finally, the dimensionality of each type of feature is respectively as follows: ti 117, ti 140, ti 36, ti 48. The extracted speech signal features of the ith sample are combined from all the above features, and have a dimension ti (117+140+36+48+48+48).
Step S20: and carrying out feature statistics on the voice signal features through a preset statistical function to obtain a feature statistical result.
The statistical results of the above features in the time dimension are obtained using a statistical function using a mean (mean), a standard deviation (standard deviation), a minimum (min), a maximum (max), a kurtosis (kurtosis), and a skewness (skewness).
Step S30: and carrying out normalization processing on the characteristic statistical result to obtain characteristic initial data.
It should be understood that the feature statistics are subjected to a preliminary normalization process to obtain sample feature data, the sample feature data are subjected to a speaker normalization process to obtain sample processing data, and the sample processing data are used as feature initial data.
In addition, the feature statistics result of the labeled sample known in the above steps is denoted as { x1,x2,…,xnAnd recording the characteristic statistical result of the unlabeled training sample of a certain speaker as { x }n+1,xn+2,…,xn+mGet statistics of all features { x }1,x2,…,xn+mPreliminary normalization was performed using the following equations, respectively: .
Figure RE-GDA0002370638650000071
Wherein
Figure RE-GDA0002370638650000072
The mean of all the samples is represented by,
Figure RE-GDA0002370638650000073
represents the variance of all samples;
thereafter, the preliminary normalization result { x'1,x'2,…,x'n+mSpeaker normalization is performed using the following equation:
Figure RE-GDA0002370638650000074
wherein x'j,j=1,2,…,niIs of the training sample with x'iSamples with the same speaker label, niIs of x 'in the training sample'iThe number of samples with the same speaker label.
Furthermore, it should be understood that a plurality of feature subsets are obtained through a unified feature selection framework, and the data description capability and classification capability of each feature subset can be ensured. The base classifier trained on the basis has better diversity and classification strength. The number of the base classifiers can be obviously reduced, and the classification capability of the base classifiers can be improved.
Meanwhile, the feature statistical results are normalized by using an improved normalization algorithm. The normalization algorithm comprises two steps of initial normalization and speaker normalization, wherein the initial normalization uses the mean value and the variance of all samples to normalize each sample, and can avoid the influence caused by different characteristic value ranges; the speaker normalization only needs to use the mean value of all samples of the speaker, and the mean value estimation can obtain higher confidence coefficient when the number of the samples is less, so that a better speaker normalization effect can be achieved under the condition that the number of unlabeled samples of the speaker is less.
Step S40: and screening the characteristic initial data to obtain characteristic target data.
It should be noted that, according to the sample processing data, the label sample processing data and the non-label sample processing data are obtained by a preset feature selection method, and the label sample target data is used as the feature target data.
Step S50: and inputting the characteristic target data into a preset training classification model to obtain an integrated speech emotion recognition result.
It should be noted that the preset training classification model includes a plurality of preset training classification submodels, and each preset classification submodel supports a vector machine.
Furthermore, it should be understood that the above steps of obtaining the preset training classification submodel are:
(1) defining a matrix L describing the local geometry of the sample:
L=(I-S)T(I-S)
in which I is ∈ Rn×nIs an identity matrix, i.e. the diagonal element value is 1 and the other element values are 0; s is optimized by the following formula:
Figure RE-GDA0002370638650000081
(2) the relationship between samples is defined using the following equation:
Figure RE-GDA0002370638650000082
a Laplace matrix is then calculated
Figure RE-GDA0002370638650000083
Where D is a diagonal matrix Dii=∑jGij(ii) a Solving the problem of feature decomposition
Figure RE-GDA0002370638650000084
And let V be [ V ]1,v2,L,vC]The feature vectors corresponding to the minimum 2 to C +1 feature values, wherein C is the category number of the speech emotion;
(3) optimization of the following equation Using (4)
Figure RE-GDA0002370638650000091
Wherein P iskq(i,j)=Wk(i,j)*Wq(i,j)
(4) The above formula was optimized using the following cycle
for k=1top
Initialization
Figure RE-GDA0002370638650000092
Is an identity matrix, t is set to be 0 (the diagonal is 1, and the rest is 0), p is the required number of submodels,
Figure RE-GDA0002370638650000093
iteratively optimizing W using the following loopkRepeating:
(4-1) calculation Using the following formula
Figure RE-GDA0002370638650000094
Figure RE-GDA0002370638650000095
Wherein X is training data, I is an identity matrix, α, gamma is three balance parameters, L is calculated by formula (1), and V is calculated by formula (2).
(4-2) calculation of
Figure RE-GDA0002370638650000096
Is a diagonal matrix, wherein
Figure RE-GDA0002370638650000097
Is calculated by the following formula:
Figure RE-GDA0002370638650000098
(4-3) calculation of
Figure RE-GDA0002370638650000099
Wherein
Figure RE-GDA00023706386500000910
The ith row and jth column of (a) are calculated by:
Figure RE-GDA00023706386500000911
in the formula Pqk(i, ·) the element in row i and column j is calculated by:
Pkq(i,j)=Wk(i,j)*Wq(i,j)
(4-4) t ═ t +1, up to
Figure RE-GDA00023706386500000912
And
Figure RE-GDA00023706386500000913
the difference is less than a predetermined threshold.
In addition, it should be understood that the feature target data is input into the preset training classification submodel to obtain speech emotion category data, the speech emotion category data is subjected to data statistics to obtain a speech emotion category data value, and an integrated speech emotion recognition result is obtained according to the speech emotion category data value.
In addition, it should be noted that, the step of obtaining the integrated speech emotion recognition result according to the speech emotion category data value is to determine whether the speech emotion category data value belongs to a preset speech emotion category threshold range; if the voice emotion type data value belongs to the preset voice emotion type threshold range, obtaining an integrated voice emotion recognition result according to the voice emotion type data value, and if the voice emotion type data value does not belong to the preset voice emotion type threshold range, returning to the step of inputting the feature target data into the preset training classification sub-model to obtain voice emotion type data.
Furthermore, it should be understood that the above is an identification phase, the steps of which are:
in this stage, the speech signal of the emotion sample to be recognized of the known speaker is processed, and the emotion classification of the emotion sample to be recognized is obtained according to the training classifier obtained in the training stage. The specific process is as follows:
the first step is as follows: extracting MFCC, LFPC, LPCC, ZCP A, PLP and R-PLP characteristics from a voice signal of an emotion sample to be recognized, wherein the number of Mel filters of the MFCC and the LFPC is 40; the linear prediction orders of the LPCC, the PLP and the R-PLP are respectively 12, 16 and 16; the frequency segmentation of the ZCAP is as follows: 0,106, 223, 352, 495, 655, 829, 1022, 1236, 1473, 1734, 2024, 2344, 2689, 3089, 3522, 4000. So that the dimensions of each class of features of each statement are respectively: t 39, t 40, t 12, t 16, wherein t is the number of frames of the emotion sentences to be identified, and the number after the multiplication number is the dimension of each frame feature. To obtain the change of the speech signal in the time dimension, a first derivative, a second derivative, is also calculated for the above features in the time dimension. Finally, the dimensionality of each type of feature is respectively as follows: t 117, t 140, t 36, t 48. The speech signal features extracted from the emotion sentences to be recognized are combined by all the features, and the dimension is t (117+140+36+48+48+48).
The second step is that: the following statistical function was used: and obtaining the characteristic statistical result x of the emotional sentence to be recognized by the mean value (mean), standard deviation (standard deviation), minimum value (min), maximum value (max), kurtosis (kurtosis) and skewness (skewness).
The third step: and (4) normalizing the speaker. Firstly, calculating a preliminary normalization result x' of x by using a formula (1) according to mu and sigma obtained in a training stage; the speaker normalization result is then calculated for x' using equation (2)
Figure RE-GDA0002370638650000101
The fourth step: selecting vector V according to the features obtained in the training process, and calculating
Figure RE-GDA0002370638650000102
The feature selection result z.
The fifth step: and obtaining the speech emotion class l of z by using the classifier obtained in the training process.
A corpus used for evaluating the emotion recognition effect is an EMO-DB voice emotion database in Germany, which is a standard database in the field of voice emotion recognition. The training process is first completed and then the recognition test is performed. The test mode was performed in a 5-fold crossover fashion. 7 emotions of anger, fear, irritability, disgust, happiness, neutrality and sadness can be identified, the average classification accuracy is 90.84% under the condition that the speaker depends on the emotion recognition method, and the distinction degree between other emotions is better except that the emotion recognition method is easier to confuse with anger and anger. The average classification accuracy was 86.50% with speaker independence.
In the embodiment, the characteristic extraction is performed on a voice sample to be recognized to obtain voice signal characteristics with preset dimensionality, the characteristic statistics is performed on the voice signal characteristics through a preset statistical function to obtain characteristic statistical results, the characteristic statistical results are subjected to normalization processing to obtain characteristic initial data, the characteristic initial data are screened to obtain characteristic target data, the characteristic target data are input into a preset training classification model to obtain integrated voice emotion recognition results, and a plurality of characteristic subsets with enough capability description data are searched in the above mode to enable the utilization rate of data to be higher, so that the voice emotion recognition effect can be more accurately obtained.
Referring to fig. 3, fig. 3 is a flowchart illustrating a method for integrated speech emotion recognition according to a second embodiment of the present invention.
Based on the first embodiment, before the step S10, the integrated speech emotion recognition method of this embodiment further includes:
step S000: and extracting the characteristics of the training recognition voice sample to obtain the training voice signal characteristics with preset dimensionality.
Step S001: and carrying out feature statistics on the training voice signal features through a preset statistical function to obtain a training feature statistical result.
Step S002: and carrying out primary normalization processing on the training characteristic statistical result to obtain training sample data.
Step S003: and carrying out speaker normalization processing on the training sample data to obtain training sample processing data.
Step S004: and obtaining label training sample processing data through a preset feature selection algorithm according to the training sample processing data.
Step S005: and obtaining the class label corresponding to the label training sample processing data according to the label training sample processing data.
Step S006: and establishing a preset training classification model according to the label training sample processing data and the class label.
Further, it should be understood that, in performing the training phase, (1-1) the features of the labeled training samples are extracted as well as the features of the unlabeled samples for each speaker; (1-2) performing feature statistics on all the features; (1-3) performing a normalization algorithm on the feature statistical result; (1-4) selecting a plurality of feature submodels using a unified feature selection framework; (1-5) training a support vector machine for each feature sub-model; (1-6) the classification result is obtained by voting the results of all the support vector machines.
Furthermore, to facilitate understanding the following specific steps of the training phase are performed:
in this stage, training is performed for all speakers respectively to obtain a classifier corresponding to each speaker, and the specific process is as follows:
the first step is as follows: extracting the characteristics of MFCC, LFPC, LPCC, ZCAP, PLP and R-PLP from all voice training signals (all voice signals with label samples and voice signals without label samples of a certain speaker in each training), wherein the number of Mel filters of MFCC and LFPC is 40; the linear prediction orders of the LPCC, the PLP and the R-PLP are respectively 12, 16 and 16; the frequency segmentation of the ZCAP is as follows: 0,106, 223, 352, 495, 655, 829, 1022, 1236, 1473, 1734, 2024, 2344, 2689, 3089, 3522, 4000. So that the dimensions of each class of features of each statement are respectively: ti 39, ti 40, ti 12, ti 16, where ti is the number of frames in the i-th sentence, and the number following the multiplier is the dimension of each frame feature. To obtain the change of the speech signal in the time dimension, a first derivative, a second derivative, is also calculated for the above features in the time dimension. Finally, the dimensionality of each type of feature is respectively as follows: ti 117, ti 140, ti 36, ti 48. The extracted speech signal features of the ith sample are combined from all the above features, and have a dimension ti (117+140+36+48+48+48).
The second step is that: the following statistical function was used: the mean (mean), standard deviation (standard deviation), minimum (min), maximum (max), kurtosis (kurtosis), skewness (skewness) are obtained as statistics of the above features in the time dimension. The feature statistics of the labeled samples are noted as { x1,x2,…,xnAnd recording the characteristic statistical result of the unlabeled training sample of a certain speaker as { x }n+1,xn+2,…,xn+mAnd f, wherein n is the number of labeled specimens, and m is the number of unlabeled samples of a speaker.
The third step: and normalizing the characteristic statistical result. The method comprises the following steps:
(1) for all the feature statistics { x ] obtained in the second step1,x2,…,xn+mPreliminary normalization was performed using the following equations, respectively:
Figure RE-GDA0002370638650000121
wherein
Figure RE-GDA0002370638650000122
The mean of all the samples is represented by,
Figure RE-GDA0002370638650000123
represents the variance of all samples;
(2) to preliminary normalization result { x'1,x'2,…,x'n+mSpeaker normalization is performed using the following equation:
Figure RE-GDA0002370638650000131
wherein x'j,j=1,2,…,niIs of the training sample with x'iSamples with the same speaker label, niIs of x 'in the training sample'iThe number of samples with the same speaker label.
The fourth step: and training a semi-supervised feature selection algorithm. The algorithm comprises the following steps:
furthermore, it should be understood that the above-mentioned predetermined feature selection method includes training a semi-supervised feature selection algorithm. The algorithm comprises the following steps:
(1) the relationship between samples is defined using the following equation:
Figure RE-GDA0002370638650000132
in the formula, SijRepresenting the relationship between samples, nliThe number of samples with class label li is shown, and li and lj represent samples
Figure RE-GDA0002370638650000133
The category label of (a) is set,
Figure RE-GDA0002370638650000134
is a sample
Figure RE-GDA0002370638650000135
The neighborhood of (a) is determined,
Figure RE-GDA0002370638650000136
is a sample
Figure RE-GDA0002370638650000137
And A isijThe definition is as follows:
Figure RE-GDA0002370638650000138
wherein the content of the first and second substances,
Figure RE-GDA0002370638650000139
to represent
Figure RE-GDA00023706386500001310
And
Figure RE-GDA00023706386500001311
the euclidean distance between them,
Figure RE-GDA00023706386500001312
to represent
Figure RE-GDA00023706386500001313
To
Figure RE-GDA00023706386500001314
The Euclidean distance of (a) is,
Figure RE-GDA00023706386500001315
to represent
Figure RE-GDA00023706386500001316
To
Figure RE-GDA00023706386500001317
The Euclidean distance of (a) is,
Figure RE-GDA00023706386500001318
is composed of
Figure RE-GDA00023706386500001319
The kth neighbor of (1).
(2) Calculating the Laplace map L-D-S, where D is a diagonal matrix Dii=∑jSij
(3) The problem of characteristic decomposition Ly is solved. And let Y ═ Y1,y2,…,yC]And C is a feature vector corresponding to the minimum 2 to C +1 feature values, wherein C is the category number of the speech emotion.
(4) Solving L1 normalized regression problem using least Angle regression algorithms (LARs)
Figure RE-GDA00023706386500001320
Obtaining C sparse coefficient vectors
Figure RE-GDA00023706386500001321
Wherein y iscThe c-th feature vector found for (1-4-3),
Figure RE-GDA00023706386500001322
(5) calculating an importance score for each feature
Figure RE-GDA00023706386500001323
j represents the jth feature and score (j) represents the score for the jth feature.
(6) The index of the d features with the largest score is returned as the feature selection result V. Where d is the dimension of the feature to be selected.
In addition, it should be noted that the semi-supervised feature selection algorithm can consider the manifold structure of the data, the category structure of the data, and the information provided by using the unlabeled sample, thereby avoiding that the feature selection result is over-fitted to the training data, and selecting the features beneficial to recognizing the speech emotion of the speaker.
The fifth step: obtaining the feature selection result { z of the labeled sample according to the feature selection result V1,z2,…,zn}. Will be provided withThe feature selection results are stored in the speech emotion vector database.
And a sixth step: using { z1,z2,…,znAnd their class labels train the classifier.
Further, it should be understood that the feature selection result { z ] of the labeled exemplar is obtained from the feature selection result1,z2,…,znObtaining { z } by using the classifier obtained in the training process1,z2,…,znThe speech emotion classification of.
In addition, it should be noted that after the training process is completed, the recognition test is performed. The test mode was performed in a 5-fold crossover fashion. 7 emotions of anger, fear, irritability, disgust, happiness, neutrality and sadness can be identified, the average classification accuracy is 90.84% under the condition that the speaker depends on the emotion recognition method, and the distinction degree between other emotions is better except that the emotion recognition method is easier to confuse with anger and anger. The average classification accuracy was 86.50% with speaker independence.
In the embodiment, training voice signal characteristics of a preset dimension are obtained by extracting characteristics of a training recognition voice sample, characteristic statistics is carried out on the training voice signal characteristics through a preset statistical function to obtain a training characteristic statistical result, preliminary normalization processing is carried out on the training characteristic statistical result to obtain label training sample data and label-free training sample data, speaker normalization processing is carried out on the label training sample data and the label-free training sample data to obtain label training sample processing data and label-free training sample processing data, label training sample selection data is obtained through a training semi-supervised characteristic selection algorithm according to the label training sample processing data and the label-free training sample processing data, and a category label corresponding to the label training sample is obtained according to the label training sample selection data, and establishing a preset training classification model according to the label training sample selection data and the class label. By the method, the influence of unlabeled samples of other speakers is avoided, so that the influence of the speaker on the speech data manifold structure can be improved to the maximum extent, and the characteristics which are beneficial to speech emotion recognition of the speaker are selected.
In addition, an embodiment of the present invention further provides a storage medium, where the storage medium stores an integrated speech emotion recognition program, and the integrated speech emotion recognition program, when executed by a processor, implements the steps of the integrated speech emotion recognition method described above.
Referring to fig. 4, fig. 4 is a block diagram illustrating a first embodiment of an integrated speech emotion recognition device according to the present invention.
As shown in fig. 4, the integrated speech emotion recognition apparatus according to the embodiment of the present invention includes: the obtaining module 4001 is configured to perform feature extraction on a voice sample to be recognized to obtain voice signal features of a preset dimension; the statistic module 4002 is configured to perform feature statistics on the speech signal features through a preset statistic function to obtain a feature statistical result; a processing module 4003, configured to perform normalization processing on the feature statistical result to obtain feature initial data; a screening module 4004, configured to screen the feature initial data to obtain feature target data; the determining module 4005 is configured to input the feature target data into a preset training classification model, so as to obtain an integrated speech emotion recognition result.
The obtaining module 4001 is configured to perform feature extraction on a voice sample to be recognized, and obtain a voice signal feature with a preset dimension.
In addition, before the step of performing feature extraction on a voice sample to be recognized to obtain voice signal features of a preset dimension, feature extraction is performed on a training recognition voice sample to obtain training voice signal features of the preset dimension, feature statistics is performed on the training voice signal features through a preset statistical function to obtain a training feature statistical result, preliminary normalization processing is performed on the training feature statistical result to obtain labeled training sample data and unlabeled training sample data, speaker normalization processing is performed on the labeled training sample data and the unlabeled training sample data to obtain labeled training sample processing data and unlabeled training sample processing data, and labeled training sample selection data is obtained through a training half-governor characteristic selection algorithm according to the labeled training sample processing data and the unlabeled training sample processing data, and obtaining a category label corresponding to the label training sample according to the label training sample selection data, and establishing a preset training classification model according to the label training sample selection data and the category label.
It should be noted that the speech signal features extracted from the speech sample to be recognized include: mel Frequency Cepstrum Coefficient (MFCC), Log Frequency Power Coefficient (LFPC), Linear Predictive Cepstrum Coefficient (LPCC), Zero Crossing with peak Amplitude (zcap), Perceptual Linear Prediction (PLP), Rasta filter Perceptual Linear prediction (R-PLP).
It should be understood that the above feature extraction result of each type of feature is a two-dimensional matrix, wherein one dimension is a time dimension, and then each type of feature F is calculatediFirst derivative in the time dimension Δ FiSecond derivative Δ Δ FiConnecting the original features, the first derivative result and the second derivative result in series in a non-time dimension to form a final feature extraction result of each type of features; and (4) connecting the final feature extraction results of the features of all the classes in series on a non-time dimension to obtain the feature extraction result of the sample.
Further, for ease of understanding, the following is exemplified:
suppose, MFCC corresponds to FMFCC∈R39×z,ΔFMFCC∈R39×z,ΔΔFi∈R39×zWherein z is the number of frames, i.e. the number of degrees of the time dimension, the concatenation result in the non-time dimension
Figure RE-GDA0002370638650000161
When MFCC and LPCC are connected, suppose
Figure RE-GDA0002370638650000162
After being connected in seriesIs composed of
Figure RE-GDA0002370638650000163
Furthermore, it should be understood that at each time of speech signal feature extraction, MFCC, L FPC, LPCC, ZCPA, PLP, R-PLP features are extracted, where the number of Mel filters of MFCC, LFPC is 40; the linear prediction orders of the LPCC, the PLP and the R-PLP are respectively 12, 16 and 16; the frequency segmentation of the ZCAP is as follows: 0,106, 223, 352, 495, 655, 829, 1022, 1236, 1473, 1734, 2024, 2344, 2689, 3089, 3522, 4000. So that the dimensions of each class of features of each statement are respectively: ti 39, ti 40, ti 12, ti 16, where ti is the number of frames in the i-th sentence, and the number following the multiplier is the dimension of each frame feature. To obtain the change of the speech signal in the time dimension, a first derivative, a second derivative, is also calculated for the above features in the time dimension. Finally, the dimensionality of each type of feature is respectively as follows: ti 117, ti 140, ti 36, ti 48. The extracted speech signal features of the ith sample are combined from all the above features, and have a dimension ti (117+140+36+48+48+48).
The statistic module 4002 is configured to perform feature statistics on the speech signal features through a preset statistic function, so as to obtain a feature statistical result.
The statistical results of the above features in the time dimension are obtained using a statistical function using a mean (mean), a standard deviation (standard deviation), a minimum (min), a maximum (max), a kurtosis (kurtosis), and a skewness (skewness).
The processing module 4003 is configured to perform normalization processing on the feature statistical result to obtain an operation of feature initial data.
It should be understood that the feature statistics are subjected to a preliminary normalization process to obtain sample feature data, the sample feature data are subjected to a speaker normalization process to obtain sample processing data, and the sample processing data are used as feature initial data.
In addition, the feature statistics result of the labeled sample known in the above steps is denoted as { x1,x2,…,xnAnd recording the characteristic statistical result of the unlabeled training sample of a certain speaker as { x }n+1,xn+2,…,xn+mGet statistics of all features { x }1,x2,…,xn+mPreliminary normalization was performed using the following equations, respectively:
Figure RE-GDA0002370638650000171
wherein
Figure RE-GDA0002370638650000172
The mean of all the samples is represented by,
Figure RE-GDA0002370638650000173
represents the variance of all samples;
thereafter, the preliminary normalization result { x'1,x'2,…,x'n+mSpeaker normalization is performed using the following equation: .
Figure RE-GDA0002370638650000174
Wherein x'j,j=1,2,…,niIs of the training sample with x'iSamples with the same speaker label, niIs of x 'in the training sample'iThe number of samples with the same speaker label.
Furthermore, it should be understood that a plurality of feature subsets are obtained through a unified feature selection framework, and the data description capability and classification capability of each feature subset can be ensured. The base classifier trained on the basis has better diversity and classification strength. The number of the base classifiers can be obviously reduced, and the classification capability of the base classifiers can be improved.
Meanwhile, the feature statistical results are normalized by using an improved normalization algorithm. The normalization algorithm comprises two steps of initial normalization and speaker normalization, wherein the initial normalization uses the mean value and the variance of all samples to normalize each sample, and can avoid the influence caused by different characteristic value ranges; the speaker normalization only needs to use the mean value of all samples of the speaker, and the mean value estimation can obtain higher confidence coefficient when the number of the samples is less, so that a better speaker normalization effect can be achieved under the condition that the number of unlabeled samples of the speaker is less.
The screening module 4004 is configured to screen the feature initial data to obtain a feature target data.
It should be noted that, according to the sample processing data, the label sample processing data and the non-label sample processing data are obtained by a preset feature selection method, and the label sample target data is used as the feature target data.
The determining module 4005 is configured to input the feature target data into a preset training classification model, and obtain an operation of integrating a speech emotion recognition result.
It should be noted that the preset training classification model includes a plurality of preset training classification submodels, and each preset classification submodel supports a vector machine.
Furthermore, it should be understood that the above steps of obtaining the preset training classification submodel are:
(1) defining a matrix L describing the local geometry of the sample:
L=(I-S)T(I-S)
in which I is ∈ Rn×nIs an identity matrix, i.e. the diagonal element value is 1 and the other element values are 0; s is optimized by the following formula:
Figure RE-GDA0002370638650000181
(2) the relationship between samples is defined using the following equation:
Figure RE-GDA0002370638650000182
a Laplace matrix is then calculated
Figure RE-GDA0002370638650000183
Where D is a diagonal matrix Dii=∑jGij(ii) a Solving the problem of feature decomposition
Figure RE-GDA0002370638650000184
And let V be [ V ]1,v2,L,vC]The feature vectors corresponding to the minimum 2 to C +1 feature values, wherein C is the category number of the speech emotion;
(3) optimization of the following equation Using (4)
Figure RE-GDA0002370638650000185
Wherein P iskq(i,j)=Wk(i,j)*Wq(i,j)
(4) The above formula was optimized using the following cycle
for k=1top
Initialization
Figure RE-GDA0002370638650000186
Is an identity matrix, t is set to be 0 (the diagonal is 1, and the rest is 0), p is the required number of submodels,
Figure RE-GDA0002370638650000187
iteratively optimizing W using the following loopkRepeating:
(4-1) calculation Using the following formula
Figure RE-GDA0002370638650000188
Figure RE-GDA0002370638650000189
Wherein X is training data, I is an identity matrix, α, gamma is three balance parameters, L is obtained by calculation in the step (1-4-1), and V is obtained by calculation in the step (1-4-2).
(4-2) calculation of
Figure RE-GDA00023706386500001810
Is a diagonal matrix, wherein
Figure RE-GDA00023706386500001811
Is calculated by the following formula:
Figure RE-GDA00023706386500001812
(4-3) calculation of
Figure RE-GDA00023706386500001813
Wherein
Figure RE-GDA00023706386500001814
The ith row and jth column of (a) are calculated by:
Figure RE-GDA00023706386500001815
in the formula Pqk(i, ·) the element in row i and column j is calculated by:
Pkq(i,j)=Wk(i,j)*Wq(i,j)
(4-4) t ═ t +1, up to
Figure RE-GDA0002370638650000191
And
Figure RE-GDA0002370638650000192
the difference is less than a predetermined threshold.
In addition, it should be understood that the feature target data is input into the preset training classification submodel to obtain speech emotion category data, the speech emotion category data is subjected to data statistics to obtain a speech emotion category data value, and an integrated speech emotion recognition result is obtained according to the speech emotion category data value.
In addition, it should be noted that, the step of obtaining the integrated speech emotion recognition result according to the speech emotion category data value is to determine whether the speech emotion category data value meets a preset speech emotion category threshold range; if the voice emotion type data value meets the preset voice emotion type threshold range, obtaining an integrated voice emotion recognition result according to the voice emotion type data value, and if the voice emotion type data value does not meet the preset voice emotion type threshold range, returning to the step of inputting the feature target data into the preset training classification submodel to obtain voice emotion type data.
Furthermore, it should be understood that the above is an identification phase, the steps of which are:
in this stage, the speech signal of the emotion sample to be recognized of the known speaker is processed, and the emotion classification of the emotion sample to be recognized is obtained according to the training classifier obtained in the training stage. The specific process is as follows:
the first step is as follows: extracting MFCC, LFPC, LPCC, ZCP A, PLP and R-PLP characteristics from a voice signal of an emotion sample to be recognized, wherein the number of Mel filters of the MFCC and the LFPC is 40; the linear prediction orders of the LPCC, the PLP and the R-PLP are respectively 12, 16 and 16; the frequency segmentation of the ZCAP is as follows: 0,106, 223, 352, 495, 655, 829, 1022, 1236, 1473, 1734, 2024, 2344, 2689, 3089, 3522, 4000. So that the dimensions of each class of features of each statement are respectively: t 39, t 40, t 12, t 16, wherein t is the number of frames of the emotion sentences to be identified, and the number after the multiplication number is the dimension of each frame feature. To obtain the change of the speech signal in the time dimension, a first derivative, a second derivative, is also calculated for the above features in the time dimension. Finally, the dimensionality of each type of feature is respectively as follows: t 117, t 140, t 36, t 48. The speech signal features extracted from the emotion sentences to be recognized are combined by all the features, and the dimension is t (117+140+36+48+48+48).
The second step is that: the following statistical function was used: and obtaining the characteristic statistical result x of the emotional sentence to be recognized by the mean value (mean), standard deviation (standard deviation), minimum value (min), maximum value (max), kurtosis (kurtosis) and skewness (skewness).
The third step: and (4) normalizing the speaker. Firstly, calculating a preliminary normalization result x' of x by using a formula (1) according to mu and sigma obtained in a training stage; the speaker normalization result is then calculated for x' using equation (2)
Figure RE-GDA0002370638650000201
The fourth step: selecting vector V according to the features obtained in the training process, and calculating
Figure RE-GDA0002370638650000202
The feature selection result z.
The fifth step: and obtaining the speech emotion class l of z by using the classifier obtained in the training process.
A corpus used for evaluating the emotion recognition effect is an EMO-DB voice emotion database in Germany, which is a standard database in the field of voice emotion recognition. The training process is first completed and then the recognition test is performed. The test mode was performed in a 5-fold crossover fashion. 7 emotions of anger, fear, irritability, disgust, happiness, neutrality and sadness can be identified, the average classification accuracy is 90.84% under the condition that the speaker depends on the emotion recognition method, and the distinction degree between other emotions is better except that the emotion recognition method is easier to confuse with anger and anger. The average classification accuracy was 86.50% with speaker independence.
It should be understood that the above is only an example, and the technical solution of the present invention is not limited in any way, and in a specific application, a person skilled in the art may set the technical solution as needed, and the present invention is not limited thereto.
In the embodiment, the characteristic extraction is performed on a voice sample to be recognized to obtain voice signal characteristics with preset dimensionality, the characteristic statistics is performed on the voice signal characteristics through a preset statistical function to obtain characteristic statistical results, the characteristic statistical results are subjected to normalization processing to obtain characteristic initial data, the characteristic initial data are screened to obtain characteristic target data, the characteristic target data are input into a preset training classification model to obtain integrated voice emotion recognition results, and a plurality of characteristic subsets with enough capability description data are searched in the above mode to enable the utilization rate of data to be higher, so that the voice emotion recognition effect can be more accurately obtained.
It should be noted that the above-described work flows are only exemplary, and do not limit the scope of the present invention, and in practical applications, a person skilled in the art may select some or all of them to achieve the purpose of the solution of the embodiment according to actual needs, and the present invention is not limited herein.
In addition, the technical details that are not described in detail in this embodiment may refer to the integrated speech emotion recognition method provided in any embodiment of the present invention, and are not described herein again.
Further, it is to be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention or portions thereof that contribute to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (e.g. Read Only Memory (ROM)/RAM, magnetic disk, optical disk), and includes several instructions for enabling a terminal device (e.g. a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. An integrated speech emotion recognition method, characterized in that the method comprises:
performing feature extraction on a voice sample to be recognized to obtain voice signal features with preset dimensionality;
carrying out feature statistics on the voice signal features through a preset statistical function to obtain a feature statistical result;
carrying out normalization processing on the feature statistical result to obtain feature initial data;
screening the characteristic initial data to obtain characteristic target data;
and inputting the characteristic target data into a preset training classification model to obtain an integrated speech emotion recognition result.
2. The method of claim 1, wherein the step of normalizing the feature statistics to obtain feature initial data comprises:
carrying out primary normalization processing on the characteristic statistical result to obtain sample characteristic data;
and carrying out speaker normalization processing on the sample characteristic data to obtain sample processing data, and taking the sample processing data as characteristic initial data.
3. The method of claim 2, wherein the step of screening the characteristic initial data to obtain characteristic target data comprises:
and according to the sample processing data, obtaining label sample processing data through a preset feature selection algorithm, and taking the label sample target data as feature target data.
4. The method of claim 1, wherein before the step of extracting features of the voice sample to be recognized and obtaining the voice signal features of the preset dimension, the method further comprises:
performing feature extraction on the training recognition voice sample to obtain training voice signal features with preset dimensionality;
performing feature statistics on the training voice signal features through a preset statistical function to obtain a training feature statistical result;
carrying out primary normalization processing on the training characteristic statistical result to obtain training sample data;
carrying out speaker normalization processing on the training sample data to obtain training sample processing data;
obtaining label training sample processing data through a preset feature selection algorithm according to the training sample processing data;
obtaining a class label corresponding to the label training sample processing data according to the label training sample processing data;
and establishing a preset training classification model according to the label training sample processing data and the class label.
5. The method of claim 4, wherein the pre-set training classification model comprises a plurality of pre-set training classification submodels;
the step of inputting the feature target data into a preset training classification model to obtain an integrated speech emotion recognition result comprises the following steps:
inputting the characteristic target data into the preset training classification submodel to obtain voice emotion category data;
performing data statistics on the voice emotion type data to obtain a voice emotion type data value;
and acquiring an integrated voice emotion recognition result according to the voice emotion category data value.
6. The method of claim 5, wherein the step of obtaining integrated speech emotion recognition results according to the speech emotion classification data value comprises:
judging whether the voice emotion type data value belongs to a preset voice emotion type threshold range or not;
and if the voice emotion category data value belongs to the preset voice emotion category threshold range, acquiring an integrated voice emotion recognition result according to the voice emotion category data value.
7. The method of claim 6, wherein the step of determining whether the speech emotion classification data value falls within a preset speech emotion classification threshold range further comprises:
and if the voice emotion category data value does not belong to the preset voice emotion category threshold range, returning to the step of inputting the characteristic target data into the preset training classification submodel to obtain voice emotion category data.
8. An integrated speech emotion recognition apparatus, characterized in that the apparatus comprises:
the acquisition module is used for extracting the characteristics of the voice sample to be recognized to acquire the voice signal characteristics with preset dimensionality;
the statistical module is used for carrying out feature statistics on the voice signal features through a preset statistical function to obtain a feature statistical result;
the processing module is used for carrying out normalization processing on the characteristic statistical result to obtain characteristic initial data;
the screening module is used for screening the characteristic initial data to obtain characteristic target data;
and the determining module is used for inputting the characteristic target data into a preset training classification model to obtain an integrated speech emotion recognition result.
9. An electronic device, characterized in that the device comprises: a memory, a processor and an integrated speech emotion recognition program stored on the memory and executable on the processor, the integrated speech emotion recognition program being configured to implement the steps of the integrated speech emotion recognition method as claimed in any of claims 1 to 7.
10. A storage medium having stored thereon an integrated speech emotion recognition program, which when executed by a processor implements the steps of the integrated speech emotion recognition method as claimed in any one of claims 1 to 7.
CN201911246545.XA 2019-12-06 2019-12-06 Integrated speech emotion recognition method, device, equipment and storage medium Pending CN110931043A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911246545.XA CN110931043A (en) 2019-12-06 2019-12-06 Integrated speech emotion recognition method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911246545.XA CN110931043A (en) 2019-12-06 2019-12-06 Integrated speech emotion recognition method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110931043A true CN110931043A (en) 2020-03-27

Family

ID=69858247

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911246545.XA Pending CN110931043A (en) 2019-12-06 2019-12-06 Integrated speech emotion recognition method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110931043A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950644A (en) * 2020-08-18 2020-11-17 东软睿驰汽车技术(沈阳)有限公司 Model training sample selection method and device and computer equipment
CN112530409A (en) * 2020-12-01 2021-03-19 平安科技(深圳)有限公司 Voice sample screening method and device based on geometry and computer equipment
CN113889149A (en) * 2021-10-15 2022-01-04 北京工业大学 Speech emotion recognition method and device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008754A (en) * 2014-05-21 2014-08-27 华南理工大学 Speech emotion recognition method based on semi-supervised feature selection

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008754A (en) * 2014-05-21 2014-08-27 华南理工大学 Speech emotion recognition method based on semi-supervised feature selection

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950644A (en) * 2020-08-18 2020-11-17 东软睿驰汽车技术(沈阳)有限公司 Model training sample selection method and device and computer equipment
CN111950644B (en) * 2020-08-18 2024-03-26 东软睿驰汽车技术(沈阳)有限公司 Training sample selection method and device for model and computer equipment
CN112530409A (en) * 2020-12-01 2021-03-19 平安科技(深圳)有限公司 Voice sample screening method and device based on geometry and computer equipment
WO2022116442A1 (en) * 2020-12-01 2022-06-09 平安科技(深圳)有限公司 Speech sample screening method and apparatus based on geometry, and computer device and storage medium
CN112530409B (en) * 2020-12-01 2024-01-23 平安科技(深圳)有限公司 Speech sample screening method and device based on geometry and computer equipment
CN113889149A (en) * 2021-10-15 2022-01-04 北京工业大学 Speech emotion recognition method and device
CN113889149B (en) * 2021-10-15 2023-08-29 北京工业大学 Speech emotion recognition method and device

Similar Documents

Publication Publication Date Title
CN111179975B (en) Voice endpoint detection method for emotion recognition, electronic device and storage medium
Xia et al. A multi-task learning framework for emotion recognition using 2D continuous space
CN112818861B (en) Emotion classification method and system based on multi-mode context semantic features
CN109767765A (en) Talk about art matching process and device, storage medium, computer equipment
CN110931043A (en) Integrated speech emotion recognition method, device, equipment and storage medium
CN110990543A (en) Intelligent conversation generation method and device, computer equipment and computer storage medium
Seng et al. Video analytics for customer emotion and satisfaction at contact centers
CN110168535A (en) A kind of information processing method and terminal, computer storage medium
US20230206928A1 (en) Audio processing method and apparatus
CN107767881B (en) Method and device for acquiring satisfaction degree of voice information
Noroozi et al. Supervised vocal-based emotion recognition using multiclass support vector machine, random forests, and adaboost
CN110598869B (en) Classification method and device based on sequence model and electronic equipment
CN108509416A (en) Sentence realizes other method and device, equipment and storage medium
CN110827797A (en) Voice response event classification processing method and device
CN111784372A (en) Store commodity recommendation method and device
Qi et al. Exploiting low-rank tensor-train deep neural networks based on Riemannian gradient descent with illustrations of speech processing
CN110956981B (en) Speech emotion recognition method, device, equipment and storage medium
Liu et al. Learning salient features for speech emotion recognition using CNN
CN113053395A (en) Pronunciation error correction learning method and device, storage medium and electronic equipment
CN113111855B (en) Multi-mode emotion recognition method and device, electronic equipment and storage medium
CN108831487A (en) Method for recognizing sound-groove, electronic device and computer readable storage medium
CN110942358A (en) Information interaction method, device, equipment and medium
CN112633381B (en) Audio recognition method and training method of audio recognition model
CN114765028A (en) Voiceprint recognition method and device, terminal equipment and computer readable storage medium
CN113421573A (en) Identity recognition model training method, identity recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200327

RJ01 Rejection of invention patent application after publication