CN110931043A - Integrated speech emotion recognition method, device, equipment and storage medium - Google Patents
Integrated speech emotion recognition method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN110931043A CN110931043A CN201911246545.XA CN201911246545A CN110931043A CN 110931043 A CN110931043 A CN 110931043A CN 201911246545 A CN201911246545 A CN 201911246545A CN 110931043 A CN110931043 A CN 110931043A
- Authority
- CN
- China
- Prior art keywords
- data
- feature
- preset
- training
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 82
- 238000000034 method Methods 0.000 title claims abstract description 72
- 238000012549 training Methods 0.000 claims abstract description 167
- 238000012545 processing Methods 0.000 claims abstract description 84
- 238000010606 normalization Methods 0.000 claims abstract description 61
- 238000000605 extraction Methods 0.000 claims abstract description 25
- 238000013145 classification model Methods 0.000 claims abstract description 22
- 238000012216 screening Methods 0.000 claims abstract description 12
- 230000008451 emotion Effects 0.000 claims description 91
- 230000006870 function Effects 0.000 claims description 22
- 238000004422 calculation algorithm Methods 0.000 claims description 19
- 230000008569 process Effects 0.000 description 18
- 239000011159 matrix material Substances 0.000 description 17
- 230000000694 effects Effects 0.000 description 11
- 239000013598 vector Substances 0.000 description 10
- 238000004364 calculation method Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 6
- 238000012360 testing method Methods 0.000 description 6
- 230000008859 change Effects 0.000 description 5
- 230000002996 emotional effect Effects 0.000 description 5
- 230000011218 segmentation Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 206010022998 Irritability Diseases 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 3
- 238000000354 decomposition reaction Methods 0.000 description 3
- 238000010187 selection method Methods 0.000 description 3
- 241001014642 Rasta Species 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- PVMPDMIKUVNOBD-CIUDSAMLSA-N Leu-Asp-Ser Chemical compound CC(C)C[C@H](N)C(=O)N[C@@H](CC(O)=O)C(=O)N[C@@H](CO)C(O)=O PVMPDMIKUVNOBD-CIUDSAMLSA-N 0.000 description 1
- FFBHFFJDDLITSX-UHFFFAOYSA-N benzyl N-[2-hydroxy-4-(3-oxomorpholin-4-yl)phenyl]carbamate Chemical compound OC1=C(NC(=O)OCC2=CC=CC=C2)C=CC(=C1)N1CCOCC1=O FFBHFFJDDLITSX-UHFFFAOYSA-N 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 238000007636 ensemble learning method Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Child & Adolescent Psychology (AREA)
- General Health & Medical Sciences (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to the technical field of voice signal processing and mode recognition, and discloses an integrated voice emotion recognition method, device, equipment and storage medium. The method comprises the following steps: performing feature extraction on a voice sample to be recognized to obtain voice signal features with preset dimensionality; carrying out feature statistics on the voice signal features through a preset statistical function to obtain a feature statistical result; carrying out normalization processing on the feature statistical result to obtain feature initial data; screening the characteristic initial data to obtain characteristic target data; the feature target data are input into a preset training classification model to obtain an integrated speech emotion recognition result, and through the method, the feature selection result is prevented from being over-fitted to the training data, so that the speech emotion recognition degree is better improved.
Description
Technical Field
The invention relates to the technical field of voice signal processing and mode recognition, in particular to an integrated voice emotion recognition method, device, equipment and storage medium.
Background
The purpose of speech emotion recognition is to enable a computer to find human emotional states from speech signals and enable a machine to understand human perceptual thinking, so that the computer has more humanized and complex functions. Speech as the most dominant way for human communication, and the ease of capture of speech signals, enables speech emotion recognition technology to be used in a very large number of fields (1) recommendation systems. A recommendation that can understand your mood, not an advertisement, but a human mind; recommended is not a service, but rather an understanding of you; and the recommendation effect can be greatly improved naturally. (2) A telephone customer service emotion management system. The quality of customer service can be improved, and psychological diseases caused by customer service can be avoided. (3) Personal health monitoring. A person can be prevented from being in a passive state for a long time. In addition, the method has great effects in the fields of smart home, distance education, game feedback, emotion treatment and the like. If the speech emotion recognition technology is developed to be mature, the user can understand the emotional state of the user at any time by using the portable equipment, and then the user is served according to the emotional state of the user. Will have subversive effect on the whole Internet service mode, generate one-time shuffling to the equipment commonly used in daily life, after all, no one is willing to face the machine with cold ice! The method has important significance for improving the competitiveness of IT industries such as electronic commerce, social software, smart televisions, mobile phones, robots and the like in China.
At present, a plurality of speech emotion recognition methods exist, wherein ensemble learning is a good method for improving speech emotion recognition effect. At present, the common speech emotion recognition methods based on ensemble learning include the following methods: (1) the layered integration framework is defined manually. (2) And manually specifying the feature submodel. (3) A variety of classifiers are used. (4) Random submodels, and other general ensemble learning methods. The methods (1) and (2) require more manual participation, and the designed model has poor universality. Most of the (3) th method is executed in the original characteristic model, and the problem of dimension disaster can exist. The diversity and classification capability of the (4) th feature submodel are difficult to ensure.
The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.
Disclosure of Invention
The invention mainly aims to provide an integrated speech emotion recognition method, device, equipment and storage medium, and aims to solve the technical problem of improving speech recognition accuracy.
In order to achieve the above object, the present invention provides an integrated speech emotion recognition method, which comprises the following steps:
performing feature extraction on a voice sample to be recognized to obtain voice signal features with preset dimensionality;
carrying out feature statistics on the voice signal features through a preset statistical function to obtain a feature statistical result;
carrying out normalization processing on the feature statistical result to obtain feature initial data;
screening the characteristic initial data to obtain characteristic target data;
and inputting the characteristic target data into a preset training classification model to obtain an integrated speech emotion recognition result.
Preferably, the step of performing normalization processing on the feature statistical result to obtain feature initial data includes:
carrying out primary normalization processing on the characteristic statistical result to obtain sample characteristic data;
and carrying out speaker normalization processing on the sample characteristic data to obtain sample processing data, and taking the sample processing data as characteristic initial data.
Preferably, the step of screening the characteristic initial data to obtain characteristic target data includes:
and according to the sample processing data, obtaining label sample processing data through a preset feature selection algorithm, and taking the label sample target data as feature target data.
Preferably, before the step of performing feature extraction on the voice sample to be recognized to obtain the voice signal features of the preset dimension, the method further includes:
performing feature extraction on the training recognition voice sample to obtain training voice signal features with preset dimensionality;
performing feature statistics on the training voice signal features through a preset statistical function to obtain a training feature statistical result;
carrying out primary normalization processing on the training characteristic statistical result to obtain training sample data;
carrying out speaker normalization processing on the training sample data to obtain training sample processing data;
obtaining label training sample processing data through a preset feature selection algorithm according to the training sample processing data;
obtaining a class label corresponding to the label training sample processing data according to the label training sample processing data;
and establishing a preset training classification model according to the label training sample processing data and the class label.
Preferably, the preset training classification model comprises a plurality of preset training classification submodels;
the step of inputting the feature target data into a preset training classification model to obtain an integrated speech emotion recognition result comprises the following steps:
inputting the characteristic target data into the preset training classification submodel to obtain voice emotion category data;
performing data statistics on the voice emotion type data to obtain a voice emotion type data value;
and acquiring an integrated voice emotion recognition result according to the voice emotion category data value.
Preferably, the step of obtaining an integrated speech emotion recognition result according to the speech emotion category data value includes:
judging whether the voice emotion type data value belongs to a preset voice emotion type threshold range or not;
and if the voice emotion category data value belongs to the preset voice emotion category threshold range, acquiring an integrated voice emotion recognition result according to the voice emotion category data value.
Preferably, after the step of determining whether the speech emotion category data value belongs to a preset speech emotion category threshold range, the method further includes:
and if the voice emotion category data value does not belong to the preset voice emotion category threshold range, returning to the step of inputting the characteristic target data into the preset training classification submodel to obtain voice emotion category data.
In addition, to achieve the above object, the present invention further provides an integrated speech emotion recognition apparatus, including:
the acquisition module is used for extracting the characteristics of the voice sample to be recognized to acquire the voice signal characteristics with preset dimensionality;
the statistical module is used for carrying out feature statistics on the voice signal features through a preset statistical function to obtain a feature statistical result;
the processing module is used for carrying out normalization processing on the characteristic statistical result to obtain characteristic initial data;
the screening module is used for screening the characteristic initial data to obtain characteristic target data;
and the determining module is used for inputting the characteristic target data into a preset training classification model to obtain an integrated speech emotion recognition result.
In addition, to achieve the above object, the present invention also provides an electronic device, including: a memory, a processor and an integrated speech emotion recognition program stored on the memory and executable on the processor, the integrated speech emotion recognition program configured to implement the steps of the integrated speech emotion recognition method as described in any of the above.
In addition, to achieve the above object, the present invention further provides a storage medium, on which an integrated speech emotion recognition program is stored, and the integrated speech emotion recognition program, when executed by a processor, implements the steps of the integrated speech emotion recognition method as described in any one of the above.
The invention extracts the characteristics of a voice sample to be recognized, then obtains the voice signal characteristics of a preset dimension, performs characteristic statistics on the voice signal characteristics through a preset statistical function to obtain a characteristic statistical result, performs initial normalization processing and speaker normalization processing on the characteristic statistical result to obtain characteristic initial data, then screens the characteristic initial data to obtain characteristic target data, obtains voice emotion category data by inputting the characteristic target data into a preset training classification submodel, performs data statistics on the voice emotion category data to obtain a voice emotion category data value, finally obtains an integrated voice emotion recognition result according to the voice emotion category data value, avoids the characteristic selection result from being over-fitted to the training data, and selects the characteristics beneficial to recognizing the voice emotion of a speaker, therefore, the diversity and the classification capability of the feature submodels can be well improved, and the effect of the integrated classifier is further improved.
Drawings
Fig. 1 is a schematic structural diagram of an electronic device in a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a first embodiment of an integrated speech emotion recognition method according to the present invention;
FIG. 3 is a flowchart illustrating a second embodiment of the integrated speech emotion recognition method according to the present invention;
FIG. 4 is a block diagram of the integrated speech emotion recognition device according to the first embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, fig. 1 is a schematic structural diagram of an electronic device in a hardware operating environment according to an embodiment of the present invention.
As shown in fig. 1, the electronic device may include: a processor 1001 such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., a WIreless-FIdelity (WI-FI) interface). The Memory 1005 may be a Random Access Memory (RAM) Memory, or a Non-Volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.
Those skilled in the art will appreciate that the configuration shown in fig. 1 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a storage medium, may include an operating system, a network communication module, a user interface module, and an integrated speech emotion recognition program therein.
In the electronic apparatus shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 of the electronic device according to the present invention may be disposed in the electronic device, and the electronic device calls the integrated speech emotion recognition program stored in the memory 1005 through the processor 1001 and executes the integrated speech emotion recognition method according to the embodiment of the present invention.
An embodiment of the present invention provides an integrated speech emotion recognition method, and referring to fig. 2, fig. 2 is a schematic flowchart of a first embodiment of the integrated speech emotion recognition method according to the present invention.
In this embodiment, the integrated speech emotion recognition method includes the following steps:
step S10: and performing feature extraction on the voice sample to be recognized to obtain voice signal features with preset dimensionality.
In addition, before the step of performing feature extraction on a voice sample to be recognized to obtain voice signal features of a preset dimension, feature extraction is performed on a training recognition voice sample to obtain training voice signal features of the preset dimension, feature statistics is performed on the training voice signal features through a preset statistical function to obtain a training feature statistical result, preliminary normalization processing is performed on the training feature statistical result to obtain labeled training sample data and unlabeled training sample data, speaker normalization processing is performed on the labeled training sample data and the unlabeled training sample data to obtain labeled training sample processing data and unlabeled training sample processing data, and labeled training sample selection data is obtained through a training half-governor characteristic selection algorithm according to the labeled training sample processing data and the unlabeled training sample processing data, and obtaining a category label corresponding to the label training sample according to the label training sample selection data, and establishing a preset training classification model according to the label training sample selection data and the category label.
It should be noted that the speech signal features extracted from the speech sample to be recognized include: mel Frequency Cepstrum Coefficient (MFCC), Log Frequency Power Coefficient (LFPC), Linear Predictive Cepstrum Coefficient (LPCC), Zero Crossing with peak Amplitude (zcap), Perceptual Linear Prediction (PLP), Rasta filter Perceptual Linear prediction (R-PLP).
It should be understood that the above feature extraction result of each type of feature is a two-dimensional matrix, wherein one dimension is a time dimension, and then each type of feature F is calculatediFirst derivative in the time dimension Δ FiSecond derivative Δ Δ FiConnecting the original features, the first derivative result and the second derivative result in series in a non-time dimension to form a final feature extraction result of each type of features; and (4) connecting the final feature extraction results of the features of all the classes in series on a non-time dimension to obtain the feature extraction result of the sample.
Further, for ease of understanding, the following is exemplified:
suppose, MFCC corresponds to FMFCC∈R39×z,ΔFMFCC∈R39×z,ΔΔFi∈R39×zWherein z is the number of frames, i.e. the number of degrees of the time dimension, the concatenation result in the non-time dimension
Furthermore, it should be understood that at each time of speech signal feature extraction, MFCC, L FPC, LPCC, ZCPA, PLP, R-PLP features are extracted, where the number of Mel filters of MFCC, LFPC is 40; the linear prediction orders of the LPCC, the PLP and the R-PLP are respectively 12, 16 and 16; the frequency segmentation of the ZCAP is as follows: 0,106, 223, 352, 495, 655, 829, 1022, 1236, 1473, 1734, 2024, 2344, 2689, 3089, 3522, 4000. So that the dimensions of each class of features of each statement are respectively: ti 39, ti 40, ti 12, ti 16, where ti is the number of frames in the i-th sentence, and the number following the multiplier is the dimension of each frame feature. To obtain the change of the speech signal in the time dimension, a first derivative, a second derivative, is also calculated for the above features in the time dimension. Finally, the dimensionality of each type of feature is respectively as follows: ti 117, ti 140, ti 36, ti 48. The extracted speech signal features of the ith sample are combined from all the above features, and have a dimension ti (117+140+36+48+48+48).
Step S20: and carrying out feature statistics on the voice signal features through a preset statistical function to obtain a feature statistical result.
The statistical results of the above features in the time dimension are obtained using a statistical function using a mean (mean), a standard deviation (standard deviation), a minimum (min), a maximum (max), a kurtosis (kurtosis), and a skewness (skewness).
Step S30: and carrying out normalization processing on the characteristic statistical result to obtain characteristic initial data.
It should be understood that the feature statistics are subjected to a preliminary normalization process to obtain sample feature data, the sample feature data are subjected to a speaker normalization process to obtain sample processing data, and the sample processing data are used as feature initial data.
In addition, the feature statistics result of the labeled sample known in the above steps is denoted as { x1,x2,…,xnAnd recording the characteristic statistical result of the unlabeled training sample of a certain speaker as { x }n+1,xn+2,…,xn+mGet statistics of all features { x }1,x2,…,xn+mPreliminary normalization was performed using the following equations, respectively: .
thereafter, the preliminary normalization result { x'1,x'2,…,x'n+mSpeaker normalization is performed using the following equation:
wherein x'j,j=1,2,…,niIs of the training sample with x'iSamples with the same speaker label, niIs of x 'in the training sample'iThe number of samples with the same speaker label.
Furthermore, it should be understood that a plurality of feature subsets are obtained through a unified feature selection framework, and the data description capability and classification capability of each feature subset can be ensured. The base classifier trained on the basis has better diversity and classification strength. The number of the base classifiers can be obviously reduced, and the classification capability of the base classifiers can be improved.
Meanwhile, the feature statistical results are normalized by using an improved normalization algorithm. The normalization algorithm comprises two steps of initial normalization and speaker normalization, wherein the initial normalization uses the mean value and the variance of all samples to normalize each sample, and can avoid the influence caused by different characteristic value ranges; the speaker normalization only needs to use the mean value of all samples of the speaker, and the mean value estimation can obtain higher confidence coefficient when the number of the samples is less, so that a better speaker normalization effect can be achieved under the condition that the number of unlabeled samples of the speaker is less.
Step S40: and screening the characteristic initial data to obtain characteristic target data.
It should be noted that, according to the sample processing data, the label sample processing data and the non-label sample processing data are obtained by a preset feature selection method, and the label sample target data is used as the feature target data.
Step S50: and inputting the characteristic target data into a preset training classification model to obtain an integrated speech emotion recognition result.
It should be noted that the preset training classification model includes a plurality of preset training classification submodels, and each preset classification submodel supports a vector machine.
Furthermore, it should be understood that the above steps of obtaining the preset training classification submodel are:
(1) defining a matrix L describing the local geometry of the sample:
L=(I-S)T(I-S)
in which I is ∈ Rn×nIs an identity matrix, i.e. the diagonal element value is 1 and the other element values are 0; s is optimized by the following formula:
(2) the relationship between samples is defined using the following equation:
a Laplace matrix is then calculatedWhere D is a diagonal matrix Dii=∑jGij(ii) a Solving the problem of feature decompositionAnd let V be [ V ]1,v2,L,vC]The feature vectors corresponding to the minimum 2 to C +1 feature values, wherein C is the category number of the speech emotion;
(3) optimization of the following equation Using (4)
Wherein P iskq(i,j)=Wk(i,j)*Wq(i,j)
(4) The above formula was optimized using the following cycle
for k=1top
InitializationIs an identity matrix, t is set to be 0 (the diagonal is 1, and the rest is 0), p is the required number of submodels,
iteratively optimizing W using the following loopkRepeating:
Wherein X is training data, I is an identity matrix, α, gamma is three balance parameters, L is calculated by formula (1), and V is calculated by formula (2).
in the formula Pqk(i, ·) the element in row i and column j is calculated by:
Pkq(i,j)=Wk(i,j)*Wq(i,j)
In addition, it should be understood that the feature target data is input into the preset training classification submodel to obtain speech emotion category data, the speech emotion category data is subjected to data statistics to obtain a speech emotion category data value, and an integrated speech emotion recognition result is obtained according to the speech emotion category data value.
In addition, it should be noted that, the step of obtaining the integrated speech emotion recognition result according to the speech emotion category data value is to determine whether the speech emotion category data value belongs to a preset speech emotion category threshold range; if the voice emotion type data value belongs to the preset voice emotion type threshold range, obtaining an integrated voice emotion recognition result according to the voice emotion type data value, and if the voice emotion type data value does not belong to the preset voice emotion type threshold range, returning to the step of inputting the feature target data into the preset training classification sub-model to obtain voice emotion type data.
Furthermore, it should be understood that the above is an identification phase, the steps of which are:
in this stage, the speech signal of the emotion sample to be recognized of the known speaker is processed, and the emotion classification of the emotion sample to be recognized is obtained according to the training classifier obtained in the training stage. The specific process is as follows:
the first step is as follows: extracting MFCC, LFPC, LPCC, ZCP A, PLP and R-PLP characteristics from a voice signal of an emotion sample to be recognized, wherein the number of Mel filters of the MFCC and the LFPC is 40; the linear prediction orders of the LPCC, the PLP and the R-PLP are respectively 12, 16 and 16; the frequency segmentation of the ZCAP is as follows: 0,106, 223, 352, 495, 655, 829, 1022, 1236, 1473, 1734, 2024, 2344, 2689, 3089, 3522, 4000. So that the dimensions of each class of features of each statement are respectively: t 39, t 40, t 12, t 16, wherein t is the number of frames of the emotion sentences to be identified, and the number after the multiplication number is the dimension of each frame feature. To obtain the change of the speech signal in the time dimension, a first derivative, a second derivative, is also calculated for the above features in the time dimension. Finally, the dimensionality of each type of feature is respectively as follows: t 117, t 140, t 36, t 48. The speech signal features extracted from the emotion sentences to be recognized are combined by all the features, and the dimension is t (117+140+36+48+48+48).
The second step is that: the following statistical function was used: and obtaining the characteristic statistical result x of the emotional sentence to be recognized by the mean value (mean), standard deviation (standard deviation), minimum value (min), maximum value (max), kurtosis (kurtosis) and skewness (skewness).
The third step: and (4) normalizing the speaker. Firstly, calculating a preliminary normalization result x' of x by using a formula (1) according to mu and sigma obtained in a training stage; the speaker normalization result is then calculated for x' using equation (2)
The fourth step: selecting vector V according to the features obtained in the training process, and calculatingThe feature selection result z.
The fifth step: and obtaining the speech emotion class l of z by using the classifier obtained in the training process.
A corpus used for evaluating the emotion recognition effect is an EMO-DB voice emotion database in Germany, which is a standard database in the field of voice emotion recognition. The training process is first completed and then the recognition test is performed. The test mode was performed in a 5-fold crossover fashion. 7 emotions of anger, fear, irritability, disgust, happiness, neutrality and sadness can be identified, the average classification accuracy is 90.84% under the condition that the speaker depends on the emotion recognition method, and the distinction degree between other emotions is better except that the emotion recognition method is easier to confuse with anger and anger. The average classification accuracy was 86.50% with speaker independence.
In the embodiment, the characteristic extraction is performed on a voice sample to be recognized to obtain voice signal characteristics with preset dimensionality, the characteristic statistics is performed on the voice signal characteristics through a preset statistical function to obtain characteristic statistical results, the characteristic statistical results are subjected to normalization processing to obtain characteristic initial data, the characteristic initial data are screened to obtain characteristic target data, the characteristic target data are input into a preset training classification model to obtain integrated voice emotion recognition results, and a plurality of characteristic subsets with enough capability description data are searched in the above mode to enable the utilization rate of data to be higher, so that the voice emotion recognition effect can be more accurately obtained.
Referring to fig. 3, fig. 3 is a flowchart illustrating a method for integrated speech emotion recognition according to a second embodiment of the present invention.
Based on the first embodiment, before the step S10, the integrated speech emotion recognition method of this embodiment further includes:
step S000: and extracting the characteristics of the training recognition voice sample to obtain the training voice signal characteristics with preset dimensionality.
Step S001: and carrying out feature statistics on the training voice signal features through a preset statistical function to obtain a training feature statistical result.
Step S002: and carrying out primary normalization processing on the training characteristic statistical result to obtain training sample data.
Step S003: and carrying out speaker normalization processing on the training sample data to obtain training sample processing data.
Step S004: and obtaining label training sample processing data through a preset feature selection algorithm according to the training sample processing data.
Step S005: and obtaining the class label corresponding to the label training sample processing data according to the label training sample processing data.
Step S006: and establishing a preset training classification model according to the label training sample processing data and the class label.
Further, it should be understood that, in performing the training phase, (1-1) the features of the labeled training samples are extracted as well as the features of the unlabeled samples for each speaker; (1-2) performing feature statistics on all the features; (1-3) performing a normalization algorithm on the feature statistical result; (1-4) selecting a plurality of feature submodels using a unified feature selection framework; (1-5) training a support vector machine for each feature sub-model; (1-6) the classification result is obtained by voting the results of all the support vector machines.
Furthermore, to facilitate understanding the following specific steps of the training phase are performed:
in this stage, training is performed for all speakers respectively to obtain a classifier corresponding to each speaker, and the specific process is as follows:
the first step is as follows: extracting the characteristics of MFCC, LFPC, LPCC, ZCAP, PLP and R-PLP from all voice training signals (all voice signals with label samples and voice signals without label samples of a certain speaker in each training), wherein the number of Mel filters of MFCC and LFPC is 40; the linear prediction orders of the LPCC, the PLP and the R-PLP are respectively 12, 16 and 16; the frequency segmentation of the ZCAP is as follows: 0,106, 223, 352, 495, 655, 829, 1022, 1236, 1473, 1734, 2024, 2344, 2689, 3089, 3522, 4000. So that the dimensions of each class of features of each statement are respectively: ti 39, ti 40, ti 12, ti 16, where ti is the number of frames in the i-th sentence, and the number following the multiplier is the dimension of each frame feature. To obtain the change of the speech signal in the time dimension, a first derivative, a second derivative, is also calculated for the above features in the time dimension. Finally, the dimensionality of each type of feature is respectively as follows: ti 117, ti 140, ti 36, ti 48. The extracted speech signal features of the ith sample are combined from all the above features, and have a dimension ti (117+140+36+48+48+48).
The second step is that: the following statistical function was used: the mean (mean), standard deviation (standard deviation), minimum (min), maximum (max), kurtosis (kurtosis), skewness (skewness) are obtained as statistics of the above features in the time dimension. The feature statistics of the labeled samples are noted as { x1,x2,…,xnAnd recording the characteristic statistical result of the unlabeled training sample of a certain speaker as { x }n+1,xn+2,…,xn+mAnd f, wherein n is the number of labeled specimens, and m is the number of unlabeled samples of a speaker.
The third step: and normalizing the characteristic statistical result. The method comprises the following steps:
(1) for all the feature statistics { x ] obtained in the second step1,x2,…,xn+mPreliminary normalization was performed using the following equations, respectively:
(2) to preliminary normalization result { x'1,x'2,…,x'n+mSpeaker normalization is performed using the following equation:
wherein x'j,j=1,2,…,niIs of the training sample with x'iSamples with the same speaker label, niIs of x 'in the training sample'iThe number of samples with the same speaker label.
The fourth step: and training a semi-supervised feature selection algorithm. The algorithm comprises the following steps:
furthermore, it should be understood that the above-mentioned predetermined feature selection method includes training a semi-supervised feature selection algorithm. The algorithm comprises the following steps:
(1) the relationship between samples is defined using the following equation:
in the formula, SijRepresenting the relationship between samples, nliThe number of samples with class label li is shown, and li and lj represent samplesThe category label of (a) is set,is a sampleThe neighborhood of (a) is determined,is a sampleAnd A isijThe definition is as follows:
wherein the content of the first and second substances,to representAndthe euclidean distance between them,to representToThe Euclidean distance of (a) is,to representToThe Euclidean distance of (a) is,is composed ofThe kth neighbor of (1).
(2) Calculating the Laplace map L-D-S, where D is a diagonal matrix Dii=∑jSij。
(3) The problem of characteristic decomposition Ly is solved. And let Y ═ Y1,y2,…,yC]And C is a feature vector corresponding to the minimum 2 to C +1 feature values, wherein C is the category number of the speech emotion.
(4) Solving L1 normalized regression problem using least Angle regression algorithms (LARs)Obtaining C sparse coefficient vectorsWherein y iscThe c-th feature vector found for (1-4-3),
(5) calculating an importance score for each featurej represents the jth feature and score (j) represents the score for the jth feature.
(6) The index of the d features with the largest score is returned as the feature selection result V. Where d is the dimension of the feature to be selected.
In addition, it should be noted that the semi-supervised feature selection algorithm can consider the manifold structure of the data, the category structure of the data, and the information provided by using the unlabeled sample, thereby avoiding that the feature selection result is over-fitted to the training data, and selecting the features beneficial to recognizing the speech emotion of the speaker.
The fifth step: obtaining the feature selection result { z of the labeled sample according to the feature selection result V1,z2,…,zn}. Will be provided withThe feature selection results are stored in the speech emotion vector database.
And a sixth step: using { z1,z2,…,znAnd their class labels train the classifier.
Further, it should be understood that the feature selection result { z ] of the labeled exemplar is obtained from the feature selection result1,z2,…,znObtaining { z } by using the classifier obtained in the training process1,z2,…,znThe speech emotion classification of.
In addition, it should be noted that after the training process is completed, the recognition test is performed. The test mode was performed in a 5-fold crossover fashion. 7 emotions of anger, fear, irritability, disgust, happiness, neutrality and sadness can be identified, the average classification accuracy is 90.84% under the condition that the speaker depends on the emotion recognition method, and the distinction degree between other emotions is better except that the emotion recognition method is easier to confuse with anger and anger. The average classification accuracy was 86.50% with speaker independence.
In the embodiment, training voice signal characteristics of a preset dimension are obtained by extracting characteristics of a training recognition voice sample, characteristic statistics is carried out on the training voice signal characteristics through a preset statistical function to obtain a training characteristic statistical result, preliminary normalization processing is carried out on the training characteristic statistical result to obtain label training sample data and label-free training sample data, speaker normalization processing is carried out on the label training sample data and the label-free training sample data to obtain label training sample processing data and label-free training sample processing data, label training sample selection data is obtained through a training semi-supervised characteristic selection algorithm according to the label training sample processing data and the label-free training sample processing data, and a category label corresponding to the label training sample is obtained according to the label training sample selection data, and establishing a preset training classification model according to the label training sample selection data and the class label. By the method, the influence of unlabeled samples of other speakers is avoided, so that the influence of the speaker on the speech data manifold structure can be improved to the maximum extent, and the characteristics which are beneficial to speech emotion recognition of the speaker are selected.
In addition, an embodiment of the present invention further provides a storage medium, where the storage medium stores an integrated speech emotion recognition program, and the integrated speech emotion recognition program, when executed by a processor, implements the steps of the integrated speech emotion recognition method described above.
Referring to fig. 4, fig. 4 is a block diagram illustrating a first embodiment of an integrated speech emotion recognition device according to the present invention.
As shown in fig. 4, the integrated speech emotion recognition apparatus according to the embodiment of the present invention includes: the obtaining module 4001 is configured to perform feature extraction on a voice sample to be recognized to obtain voice signal features of a preset dimension; the statistic module 4002 is configured to perform feature statistics on the speech signal features through a preset statistic function to obtain a feature statistical result; a processing module 4003, configured to perform normalization processing on the feature statistical result to obtain feature initial data; a screening module 4004, configured to screen the feature initial data to obtain feature target data; the determining module 4005 is configured to input the feature target data into a preset training classification model, so as to obtain an integrated speech emotion recognition result.
The obtaining module 4001 is configured to perform feature extraction on a voice sample to be recognized, and obtain a voice signal feature with a preset dimension.
In addition, before the step of performing feature extraction on a voice sample to be recognized to obtain voice signal features of a preset dimension, feature extraction is performed on a training recognition voice sample to obtain training voice signal features of the preset dimension, feature statistics is performed on the training voice signal features through a preset statistical function to obtain a training feature statistical result, preliminary normalization processing is performed on the training feature statistical result to obtain labeled training sample data and unlabeled training sample data, speaker normalization processing is performed on the labeled training sample data and the unlabeled training sample data to obtain labeled training sample processing data and unlabeled training sample processing data, and labeled training sample selection data is obtained through a training half-governor characteristic selection algorithm according to the labeled training sample processing data and the unlabeled training sample processing data, and obtaining a category label corresponding to the label training sample according to the label training sample selection data, and establishing a preset training classification model according to the label training sample selection data and the category label.
It should be noted that the speech signal features extracted from the speech sample to be recognized include: mel Frequency Cepstrum Coefficient (MFCC), Log Frequency Power Coefficient (LFPC), Linear Predictive Cepstrum Coefficient (LPCC), Zero Crossing with peak Amplitude (zcap), Perceptual Linear Prediction (PLP), Rasta filter Perceptual Linear prediction (R-PLP).
It should be understood that the above feature extraction result of each type of feature is a two-dimensional matrix, wherein one dimension is a time dimension, and then each type of feature F is calculatediFirst derivative in the time dimension Δ FiSecond derivative Δ Δ FiConnecting the original features, the first derivative result and the second derivative result in series in a non-time dimension to form a final feature extraction result of each type of features; and (4) connecting the final feature extraction results of the features of all the classes in series on a non-time dimension to obtain the feature extraction result of the sample.
Further, for ease of understanding, the following is exemplified:
suppose, MFCC corresponds to FMFCC∈R39×z,ΔFMFCC∈R39×z,ΔΔFi∈R39×zWherein z is the number of frames, i.e. the number of degrees of the time dimension, the concatenation result in the non-time dimension
Furthermore, it should be understood that at each time of speech signal feature extraction, MFCC, L FPC, LPCC, ZCPA, PLP, R-PLP features are extracted, where the number of Mel filters of MFCC, LFPC is 40; the linear prediction orders of the LPCC, the PLP and the R-PLP are respectively 12, 16 and 16; the frequency segmentation of the ZCAP is as follows: 0,106, 223, 352, 495, 655, 829, 1022, 1236, 1473, 1734, 2024, 2344, 2689, 3089, 3522, 4000. So that the dimensions of each class of features of each statement are respectively: ti 39, ti 40, ti 12, ti 16, where ti is the number of frames in the i-th sentence, and the number following the multiplier is the dimension of each frame feature. To obtain the change of the speech signal in the time dimension, a first derivative, a second derivative, is also calculated for the above features in the time dimension. Finally, the dimensionality of each type of feature is respectively as follows: ti 117, ti 140, ti 36, ti 48. The extracted speech signal features of the ith sample are combined from all the above features, and have a dimension ti (117+140+36+48+48+48).
The statistic module 4002 is configured to perform feature statistics on the speech signal features through a preset statistic function, so as to obtain a feature statistical result.
The statistical results of the above features in the time dimension are obtained using a statistical function using a mean (mean), a standard deviation (standard deviation), a minimum (min), a maximum (max), a kurtosis (kurtosis), and a skewness (skewness).
The processing module 4003 is configured to perform normalization processing on the feature statistical result to obtain an operation of feature initial data.
It should be understood that the feature statistics are subjected to a preliminary normalization process to obtain sample feature data, the sample feature data are subjected to a speaker normalization process to obtain sample processing data, and the sample processing data are used as feature initial data.
In addition, the feature statistics result of the labeled sample known in the above steps is denoted as { x1,x2,…,xnAnd recording the characteristic statistical result of the unlabeled training sample of a certain speaker as { x }n+1,xn+2,…,xn+mGet statistics of all features { x }1,x2,…,xn+mPreliminary normalization was performed using the following equations, respectively:
thereafter, the preliminary normalization result { x'1,x'2,…,x'n+mSpeaker normalization is performed using the following equation: .
Wherein x'j,j=1,2,…,niIs of the training sample with x'iSamples with the same speaker label, niIs of x 'in the training sample'iThe number of samples with the same speaker label.
Furthermore, it should be understood that a plurality of feature subsets are obtained through a unified feature selection framework, and the data description capability and classification capability of each feature subset can be ensured. The base classifier trained on the basis has better diversity and classification strength. The number of the base classifiers can be obviously reduced, and the classification capability of the base classifiers can be improved.
Meanwhile, the feature statistical results are normalized by using an improved normalization algorithm. The normalization algorithm comprises two steps of initial normalization and speaker normalization, wherein the initial normalization uses the mean value and the variance of all samples to normalize each sample, and can avoid the influence caused by different characteristic value ranges; the speaker normalization only needs to use the mean value of all samples of the speaker, and the mean value estimation can obtain higher confidence coefficient when the number of the samples is less, so that a better speaker normalization effect can be achieved under the condition that the number of unlabeled samples of the speaker is less.
The screening module 4004 is configured to screen the feature initial data to obtain a feature target data.
It should be noted that, according to the sample processing data, the label sample processing data and the non-label sample processing data are obtained by a preset feature selection method, and the label sample target data is used as the feature target data.
The determining module 4005 is configured to input the feature target data into a preset training classification model, and obtain an operation of integrating a speech emotion recognition result.
It should be noted that the preset training classification model includes a plurality of preset training classification submodels, and each preset classification submodel supports a vector machine.
Furthermore, it should be understood that the above steps of obtaining the preset training classification submodel are:
(1) defining a matrix L describing the local geometry of the sample:
L=(I-S)T(I-S)
in which I is ∈ Rn×nIs an identity matrix, i.e. the diagonal element value is 1 and the other element values are 0; s is optimized by the following formula:
(2) the relationship between samples is defined using the following equation:
a Laplace matrix is then calculatedWhere D is a diagonal matrix Dii=∑jGij(ii) a Solving the problem of feature decompositionAnd let V be [ V ]1,v2,L,vC]The feature vectors corresponding to the minimum 2 to C +1 feature values, wherein C is the category number of the speech emotion;
(3) optimization of the following equation Using (4)
Wherein P iskq(i,j)=Wk(i,j)*Wq(i,j)
(4) The above formula was optimized using the following cycle
for k=1top
InitializationIs an identity matrix, t is set to be 0 (the diagonal is 1, and the rest is 0), p is the required number of submodels,
iteratively optimizing W using the following loopkRepeating:
Wherein X is training data, I is an identity matrix, α, gamma is three balance parameters, L is obtained by calculation in the step (1-4-1), and V is obtained by calculation in the step (1-4-2).
in the formula Pqk(i, ·) the element in row i and column j is calculated by:
Pkq(i,j)=Wk(i,j)*Wq(i,j)
In addition, it should be understood that the feature target data is input into the preset training classification submodel to obtain speech emotion category data, the speech emotion category data is subjected to data statistics to obtain a speech emotion category data value, and an integrated speech emotion recognition result is obtained according to the speech emotion category data value.
In addition, it should be noted that, the step of obtaining the integrated speech emotion recognition result according to the speech emotion category data value is to determine whether the speech emotion category data value meets a preset speech emotion category threshold range; if the voice emotion type data value meets the preset voice emotion type threshold range, obtaining an integrated voice emotion recognition result according to the voice emotion type data value, and if the voice emotion type data value does not meet the preset voice emotion type threshold range, returning to the step of inputting the feature target data into the preset training classification submodel to obtain voice emotion type data.
Furthermore, it should be understood that the above is an identification phase, the steps of which are:
in this stage, the speech signal of the emotion sample to be recognized of the known speaker is processed, and the emotion classification of the emotion sample to be recognized is obtained according to the training classifier obtained in the training stage. The specific process is as follows:
the first step is as follows: extracting MFCC, LFPC, LPCC, ZCP A, PLP and R-PLP characteristics from a voice signal of an emotion sample to be recognized, wherein the number of Mel filters of the MFCC and the LFPC is 40; the linear prediction orders of the LPCC, the PLP and the R-PLP are respectively 12, 16 and 16; the frequency segmentation of the ZCAP is as follows: 0,106, 223, 352, 495, 655, 829, 1022, 1236, 1473, 1734, 2024, 2344, 2689, 3089, 3522, 4000. So that the dimensions of each class of features of each statement are respectively: t 39, t 40, t 12, t 16, wherein t is the number of frames of the emotion sentences to be identified, and the number after the multiplication number is the dimension of each frame feature. To obtain the change of the speech signal in the time dimension, a first derivative, a second derivative, is also calculated for the above features in the time dimension. Finally, the dimensionality of each type of feature is respectively as follows: t 117, t 140, t 36, t 48. The speech signal features extracted from the emotion sentences to be recognized are combined by all the features, and the dimension is t (117+140+36+48+48+48).
The second step is that: the following statistical function was used: and obtaining the characteristic statistical result x of the emotional sentence to be recognized by the mean value (mean), standard deviation (standard deviation), minimum value (min), maximum value (max), kurtosis (kurtosis) and skewness (skewness).
The third step: and (4) normalizing the speaker. Firstly, calculating a preliminary normalization result x' of x by using a formula (1) according to mu and sigma obtained in a training stage; the speaker normalization result is then calculated for x' using equation (2)
The fourth step: selecting vector V according to the features obtained in the training process, and calculatingThe feature selection result z.
The fifth step: and obtaining the speech emotion class l of z by using the classifier obtained in the training process.
A corpus used for evaluating the emotion recognition effect is an EMO-DB voice emotion database in Germany, which is a standard database in the field of voice emotion recognition. The training process is first completed and then the recognition test is performed. The test mode was performed in a 5-fold crossover fashion. 7 emotions of anger, fear, irritability, disgust, happiness, neutrality and sadness can be identified, the average classification accuracy is 90.84% under the condition that the speaker depends on the emotion recognition method, and the distinction degree between other emotions is better except that the emotion recognition method is easier to confuse with anger and anger. The average classification accuracy was 86.50% with speaker independence.
It should be understood that the above is only an example, and the technical solution of the present invention is not limited in any way, and in a specific application, a person skilled in the art may set the technical solution as needed, and the present invention is not limited thereto.
In the embodiment, the characteristic extraction is performed on a voice sample to be recognized to obtain voice signal characteristics with preset dimensionality, the characteristic statistics is performed on the voice signal characteristics through a preset statistical function to obtain characteristic statistical results, the characteristic statistical results are subjected to normalization processing to obtain characteristic initial data, the characteristic initial data are screened to obtain characteristic target data, the characteristic target data are input into a preset training classification model to obtain integrated voice emotion recognition results, and a plurality of characteristic subsets with enough capability description data are searched in the above mode to enable the utilization rate of data to be higher, so that the voice emotion recognition effect can be more accurately obtained.
It should be noted that the above-described work flows are only exemplary, and do not limit the scope of the present invention, and in practical applications, a person skilled in the art may select some or all of them to achieve the purpose of the solution of the embodiment according to actual needs, and the present invention is not limited herein.
In addition, the technical details that are not described in detail in this embodiment may refer to the integrated speech emotion recognition method provided in any embodiment of the present invention, and are not described herein again.
Further, it is to be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention or portions thereof that contribute to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (e.g. Read Only Memory (ROM)/RAM, magnetic disk, optical disk), and includes several instructions for enabling a terminal device (e.g. a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.
Claims (10)
1. An integrated speech emotion recognition method, characterized in that the method comprises:
performing feature extraction on a voice sample to be recognized to obtain voice signal features with preset dimensionality;
carrying out feature statistics on the voice signal features through a preset statistical function to obtain a feature statistical result;
carrying out normalization processing on the feature statistical result to obtain feature initial data;
screening the characteristic initial data to obtain characteristic target data;
and inputting the characteristic target data into a preset training classification model to obtain an integrated speech emotion recognition result.
2. The method of claim 1, wherein the step of normalizing the feature statistics to obtain feature initial data comprises:
carrying out primary normalization processing on the characteristic statistical result to obtain sample characteristic data;
and carrying out speaker normalization processing on the sample characteristic data to obtain sample processing data, and taking the sample processing data as characteristic initial data.
3. The method of claim 2, wherein the step of screening the characteristic initial data to obtain characteristic target data comprises:
and according to the sample processing data, obtaining label sample processing data through a preset feature selection algorithm, and taking the label sample target data as feature target data.
4. The method of claim 1, wherein before the step of extracting features of the voice sample to be recognized and obtaining the voice signal features of the preset dimension, the method further comprises:
performing feature extraction on the training recognition voice sample to obtain training voice signal features with preset dimensionality;
performing feature statistics on the training voice signal features through a preset statistical function to obtain a training feature statistical result;
carrying out primary normalization processing on the training characteristic statistical result to obtain training sample data;
carrying out speaker normalization processing on the training sample data to obtain training sample processing data;
obtaining label training sample processing data through a preset feature selection algorithm according to the training sample processing data;
obtaining a class label corresponding to the label training sample processing data according to the label training sample processing data;
and establishing a preset training classification model according to the label training sample processing data and the class label.
5. The method of claim 4, wherein the pre-set training classification model comprises a plurality of pre-set training classification submodels;
the step of inputting the feature target data into a preset training classification model to obtain an integrated speech emotion recognition result comprises the following steps:
inputting the characteristic target data into the preset training classification submodel to obtain voice emotion category data;
performing data statistics on the voice emotion type data to obtain a voice emotion type data value;
and acquiring an integrated voice emotion recognition result according to the voice emotion category data value.
6. The method of claim 5, wherein the step of obtaining integrated speech emotion recognition results according to the speech emotion classification data value comprises:
judging whether the voice emotion type data value belongs to a preset voice emotion type threshold range or not;
and if the voice emotion category data value belongs to the preset voice emotion category threshold range, acquiring an integrated voice emotion recognition result according to the voice emotion category data value.
7. The method of claim 6, wherein the step of determining whether the speech emotion classification data value falls within a preset speech emotion classification threshold range further comprises:
and if the voice emotion category data value does not belong to the preset voice emotion category threshold range, returning to the step of inputting the characteristic target data into the preset training classification submodel to obtain voice emotion category data.
8. An integrated speech emotion recognition apparatus, characterized in that the apparatus comprises:
the acquisition module is used for extracting the characteristics of the voice sample to be recognized to acquire the voice signal characteristics with preset dimensionality;
the statistical module is used for carrying out feature statistics on the voice signal features through a preset statistical function to obtain a feature statistical result;
the processing module is used for carrying out normalization processing on the characteristic statistical result to obtain characteristic initial data;
the screening module is used for screening the characteristic initial data to obtain characteristic target data;
and the determining module is used for inputting the characteristic target data into a preset training classification model to obtain an integrated speech emotion recognition result.
9. An electronic device, characterized in that the device comprises: a memory, a processor and an integrated speech emotion recognition program stored on the memory and executable on the processor, the integrated speech emotion recognition program being configured to implement the steps of the integrated speech emotion recognition method as claimed in any of claims 1 to 7.
10. A storage medium having stored thereon an integrated speech emotion recognition program, which when executed by a processor implements the steps of the integrated speech emotion recognition method as claimed in any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911246545.XA CN110931043A (en) | 2019-12-06 | 2019-12-06 | Integrated speech emotion recognition method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911246545.XA CN110931043A (en) | 2019-12-06 | 2019-12-06 | Integrated speech emotion recognition method, device, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110931043A true CN110931043A (en) | 2020-03-27 |
Family
ID=69858247
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911246545.XA Pending CN110931043A (en) | 2019-12-06 | 2019-12-06 | Integrated speech emotion recognition method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110931043A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111950644A (en) * | 2020-08-18 | 2020-11-17 | 东软睿驰汽车技术(沈阳)有限公司 | Model training sample selection method and device and computer equipment |
CN112530409A (en) * | 2020-12-01 | 2021-03-19 | 平安科技(深圳)有限公司 | Voice sample screening method and device based on geometry and computer equipment |
CN113889149A (en) * | 2021-10-15 | 2022-01-04 | 北京工业大学 | Speech emotion recognition method and device |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104008754A (en) * | 2014-05-21 | 2014-08-27 | 华南理工大学 | Speech emotion recognition method based on semi-supervised feature selection |
-
2019
- 2019-12-06 CN CN201911246545.XA patent/CN110931043A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104008754A (en) * | 2014-05-21 | 2014-08-27 | 华南理工大学 | Speech emotion recognition method based on semi-supervised feature selection |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111950644A (en) * | 2020-08-18 | 2020-11-17 | 东软睿驰汽车技术(沈阳)有限公司 | Model training sample selection method and device and computer equipment |
CN111950644B (en) * | 2020-08-18 | 2024-03-26 | 东软睿驰汽车技术(沈阳)有限公司 | Training sample selection method and device for model and computer equipment |
CN112530409A (en) * | 2020-12-01 | 2021-03-19 | 平安科技(深圳)有限公司 | Voice sample screening method and device based on geometry and computer equipment |
WO2022116442A1 (en) * | 2020-12-01 | 2022-06-09 | 平安科技(深圳)有限公司 | Speech sample screening method and apparatus based on geometry, and computer device and storage medium |
CN112530409B (en) * | 2020-12-01 | 2024-01-23 | 平安科技(深圳)有限公司 | Speech sample screening method and device based on geometry and computer equipment |
CN113889149A (en) * | 2021-10-15 | 2022-01-04 | 北京工业大学 | Speech emotion recognition method and device |
CN113889149B (en) * | 2021-10-15 | 2023-08-29 | 北京工业大学 | Speech emotion recognition method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111179975B (en) | Voice endpoint detection method for emotion recognition, electronic device and storage medium | |
Xia et al. | A multi-task learning framework for emotion recognition using 2D continuous space | |
CN112818861B (en) | Emotion classification method and system based on multi-mode context semantic features | |
CN109767765A (en) | Talk about art matching process and device, storage medium, computer equipment | |
CN110931043A (en) | Integrated speech emotion recognition method, device, equipment and storage medium | |
CN110990543A (en) | Intelligent conversation generation method and device, computer equipment and computer storage medium | |
Seng et al. | Video analytics for customer emotion and satisfaction at contact centers | |
CN110168535A (en) | A kind of information processing method and terminal, computer storage medium | |
US20230206928A1 (en) | Audio processing method and apparatus | |
CN107767881B (en) | Method and device for acquiring satisfaction degree of voice information | |
Noroozi et al. | Supervised vocal-based emotion recognition using multiclass support vector machine, random forests, and adaboost | |
CN110598869B (en) | Classification method and device based on sequence model and electronic equipment | |
CN108509416A (en) | Sentence realizes other method and device, equipment and storage medium | |
CN110827797A (en) | Voice response event classification processing method and device | |
CN111784372A (en) | Store commodity recommendation method and device | |
Qi et al. | Exploiting low-rank tensor-train deep neural networks based on Riemannian gradient descent with illustrations of speech processing | |
CN110956981B (en) | Speech emotion recognition method, device, equipment and storage medium | |
Liu et al. | Learning salient features for speech emotion recognition using CNN | |
CN113053395A (en) | Pronunciation error correction learning method and device, storage medium and electronic equipment | |
CN113111855B (en) | Multi-mode emotion recognition method and device, electronic equipment and storage medium | |
CN108831487A (en) | Method for recognizing sound-groove, electronic device and computer readable storage medium | |
CN110942358A (en) | Information interaction method, device, equipment and medium | |
CN112633381B (en) | Audio recognition method and training method of audio recognition model | |
CN114765028A (en) | Voiceprint recognition method and device, terminal equipment and computer readable storage medium | |
CN113421573A (en) | Identity recognition model training method, identity recognition method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200327 |
|
RJ01 | Rejection of invention patent application after publication |