CN110931043A

CN110931043A - Integrated speech emotion recognition method, device, equipment and storage medium

Info

Publication number: CN110931043A
Application number: CN201911246545.XA
Authority: CN
Inventors: 孙亚新; 叶青
Original assignee: Hubei University of Arts and Science
Current assignee: Hubei University of Arts and Science
Priority date: 2019-12-06
Filing date: 2019-12-06
Publication date: 2020-03-27

Abstract

The invention belongs to the technical field of voice signal processing and mode recognition, and discloses an integrated voice emotion recognition method, device, equipment and storage medium. The method comprises the following steps: performing feature extraction on a voice sample to be recognized to obtain voice signal features with preset dimensionality; carrying out feature statistics on the voice signal features through a preset statistical function to obtain a feature statistical result; carrying out normalization processing on the feature statistical result to obtain feature initial data; screening the characteristic initial data to obtain characteristic target data; the feature target data are input into a preset training classification model to obtain an integrated speech emotion recognition result, and through the method, the feature selection result is prevented from being over-fitted to the training data, so that the speech emotion recognition degree is better improved.

Description

Integrated speech emotion recognition method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of voice signal processing and mode recognition, in particular to an integrated voice emotion recognition method, device, equipment and storage medium.

Background

The purpose of speech emotion recognition is to enable a computer to find human emotional states from speech signals and enable a machine to understand human perceptual thinking, so that the computer has more humanized and complex functions. Speech as the most dominant way for human communication, and the ease of capture of speech signals, enables speech emotion recognition technology to be used in a very large number of fields (1) recommendation systems. A recommendation that can understand your mood, not an advertisement, but a human mind; recommended is not a service, but rather an understanding of you; and the recommendation effect can be greatly improved naturally. (2) A telephone customer service emotion management system. The quality of customer service can be improved, and psychological diseases caused by customer service can be avoided. (3) Personal health monitoring. A person can be prevented from being in a passive state for a long time. In addition, the method has great effects in the fields of smart home, distance education, game feedback, emotion treatment and the like. If the speech emotion recognition technology is developed to be mature, the user can understand the emotional state of the user at any time by using the portable equipment, and then the user is served according to the emotional state of the user. Will have subversive effect on the whole Internet service mode, generate one-time shuffling to the equipment commonly used in daily life, after all, no one is willing to face the machine with cold ice! The method has important significance for improving the competitiveness of IT industries such as electronic commerce, social software, smart televisions, mobile phones, robots and the like in China.

At present, a plurality of speech emotion recognition methods exist, wherein ensemble learning is a good method for improving speech emotion recognition effect. At present, the common speech emotion recognition methods based on ensemble learning include the following methods: (1) the layered integration framework is defined manually. (2) And manually specifying the feature submodel. (3) A variety of classifiers are used. (4) Random submodels, and other general ensemble learning methods. The methods (1) and (2) require more manual participation, and the designed model has poor universality. Most of the (3) th method is executed in the original characteristic model, and the problem of dimension disaster can exist. The diversity and classification capability of the (4) th feature submodel are difficult to ensure.

The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.

Disclosure of Invention

The invention mainly aims to provide an integrated speech emotion recognition method, device, equipment and storage medium, and aims to solve the technical problem of improving speech recognition accuracy.

In order to achieve the above object, the present invention provides an integrated speech emotion recognition method, which comprises the following steps:

performing feature extraction on a voice sample to be recognized to obtain voice signal features with preset dimensionality;

carrying out feature statistics on the voice signal features through a preset statistical function to obtain a feature statistical result;

carrying out normalization processing on the feature statistical result to obtain feature initial data;

screening the characteristic initial data to obtain characteristic target data;

and inputting the characteristic target data into a preset training classification model to obtain an integrated speech emotion recognition result.

Preferably, the step of performing normalization processing on the feature statistical result to obtain feature initial data includes:

carrying out primary normalization processing on the characteristic statistical result to obtain sample characteristic data;

and carrying out speaker normalization processing on the sample characteristic data to obtain sample processing data, and taking the sample processing data as characteristic initial data.

Preferably, the step of screening the characteristic initial data to obtain characteristic target data includes:

and according to the sample processing data, obtaining label sample processing data through a preset feature selection algorithm, and taking the label sample target data as feature target data.

Preferably, before the step of performing feature extraction on the voice sample to be recognized to obtain the voice signal features of the preset dimension, the method further includes:

performing feature extraction on the training recognition voice sample to obtain training voice signal features with preset dimensionality;

performing feature statistics on the training voice signal features through a preset statistical function to obtain a training feature statistical result;

carrying out primary normalization processing on the training characteristic statistical result to obtain training sample data;

carrying out speaker normalization processing on the training sample data to obtain training sample processing data;

obtaining label training sample processing data through a preset feature selection algorithm according to the training sample processing data;

obtaining a class label corresponding to the label training sample processing data according to the label training sample processing data;

and establishing a preset training classification model according to the label training sample processing data and the class label.

Preferably, the preset training classification model comprises a plurality of preset training classification submodels;

the step of inputting the feature target data into a preset training classification model to obtain an integrated speech emotion recognition result comprises the following steps:

inputting the characteristic target data into the preset training classification submodel to obtain voice emotion category data;

performing data statistics on the voice emotion type data to obtain a voice emotion type data value;

and acquiring an integrated voice emotion recognition result according to the voice emotion category data value.

Preferably, the step of obtaining an integrated speech emotion recognition result according to the speech emotion category data value includes:

judging whether the voice emotion type data value belongs to a preset voice emotion type threshold range or not;

and if the voice emotion category data value belongs to the preset voice emotion category threshold range, acquiring an integrated voice emotion recognition result according to the voice emotion category data value.

Preferably, after the step of determining whether the speech emotion category data value belongs to a preset speech emotion category threshold range, the method further includes:

and if the voice emotion category data value does not belong to the preset voice emotion category threshold range, returning to the step of inputting the characteristic target data into the preset training classification submodel to obtain voice emotion category data.

In addition, to achieve the above object, the present invention further provides an integrated speech emotion recognition apparatus, including:

the acquisition module is used for extracting the characteristics of the voice sample to be recognized to acquire the voice signal characteristics with preset dimensionality;

the statistical module is used for carrying out feature statistics on the voice signal features through a preset statistical function to obtain a feature statistical result;

the processing module is used for carrying out normalization processing on the characteristic statistical result to obtain characteristic initial data;

the screening module is used for screening the characteristic initial data to obtain characteristic target data;

and the determining module is used for inputting the characteristic target data into a preset training classification model to obtain an integrated speech emotion recognition result.

In addition, to achieve the above object, the present invention also provides an electronic device, including: a memory, a processor and an integrated speech emotion recognition program stored on the memory and executable on the processor, the integrated speech emotion recognition program configured to implement the steps of the integrated speech emotion recognition method as described in any of the above.

In addition, to achieve the above object, the present invention further provides a storage medium, on which an integrated speech emotion recognition program is stored, and the integrated speech emotion recognition program, when executed by a processor, implements the steps of the integrated speech emotion recognition method as described in any one of the above.

The invention extracts the characteristics of a voice sample to be recognized, then obtains the voice signal characteristics of a preset dimension, performs characteristic statistics on the voice signal characteristics through a preset statistical function to obtain a characteristic statistical result, performs initial normalization processing and speaker normalization processing on the characteristic statistical result to obtain characteristic initial data, then screens the characteristic initial data to obtain characteristic target data, obtains voice emotion category data by inputting the characteristic target data into a preset training classification submodel, performs data statistics on the voice emotion category data to obtain a voice emotion category data value, finally obtains an integrated voice emotion recognition result according to the voice emotion category data value, avoids the characteristic selection result from being over-fitted to the training data, and selects the characteristics beneficial to recognizing the voice emotion of a speaker, therefore, the diversity and the classification capability of the feature submodels can be well improved, and the effect of the integrated classifier is further improved.

Drawings

Fig. 1 is a schematic structural diagram of an electronic device in a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a first embodiment of an integrated speech emotion recognition method according to the present invention;

FIG. 3 is a flowchart illustrating a second embodiment of the integrated speech emotion recognition method according to the present invention;

FIG. 4 is a block diagram of the integrated speech emotion recognition device according to the first embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1, fig. 1 is a schematic structural diagram of an electronic device in a hardware operating environment according to an embodiment of the present invention.

As shown in fig. 1, the electronic device may include: a processor 1001 such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., a WIreless-FIdelity (WI-FI) interface). The Memory 1005 may be a Random Access Memory (RAM) Memory, or a Non-Volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the configuration shown in fig. 1 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a storage medium, may include an operating system, a network communication module, a user interface module, and an integrated speech emotion recognition program therein.

In the electronic apparatus shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 of the electronic device according to the present invention may be disposed in the electronic device, and the electronic device calls the integrated speech emotion recognition program stored in the memory 1005 through the processor 1001 and executes the integrated speech emotion recognition method according to the embodiment of the present invention.

An embodiment of the present invention provides an integrated speech emotion recognition method, and referring to fig. 2, fig. 2 is a schematic flowchart of a first embodiment of the integrated speech emotion recognition method according to the present invention.

In this embodiment, the integrated speech emotion recognition method includes the following steps:

step S10: and performing feature extraction on the voice sample to be recognized to obtain voice signal features with preset dimensionality.

In addition, before the step of performing feature extraction on a voice sample to be recognized to obtain voice signal features of a preset dimension, feature extraction is performed on a training recognition voice sample to obtain training voice signal features of the preset dimension, feature statistics is performed on the training voice signal features through a preset statistical function to obtain a training feature statistical result, preliminary normalization processing is performed on the training feature statistical result to obtain labeled training sample data and unlabeled training sample data, speaker normalization processing is performed on the labeled training sample data and the unlabeled training sample data to obtain labeled training sample processing data and unlabeled training sample processing data, and labeled training sample selection data is obtained through a training half-governor characteristic selection algorithm according to the labeled training sample processing data and the unlabeled training sample processing data, and obtaining a category label corresponding to the label training sample according to the label training sample selection data, and establishing a preset training classification model according to the label training sample selection data and the category label.

It should be noted that the speech signal features extracted from the speech sample to be recognized include: mel Frequency Cepstrum Coefficient (MFCC), Log Frequency Power Coefficient (LFPC), Linear Predictive Cepstrum Coefficient (LPCC), Zero Crossing with peak Amplitude (zcap), Perceptual Linear Prediction (PLP), Rasta filter Perceptual Linear prediction (R-PLP).

It should be understood that the above feature extraction result of each type of feature is a two-dimensional matrix, wherein one dimension is a time dimension, and then each type of feature F is calculated_iFirst derivative in the time dimension Δ F_iSecond derivative Δ Δ F_iConnecting the original features, the first derivative result and the second derivative result in series in a non-time dimension to form a final feature extraction result of each type of features; and (4) connecting the final feature extraction results of the features of all the classes in series on a non-time dimension to obtain the feature extraction result of the sample.

Further, for ease of understanding, the following is exemplified:

suppose, MFCC corresponds to F_MFCC∈R^39×z，ΔF_MFCC∈R^39×z，ΔΔF_i∈R^39×zWherein z is the number of frames, i.e. the number of degrees of the time dimension, the concatenation result in the non-time dimension

When MFCC and LPCC are connected, suppose

After being connected in series, are

Furthermore, it should be understood that at each time of speech signal feature extraction, MFCC, L FPC, LPCC, ZCPA, PLP, R-PLP features are extracted, where the number of Mel filters of MFCC, LFPC is 40; the linear prediction orders of the LPCC, the PLP and the R-PLP are respectively 12, 16 and 16; the frequency segmentation of the ZCAP is as follows: 0,106, 223, 352, 495, 655, 829, 1022, 1236, 1473, 1734, 2024, 2344, 2689, 3089, 3522, 4000. So that the dimensions of each class of features of each statement are respectively: ti 39, ti 40, ti 12, ti 16, where ti is the number of frames in the i-th sentence, and the number following the multiplier is the dimension of each frame feature. To obtain the change of the speech signal in the time dimension, a first derivative, a second derivative, is also calculated for the above features in the time dimension. Finally, the dimensionality of each type of feature is respectively as follows: ti 117, ti 140, ti 36, ti 48. The extracted speech signal features of the ith sample are combined from all the above features, and have a dimension ti (117+140+36+48+48+48).

Step S20: and carrying out feature statistics on the voice signal features through a preset statistical function to obtain a feature statistical result.

The statistical results of the above features in the time dimension are obtained using a statistical function using a mean (mean), a standard deviation (standard deviation), a minimum (min), a maximum (max), a kurtosis (kurtosis), and a skewness (skewness).

Step S30: and carrying out normalization processing on the characteristic statistical result to obtain characteristic initial data.

It should be understood that the feature statistics are subjected to a preliminary normalization process to obtain sample feature data, the sample feature data are subjected to a speaker normalization process to obtain sample processing data, and the sample processing data are used as feature initial data.

In addition, the feature statistics result of the labeled sample known in the above steps is denoted as { x₁,x₂,…,x_nAnd recording the characteristic statistical result of the unlabeled training sample of a certain speaker as { x }_n+1,x_n+2,…,x_n+mGet statistics of all features { x }₁,x₂,…,x_n+mPreliminary normalization was performed using the following equations, respectively: .

Wherein

The mean of all the samples is represented by,

represents the variance of all samples;

thereafter, the preliminary normalization result { x'₁,x'₂,…,x'_n+mSpeaker normalization is performed using the following equation:

wherein x'_j,j＝1,2,…,n_iIs of the training sample with x'_iSamples with the same speaker label, n_iIs of x 'in the training sample'_iThe number of samples with the same speaker label.

Furthermore, it should be understood that a plurality of feature subsets are obtained through a unified feature selection framework, and the data description capability and classification capability of each feature subset can be ensured. The base classifier trained on the basis has better diversity and classification strength. The number of the base classifiers can be obviously reduced, and the classification capability of the base classifiers can be improved.

Meanwhile, the feature statistical results are normalized by using an improved normalization algorithm. The normalization algorithm comprises two steps of initial normalization and speaker normalization, wherein the initial normalization uses the mean value and the variance of all samples to normalize each sample, and can avoid the influence caused by different characteristic value ranges; the speaker normalization only needs to use the mean value of all samples of the speaker, and the mean value estimation can obtain higher confidence coefficient when the number of the samples is less, so that a better speaker normalization effect can be achieved under the condition that the number of unlabeled samples of the speaker is less.

Step S40: and screening the characteristic initial data to obtain characteristic target data.

It should be noted that, according to the sample processing data, the label sample processing data and the non-label sample processing data are obtained by a preset feature selection method, and the label sample target data is used as the feature target data.

Step S50: and inputting the characteristic target data into a preset training classification model to obtain an integrated speech emotion recognition result.

It should be noted that the preset training classification model includes a plurality of preset training classification submodels, and each preset classification submodel supports a vector machine.

Furthermore, it should be understood that the above steps of obtaining the preset training classification submodel are:

(1) defining a matrix L describing the local geometry of the sample:

L＝(I-S)^T(I-S)

in which I is ∈ R^n×nIs an identity matrix, i.e. the diagonal element value is 1 and the other element values are 0; s is optimized by the following formula:

(2) the relationship between samples is defined using the following equation:

a Laplace matrix is then calculated

Where D is a diagonal matrix D_ii＝∑_jG_ij(ii) a Solving the problem of feature decomposition

And let V be [ V ]₁,v₂,L,v_C]The feature vectors corresponding to the minimum 2 to C +1 feature values, wherein C is the category number of the speech emotion;

(3) optimization of the following equation Using (4)

Wherein P is_kq(i,j)＝W_k(i,j)*W_q(i,j)

(4) The above formula was optimized using the following cycle

for k＝1top

Initialization

Is an identity matrix, t is set to be 0 (the diagonal is 1, and the rest is 0), p is the required number of submodels,

iteratively optimizing W using the following loop_kRepeating:

(4-1) calculation Using the following formula

Wherein X is training data, I is an identity matrix, α, gamma is three balance parameters, L is calculated by formula (1), and V is calculated by formula (2).

(4-2) calculation of

Is a diagonal matrix, wherein

Is calculated by the following formula:

(4-3) calculation of

Wherein

The ith row and jth column of (a) are calculated by:

in the formula P_qk(i, ·) the element in row i and column j is calculated by:

P_kq(i,j)＝W_k(i,j)*W_q(i,j)

(4-4) t ═ t +1, up to

And

the difference is less than a predetermined threshold.

In addition, it should be understood that the feature target data is input into the preset training classification submodel to obtain speech emotion category data, the speech emotion category data is subjected to data statistics to obtain a speech emotion category data value, and an integrated speech emotion recognition result is obtained according to the speech emotion category data value.

In addition, it should be noted that, the step of obtaining the integrated speech emotion recognition result according to the speech emotion category data value is to determine whether the speech emotion category data value belongs to a preset speech emotion category threshold range; if the voice emotion type data value belongs to the preset voice emotion type threshold range, obtaining an integrated voice emotion recognition result according to the voice emotion type data value, and if the voice emotion type data value does not belong to the preset voice emotion type threshold range, returning to the step of inputting the feature target data into the preset training classification sub-model to obtain voice emotion type data.

Furthermore, it should be understood that the above is an identification phase, the steps of which are:

in this stage, the speech signal of the emotion sample to be recognized of the known speaker is processed, and the emotion classification of the emotion sample to be recognized is obtained according to the training classifier obtained in the training stage. The specific process is as follows:

the first step is as follows: extracting MFCC, LFPC, LPCC, ZCP A, PLP and R-PLP characteristics from a voice signal of an emotion sample to be recognized, wherein the number of Mel filters of the MFCC and the LFPC is 40; the linear prediction orders of the LPCC, the PLP and the R-PLP are respectively 12, 16 and 16; the frequency segmentation of the ZCAP is as follows: 0,106, 223, 352, 495, 655, 829, 1022, 1236, 1473, 1734, 2024, 2344, 2689, 3089, 3522, 4000. So that the dimensions of each class of features of each statement are respectively: t 39, t 40, t 12, t 16, wherein t is the number of frames of the emotion sentences to be identified, and the number after the multiplication number is the dimension of each frame feature. To obtain the change of the speech signal in the time dimension, a first derivative, a second derivative, is also calculated for the above features in the time dimension. Finally, the dimensionality of each type of feature is respectively as follows: t 117, t 140, t 36, t 48. The speech signal features extracted from the emotion sentences to be recognized are combined by all the features, and the dimension is t (117+140+36+48+48+48).

The second step is that: the following statistical function was used: and obtaining the characteristic statistical result x of the emotional sentence to be recognized by the mean value (mean), standard deviation (standard deviation), minimum value (min), maximum value (max), kurtosis (kurtosis) and skewness (skewness).

The third step: and (4) normalizing the speaker. Firstly, calculating a preliminary normalization result x' of x by using a formula (1) according to mu and sigma obtained in a training stage; the speaker normalization result is then calculated for x' using equation (2)

The fourth step: selecting vector V according to the features obtained in the training process, and calculating

The feature selection result z.

The fifth step: and obtaining the speech emotion class l of z by using the classifier obtained in the training process.

A corpus used for evaluating the emotion recognition effect is an EMO-DB voice emotion database in Germany, which is a standard database in the field of voice emotion recognition. The training process is first completed and then the recognition test is performed. The test mode was performed in a 5-fold crossover fashion. 7 emotions of anger, fear, irritability, disgust, happiness, neutrality and sadness can be identified, the average classification accuracy is 90.84% under the condition that the speaker depends on the emotion recognition method, and the distinction degree between other emotions is better except that the emotion recognition method is easier to confuse with anger and anger. The average classification accuracy was 86.50% with speaker independence.

In the embodiment, the characteristic extraction is performed on a voice sample to be recognized to obtain voice signal characteristics with preset dimensionality, the characteristic statistics is performed on the voice signal characteristics through a preset statistical function to obtain characteristic statistical results, the characteristic statistical results are subjected to normalization processing to obtain characteristic initial data, the characteristic initial data are screened to obtain characteristic target data, the characteristic target data are input into a preset training classification model to obtain integrated voice emotion recognition results, and a plurality of characteristic subsets with enough capability description data are searched in the above mode to enable the utilization rate of data to be higher, so that the voice emotion recognition effect can be more accurately obtained.

Referring to fig. 3, fig. 3 is a flowchart illustrating a method for integrated speech emotion recognition according to a second embodiment of the present invention.

Based on the first embodiment, before the step S10, the integrated speech emotion recognition method of this embodiment further includes:

step S000: and extracting the characteristics of the training recognition voice sample to obtain the training voice signal characteristics with preset dimensionality.

Step S001: and carrying out feature statistics on the training voice signal features through a preset statistical function to obtain a training feature statistical result.

Step S002: and carrying out primary normalization processing on the training characteristic statistical result to obtain training sample data.

Step S003: and carrying out speaker normalization processing on the training sample data to obtain training sample processing data.

Step S004: and obtaining label training sample processing data through a preset feature selection algorithm according to the training sample processing data.

Step S005: and obtaining the class label corresponding to the label training sample processing data according to the label training sample processing data.

Step S006: and establishing a preset training classification model according to the label training sample processing data and the class label.

Further, it should be understood that, in performing the training phase, (1-1) the features of the labeled training samples are extracted as well as the features of the unlabeled samples for each speaker; (1-2) performing feature statistics on all the features; (1-3) performing a normalization algorithm on the feature statistical result; (1-4) selecting a plurality of feature submodels using a unified feature selection framework; (1-5) training a support vector machine for each feature sub-model; (1-6) the classification result is obtained by voting the results of all the support vector machines.

Furthermore, to facilitate understanding the following specific steps of the training phase are performed:

in this stage, training is performed for all speakers respectively to obtain a classifier corresponding to each speaker, and the specific process is as follows:

the first step is as follows: extracting the characteristics of MFCC, LFPC, LPCC, ZCAP, PLP and R-PLP from all voice training signals (all voice signals with label samples and voice signals without label samples of a certain speaker in each training), wherein the number of Mel filters of MFCC and LFPC is 40; the linear prediction orders of the LPCC, the PLP and the R-PLP are respectively 12, 16 and 16; the frequency segmentation of the ZCAP is as follows: 0,106, 223, 352, 495, 655, 829, 1022, 1236, 1473, 1734, 2024, 2344, 2689, 3089, 3522, 4000. So that the dimensions of each class of features of each statement are respectively: ti 39, ti 40, ti 12, ti 16, where ti is the number of frames in the i-th sentence, and the number following the multiplier is the dimension of each frame feature. To obtain the change of the speech signal in the time dimension, a first derivative, a second derivative, is also calculated for the above features in the time dimension. Finally, the dimensionality of each type of feature is respectively as follows: ti 117, ti 140, ti 36, ti 48. The extracted speech signal features of the ith sample are combined from all the above features, and have a dimension ti (117+140+36+48+48+48).

The second step is that: the following statistical function was used: the mean (mean), standard deviation (standard deviation), minimum (min), maximum (max), kurtosis (kurtosis), skewness (skewness) are obtained as statistics of the above features in the time dimension. The feature statistics of the labeled samples are noted as { x₁,x₂,…,x_nAnd recording the characteristic statistical result of the unlabeled training sample of a certain speaker as { x }_n+1,x_n+2,…,x_n+mAnd f, wherein n is the number of labeled specimens, and m is the number of unlabeled samples of a speaker.

The third step: and normalizing the characteristic statistical result. The method comprises the following steps:

(1) for all the feature statistics { x ] obtained in the second step₁,x₂,…,x_n+mPreliminary normalization was performed using the following equations, respectively:

wherein

The mean of all the samples is represented by,

represents the variance of all samples;

(2) to preliminary normalization result { x'₁,x'₂,…,x'_n+mSpeaker normalization is performed using the following equation:

The fourth step: and training a semi-supervised feature selection algorithm. The algorithm comprises the following steps:

furthermore, it should be understood that the above-mentioned predetermined feature selection method includes training a semi-supervised feature selection algorithm. The algorithm comprises the following steps:

(1) the relationship between samples is defined using the following equation:

in the formula, S_ijRepresenting the relationship between samples, n_liThe number of samples with class label li is shown, and li and lj represent samples

The category label of (a) is set,

is a sample

The neighborhood of (a) is determined,

is a sample

And A is_ijThe definition is as follows:

wherein the content of the first and second substances,

to represent

And

the euclidean distance between them,

to represent

To

The Euclidean distance of (a) is,

to represent

To

The Euclidean distance of (a) is,

is composed of

The kth neighbor of (1).

(2) Calculating the Laplace map L-D-S, where D is a diagonal matrix D_ii＝∑_jS_ij。

(3) The problem of characteristic decomposition Ly is solved. And let Y ═ Y₁,y₂,…,y_C]And C is a feature vector corresponding to the minimum 2 to C +1 feature values, wherein C is the category number of the speech emotion.

(4) Solving L1 normalized regression problem using least Angle regression algorithms (LARs)

Obtaining C sparse coefficient vectors

Wherein y is_cThe c-th feature vector found for (1-4-3),

(5) calculating an importance score for each feature

j represents the jth feature and score (j) represents the score for the jth feature.

(6) The index of the d features with the largest score is returned as the feature selection result V. Where d is the dimension of the feature to be selected.

In addition, it should be noted that the semi-supervised feature selection algorithm can consider the manifold structure of the data, the category structure of the data, and the information provided by using the unlabeled sample, thereby avoiding that the feature selection result is over-fitted to the training data, and selecting the features beneficial to recognizing the speech emotion of the speaker.

The fifth step: obtaining the feature selection result { z of the labeled sample according to the feature selection result V₁,z₂,…,z_n}. Will be provided withThe feature selection results are stored in the speech emotion vector database.

And a sixth step: using { z₁,z₂,…,z_nAnd their class labels train the classifier.

Further, it should be understood that the feature selection result { z ] of the labeled exemplar is obtained from the feature selection result₁,z₂,…,z_nObtaining { z } by using the classifier obtained in the training process₁,z₂,…,z_nThe speech emotion classification of.

In addition, it should be noted that after the training process is completed, the recognition test is performed. The test mode was performed in a 5-fold crossover fashion. 7 emotions of anger, fear, irritability, disgust, happiness, neutrality and sadness can be identified, the average classification accuracy is 90.84% under the condition that the speaker depends on the emotion recognition method, and the distinction degree between other emotions is better except that the emotion recognition method is easier to confuse with anger and anger. The average classification accuracy was 86.50% with speaker independence.

In the embodiment, training voice signal characteristics of a preset dimension are obtained by extracting characteristics of a training recognition voice sample, characteristic statistics is carried out on the training voice signal characteristics through a preset statistical function to obtain a training characteristic statistical result, preliminary normalization processing is carried out on the training characteristic statistical result to obtain label training sample data and label-free training sample data, speaker normalization processing is carried out on the label training sample data and the label-free training sample data to obtain label training sample processing data and label-free training sample processing data, label training sample selection data is obtained through a training semi-supervised characteristic selection algorithm according to the label training sample processing data and the label-free training sample processing data, and a category label corresponding to the label training sample is obtained according to the label training sample selection data, and establishing a preset training classification model according to the label training sample selection data and the class label. By the method, the influence of unlabeled samples of other speakers is avoided, so that the influence of the speaker on the speech data manifold structure can be improved to the maximum extent, and the characteristics which are beneficial to speech emotion recognition of the speaker are selected.

In addition, an embodiment of the present invention further provides a storage medium, where the storage medium stores an integrated speech emotion recognition program, and the integrated speech emotion recognition program, when executed by a processor, implements the steps of the integrated speech emotion recognition method described above.

Referring to fig. 4, fig. 4 is a block diagram illustrating a first embodiment of an integrated speech emotion recognition device according to the present invention.

As shown in fig. 4, the integrated speech emotion recognition apparatus according to the embodiment of the present invention includes: the obtaining module 4001 is configured to perform feature extraction on a voice sample to be recognized to obtain voice signal features of a preset dimension; the statistic module 4002 is configured to perform feature statistics on the speech signal features through a preset statistic function to obtain a feature statistical result; a processing module 4003, configured to perform normalization processing on the feature statistical result to obtain feature initial data; a screening module 4004, configured to screen the feature initial data to obtain feature target data; the determining module 4005 is configured to input the feature target data into a preset training classification model, so as to obtain an integrated speech emotion recognition result.

The obtaining module 4001 is configured to perform feature extraction on a voice sample to be recognized, and obtain a voice signal feature with a preset dimension.

Further, for ease of understanding, the following is exemplified:

When MFCC and LPCC are connected, suppose

After being connected in seriesIs composed of

The statistic module 4002 is configured to perform feature statistics on the speech signal features through a preset statistic function, so as to obtain a feature statistical result.

The processing module 4003 is configured to perform normalization processing on the feature statistical result to obtain an operation of feature initial data.

In addition, the feature statistics result of the labeled sample known in the above steps is denoted as { x₁,x₂,…,x_nAnd recording the characteristic statistical result of the unlabeled training sample of a certain speaker as { x }_n+1,x_n+2,…,x_n+mGet statistics of all features { x }₁,x₂,…,x_n+mPreliminary normalization was performed using the following equations, respectively:

wherein

The mean of all the samples is represented by,

represents the variance of all samples;

thereafter, the preliminary normalization result { x'₁,x'₂,…,x'_n+mSpeaker normalization is performed using the following equation: .

The screening module 4004 is configured to screen the feature initial data to obtain a feature target data.

The determining module 4005 is configured to input the feature target data into a preset training classification model, and obtain an operation of integrating a speech emotion recognition result.

(1) defining a matrix L describing the local geometry of the sample:

L＝(I-S)^T(I-S)

(2) the relationship between samples is defined using the following equation:

a Laplace matrix is then calculated

(3) optimization of the following equation Using (4)

Wherein P is_kq(i,j)＝W_k(i,j)*W_q(i,j)

(4) The above formula was optimized using the following cycle

for k＝1top

Initialization

iteratively optimizing W using the following loop_kRepeating:

(4-1) calculation Using the following formula

Wherein X is training data, I is an identity matrix, α, gamma is three balance parameters, L is obtained by calculation in the step (1-4-1), and V is obtained by calculation in the step (1-4-2).

(4-2) calculation of

Is a diagonal matrix, wherein

Is calculated by the following formula:

(4-3) calculation of

Wherein

The ith row and jth column of (a) are calculated by:

in the formula P_qk(i, ·) the element in row i and column j is calculated by:

P_kq(i,j)＝W_k(i,j)*W_q(i,j)

(4-4) t ═ t +1, up to

And

the difference is less than a predetermined threshold.

In addition, it should be noted that, the step of obtaining the integrated speech emotion recognition result according to the speech emotion category data value is to determine whether the speech emotion category data value meets a preset speech emotion category threshold range; if the voice emotion type data value meets the preset voice emotion type threshold range, obtaining an integrated voice emotion recognition result according to the voice emotion type data value, and if the voice emotion type data value does not meet the preset voice emotion type threshold range, returning to the step of inputting the feature target data into the preset training classification submodel to obtain voice emotion type data.

The feature selection result z.

It should be understood that the above is only an example, and the technical solution of the present invention is not limited in any way, and in a specific application, a person skilled in the art may set the technical solution as needed, and the present invention is not limited thereto.

It should be noted that the above-described work flows are only exemplary, and do not limit the scope of the present invention, and in practical applications, a person skilled in the art may select some or all of them to achieve the purpose of the solution of the embodiment according to actual needs, and the present invention is not limited herein.

In addition, the technical details that are not described in detail in this embodiment may refer to the integrated speech emotion recognition method provided in any embodiment of the present invention, and are not described herein again.

Further, it is to be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention or portions thereof that contribute to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (e.g. Read Only Memory (ROM)/RAM, magnetic disk, optical disk), and includes several instructions for enabling a terminal device (e.g. a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. An integrated speech emotion recognition method, characterized in that the method comprises:

screening the characteristic initial data to obtain characteristic target data;

2. The method of claim 1, wherein the step of normalizing the feature statistics to obtain feature initial data comprises:

3. The method of claim 2, wherein the step of screening the characteristic initial data to obtain characteristic target data comprises:

4. The method of claim 1, wherein before the step of extracting features of the voice sample to be recognized and obtaining the voice signal features of the preset dimension, the method further comprises:

5. The method of claim 4, wherein the pre-set training classification model comprises a plurality of pre-set training classification submodels;

6. The method of claim 5, wherein the step of obtaining integrated speech emotion recognition results according to the speech emotion classification data value comprises:

7. The method of claim 6, wherein the step of determining whether the speech emotion classification data value falls within a preset speech emotion classification threshold range further comprises:

8. An integrated speech emotion recognition apparatus, characterized in that the apparatus comprises:

9. An electronic device, characterized in that the device comprises: a memory, a processor and an integrated speech emotion recognition program stored on the memory and executable on the processor, the integrated speech emotion recognition program being configured to implement the steps of the integrated speech emotion recognition method as claimed in any of claims 1 to 7.

10. A storage medium having stored thereon an integrated speech emotion recognition program, which when executed by a processor implements the steps of the integrated speech emotion recognition method as claimed in any one of claims 1 to 7.