CN113409821A

CN113409821A - Method for recognizing unknown emotional state of voice signal

Info

Publication number: CN113409821A
Application number: CN202110584445.9A
Authority: CN
Inventors: 徐新洲; 顾正; 吕震; 刘硕; 吴尘
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-05-27
Filing date: 2021-05-27
Publication date: 2021-09-17
Anticipated expiration: 2041-05-27
Also published as: CN113409821B

Abstract

The invention discloses a speech signal unknown emotional state identification method, which is characterized in that a speech segment signal sample with unknown emotional state information is subjected to secondary language feature extraction, semantic embedding of an emotional state label is combined, and classification judgment is carried out through a method of a synthesis classifier. In the training stage, extracting sublingual features from a known emotion category training speech segment sample, processing according to a known emotion category name to obtain a known emotion category prototype weight, and solving by combining with a known emotion category training speech segment sample label to obtain an optimal virtual classifier; and in the testing stage, an optimal virtual classifier is used, and the unknown emotion classification judgment is carried out on the testing sample by combining the secondary language features of the unknown emotion classification testing corpus sample and the unknown emotion classification prototype weight. The invention provides a method for recognizing unknown emotion voice signals based on semantic embedding in the aspect of voice signal emotion recognition, and can effectively distinguish unknown emotion from voice signals.

Description

Method for recognizing unknown emotional state of voice signal

Technical Field

The invention belongs to the field of speech signal emotion recognition, and particularly relates to a method for recognizing unknown emotion states of speech signals.

Background

Speech Emotion Recognition (SER) has a wide application background in the fields of human-computer interaction and the like, and can judge the subjective intention of a speaker to be conveyed in a Speech segment and deeper Emotion expression of the speaker by researching Emotion information in a Speech signal. In addition, it is possible to perform speech synthesis of emotion expression for a speech signal by analyzing emotion information in speech. In the aspect of psychological disease diagnosis, the preliminary screening of patients with depression and the like can be realized through related technologies, and a basis is provided for further diagnosis and treatment; in terms of virtual reality, the robot can be enabled to have more powerful emotion analysis and expression capability.

The prior art scheme has the problem that the unknown emotion state in the voice signal cannot be effectively recognized, and in a large amount of prior work related to an SER, the emotion state never seen in a training sample cannot be recognized, so that the unknown emotion type cannot be judged and recognized on the voice signal sample. For example, in human-computer interaction, a machine may decide upon receiving an utterance whether the speaker is a complex emotion of credibility, friendliness, or violence. However, the machine will not be able to accomplish this task without teaching it how to estimate the complex emotions or intentions of these speakers.

In the prior art, for example, the following publications are published: xu X, Deng J, Cummins N, et al, Autonomous emission Learning in Speech, AView of zero-Shot Speech emission Recognition [ C ]// Proc. INTERSPEECH 2019,2019: 949-. In the disclosed recognition scheme, emotion space dimension values corresponding to samples in training of known emotion samples need to be labeled, which may bring high manual workload and labeling cost, and increase complexity of a calculation process.

Disclosure of Invention

The invention provides a method for recognizing unknown emotion states of voice signals, which aims to solve the problems that unknown emotion in the voice signals cannot be recognized in the prior art and the problem that in the existing scheme for recognizing the unknown emotion of the voice signals, the known emotion samples need to be labeled in each dimension of each sample emotion space.

In order to solve the technical problems, the invention adopts the following technical scheme:

a speech signal unknown emotion state recognition method comprises the steps of firstly establishing a speech emotion database which comprises a plurality of speech segment samples, wherein each sample has an emotion category label corresponding to the sample; dividing a voice emotion database into a training set consisting of known emotion type samples and a test set consisting of unknown emotion type samples; each sample has a known and unique emotion category label. The method comprises the following steps of:

step one, extracting and generating n_FOriginal features of dimension: processing each language segment sample in the training sample set and the test sample set respectively, extracting corresponding secondary language features as original features, and regularizing the original features to obtain N^(S)Regularization features corresponding to individual training samples

And normalized feature x corresponding to any one test sample^(U)；

Secondly, carrying out semantic embedding mapping on the known emotion category names to generate semantic embedding prototypes of the known emotion categories

Wherein c is^(S)Number of known emotion classes, n_AEmbedding dimensions for semantics of emotion category names;

step three, a prototype matrix A of the known emotion category^(S)And a virtual class prototype matrix

Calculating to obtain a prototype weight matrix of the known emotion category

Step four, useKnown emotion category corpus sample sublingual feature X^(S)And emotion category label of corresponding sample

Semantic embedding of known emotion classes into prototypes A^(S)The emotion category prototype weight matrix S is used for aligning the linear virtual classifier according to the optimization target f

Optimizing:

finding an optimal virtual classifier

Step five, testing: sample feature x is tested for each unknown emotion class corpus^(U)And D, performing classification judgment of unknown emotion classes on each test sample by using the classifier obtained in the step four.

Further, the normalization processing method in the step one is as follows:

the characteristic column vector of any sample in all the language segment samples before normalization is x⁽⁰⁾In which N is^(S)The training sample set composed of the characteristic column vectors of the training samples with known emotion classes is

Is provided with

Is composed of

The jth feature element of (1);

the feature column vector x for any sample⁽⁰⁾Feature j corresponds to an element

The formula for regularization is:

wherein

Represents X⁽⁰⁾The largest element in the j-th row,

represents X⁽⁰⁾The smallest element in row j; x is the number of_·jIs composed of

Regularization of the results;

calculating all elements in any sample according to the formula (2) to obtain a normalized characteristic column vector x ═ x of any training or testing sample_·1,x_·2,...,x_·n]^TWherein, the normalized feature vectors of the speech segment signal samples belonging to the known emotion classification training sample set form the normalized feature vector set of the training sample

Further, the semantic embedding mapping in the second step can be implemented by using a word vector pre-training model for the emotion category names:

obtaining n corresponding to the category by inputting emotion category name into pre-training_ASemantic embedding vector of emotion category of dimension, and expressing the semantic embedding vector of known emotion category corresponding to the training set as

C to be predicted for a set of test samples^(U)An unknown emotion class whose semantic embedded vector is expressed as

Further, the prototype weight matrix of the known emotion classification in step three

Middle virtual class c_PKnown emotion category c_SThe corresponding elements are:

wherein

Measure of distance between

Wherein sigma²Is the distance weight.

Further, the virtual category prototype matrix B in step three can be constructed in the following two ways:

(1) randomly generating n according to uniform distribution between 0 and 1_A×c(^P) A matrix of dimensions;

(2) semantic embedding matrix A set as known emotion category^(S)。

Further, the optimal virtual classifier is obtained in the fourth step

The optimization target of (1) is as follows:

wherein, the regularization term weight tau is more than 0, and the known emotion classification linear classifier:

loss function:

wherein the content of the first and second substances,

for the c-th known emotion category label information when

When the temperature of the water is higher than the set temperature,

otherwise

Further, the classification decision on the unknown emotion category test sample in the fifth step includes the following steps executed in sequence:

(1) embedding prototypes according to calculated unknown emotion category semantics

Computing emotion category prototype weight vectors for category m

(2) Side language feature x corresponding to test sample^(U)Predicting the unknown emotion category to which the sample belongs

Corresponding reference numerals

And obtaining the judgment result of the unknown emotion classification of the test sample.

Has the advantages that: as shown in fig. 1, the method for recognizing an unknown emotional state of a speech signal according to the present invention extracts a sublingual feature of a speech segment signal sample with unknown emotional state information, and performs classification and decision by a method of synthesizing a classifier in combination with semantic embedding of an emotional state label. Specifically, in the training stage, the sublingual features are extracted from a known emotion category training speech segment sample, meanwhile, the known emotion category prototype weight is obtained through processing according to the known emotion category name, and then the optimal virtual classifier is obtained through solving by combining with the known emotion category training speech segment sample label; and in the testing stage, an optimal virtual classifier is used, and the unknown emotion classification judgment is carried out on the testing sample by combining the secondary language features of the unknown emotion classification testing corpus sample and the unknown emotion classification prototype weight.

The existing speech emotion recognition method has two problems: for a general SER method, the method can be only used for identifying the emotion class providing a sample in a training set, and the identification processing of an unknown class has problems; although solutions have been presented in the work on unknown emotion recognition of speech signals, successful implementation of the solutions still relies on adequate labeling of the dimensions of the emotion space of the training sample.

Therefore, the method for recognizing the unknown emotion of the voice signal, which is disclosed by the invention, adopts the method for recognizing the unknown emotion of the voice signal based on the synthesis classifier and semantic embedding, and can provide help for the cognition of the unknown emotion in the voice signal on the basis of not increasing the cost of manual labeling, so that the unknown emotion in the voice can be effectively recognized.

Experiments prove that the method for recognizing the unknown emotion voice signal is provided based on semantic embedding in the aspect of voice signal emotion recognition, and the unknown emotion can be effectively distinguished aiming at the voice signal.

Drawings

Fig. 1 is a flow chart of a method for recognizing an unknown emotional state of a speech signal according to the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings and the detailed description.

As shown in FIG. 1, the method of the present invention firstly performs secondary language feature extraction on speech segment signal samples with unknown emotional state information, and performs classification judgment by a method of a synthesis classifier in combination with semantic embedding of emotional state labels. In the training stage, extracting sublingual features from a known emotion category training speech segment sample, processing according to a known emotion category name to obtain a known emotion category prototype weight, and solving by combining with a known emotion category training speech segment sample label to obtain an optimal virtual classifier; and in the testing stage, an optimal virtual classifier is used, and the unknown emotion classification judgment is carried out on the testing sample by combining the secondary language features of the unknown emotion classification testing corpus sample and the unknown emotion classification prototype weight.

In the following, the method of the present invention is compared with the existing zero sample learning method by an experimental method for identifying the Unweighted Accuracy (UA) recognition rate.

The effectiveness verification of the method of the embodiment of the invention is carried out by adopting the voice signal part in a GEMEP (GEneva Multimodal enumeration Portrayals) database.

The bimodal database GEMEP includes a set of speech samples and its corresponding set of video samples GEMEP-FERA. The database contains 18 emotion categories, namely, opinion, amument, anxiety, collenger, contentmp, despain, distust, animation, hot anger, interest, systemic rear, pleasure, pride, relief, sadness, name, surrise, tenderness. The database was recorded in french for 1260 samples, which were assigned to 10 speakers, including 5 females. The experiment uses 12 types of emotion categories, specifically, amusement, anxiety, cold anger, despain, elation, hot anger, interest, panic fear, pleasure, pride, relief and sadness, wherein the average number of each type is 90 samples, and 1080 samples are total; all samples of every two types of emotions in the data set are used as unknown emotion testing phrase sample sets, other emotion category samples are used as known emotion training phrase sample sets, and 66 different sample type combination modes exist, so that the experiment is trained and tested for 66 times.

The original secondary language features of the experiment adopt a Japanese interior tile reduced Acoustic Parameter Set (eGeMAPS) feature Set, and the original feature dimension n_F88, the features are derived from 25 Low-Level Descriptors (LLDs) combined with High-Level statistical functions (HSFs), and temporal features and equivalent sound levels, and are extracted by using openSMILE 2.3 tool box in the experiment.

Semantic embedding prototypes for emotion classes using n_AThese prototypes were derived based on pre-trained models of word2vec, GloVe, and fastText, for a 300-dimensional english word semantic vector. The semantic embedding model in the experiment was a Google pre-trained word2vec model trained on a Google News corpus containing 300 ten thousand words; the GloVe model in the experiment uses Wikipedia 2014 and Gigaword 5 as training data, and comprises 40 ten thousand words; fastText uses 200 ten thousand word vectors trained on web-based crawl data and 100 thousand word vectors trained on Wikipedia 2017, UMBC webbase and stat.

In the experiment, in order to show the effect of the method of the present invention, the methods for comparison were: SAE (semantic AutoEncoder), DEM (Deep Embedding Model; DEM), LatEm (Laten Embedding), ESZSL (Embarrasingly Simple Zero-Shot Learning), and EXEM (EXEMPLAR Synthesis).

The speech signal emotion state recognition model comprises two models which are respectively as follows: SYNC (origin) (example 1, using a prototype of a known emotion class as a prototype matrix B of a virtual class in step three, i.e. B ═ A^(S)) SYNC (rand) (example 2, using the prototype of the known emotion category as the prototype matrix B of the virtual category in step three, wherein the number of the virtual categories c^(P)＝1000)。

In the experiment, the training set is subjected to optimal parameter selection by adopting emotion class independent five-fold cross validation, and the experiment is repeated for 10 times for the random generation of the virtual class prototype in the third step. For the embodiment of the invention, the range of the selected parameters is as follows: regularizationTerm weight τ ═ {2 ═ 2^-24，2^-23，…，2^-9}, distance weight σ²＝{2^-5，2^-4，…，2⁵}。

The average result of the optimal UA for all semantic embedded prototypes on the GEMEP database is shown in table 1:

TABLE 1

	By means of	UA
			Comparative example 1	SAE	57.2％
Comparative example 2	DEM	59.3％
			Comparative example 3	LatEm	64.2％
Comparative example 4	ESZSL	64.6％
			Comparative example 5	EXEM	62.3％
Practice ofExample 1	SYNC(origin)	64.4％
			Example 2	SYNC(rand)	65.0％±0.9％

As can be seen from table 1, the SYNC methods in the present embodiment 1 and embodiment 2 can achieve better UA performance for recognition of unknown emotion of voice signal than other related comparative example methods.

Further, we present three results of example 2 using SYNC (rand) method for best performance in 10 replicates, compared to the results of SYNC (origin), as shown in Table 2. As can be seen from table 2, the randomly selected virtual class prototype matrix can enable the method of the present invention to achieve better performance.

TABLE 2

Method	UA
		SYNC(origin)	64.4％
SYNC(rand)best	66.6％
		SYNC(rand)2^ndbest	66.2％
SYNC(rand)3^rdbest	65.7％

In summary, the SYNC method adopted in this embodiment can provide better performance on the recognition problem of unknown emotion of a voice signal by learning the discrimination information between known emotion categories for the secondary language features used in emotion recognition of the voice signal.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A method for recognizing an unknown emotional state of a speech signal, the method comprising: extracting sublingual features from the known emotion category training speech segment samples, processing according to the known emotion category names to obtain the prototype weight of the known emotion category, and solving by combining with the known emotion category training speech segment sample labels to obtain an optimal virtual classifier; and extracting secondary language features of the speech segment signal samples with unknown emotional state information, and performing unknown emotional category judgment on the test samples by using the optimal virtual classifier in combination with semantic embedding of emotional state labels and unknown emotional category prototype weights.

2. The method for recognizing an unknown emotional state of a speech signal according to claim 1, wherein the method specifically comprises the following steps:

step one, each language segment sample in the training sample set and the testing sample set is processed respectively, corresponding secondary language features are extracted and used as original features, and n is extracted and generated_FMaintaining the original characteristics, and regularizing the original characteristics to obtain N^(S)Regularization features corresponding to individual training samples

And normalized feature x corresponding to any one test sample^(U)；

Step two, inputting emotion category names into pre-training to obtain n corresponding to the category_ASemantic embedding vector of emotion category of dimension, and expressing the semantic embedding vector of known emotion category corresponding to the training set as

Wherein c is^(S)Number of known emotion classes, n_AEmbedding dimensions for semantics of emotion category names; c to be predicted for a set of test samples^(U)An unknown emotion class whose semantic embedded vector is expressed as

Calculating to obtain a prototype weight matrix of the known emotion category

Step four, using the sublingual feature X of the known emotion category speech fragment sample^(S)And emotion category label of corresponding sample

Optimizing:

finding an optimal virtual classifier

Step five, testing sample characteristics x for each unknown emotion category speech segment^(U)Using the optimal virtual classifier obtained in step four

And carrying out classification judgment on unknown emotion classes for each test sample.

3. The method of claim 2, wherein the prototype weight matrix of the known emotion category is used to identify the unknown emotional state of the speech signal

Virtual class c in (1)_PKnown emotion category c_SThe corresponding elements are:

wherein the content of the first and second substances,

measure of distance between

Wherein sigma²Is the distance weight.

4. The method of claim 3, wherein the distance weight σ is used to identify the unknown emotional state of the speech signal²＝{2^-5，2^-4，…，2⁵}。

5. The method for recognizing the unknown emotional state of the speech signal according to claim 2, wherein the virtual class prototype matrix B in step three is specifically constructed in a manner that: randomly generating n according to uniform distribution between 0 and 1_A×c(^P) A matrix of dimensions.

6. The method for recognizing the unknown emotional state of the speech signal according to claim 2, wherein the virtual class prototype matrix B in step three is specifically constructed in a manner that: semantic embedding matrix A set as known emotion category^(S)。

7. The method for recognizing the unknown emotional state of the speech signal according to claim 2, wherein the optimal virtual classifier is obtained in the fourth step

The optimization target of (1) is as follows:

wherein the regularization term weight tau is more than 0, and the known emotion classification linear classifier

The loss function is:

wherein the content of the first and second substances,

for the c-th known emotion category label information when

Time of flight

Otherwise

8. The method according to claim 7, wherein the regularization term weight τ ═ 2^-24，2^-23，…，2^-9}。

9. The method for recognizing the unknown emotional state of the speech signal according to claim 2, wherein in the fifth step, the unknown emotional category judgment is performed on the test sample by using the optimal virtual classifier, specifically: embedding prototypes according to calculated unknown emotion category semantics

Calculating emotion category prototype weight vector for category m:

side language feature x corresponding to test sample^(U)Predicting the unknown emotion category to which the sample belongs

Corresponding reference numerals:

10. A voice message according to claim 2The unknown emotional state identification method is characterized in that the original features adopt an EGeMAPS (EGeMAPS) and an original feature dimension n_F＝88。