CN107945790B

CN107945790B - Emotion recognition method and emotion recognition system

Info

Publication number: CN107945790B
Application number: CN201810007403.7A
Authority: CN
Inventors: 王雪云
Original assignee: BOE Technology Group Co Ltd
Current assignee: BOE Technology Group Co Ltd
Priority date: 2018-01-03
Filing date: 2018-01-03
Publication date: 2021-01-26
Anticipated expiration: 2038-01-03
Also published as: CN107945790A

Abstract

The embodiment of the invention discloses an emotion recognition method and an emotion recognition system, wherein the method comprises the following steps: acquiring a current voice signal; extracting the voice features of the current voice signal, wherein the voice features comprise: acoustic features and text features; according to the voice features and the preset depth model, recognizing the emotion types corresponding to the current voice signals, wherein the emotion types comprise: the technical scheme of the invention can identify the corresponding emotion type through the voice signal so as to supervise the service personnel and improve the service level.

Description

Emotion recognition method and emotion recognition system

Technical Field

The embodiment of the invention relates to the technical field of communication, in particular to an emotion recognition method and an emotion recognition system.

Background

In interpersonal communication, language is one of the most natural and important means. The emotion entrained in the speaker's speech can have a great influence on the mood of the surrounding people, including: the positive side and the negative side, especially the service personnel, for example, in public places such as buses, nursing homes or hospitals, if the service personnel have bad attitudes, pride voice and have great advantage over the language, i.e. the emotion is negative, the negative side will have bad influence on the serviced personnel, thus being not favorable for social harmony and improving happiness index.

The inventor researches and discovers that no effective technical means can judge the corresponding emotion through the speech of a service staff at present so as to supervise and improve the service level.

Disclosure of Invention

In order to solve the above technical problem, embodiments of the present invention provide an emotion recognition method and an emotion recognition system, which can recognize corresponding emotion through a speech signal.

In one aspect, an embodiment of the present invention provides an emotion recognition method, including:

acquiring a current voice signal;

extracting voice features of a current voice signal, wherein the voice features comprise: acoustic features and text features;

according to the voice features and a preset depth model, recognizing emotion types corresponding to the current voice signals, wherein the emotion types comprise: positive, neutral and negative.

Optionally, before the extracting the speech feature of the current speech signal, the method further includes:

and preprocessing the current voice signal.

Optionally, after the emotion type corresponding to the current speech signal is identified, the method further includes:

and activating a corresponding preset coping scheme according to the emotion type.

Optionally, the acoustic features include: fundamental frequency, duration, energy and frequency spectrum.

Optionally, the recognizing, according to the speech feature and the preset depth model, an emotion type corresponding to the current speech signal includes:

obtaining acoustic characteristic information and text characteristic information for emotion recognition according to the acoustic characteristic and the text characteristic;

obtaining K acoustic feature vectors according to the acoustic feature information;

obtaining K text feature vectors according to the K acoustic feature vectors and the text feature information;

and recognizing the emotion type of the current voice signal according to the K acoustic feature vectors, the K text feature vectors and the preset depth model.

Optionally, the obtaining acoustic feature information and text feature information for emotion recognition according to the acoustic feature and the text feature includes:

respectively converting the acoustic features and the text features into corresponding vectors;

and respectively inputting the vector corresponding to the acoustic feature and the vector corresponding to the text feature into a convolutional neural network to obtain acoustic feature information and text feature information for emotion recognition.

Optionally, the obtaining K acoustic feature vectors according to the acoustic feature information includes:

pooling the acoustic feature information to obtain K acoustic feature vectors;

the obtaining K text feature vectors according to the K acoustic feature vectors and the text feature information includes:

focusing the text feature information by adopting a focusing mechanism according to the mean value of the K acoustic feature vectors;

pooling the focused text feature information to obtain K text feature vectors.

On the other hand, an embodiment of the present invention further provides an emotion recognition system, including:

a voice acquisition module configured to acquire a current voice signal;

a feature extraction module configured to extract voice features of a current voice signal, the voice features including: acoustic features and text features;

the emotion recognition module is configured to recognize an emotion type corresponding to the current voice signal according to the voice feature and a preset depth model, wherein the emotion type comprises: positive, neutral and negative.

Optionally, the system further comprises: the device comprises a signal preprocessing module and an activation module;

the signal preprocessing module is configured to preprocess the current voice signal;

the activation module is configured to activate a corresponding preset coping scheme according to the emotion type.

Optionally, the emotion recognition module includes:

the first obtaining unit is configured to obtain acoustic feature information and text feature information for emotion recognition according to the acoustic feature and the text feature, and specifically includes: respectively converting the acoustic features and the text features into corresponding vectors; respectively inputting the vector corresponding to the acoustic feature and the vector corresponding to the text feature into a convolutional neural network to obtain acoustic feature information and text feature information for emotion recognition; the acoustic features include: fundamental frequency, duration, energy and frequency spectrum;

a second obtaining unit, configured to obtain K acoustic feature vectors according to the acoustic feature information, including: pooling the acoustic feature information to obtain K acoustic feature vectors; the method is further configured to obtain K text feature vectors according to the K acoustic feature vectors and the text feature information, and specifically includes: focusing the text feature information by adopting a focusing mechanism according to the mean value of the K acoustic feature vectors; pooling focused text feature information to obtain K text feature vectors;

and the emotion recognition unit is configured to recognize the emotion type of the current voice signal according to the K acoustic feature vectors, the K text feature vectors and the preset depth model.

The embodiment of the invention provides an emotion recognition method and an emotion recognition system, wherein the method comprises the following steps: acquiring a current voice signal; extracting the voice features of the current voice signal, wherein the voice features comprise: acoustic features and text features; according to the voice features and a preset depth model, recognizing emotion types corresponding to the current voice signals, wherein the emotion types comprise: the technical scheme of the invention can identify the corresponding emotion type through the voice signal so as to supervise the service personnel and improve the service level.

Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention. Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the embodiments of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the example serve to explain the principles of the invention and not to limit the invention.

Fig. 1 is a flowchart of an emotion recognition method according to an embodiment of the present invention;

FIG. 2 is another flow chart of a method for emotion recognition according to an embodiment of the present invention;

FIG. 3 is a flow chart of step 300 provided by an embodiment of the present invention;

FIG. 4 is a block diagram of an emotion recognition system provided in an embodiment of the present invention;

FIG. 5 is another block diagram of an emotion recognition system provided in an embodiment of the present invention;

FIG. 6 is a block diagram of an emotion recognition module according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.

In order to explain the technical solutions of the embodiments of the present invention, the following description is given by way of specific examples.

Example one

Fig. 1 is a flowchart of an emotion recognition method provided in an embodiment of the present invention, and as shown in fig. 1, the emotion recognition method provided in the embodiment of the present invention specifically includes the following steps:

step 100, obtaining a current voice signal.

Specifically, step 100 acquires a speech signal through a microphone or a microphone array.

Step 200, extracting the voice characteristics of the current voice signal.

Wherein the voice features include: acoustic features and text features.

Optionally, the acoustic features comprise: the method comprises the following steps of fundamental frequency, duration, energy and frequency spectrum, wherein the fundamental frequency determines tone height, and fundamental frequency features are extracted through an autocorrelation algorithm; duration is related to Speech speed, the silence information in the current voice signal is valuable for emotion recognition, and duration features are extracted through a Visual Speech tool; energy is related to amplitude, and energy characteristics and spectral characteristics can be extracted by the existing technology.

Optionally, the text feature is text information in the current Speech signal, and the text feature is extracted by a Speech Recognition technology, for example, Auto-Speech Recognition by science news.

And 300, recognizing the emotion type corresponding to the current voice signal according to the voice characteristics and the preset depth model.

Wherein, the emotion types include: positive, neutral and negative, it should be noted that a positive emotional type may be pleasing to the served person, a neutral emotional type may not affect the mood of the served person, and a negative emotional type may be uncomfortable for the served person. For the same sentence, such as "you are fool," one may be canon a friend, or a jeer adversary, and the emotion may be positive or negative.

It should be noted that the preset depth model is subjected to a large amount of training through the sample database, so that the accuracy of the recognized emotion types is high.

Optionally, the emotion recognition method provided by the embodiment of the invention can be applied to public occasions such as buses, gerocomiums, hospitals and the like.

The emotion recognition method provided by the embodiment of the invention comprises the following steps: acquiring a current voice signal; extracting the voice features of the current voice signal, wherein the voice features comprise: acoustic features and text features; according to the voice features and the preset depth model, recognizing the emotion types corresponding to the current voice signals, wherein the emotion types comprise: the technical scheme of the invention can identify the corresponding emotion type through the voice signal so as to supervise the service personnel and improve the service level.

Optionally, fig. 2 is another flowchart of the emotion recognition method provided in the embodiment of the present invention, as shown in fig. 2, before step 200, the emotion recognition method provided in the embodiment of the present invention further includes:

step 400, preprocessing the current voice signal.

Specifically, the preprocessing in step 400 includes: the present invention is to eliminate the environmental noise, strengthen the useful signal or segment the current speech signal, etc., it should be noted that the segmentation of the current speech signal can be realized by windowing and framing the signal, for example, by using a hamming window with a window length of 25ms and a window shift of 10ms (i.e., a speech duration of 25ms per frame and a window shift step size of 10 ms).

Optionally, after step 300, the emotion recognition method provided in the embodiment of the present invention further includes:

and 500, activating a corresponding preset coping scheme according to the emotion type.

Specifically, step 500 includes: and (3) in the state that the emotion type is positive or neutral, the service personnel is encouraged to keep on, and in the state that the emotion type is negative, a preset coping scheme is activated, wherein the coping scheme comprises the following steps: (1) alarming in time to remind service personnel of paying attention to the service attitude, wherein optionally, the alarming comprises character display, buzzing, voice broadcasting and the like; (2) collecting current voice signals corresponding to the negative emotion and storing the current voice signals in a cloud end for service organizations to evaluate and improve service quality; (3) and (3) pushing timing messages, namely pushing the service quality information of the service personnel to the mobile phone of the service personnel every day after work, so that the service personnel comprehensively know the service condition of the service personnel on the day, and further improving the service level.

Optionally, fig. 3 is a flowchart of step 300 provided in the embodiment of the present invention, as shown in fig. 3, step 300 includes:

step 301, obtaining acoustic feature information and text feature information for emotion recognition according to the acoustic feature and the text feature.

Specifically, step 301 includes: respectively converting the acoustic features and the text features into corresponding vectors; and respectively inputting the vector corresponding to the acoustic feature and the vector corresponding to the text feature into a convolutional neural network to obtain acoustic feature information and text feature information for emotion recognition.

And step 302, obtaining K acoustic feature vectors according to the acoustic feature information.

Specifically, step 302 includes: and pooling the acoustic feature information to obtain K acoustic feature vectors.

And 303, acquiring K text characteristic vectors according to the K acoustic characteristic vectors and the text characteristic information.

Specifically, step 303 includes: focusing the text feature information by adopting a focusing mechanism according to the mean value of the K acoustic feature vectors; pooling the focused text feature information to obtain K text feature vectors.

It should be noted that, different weights are assigned to different texts by using the focusing mechanism, for example, a higher weight is assigned to an illegitimate word, which affects the judgment of emotion, in a colloquial way, for example, the feature of the convolutional neural output indicates that the attitude of the current speaker is rough, the focusing mechanism of the convolutional neural network assigns a higher weight to an "illegitimate word" (e.g., mixed egg, stupid, etc.), for example, the feature of the convolutional neural output indicates that the attitude of the current speaker is mild, and the focusing mechanism of the convolutional neural network does not assign a higher weight to an "illegitimate word" (e.g., mixed egg, stupid, etc.).

Specifically, the focusing mechanism of the text feature information is as follows: and assigning weights to the text feature information, wherein the weights are determined according to the K acoustic feature vectors.

In particular, suppose that at time t, the text feature information is h_a(t) acoustic feature information is O_qEach text characteristic information becomes after the action of the focusing mechanism

m_a，q(t)＝tanh(W_amh_a(t)+W_qmO_q)

Wherein, W_am，W_qm，W_msIs the focus parameter, S_a，q(t) is a weight of the weight,

is based on focused text feature information.

And step 304, recognizing the emotion type of the current voice signal according to the K acoustic feature vectors, the K text feature vectors and the preset depth model.

Specifically, step 304 specifically includes: and performing logistic regression on the K voice feature vectors and the K text feature vectors, and identifying the emotion type of the current voice signal according to the K voice feature vectors, the K text feature vectors and the depth model after the logistic regression.

The working principle of the embodiment of the invention is specifically explained as follows: obtaining a current voice signal through a microphone or a microphone array; preprocessing current voice information; extracting acoustic features of the current voice signal, extracting text features of the current voice signal through a voice recognition technology, and respectively converting the acoustic features and the text features into corresponding vectors; respectively inputting the vector corresponding to the acoustic feature and the vector corresponding to the text feature into a convolutional neural network to obtain acoustic feature information and text feature information for emotion recognition; pooling acoustic feature information to obtain K acoustic feature vectors; focusing the text feature information by adopting a focusing mechanism according to the mean value of the K acoustic feature vectors; pooling focused text feature information to obtain K text feature vectors; performing logistic regression on the K voice feature vectors and the K text feature vectors, and identifying the emotion type of the current voice signal according to the K voice feature vectors, the K text feature vectors and the depth model after the logistic regression; and activating a corresponding preset coping scheme according to the emotion type.

Example two

Based on the inventive concept of the above embodiment, fig. 4 is a schematic structural diagram of an emotion recognition system provided in an embodiment of the present invention, and as shown in fig. 4, the emotion recognition system provided in an embodiment of the present invention includes: a voice acquisition module 10, a feature extraction module 20 and an emotion recognition module 30.

In the present embodiment, the voice acquiring module 10 is configured to acquire a current voice signal; a feature extraction module 20 configured to extract a voice feature of the current voice signal; and the emotion recognition module 30 is configured to recognize an emotion type corresponding to the current voice signal according to the voice feature and the preset depth model.

Optionally, the emotion recognition system provided by the embodiment of the invention can be applied to public occasions such as buses, gerocomiums, hospitals and the like.

The emotion recognition system provided by the embodiment of the invention comprises: a voice acquisition module configured to acquire a current voice signal; the feature extraction module is configured to extract voice features of a current voice signal, the voice features including: acoustic features and text features; the emotion recognition module is configured to recognize an emotion type corresponding to the current voice signal according to the voice features and the preset depth model, wherein the emotion type comprises: the technical scheme of the invention can identify the corresponding emotion type through the voice signal so as to supervise the service personnel and improve the service level.

Optionally, fig. 5 is another schematic structural diagram of the emotion recognition system provided in the embodiment of the present invention, and as shown in fig. 5, the system provided in the embodiment of the present invention further includes: a signal preprocessing module 40 and an activation module 50.

A signal preprocessing module 40 configured to preprocess the current speech signal.

Specifically, the pretreatment comprises the following steps: the present invention is to eliminate the environmental noise, strengthen the useful signal or segment the current speech signal, etc., it should be noted that the segmentation of the current speech signal can be realized by windowing and framing the signal, for example, by using a hamming window with a window length of 25ms and a window shift of 10ms (i.e., a speech duration of 25ms per frame and a window shift step size of 10 ms).

And the activation module 50 is configured to activate the corresponding preset coping schemes according to the emotion types.

Specifically, the activation module 50 encourages the service staff to keep on in a state that the emotion type is positive or neutral, and activates a preset coping scheme in a state that the emotion type is negative, wherein the coping scheme includes, but is not limited to, the following: (1) alarming in time to remind service personnel of paying attention to the service attitude, wherein optionally, the alarming comprises character display, buzzing, voice broadcasting and the like; (2) collecting current voice signals corresponding to the negative emotion and storing the current voice signals in a cloud end for service organizations to evaluate and improve service quality; (3) and (3) pushing timing messages, namely pushing the service quality information of the service personnel to the mobile phone of the service personnel every day after work, so that the service personnel comprehensively know the service condition of the service personnel on the day, and further improving the service level.

Optionally, fig. 6 is a schematic structural diagram of an emotion recognition module provided in an embodiment of the present invention, and as shown in fig. 6, the emotion recognition module includes: a first obtaining unit 31, a second obtaining unit 32 and an emotion recognition unit 33.

The first obtaining unit 31 is configured to obtain acoustic feature information and text feature information for emotion recognition according to the acoustic feature and the text feature, and specifically includes: respectively converting the acoustic features and the text features into corresponding vectors; respectively inputting the vector corresponding to the acoustic feature and the vector corresponding to the text feature into a convolutional neural network to obtain acoustic feature information and text feature information for emotion recognition; the acoustic features include: (ii) a

The second obtaining unit 31 is configured to obtain K acoustic feature vectors according to the acoustic feature information, and specifically includes: pooling acoustic feature information to obtain K acoustic feature vectors; the method is further configured to obtain K text feature vectors according to the K acoustic feature vectors and the text feature information, and specifically includes: focusing the text feature information by adopting a focusing mechanism according to the mean value of the K acoustic feature vectors; pooling focused text feature information to obtain K text feature vectors;

and the emotion recognition unit 33 is configured to recognize the emotion type of the current voice signal according to the K acoustic feature vectors, the K text feature vectors and the preset depth model.

Those skilled in the art can understand that each module or unit included in the second embodiment is only divided according to functional logic, but is not limited to the above division as long as the corresponding function can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It will be further understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by executing the relevant hardware through a program, where the program may be stored in a computer-readable storage medium, and the storage medium includes: ROM/RAM, magnetic disks, optical disks, and the like.

Although the embodiments of the present invention have been described above, the above description is only for the convenience of understanding the present invention, and is not intended to limit the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An emotion recognition method, comprising:

acquiring a current voice signal;

according to the voice features and a preset depth model, recognizing emotion types corresponding to the current voice signals, wherein the emotion types comprise: positive, neutral and negative;

respectively converting the acoustic features and the text features into corresponding vectors; respectively inputting the vector corresponding to the acoustic feature and the vector corresponding to the text feature into a convolutional neural network to obtain acoustic feature information and text feature information for emotion recognition;

pooling the acoustic feature information to obtain K acoustic feature vectors;

focusing the text feature information by adopting a focusing mechanism according to the mean value of the K acoustic feature vectors; pooling focused text feature information to obtain K text feature vectors;

and performing logistic regression on the K voice feature vectors and the K text feature vectors, and identifying the emotion type of the current voice signal according to the K voice feature vectors, the K text feature vectors and the depth model after the logistic regression.

2. The method of claim 1, wherein prior to extracting the speech feature of the current speech signal, the method further comprises:

and preprocessing the current voice signal.

3. The method of claim 1 or 2, wherein after the emotion type corresponding to the current speech signal is identified, the method further comprises:

4. The method of claim 1, wherein the acoustic features comprise: fundamental frequency, duration, energy and frequency spectrum.

5. An emotion recognition system, comprising:

a voice acquisition module configured to acquire a current voice signal;

an emotion recognition module configured to include:

a first obtaining unit configured to convert the acoustic features and the text features into corresponding vectors, respectively; respectively inputting the vector corresponding to the acoustic feature and the vector corresponding to the text feature into a convolutional neural network to obtain acoustic feature information and text feature information for emotion recognition;

the second obtaining unit is configured to pool the acoustic feature information, obtain K acoustic feature vectors, and focus the text feature information by adopting a focusing mechanism according to a mean value of the K acoustic feature vectors; pooling focused text feature information to obtain K text feature vectors;

the emotion recognition unit is configured to perform logistic regression on the K voice feature vectors and the K text feature vectors, and recognize the emotion type of the current voice signal according to the K voice feature vectors, the K text feature vectors and the depth model after the logistic regression;

the emotion types include: positive, neutral and negative.

6. The system of claim 5, further comprising: the device comprises a signal preprocessing module and an activation module;

7. The system of claim 5, wherein the acoustic features comprise: fundamental frequency, duration, energy and frequency spectrum.