CN114822598A

CN114822598A - Server and speech emotion recognition method

Info

Publication number: CN114822598A
Application number: CN202210459756.7A
Authority: CN
Inventors: 芮智琦; 李俊彦
Original assignee: Hisense Electronic Technology Wuhan Co ltd
Current assignee: Hisense Electronic Technology Wuhan Co ltd
Priority date: 2022-04-24
Filing date: 2022-04-24
Publication date: 2022-07-29

Abstract

The embodiment of the application provides a server and a speech emotion recognition method, wherein received speech is recognized as text data, and vector representation of the text data is obtained; mapping the emotion words to a random vector space by utilizing remote supervision according to the emotion polar word list, and acquiring vector representation of the emotion words; splicing the text data and the vector representation of the emotion words to obtain a bottom sharing parameter; and inputting the bottom layer shared parameters into the multitask learning model to obtain an emotion analysis result and emotion keywords. The word vector corresponding to the emotion words is spliced with the sentence vector to which the emotion words belong, so that the emotion words in the sentences can be focused, the extraction accuracy of the emotion keywords is improved, and the emotion recognition accuracy is improved. In addition, the server continuously supplements the emotion polarity word list by using the extracted emotion key words, so that the emotion words in the text data can be remotely monitored more comprehensively in the follow-up process according to the emotion polarity word list.

Description

Server and speech emotion recognition method

Technical Field

The application relates to the technical field of internet, in particular to a server and a voice emotion recognition method.

Background

The display device is a television product which can realize the bidirectional man-machine interaction function and integrates a plurality of functions such as audio and video, entertainment, data and the like. In order to meet the diversified requirements of users, the display equipment is provided with various applications such as audio and video, entertainment and the like, and interacts and exchanges information with the users through a user interface.

With the continuous development of human-computer interaction, voice is being reshaped into a new paradigm of human-computer interaction, and the current display devices have functions of dialogue, question answering and answering. For example, a user may send "i want to see XX" to a display device, and the display device may retrieve video content related to XX from a server and present and recommend to the user. When browsing the video content recommended by the display device, the user can make a voice response to the display device to express whether the user is satisfied with the result recommended by the display device. In order to obtain the satisfaction degree of the user on the recommendation result, the display equipment can perform emotion recognition on the voice of the user and judge the emotional state of the user.

At present, speech emotion recognition is mainly based on text content for recognition, and accuracy is low. For example, some emotional words in the Chinese language, which appear in different positions in the whole sentence, and are collocated differently with other words, may result in different emotion recognition results.

Disclosure of Invention

The application provides a server and a voice emotion recognition method, and aims to solve the technical problem that accuracy of voice emotion recognition is low in the prior art.

In a first aspect, the present application provides a server configured to:

recognizing the received voice as text data, and acquiring vector representation of the text data;

according to an emotion polar word list, using remote supervision to mark out emotion words in the text data, mapping the emotion words to a random vector space, and obtaining vector representation of the emotion words;

splicing the vector representation of the text data and the vector representation of the emotion words to obtain a bottom layer sharing parameter;

and inputting the bottom layer shared parameters into a trained multi-task learning model to obtain an emotion analysis result and emotion keywords, wherein the multi-task learning model comprises an emotion analysis task and an emotion keyword extraction task, and the emotion keywords are used for supplementing the emotion polarity word list.

In a second aspect, the present application provides a speech emotion recognition method, including:

Compared with the prior art, the beneficial effect of this application is:

the application provides a server and a speech emotion recognition method. Meanwhile, the server can also extract the emotional words in the text data by utilizing the emotional polar word list and convert the emotional words into word vectors. And the server splices the sentence vectors corresponding to the text data and the word vectors corresponding to the emotion words to obtain bottom sharing parameters. And the server inputs the bottom layer sharing parameters into the multi-task learning model so that the multi-task learning model outputs emotion analysis results and emotion keywords. In the application, the server splices word vectors corresponding to emotion words and sentence vectors of text data where the emotion words are located, so that the multi-task learning model focuses on the vectors corresponding to the emotion words in the bottom sharing parameters, the extraction accuracy of emotion keywords in the multi-task learning model is improved, and the accuracy of emotion recognition is improved. In addition, the server continuously supplements the emotion polar word list by using the emotion key words output by the multi-task learning model, so that the emotion words in the text data can be remotely supervised more comprehensively according to the emotion polar word list.

Drawings

In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.

FIG. 1 is a schematic diagram illustrating a system architecture of a speech recognition method and a speech recognition apparatus according to some embodiments;

a block diagram of a hardware configuration of a smart device 200 according to some embodiments is illustrated in fig. 2;

a schematic configuration of a smart device 200 according to some embodiments is illustrated in fig. 3;

FIG. 4 is a schematic diagram illustrating a voice interaction network architecture, according to some embodiments;

FIG. 5 illustrates a training flow diagram of a multi-task learning model, according to some embodiments;

FIG. 6 illustrates a diagram of an initial emotion polarity vocabulary retrieval, in accordance with some embodiments;

FIG. 7 illustrates a training diagram of a multi-task learning model, according to some embodiments;

FIG. 8 is a network architecture diagram illustrating a multitasking learning model according to some embodiments;

FIG. 9 is a flow diagram illustrating a method of speech emotion recognition, according to some embodiments;

another flow diagram of a method of speech emotion recognition according to some embodiments is illustrated in FIG. 10.

Detailed Description

To make the purpose and embodiments of the present application clearer, the following will clearly and completely describe the exemplary embodiments of the present application with reference to the attached drawings in the exemplary embodiments of the present application, and it is obvious that the described exemplary embodiments are only a part of the embodiments of the present application, and not all of the embodiments.

It should be noted that the brief descriptions of the terms in the present application are only for the convenience of understanding the embodiments described below, and are not intended to limit the embodiments of the present application. These terms should be understood in their ordinary and customary meaning unless otherwise indicated.

The terms "first," "second," "third," and the like in the description and claims of this application and in the above-described drawings are used for distinguishing between similar or analogous objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.

Fig. 1 shows an exemplary system architecture to which the speech recognition method and the speech recognition apparatus of the present application can be applied. As shown in fig. 1, where 10 is a server and 200 is a terminal device, exemplary devices include (smart tv 200a, mobile device 200b, and smart sound box 200 c).

In the present application, the server 10 and the smart device 200 perform data communication in a plurality of communication modes. The smart device 200 may be allowed to communicatively connect through a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks. The server 10 can provide various contents and interactions to the terminal device 20. Illustratively, the smart device 200 and the server 10 may receive software program updates by sending and receiving information.

The server 10 may be a server that provides various services, such as a background server that provides support for audio data collected by the smart device 200. The background server may analyze and perform other processing on the received data such as audio, and feed back a processing result (e.g., endpoint information) to the terminal device. The server 10 may be a server cluster, or may be a plurality of server clusters, and may include one or more types of servers.

The smart device 200 may be hardware or software. When the smart device 200 is hardware, it may be various electronic devices with a sound collection function, including but not limited to a smart speaker, a smart phone, a television, a tablet computer, an e-book reader, a smart watch, a player, a computer, an AI device, a robot, a smart vehicle, and so on. When the smart devices 200, 201, 202 are software, they may be installed in the electronic devices listed above. It may be implemented as a plurality of software or software modules (for example to provide sound collection services) or as a single software or software module. And is not particularly limited herein.

It should be noted that the speech emotion recognition method provided in the embodiment of the present application may be executed by server 10, terminal device 20, or both server 10 and terminal device 20, which is not limited in the present application.

Fig. 2 shows a block diagram of a hardware configuration of an intelligent device 200 according to an exemplary embodiment. The smart device 200 as shown in fig. 2 includes at least one of a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a memory, a power supply, and a user interface 280. The controller includes a central processor, an audio processor, a graphic processor, a RAM, a ROM, and first to nth interfaces for input/output.

The display 260 includes a display screen component for presenting a picture, and a driving component for driving image display, and a component for receiving an image signal from the controller output, performing display of video content, image content, and a menu manipulation interface, and a user manipulation UI interface.

The display 260 may be a liquid crystal display, an OLED display, and a projection display, and may also be a projection device and a projection screen.

The communicator 220 is a component for communicating with an external device or a server according to various communication protocol types. For example: the communicator may include at least one of a Wifi module, a bluetooth module, a wired ethernet module, and other network communication protocol chips or near field communication protocol chips, and an infrared receiver. The smart device 200 may establish transmission and reception of control signals and data signals by the server 10 through the communicator 220.

And the user interface can be used for receiving external control signals.

The detector 230 is used to collect signals of the external environment or interaction with the outside. For example, detector 230 includes a light receiver, a sensor for collecting ambient light intensity; alternatively, the detector 230 includes an image collector, such as a camera, which may be used to collect external environment scenes, attributes of the user, or user interaction gestures, or the detector 230 includes a sound collector, such as a microphone, which is used to receive external sounds.

The sound collector can be a microphone, also called a microphone or a microphone, and can be used for receiving the sound of a user and converting a sound signal into an electric signal. The smart device 200 may be provided with at least one microphone. In other embodiments, the smart device 200 may be provided with two microphones to achieve a noise reduction function in addition to collecting sound signals. In other embodiments, the smart device 200 may further include three, four, or more microphones to collect sound signals and reduce noise, and may further identify sound sources and implement directional recording functions.

In addition, the microphone may be built in the smart device 200, or the microphone may be connected to the smart device 200 in a wired or wireless manner. Of course, the position of the microphone on the smart device 200 is not limited in the embodiment of the present application. Alternatively, the smart device 200 may not include a microphone, i.e., the microphone is not provided in the smart device 200. The smart device 200 may be externally connected to a microphone (also referred to as a microphone) via an interface (e.g., the USB interface 130). The external microphone may be secured to the smart device 200 by an external fixture (e.g., a camera mount with a clip).

The controller 250 controls the operation of the display device and responds to the user's operation through various software control programs stored in the memory. The controller 250 controls the overall operation of the smart device 200.

Illustratively, the controller includes at least one of a Central Processing Unit (CPU), an audio processor, a Graphic Processing Unit (GPU), a RAM Random Access Memory (RAM), a ROM (Read-Only Memory), a first to nth interface for input/output, a communication Bus (Bus), and the like.

In some examples, the operating system of the smart device is an Android system as an example, and as shown in fig. 3, the smart television 200-1 may be logically divided into an application (Applications) layer 21, a kernel layer 22, and a hardware layer 23.

As shown in fig. 3, the hardware layer may include the controller 250, the communicator 220, the detector 230, and the like shown in fig. 2. The application layer 21 includes one or more applications. The application may be a system application or a third party application. For example, the application layer 21 includes a voice recognition application, and the voice recognition application may provide a voice interaction interface and a service for realizing the connection of the smart tv 200-1 with the server 10.

The kernel layer 22 acts as software middleware between the hardware layer and the application layer 21 for managing and controlling hardware and software resources.

In some examples, the kernel layer 22 includes a detector driver to send voice data collected by the detector 230 to a voice recognition application. Illustratively, when the voice recognition application in the smart device 200 is started and the smart device 200 establishes a communication connection with the server 10, the detector driver is configured to transmit the voice data input by the user, collected by the detector 230, to the voice recognition application. The speech recognition application then sends query information containing the speech data to the intent recognition module 202 in the server. The intention recognition module 202 is used to input the voice data sent by the smart device 200 to the intention recognition model.

For clarity of explanation of the embodiments of the present application, a speech emotion recognition network architecture provided by the embodiments of the present application is described below with reference to fig. 4.

Referring to fig. 4, fig. 4 is a schematic diagram of a voice interaction network architecture according to an embodiment of the present application. In fig. 4, the smart device is configured to receive input information and output a processing result of the information. The voice recognition module is deployed with voice recognition service and used for recognizing the audio frequency as a text; the semantic understanding module is deployed with semantic understanding service and used for performing semantic analysis on the text; the business management module is provided with a business instruction management service for providing business instructions; the language generation module is deployed with a language generation service (NLG) and used for converting an instruction which indicates the intelligent equipment to execute into a text language; and the voice synthesis module is deployed with a voice synthesis (TTS) service and used for processing a text language corresponding to the instruction and then sending the processed text language to a loudspeaker for broadcasting. In one embodiment, in the architecture shown in fig. 4, there may be multiple entity service devices deployed with different business services, and one or more function services may also be aggregated in one or more entity service devices.

In some embodiments, the following describes an example of a process for processing information input to an intelligent device based on the architecture shown in fig. 4, taking the information input to the intelligent device as an example of a query statement input by voice:

[ Speech recognition ]

After receiving the query statement input by voice, the smart device may perform noise reduction processing and feature extraction on the audio of the query statement, where the noise reduction processing may include removing echo and ambient noise.

[ semantic understanding ]

And performing natural language understanding on the identified candidate texts and associated context information by using the acoustic model and the language model, and analyzing the texts into structured and machine-readable information, information such as business fields, intentions, word slots and the like to express semantics and the like. Deriving an actionable intent determination intent confidence score, a semantic understanding module selects one or more candidate actionable intents based on the determined intent confidence score,

[ Business management ]

The semantic understanding module issues a query instruction to the corresponding business management module according to the semantic analysis result of the text of the query statement to acquire the query result given by the business service, executes the action required by the final request of the user, and feeds back the equipment execution instruction corresponding to the query result.

It should be noted that the architecture shown in fig. 4 is only an example, and is not a limitation to the scope of the present application. In the embodiment of the present application, other architectures may also be adopted to implement similar functions, for example: all or part of the above process can be completed by the intelligent terminal, which is not described herein.

With the continuous development of human-computer interaction, corresponding intelligent equipment is controlled through voice commands, and the method is favored by the majority of users. When the user performs the functions of querying and the like through the intelligent device, the intelligent device obtains the corresponding recommendation result from the server 10 and feeds the recommendation result back to the user. The user can perform voice feedback on the recommendation result of the intelligent device to express whether the user is satisfied with the recommendation result, for example: the user says: when receiving query feedback of a user, the user needs to timely identify that the user is dissatisfied with the previous recommendation result, needs to identify that the emotional state of the current user is negative, negative and the like, and needs to timely analyze conversation or recommendation and query results. At present, speech emotion recognition is mainly based on text content for recognition, and accuracy is low. To improve the accuracy of speech emotion recognition, the present application provides, in some embodiments, a server configured to perform a speech emotion recognition process. The speech emotion recognition process is described below with reference to the drawings.

In some embodiments, the speech emotion recognition process performed by server 10 may be performed in one of the other devices. The server 10 then executes as an example.

In some embodiments, a multitask learning model may be trained prior to speech emotion recognition. In some embodiments, the multi-task learning model includes two tasks, namely an emotion analysis task and an emotion keyword extraction task, wherein the emotion keyword extraction task is used for extracting words with emotion polarity in voice sent by a user, and the emotion analysis task is used for judging whether the voice sent by the user is positive, negative or positive.

The training process of the multi-task learning model is described below with reference to the accompanying drawings.

A training flow diagram of a multi-task learning model according to some embodiments is illustrated in fig. 5. In conjunction with FIG. 5, the training process for the multitask learning model is as follows:

s501: and acquiring an initial emotion polarity word list and training data according to the user log data, wherein the initial emotion polarity word list is the emotion polarity word list before the emotion polarity word list is supplemented by the emotion key words.

In some embodiments, the server 10 may use the user log data as a base corpus, and after obtaining the user log data, may perform manual annotation by sampling a portion of the user log data. The server 10 calculates a first word for representing emotion and a second word for which emotion polarity is clear in the user log data by using a PMI algorithm (point-to-point information algorithm), and calculates a PMI value between the two words. The server 10 confirms whether the emotion polarities between the two words are the same through the PMI value between the two words. Taking the positive emotion words as an example, namely, by calculating the probability that two words appear together in the corpus, when the PMI value of one word and one word with positive emotion reaches a certain threshold, the word can be judged to be also the positive emotion word, and the negative homologies are obtained.

In some embodiments, if an emotional break occurs in a sentence, the same polarity cannot be determined even if two words occur simultaneously in the same sentence. Therefore, the server 10 may determine whether an emotional transition occurs in the sentence where the first word is located by using the dependency parsing algorithm, and if an emotional transition exists in the sentence where the first word is located, the PMI values of the first word and the second word do not rise, and it may be directly determined that the two words do not have the same polarity. If there is no emotional turn in the sentence where the first word is located, the PMI values of the first word and the second word rise, and it can also be determined that the two have the same polarity.

An acquisition schematic of an initial emotion polarity vocabulary according to some embodiments is illustrated in fig. 6. As shown in fig. 6, the calculation result is corrected by performing a point-to-point mutual information algorithm on the common emotion polar words and the user log data and using a dependency syntax analysis algorithm. The initial emotion polarity word list and emotion classification training data can be obtained by manually labeling the log data of the sampling part and the above process.

S502: and obtaining input layer data according to the initial emotion polarity vocabulary and the training data.

A training diagram of a multi-task learning model according to some embodiments is illustrated in fig. 7. With reference to fig. 7, after the training data and the initial emotion polarity vocabulary are obtained, for the case that "i am happy and really good" in the training data, the server 10 labels words "happy and" good "with emotion in the sentence through the initial emotion polarity vocabulary.

In some embodiments, the server 10 inputs the statement into a BERT (bidirectional Encoder retrieval from transformations) model, and outputs a vector Representation of the statement via the BERT model. Meanwhile, the server 10 maps the marked emotion words into a random vector space (Extra-Feature Embedding) to obtain vector representations of the emotion words, and splices the vector representations of the emotion words with BERT outputs of the whole sentence to obtain a sentence vector representation with Extra information, namely, input layer data, namely, bottom layer shared parameters.

S503: and inputting the input layer data into a plurality of expert networks to obtain a first characteristic, and inputting the input layer data into a gate network to obtain the weights of the plurality of expert networks.

S504: and weighting the first characteristics according to the weights of the plurality of expert networks to obtain second characteristics corresponding to the tasks.

S505: and inputting the second characteristics into a corresponding Tower network to obtain the data of the output layer.

A network architecture diagram of a multitask learning model according to some embodiments is illustrated in fig. 8. Referring to fig. 8, the multitasking learning model includes a plurality of expert networks, gate networks equal to the number of tasks, and Tower networks equal to the number of tasks. The first expert network, the first expert network and the third expert network are used for carrying out feature extraction on the input layer data. The first gate network is used for calculating the weights of the plurality of expert networks aiming at the emotion analysis task, and the second gate network is used for calculating the weights of the plurality of expert networks aiming at the emotion keyword extraction task. The server 10 inputs the second feature weighted according to the first door network to the first Tower network for emotion analysis, and inputs the second feature weighted according to the second door network to the second Tower network for emotion keyword extraction.

In some embodiments, the server 10 controls the multitask learning model to iteratively train the above process. The server 10 acquires the output layer data generated by the multitask learning model in the iterative training process each time, wherein the output layer data comprises the extracted emotion keywords. The server 10 filters the emotion keywords in the output layer data according to a nonsense word list, wherein the nonsense word list contains words with clear emotional polarity, such as: "o" "and" cala "and other words. The server 10 supplements the initial emotion polarity vocabulary according to the filtered emotion keywords in the output layer data, and the continuously supplemented initial emotion polarity vocabulary is the emotion polarity vocabulary which is actually applied to the subsequent multi-task learning model.

The speech emotion recognition process provided by some embodiments of the present application is described below with reference to the accompanying drawings.

A flowchart of a method of speech emotion recognition according to some embodiments is illustrated in FIG. 9. As shown in fig. 9, the method comprises the steps of:

s901: and recognizing the received voice as text data, and acquiring vector representation of the text data.

In some embodiments, a user inputs speech into the smart device, and the smart device may send the received speech to the server 10, where the user speech input is recognized by the server 10 as textual data according to ASR. Here, the smart device may also directly convert the user's voice into text data through ASR and then transmit the text data to the server 10. In some embodiments, the server 10 further obtains a vector representation of the text data through the BERT model.

S902: and marking out the emotion words in the text data by using remote supervision according to an emotion polar word list, mapping the emotion words to a random vector space, and acquiring vector representation of the emotion words.

In some embodiments, the server 10 continuously fills the initial emotion polarity vocabulary with the obtained new emotion keywords in the iterative training process of the multitask learning model to obtain the emotion polarity vocabulary with rich emotion words. The server 10 marks the emotion words in the text data through remote supervision of the emotion polarity word list, and further obtains vector representation of the emotion words.

S903: and splicing the vector representation of the text data and the vector representation of the emotion words to obtain a bottom layer sharing parameter.

In some embodiments, the vector representation of the text data and the vector representation of the emotion words are added to obtain the underlying sharing parameters of the multi-task learning model. The above process can be seen in fig. 7.

S904: and inputting the bottom layer shared parameters into a trained multi-task learning model to obtain an emotion analysis result and emotion keywords, wherein the multi-task learning model comprises an emotion analysis task and an emotion keyword extraction task, and the emotion keywords are used for supplementing the emotion polarity word list.

In some embodiments, after only obtaining the emotion analysis results and emotion keywords with the multitask learning model, the server 10 can continue to supplement the emotion polar vocabulary with the obtained emotion keywords regardless of whether the emotion analysis results are positive or negative. Of course, the contents added to the emotion polar word list must be new emotion words, that is, emotion words not existing in the previous list.

In some embodiments, in the emotion keyword extraction task, server 10 converts the emotion keyword extraction task into a plurality of binary networks using a pointer network, and predicts a head pointer and a tail pointer, respectively. That is, the server 10 decodes the second feature weighted according to the second gate network, and outputs a probability distribution, that is, a so-called pointer, as to whether the second feature is a start position or an end position of the keyword. The server 10 extracts the characters between the head pointer and the tail pointer as emotion keywords.

In some embodiments, considering that the length of the segment that may be predicted by the pointer network is too long and does not conform to the task idea of extracting the emotion keyword, the server 10 further needs to perform post-processing, that is, control the length of the interval between the head pointer and the tail pointer to be smaller than a preset value, for example, smaller than 3, so as to avoid that the length of the extracted emotion keyword is too long.

In some embodiments, when the server 10 decodes the second feature weighted according to the second gate network, a corresponding processing function is set for one, a plurality of or un-decoded results, so as to ensure that the corresponding emotion keyword can be extracted at last. For example, if the result is not decoded, that is, the emotion keyword is not obtained, the server 10 may select the second largest value in the probability distribution as the head pointer and the tail pointer.

In some embodiments, in the emotion analysis task, the server 10 may perform the second classification after inputting the second features weighted according to the first network into the full link layer.

In some embodiments, when the server 10 knows that the emotion of the user is negative according to the emotion analysis result, the recommendation policy needs to be updated in time so as to recommend the media resource content for the user again.

The process of speech emotion recognition described above is further described below with reference to the drawings.

Another flow diagram of a method of speech emotion recognition according to some embodiments is illustrated in FIG. 10. With reference to fig. 10, the server 10 needs to remotely supervise a small amount of existing labeled data by using an emotion polar vocabulary, and uses the same as prior features of emotion classification and emotion keyword extraction, and maps the prior features to a new vector space, and then concatenates sentence vectors of the original text to train a multi-task learning model, i.e., an initial model, which performs emotion classification and emotion keyword extraction simultaneously. Here, the emotion polarity word list used at the beginning is an initial emotion polarity word list, and may also be regarded as a list composed of common emotion polarity words. After the initial model is obtained, the server 10 continuously predicts using the user log data, and performs N iterations to obtain a staged model. When the server 10 uses the initial model to analyze the user log data, a new emotion polarity word can be obtained and added into the emotion polarity word list to be used as remote supervision for new training, and after several iterations, a more accurate emotion classification model can be performed on the common reactions of the user. After iteration for several rounds, the number of the new emotion polarity words obtained through the model is gradually reduced to be the number, the model can be determined to be stable, and the model can be put into use. When the staged model is obtained through training, real-time user voice can be obtained, and emotion keyword extraction and emotion analysis are carried out through the staged model. And the new emotion polarity words extracted by the model can continuously supplement the emotion polarity word list. That is to say, after the emotion classification model is put into use, the server 10 can also collect some newly appeared words with emotion polarity, and when a certain number of words are reached, the model can be retrained again to update, so as to realize a semi-automatic optimization function. Finally, when the model analyzes that the emotion of the user is negative, the server 10 needs to update the recommendation strategy.

In the application, the server splices word vectors corresponding to emotion words and sentence vectors of text data where the emotion words are located, so that the multi-task learning model focuses on the vectors corresponding to the emotion words in the bottom sharing parameters, the extraction accuracy of emotion keywords in the multi-task learning model is improved, and the accuracy of emotion recognition is improved. In addition, the server continuously supplements the emotion polar word list by using the emotion key words output by the multi-task learning model, so that the emotion words in the text data can be remotely supervised more comprehensively according to the emotion polar word list. The server achieves the aim of finer-grained and more accurate emotion recognition through a remote supervision mode, thereby providing a direction and an idea for product optimization, providing a real user emotion state change sample, and providing basic technology and data guarantee for continuous optimization of products.

Corresponding to the server, the application also provides a speech emotion recognition method, which comprises the following steps: the server 10 recognizes the received speech as text data, and obtains a vector representation of the text data. And marking out the emotion words in the text data by using remote supervision according to an emotion polar word list, and mapping the emotion words to a random vector space by using the server 10 to obtain vector representation of the emotion words. The server 10 splices the vector representation of the text data and the vector representation of the emotion words to obtain bottom layer sharing parameters, and inputs the bottom layer sharing parameters into a trained multi-task learning model to obtain emotion analysis results and emotion keywords, wherein the multi-task learning model comprises an emotion analysis task and an emotion keyword extraction task, and the emotion keywords are used for supplementing the emotion polar word list.

In some embodiments, during training of the multi-tasking learning model, the method comprises: the server 10 obtains an initial emotion polarity vocabulary and training data according to the user log data, wherein the initial emotion polarity vocabulary is the emotion polarity vocabulary before being supplemented by the emotion keywords. And the server 10 obtains input layer data according to the initial emotion polarity vocabulary and the training data. The server 10 inputs the input layer data to a plurality of expert networks to obtain a first characteristic, and inputs the input layer data to a portal network to obtain weights of the plurality of expert networks. The server 10 weights the first features according to the weights of the plurality of expert networks to obtain second features corresponding to the tasks. The server 10 inputs the second characteristic into the corresponding Tower network to obtain the data of the output layer.

In some embodiments, during the training of the multi-tasking learning model, the method further comprises: the server 10 obtains the output layer data generated by the multi-task learning model during the iterative training process. The server 10 filters the emotion keywords in the output layer data according to a nonsense word list, wherein the nonsense word list contains words which are definitely not provided with emotion polarity. The server 10 supplements the initial emotion polarity word list according to the filtered emotion keywords in the output layer data.

Since the above embodiments are all described by referring to and combining with other embodiments, the same portions are provided between different embodiments, and the same and similar portions between the various embodiments in this specification may be referred to each other. And will not be described in detail herein.

It is noted that, in this specification, relational terms such as "first" and "second," and the like, are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a circuit structure, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such circuit structure, article, or apparatus. Without further limitation, the presence of an element identified by the phrase "comprising an … …" does not exclude the presence of other like elements in a circuit structure, article, or device comprising the element.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

The above embodiments of the present application do not limit the scope of the present application.

Claims

1. A server, wherein the server is configured to:

2. The server of claim 1, wherein during the training of the multitask learning model, the server is configured to:

acquiring an initial emotion polarity word list and training data according to user log data, wherein the initial emotion polarity word list is the emotion polarity word list before emotion keyword supplement;

obtaining input layer data according to the initial emotion polarity vocabulary and the training data;

inputting the input layer data into a plurality of expert networks to obtain a first characteristic, and inputting the input layer data into a gate network to obtain weights of the plurality of expert networks;

weighting the first features according to the weights of the plurality of expert networks to obtain second features corresponding to tasks;

and inputting the second characteristics into a corresponding Tower network to obtain the data of the output layer.

3. The server of claim 2, wherein during the training of the multitask learning model, the server is further configured to:

acquiring the output layer data generated by the multi-task learning model in the iterative training process;

filtering the emotional keywords in the data of the output layer according to a nonsense word list, wherein the nonsense word list contains words which are clear and have no emotional polarity;

and supplementing the initial emotion polarity word list according to the filtered emotion key words in the data of the output layer.

4. The server according to claim 2, wherein in the step of obtaining an initial emotion polarity vocabulary from user log data, the server is configured to:

acquiring user log data, and calculating a PMI value of a first word and a second word by using a PMI algorithm, wherein the first word is a word used for representing emotion in the user log data, the second word is a word with definite emotion polarity, and the PMI value is used for confirming whether the emotion polarities of the first word and the second word are the same;

judging whether the sentence where the first word is located has emotion turning or not by using a dependency syntax analysis algorithm;

if the sentence where the first word is located has emotion turning, the PMI value does not rise;

and if the sentence where the first word is located does not have emotion turning, the PMI value is increased.

5. The server according to claim 1, wherein after inputting the underlying sharing parameters into the trained multi-task learning model, obtaining emotion analysis results and emotion keywords, the server is further configured to:

if the emotion analysis result represents that the user emotion is positive, supplementing the emotion polar word list according to the emotion key word;

and if the emotion analysis result represents that the user emotion is negative, supplementing the emotion polar word list according to the emotion key words and updating a recommendation strategy, wherein the recommendation strategy is used for recommending media resource content for the user.

6. The server according to claim 1, wherein in the emotion keyword extraction task, the server is configured to:

converting the emotion keyword extraction task into a plurality of binary networks by using a pointer network, and respectively predicting a head pointer and a tail pointer;

and extracting characters between the head pointer and the tail pointer to be used as emotion keywords.

7. The server of claim 6, wherein the server is configured to:

and controlling the interval length between the head pointer and the tail pointer to be smaller than a preset value.

8. A speech emotion recognition method, characterized in that the method comprises:

9. The method according to claim 8, wherein in the training process of the multi-task learning model, the method comprises:

10. The method for recognizing speech emotion according to claim 9, wherein during the training of the multitask learning model, the method further comprises:

filtering the emotional keywords in the data of the output layer according to a meaningless word list, wherein the meaningless word list contains words which are clear and have no emotional polarity;