CN116434787A

CN116434787A - Voice emotion recognition method and device, storage medium and electronic equipment

Info

Publication number: CN116434787A
Application number: CN202310705248.7A
Authority: CN
Inventors: 李太豪; 黄宇鑫
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-06-14
Filing date: 2023-06-14
Publication date: 2023-07-14
Anticipated expiration: 2043-06-14
Also published as: CN116434787B

Abstract

The specification discloses a method, a device, a storage medium and electronic equipment for recognizing voice emotion, which are used for acquiring target voice, selecting a plurality of voice fragments with preset length from the target voice, respectively inputting each voice fragment and the target voice into a pre-trained emotion prediction model, acquiring a local emotion prediction result corresponding to each voice fragment and a global emotion prediction result of the target voice, fusing the global emotion prediction result with at least one local emotion prediction result to obtain an optimized global emotion prediction result, and determining a final emotion prediction result of the target voice according to the optimized global emotion prediction result. According to the method, the model can output the local emotion prediction result, and the global emotion prediction result and the local emotion prediction result are fused, so that the global emotion prediction result is optimized, and the accuracy of the final emotion prediction result is improved.

Description

Voice emotion recognition method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to a method and apparatus for speech emotion recognition, a storage medium, and an electronic device.

Background

With the development of artificial intelligence, the artificial intelligence is applied to various fields, and when tasks related to user demands are performed by using artificial intelligence technology, the identification of user emotion is involved so as to better meet the user demands. When user emotion is recognized through user voice, the characteristics in the user voice are generally obtained through a neural network, and then the user emotion result is obtained through the classifier and the characteristics in the user voice, but the accuracy of the obtained user emotion result is low, and local emotion expression in the user voice cannot be obtained.

Based on this, the present specification provides a method of speech emotion recognition.

Disclosure of Invention

The present disclosure provides a method, an apparatus, a storage medium, and an electronic device for speech emotion recognition, so as to partially solve the foregoing problems in the prior art.

The technical scheme adopted in the specification is as follows:

the specification provides a method for speech emotion recognition, comprising the following steps:

acquiring target voice;

selecting a plurality of voice fragments with preset lengths from the target voice;

inputting each voice segment into a pre-trained emotion prediction model for obtaining a local emotion prediction result corresponding to the voice segment according to the emotion prediction model; inputting the target voice into the emotion prediction model to obtain a global emotion prediction result of the target voice according to the emotion prediction model;

Fusing the global emotion prediction result with at least one local emotion prediction result to obtain an optimized global emotion prediction result;

and determining a final emotion prediction result of the target voice according to the optimized global emotion prediction result.

Optionally, the local emotion prediction result includes probabilities that the voice fragments belong to each emotion type respectively;

fusing the global emotion prediction result with at least one local emotion prediction result, wherein the method specifically comprises the following steps:

taking the probability that the target voice belongs to each emotion type as global probability and taking the probability that the voice fragment belongs to each emotion type as local probability;

for each emotion type, determining the maximum value of the local probability of the emotion type in the local probability of at least one local emotion prediction result as the local fusion probability of the emotion type;

and weighting the global probability of each emotion type and the local fusion probability of the emotion type according to preset fusion weights and the local fusion probability of the emotion type.

Optionally, for each local emotion prediction result, optimizing the local emotion prediction result according to the optimized global emotion prediction result to obtain an optimized local emotion prediction result;

And determining a final emotion prediction result of the target voice according to the optimized global emotion prediction result and the optimized local emotion prediction result.

Optionally, the global emotion prediction result includes probabilities that the target voice belongs to each emotion type respectively;

the local emotion prediction result comprises probabilities that the voice fragments belong to each emotion type respectively;

optimizing the local emotion prediction result according to the optimized global emotion prediction result, specifically including:

aiming at the optimized global emotion prediction result, taking the probability that the target voice belongs to each emotion type as the optimized global probability, and taking the probability that the voice fragment belongs to each emotion type as the local probability;

weighting the optimized global probability of the emotion type and the local probability of the emotion type in the local emotion prediction result according to preset weights for each emotion type to obtain the optimized local probability of the emotion type in the local emotion prediction result;

and obtaining the optimized local emotion prediction result according to the optimized local probability of each emotion type in the local emotion prediction result.

Optionally, taking the probability that the target voice belongs to each emotion type as a global probability, and taking the probability that the voice segment belongs to each emotion type as a local probability;

selecting an emotion type corresponding to the maximum value of the global probability from the optimized global emotion prediction results, and taking the emotion type as a final emotion first prediction result of the target voice;

selecting an emotion type corresponding to the maximum value of the local probability as a final emotion second prediction result of the target voice aiming at each optimized local prediction result;

and determining the final emotion prediction result of the target voice according to the final emotion first prediction result of the target voice and the final emotion second prediction result of the target voice.

Optionally, training the emotion prediction model specifically includes:

acquiring sample voice and emotion marking of the sample voice;

inputting the sample voice into an emotion prediction model to determine an emotion prediction result of the sample according to the emotion prediction model;

determining the emotion prediction result and the difference of emotion marks corresponding to the sample voice;

and training the emotion prediction model according to the difference.

The specification provides a device for speech emotion recognition, comprising:

the target voice acquisition module is used for acquiring target voice;

the voice segment acquisition module is used for selecting a plurality of voice segments with preset lengths from the target voice;

the prediction result acquisition module is used for inputting each voice fragment into a pre-trained emotion prediction model so as to acquire a local emotion prediction result corresponding to the voice fragment according to the emotion prediction model; inputting the target voice into the emotion prediction model to obtain a global emotion prediction result of the target voice according to the emotion prediction model;

the optimization module is used for fusing the global emotion prediction result with at least one local emotion prediction result to obtain an optimized global emotion prediction result;

and the final result determining module is used for determining a final emotion prediction result of the target voice according to the optimized global emotion prediction result.

Optionally, the optimization module is specifically configured to include probabilities that the target speech belongs to each emotion type respectively; the local emotion prediction result comprises probabilities that the voice fragments belong to each emotion type respectively; the optimizing module is specifically configured to take a probability that the target voice belongs to each emotion type as a global probability, and a probability that the voice segment belongs to each emotion type as a local probability; for each emotion type, determining the maximum value of the local probability of the emotion type in the local probability of at least one local emotion prediction result as the local fusion probability of the emotion type; and weighting the global probability of each emotion type and the local fusion probability of the emotion type according to preset fusion weights and the local fusion probability of the emotion type.

Optionally, the final result determining module is specifically configured to optimize, for each local emotion prediction result, the local emotion prediction result according to the optimized global emotion prediction result, so as to obtain an optimized local emotion prediction result; and determining a final emotion prediction result of the target voice according to the optimized global emotion prediction result and the optimized local emotion prediction result.

Optionally, the global emotion prediction result includes probabilities that the target voice belongs to each emotion type respectively; the local emotion prediction result comprises probabilities that the voice fragments belong to each emotion type respectively; the final result determining module is specifically configured to optimize the local emotion prediction result according to the optimized global emotion prediction result, and specifically includes: aiming at the optimized global emotion prediction result, taking the probability that the target voice belongs to each emotion type as the optimized global probability, and taking the probability that the voice fragment belongs to each emotion type as the local probability; weighting the optimized global probability of the emotion type and the local probability of the emotion type in the local emotion prediction result according to preset weights for each emotion type to obtain the optimized local probability of the emotion type in the local emotion prediction result; and obtaining the optimized local emotion prediction result according to the optimized local probability of each emotion type in the local emotion prediction result.

Optionally, the final result determining module is specifically configured to take a probability that the target voice belongs to each emotion type as a global probability, and a probability that the voice segment belongs to each emotion type as a local probability; selecting an emotion type corresponding to the maximum value of the global probability from the optimized global emotion prediction results, and taking the emotion type as a final emotion first prediction result of the target voice; selecting an emotion type corresponding to the maximum value of the local probability as a final emotion second prediction result of the target voice aiming at each optimized local prediction result; and determining the final emotion prediction result of the target voice according to the final emotion first prediction result of the target voice and the final emotion second prediction result of the target voice.

Optionally, the apparatus further comprises:

the model training module is used for acquiring sample voice and emotion marking of the sample voice; inputting the sample voice into an emotion prediction model to determine an emotion prediction result of the sample according to the emotion prediction model; determining the emotion prediction result and the difference of emotion marks corresponding to the sample voice; and training the emotion prediction model according to the difference.

The present specification provides a computer readable storage medium storing a computer program which when executed by a processor implements the method of speech emotion recognition described above.

The present specification provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of speech emotion recognition described above when executing the program.

The above-mentioned at least one technical scheme that this specification adopted can reach following beneficial effect:

according to the voice emotion recognition method provided by the specification, the local emotion prediction results can be output through the emotion prediction model, and because a plurality of local emotion prediction results of the target voice can generate certain influence on the global emotion prediction result of the target voice, the global emotion prediction result and the local emotion prediction result are fused to optimize the global emotion prediction result, so that the accuracy of the final emotion prediction result is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. In the drawings:

FIG. 1 is a schematic flow chart of a method for speech emotion recognition provided in the present specification;

FIG. 2 is a schematic diagram showing the local prediction results provided in the present specification;

FIG. 3 is a schematic diagram of the internal structure of the emotion prediction model provided in the present specification;

FIG. 4 is a schematic diagram of a speech emotion recognition apparatus provided in the present specification;

fig. 5 is a schematic structural diagram of an electronic device corresponding to fig. 1 provided in the present specification.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

The following describes in detail the technical solutions provided by the embodiments of the present specification with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of a method for identifying speech emotion provided in the present specification, which includes the following steps:

S100: and acquiring target voice.

With the development of artificial intelligence, emotion of a user, such as text, voice, image, etc., can be predicted according to various kinds of information related to the user. However, when predicting the emotion of a user using speech, the accuracy of the prediction result obtained is low. In order to improve the accuracy of predicted emotion of a user, the present specification provides a method of speech emotion recognition. The execution subject of the present specification may be a server for model training, or may be other electronic devices that can predict the emotion of a user. For convenience of explanation, a method of speech emotion recognition provided in the present specification will be explained below with only a server as an execution subject.

In one or more embodiments of the present description, recognizing a user emotion requires first acquiring a user's voice, i.e., acquiring a target voice. In order to more accurately predict the emotion of the user, the server may process the target voice before predicting the emotion of the user according to the target voice, for example, perform format conversion, denoising, removing audio segments of non-users, and the like, which is not limited in this specification.

S102: and selecting a plurality of voice fragments with preset lengths from the target voice.

In order to obtain the emotion prediction result of the user at a certain moment according to the target voice, the server may divide the target voice into a plurality of voice segments, i.e. select a plurality of voice segments with a preset length from the target voice, for example, the time length of the target voice is 10 seconds, the preset length may be 1 second, and then the target voice is divided into 10 voice segments with a time length of 1. The preset length may be a fixed length or a length that varies with the time length of the target voice, which is not limited in this specification.

It should be noted that if the target voice cannot be equally divided, the last remaining voice length may be taken as a preset length to obtain the last voice segment, for example, the target voice has a time length of 9.5 seconds and a preset length of 1 second, the target voice is divided into 9 voice segments having a time length of 1, the target voice has a time length of 0.5 seconds, the preset length is modified into 0.5 seconds, and finally, the target voice is divided into 9 voice segments having a time length of 1 and 1 voice segment having a time length of 0.5.

S104: and inputting each voice segment into a pre-trained emotion prediction model to obtain a local emotion prediction result corresponding to the voice segment according to the emotion prediction model.

It should be noted that, the emotion types of the user include neutral (no emotion), anger, happiness, aversion, fear, surprise, sadness, and the like, and the emotion types can be further divided according to the business requirement, which is not limited in this specification. The local emotion refers to the emotion of the user corresponding to a certain speech segment in the target speech, for example, the time length of the target speech is 1 minute, the emotion of the user is aversive in the first half minute, and the emotion of the user is anger in the second half minute. Of course, during the same time period, the user may exhibit multiple emotion types.

S106: inputting the target voice into the emotion prediction model to obtain a global emotion prediction result of the target voice according to the emotion prediction model.

In one or more embodiments of the present disclosure, the global emotion refers to an overall emotion of the target voice, for example, the time length of the target voice is 1 minute, and during the 1 minute, the emotion of the user is expressed as happiness as a whole.

S108: and fusing the global emotion prediction result with at least one local emotion prediction result to obtain an optimized global emotion prediction result.

Because a plurality of local emotion prediction results of the target voice can generate a certain influence on the global emotion prediction result of the target voice, in order to improve the accuracy of the global emotion prediction result, the server can fuse the global emotion prediction result with at least one local emotion prediction result to obtain an optimized global emotion prediction result.

Specifically, if the global emotion prediction result is fused with only one local emotion prediction result, the global probability of each emotion type in the global emotion prediction result is fused with the corresponding local probability in the local emotion prediction result. Of course, the global probability and the local probability of an emotion type may be fused for only one emotion type, which is not limited in this specification. For example, the global emotion prediction result and the optimized local emotion prediction result have the same emotion types, namely emotion type a, emotion type B and emotion type C, and the server can only fuse the global probability and the local probability of emotion type a, and the prediction results of the other two emotion types are not optimized, and the prediction results of the three emotion types are optimized.

If the global emotion prediction result is fused with a plurality of local emotion prediction results, determining the maximum value of the local probability of each emotion type in the local probabilities of the plurality of local emotion prediction results as the local fusion probability of the emotion type. For example, for emotion type a, there are three local probabilities of the local emotion prediction results, 75%, 40%, 85%, and 85% are local fusion probabilities of emotion type a.

After determining the local fusion probability, the server weights the global probability of the emotion type and the local fusion probability of the emotion type according to the preset fusion weight and the local fusion probability of the emotion type for each emotion type. Wherein, the fusion weight can be set according to the requirement. For example, if the preset fusion weight is 0.6, the global probability in the global emotion prediction result is 80% and the local fusion probability in the local emotion prediction result is 75% for emotion type a. The global probability of the emotion type a is weighted with the local fusion probability of the emotion type, and the optimized global probability of the emotion type is 80% ×0.6+75% ×0.4=78%. Similarly, the server may only fuse the global probability and the local probability of one emotion type, and this description is not limited thereto.

S110: and determining a final emotion prediction result of the target voice according to the optimized global emotion prediction result.

Because the global emotion prediction result of the target voice can have a certain influence on a plurality of local emotion prediction results of the target voice, the server can optimize the local emotion prediction results through the global emotion prediction results and improve the accuracy of the local emotion prediction results, namely, for each local emotion prediction result, the server optimizes the local emotion prediction results according to the optimized global emotion prediction results to obtain optimized local emotion prediction results. And determining a final emotion prediction result of the target voice according to the optimized global emotion prediction result and the optimized local emotion prediction result. The global emotion prediction result includes a probability that the target voice belongs to each emotion type respectively, that is, the global emotion prediction result includes a confidence that the target voice belongs to each emotion type respectively, the local emotion prediction result includes a probability that the voice fragment belongs to each emotion type respectively, that is, the local emotion prediction result includes a confidence that the voice fragment belongs to each emotion type respectively. For example, the global emotion prediction result is that the emotion of the user is anger 70%, aversion 15%, surprise 5%, sadness 5%, fear 3%, happiness 2%.

When the server optimizes the local emotion prediction result according to the optimized global emotion prediction result, aiming at the optimized global emotion prediction result, the probability that the target voice belongs to each emotion type is used as the optimized global probability, and the probability that the voice fragment belongs to each emotion type is used as the local probability. And weighting the optimized global probability of the emotion type and the local probability of the emotion type in the local emotion prediction result according to preset weights for each emotion type to obtain the optimized local probability of the emotion type in the local emotion prediction result. Wherein the weights can be set as required. For example, if the preset weight is 0.9, the global probability of emotion type a is 80% and the local probability of emotion type a is 75% for emotion type a. The optimized emotion type a local probability is 75% ×0.9+80% ×0.1=75.5%.

After obtaining the optimized local probability of the emotion type in the local emotion prediction result, the server can obtain the optimized local emotion prediction result according to the optimized local probability of each emotion type in the local emotion prediction result. It should be noted that, for each local emotion prediction result, at least one emotion type prediction result in the local emotion prediction results is optimized, that is, at least the optimized global probability of one emotion type is fused with the local probability of the emotion type.

After the optimized local emotion prediction result is obtained, the server can determine a final emotion prediction result of the target voice according to the optimized global emotion prediction result and the optimized local emotion prediction result.

Specifically, the final emotion prediction result includes a final emotion first prediction result and a final emotion second prediction result. And selecting the emotion type corresponding to the maximum value of the global probability from the optimized global emotion prediction results as a final emotion first prediction result of the target voice. For example, the global probability in the optimized global emotion prediction result is anger 70%, aversion 15%, surprise 5%, sadness 5%, fear 3% and happiness 2%, wherein the global probability is the maximum 70%, and the corresponding emotion type is anger, so that anger is the final emotion first prediction result of the target voice.

And selecting the emotion type corresponding to the maximum value of the local probability as a final emotion second prediction result of the target voice according to each optimized local prediction result. For example, for each partial prediction, 80% of the partial predictions are happy, 6% of the partial predictions are surprised, 5% of the partial predictions are averted, 5% of the partial predictions are angry, and 4% of the partial predictions are sad, and then the happy is the final emotion second prediction. And the server determines the final emotion prediction result of the target voice according to the final emotion first prediction result of the target voice and the final emotion second prediction result of the target voice.

Based on the voice emotion recognition method shown in fig. 1, the method can output local emotion prediction results through the emotion prediction model, and because a plurality of local emotion prediction results of the target voice can generate a certain influence on the global emotion prediction result of the target voice, the global emotion prediction result and the local emotion prediction result are fused to optimize the global emotion prediction result, so that the accuracy of the final emotion prediction result is improved.

After performing step S110, the server may present the final emotion prediction result to the user. In addition, the server may also present global and local predictors to the user, e.g., using a histogram or pie chart to present confidence in each emotion type in the global predictor. Fig. 2 is a schematic diagram showing a local prediction result provided in the present specification, and as shown in fig. 2, for the local prediction result of the target voice, the server may label a preset length and a confidence level of each emotion type in each preset length to display emotion changes of the user.

The present disclosure also provides a training method for the emotion prediction model, and fig. 3 is a schematic diagram of the internal structure of the emotion prediction model provided in the present disclosure, as shown in fig. 3.

When training the emotion prediction model, the server can acquire sample voice and emotion marking of the sample voice, and then input the sample voice into the emotion prediction model so as to determine an emotion prediction result of the sample according to the emotion prediction model. And then, determining the difference between the emotion prediction result and emotion marks corresponding to the sample voice. And finally training the emotion prediction model according to the difference.

As shown in FIG. 3, the emotion prediction model includes a backbone network, a feature mapping module, and a classifier. When the emotion prediction model is trained, the sample voice is input into the emotion prediction model, and a backbone network in the emotion prediction model can acquire the frame-level emotion characteristics of the sample voice. The types of the backbone network may include convolutional neural networks (Convolutional Neural Network, CNN), recurrent neural networks (Recurrent Neural Networks, RNN), convolutional recurrent neural networks (Convolutional Recurrent Neural Network, CRNN), and the like, and may be non-pre-training models or pre-training models. After the frame-level emotion characteristics of the sample voice are obtained, the characteristic mapping module can map the frame-level emotion characteristics of the sample voice to the segment-level characteristics of the whole voice by means of average pooling, maximum pooling and the like. Finally, the classifier maps the segment level features to the probability of predicting each emotion type, which can be realized through a full connection layer and a Softmax layer.

In one or more embodiments of the present disclosure, the server uses the emotion prediction model to obtain the global emotion prediction result and the local emotion prediction results of the target voice only in order to obtain the emotion prediction model, and when obtaining the global emotion prediction result and the local emotion prediction results, only a plurality of voice fragments and the target voice need to be input into the emotion prediction model. In addition, the process of optimizing each other by the mutual influence between the global emotion prediction result and the local emotion prediction result in the subsequent server is not related to the emotion prediction model, so that the emotion prediction model can be trained only for the purpose of acquiring the emotion prediction result.

In addition, the emotion prediction model can be added with an audio segmentation module for segmenting the target voice into voice fragments with preset length, and then carrying out subsequent steps. For example, the target voice is segmented by using the sliding window, that is, the length of the sliding window is set, and the target voice is segmented by taking the length of the sliding window as a primary segmentation length, so as to obtain a plurality of voice segments, which is not limited in the specification.

The foregoing is a method implemented by one or more embodiments of the present disclosure, and based on the same concept, the present disclosure further provides a corresponding apparatus for speech emotion recognition, as shown in fig. 4.

Fig. 4 is a schematic diagram of a device for speech emotion recognition provided in the present specification, including:

a target voice acquisition module 400, configured to acquire target voice;

a voice segment obtaining module 402, configured to select a plurality of voice segments with preset lengths from the target voice;

a prediction result obtaining module 404, configured to input, for each speech segment, the speech segment into a pre-trained emotion prediction model, so as to obtain, according to the emotion prediction model, a local emotion prediction result corresponding to the speech segment; inputting the target voice into the emotion prediction model to obtain a global emotion prediction result of the target voice according to the emotion prediction model;

the optimizing module 406 is configured to fuse the global emotion prediction result with at least one local emotion prediction result, so as to obtain an optimized global emotion prediction result;

and the final result determining module 408 is configured to determine a final emotion prediction result of the target speech according to the optimized global emotion prediction result.

Optionally, the optimizing module 406 is specifically configured to include probabilities that the target speech belongs to each emotion type; the local emotion prediction result comprises probabilities that the voice fragments belong to each emotion type respectively; the optimizing module is specifically configured to take a probability that the target voice belongs to each emotion type as a global probability, and a probability that the voice segment belongs to each emotion type as a local probability; for each emotion type, determining the maximum value of the local probability of the emotion type in the local probability of at least one local emotion prediction result as the local fusion probability of the emotion type; and weighting the global probability of each emotion type and the local fusion probability of the emotion type according to preset fusion weights and the local fusion probability of the emotion type.

Optionally, the final result determining module 408 is specifically configured to optimize, for each local emotion prediction result, the local emotion prediction result according to the optimized global emotion prediction result, so as to obtain an optimized local emotion prediction result; and determining a final emotion prediction result of the target voice according to the optimized global emotion prediction result and the optimized local emotion prediction result.

Optionally, the global emotion prediction result includes probabilities that the target voice belongs to each emotion type respectively; the local emotion prediction result comprises probabilities that the voice fragments belong to each emotion type respectively; the final result determining module 408 is specifically configured to optimize the local emotion prediction result according to the optimized global emotion prediction result, and specifically includes: aiming at the optimized global emotion prediction result, taking the probability that the target voice belongs to each emotion type as the optimized global probability, and taking the probability that the voice fragment belongs to each emotion type as the local probability; weighting the optimized global probability of the emotion type and the local probability of the emotion type in the local emotion prediction result according to preset weights for each emotion type to obtain the optimized local probability of the emotion type in the local emotion prediction result; and obtaining the optimized local emotion prediction result according to the optimized local probability of each emotion type in the local emotion prediction result.

Optionally, the final result determining module 408 is specifically configured to take, as a global probability, a probability that the target speech belongs to each emotion type, and take, as a local probability, a probability that the speech segment belongs to each emotion type; selecting an emotion type corresponding to the maximum value of the global probability from the optimized global emotion prediction results, and taking the emotion type as a final emotion first prediction result of the target voice; selecting an emotion type corresponding to the maximum value of the local probability as a final emotion second prediction result of the target voice aiming at each optimized local prediction result; and determining the final emotion prediction result of the target voice according to the final emotion first prediction result of the target voice and the final emotion second prediction result of the target voice.

Optionally, the apparatus further comprises:

the model training module 410 is configured to obtain a sample voice and emotion marks of the sample voice; inputting the sample voice into an emotion prediction model to determine an emotion prediction result of the sample according to the emotion prediction model; determining the emotion prediction result and the difference of emotion marks corresponding to the sample voice; and training the emotion prediction model according to the difference.

The present specification also provides a computer readable storage medium storing a computer program operable to perform a method of speech emotion recognition as provided in fig. 1 above.

The present specification also provides a schematic structural diagram of the electronic device shown in fig. 5, which corresponds to fig. 1. At the hardware level, as shown in fig. 5, the electronic device includes a processor, an internal bus, a network interface, a memory, and a nonvolatile storage, and may of course include hardware required by other services. The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs the computer program to implement a method for speech emotion recognition as described above with respect to fig. 1.

Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present description, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic devices.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present description.

Claims

1. A method of speech emotion recognition, the method comprising:

acquiring target voice;

2. The method of claim 1, wherein the global emotion prediction result comprises probabilities that the target speech belongs to each emotion type, respectively;

3. The method of claim 1, wherein determining a final emotion prediction result for the target speech based on the optimized global emotion prediction result, specifically comprises:

optimizing the local emotion prediction results according to the optimized global emotion prediction results aiming at each local emotion prediction result to obtain optimized local emotion prediction results;

4. The method of claim 3, wherein the global emotion prediction result includes probabilities that the target speech belongs to each emotion type, respectively;

5. The method of claim 3, wherein determining the final emotion prediction result for the target speech based on the optimized global emotion prediction result and the optimized local emotion prediction result, specifically comprises:

6. The method of claim 1, wherein training the emotion prediction model comprises:

acquiring sample voice and emotion marking of the sample voice;

and training the emotion prediction model according to the difference.

7. An apparatus for speech emotion recognition, said apparatus comprising:

the target voice acquisition module is used for acquiring target voice;

8. The apparatus of claim 7, wherein the global emotion prediction result comprises probabilities that the target speech belongs to each emotion type, respectively; the local emotion prediction result comprises probabilities that the voice fragments belong to each emotion type respectively; the optimization module is specifically used for: taking the probability that the target voice belongs to each emotion type as global probability and taking the probability that the voice fragment belongs to each emotion type as local probability; for each emotion type, determining the maximum value of the local probability of the emotion type in the local probability of at least one local emotion prediction result as the local fusion probability of the emotion type; and weighting the global probability of each emotion type and the local fusion probability of the emotion type according to preset fusion weights and the local fusion probability of the emotion type.

9. A computer readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-6.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of the preceding claims 1-6 when executing the program.