CN113689886A

CN113689886A - Voice data emotion detection method and device, electronic equipment and storage medium

Info

Publication number: CN113689886A
Application number: CN202110791641.3A
Authority: CN
Inventors: 李建强; 张硕; 付光晖
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-07-13
Filing date: 2021-07-13
Publication date: 2021-11-23
Anticipated expiration: 2041-07-13
Also published as: CN113689886B

Abstract

The invention provides a voice data emotion detection method, a voice data emotion detection device, electronic equipment and a storage medium, wherein the method comprises the following steps: inputting the voice data into an emotion detection model to obtain an emotion detection result output by the emotion detection model; the emotion detection model is used for dividing the voice features into a plurality of candidate emotion area features after the voice features of the voice data are extracted, determining target emotion area features from the candidate emotion area features based on non-maximum suppression, and performing emotion classification on the target emotion area features to obtain emotion detection results. Because the voice data corresponding to the candidate emotion area features are one or more complete sentences, emotion detection can be performed based on the complete sentences, and the problem that emotion detection cannot be accurately performed due to the fact that emotion classification is performed on each sentence of conversation in the traditional method, the starting and ending positions of each emotion are not considered, and the process and the stage of emotion expression are ignored is solved.

Description

Voice data emotion detection method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of natural language processing, in particular to a voice data emotion detection method and device, electronic equipment and a storage medium.

Background

The emotion detection of the voice data is to analyze the original audio information to obtain emotion information expressed by a speaker in the audio. The speech records the complete voice information of the speaker, which includes the information of the speaker's language content, the voice intonation, etc. Speakers often express emotional feelings of mind through combined action of the content of speech and the intonation of speech, so that it is more advantageous to use speech for emotional analysis than to use text for emotional analysis.

In life, long-time continuous conversation voice is common, such as service and rescue scenes of a customer service hotline, a psychological assistance hotline and the like, emotion changes of a caller are obtained through emotion analysis of the long-time conversation voice, and finally, an overall evaluation is obtained. At present, the main method for emotion analysis of long-term continuous conversation is to perform emotion classification on each sentence of conversation, however, expression of psychological emotion is a time sequence process and is an accumulated result obtained by sequentially expressing a plurality of sentences according to the time sequence, and the number of words required for judging the emotion is often difficult to determine, so that the method does not consider the starting and ending positions of each emotion generation and neglects the process and stage of emotion expression, so that emotion detection cannot be accurately performed.

Disclosure of Invention

The invention provides a voice data emotion detection method and device, electronic equipment and a storage medium, which are used for solving the defect of low voice data emotion detection precision in the prior art.

The invention provides a voice data emotion detection method, which comprises the following steps:

determining voice data to be detected, wherein the voice data comprises at least one complete sentence;

inputting the voice data into an emotion detection model to obtain an emotion detection result output by the emotion detection model;

the emotion detection model is obtained by training based on sample voice data containing at least one complete sentence and a corresponding sample emotion detection result; the emotion detection model is used for dividing the voice features into a plurality of candidate emotion area features after the voice features of the voice data are extracted, determining target emotion area features from the candidate emotion area features based on non-maximum suppression, and performing emotion classification on the target emotion area features to obtain emotion detection results; the voice data corresponding to each candidate emotional area feature is one or more complete sentences.

According to the emotion detection method of the voice data provided by the invention, the voice data is input to an emotion detection model to obtain an emotion detection result output by the emotion detection model, and the method comprises the following steps:

inputting the voice data to a feature extraction layer of the emotion detection model to obtain voice features of the voice data output by the feature extraction layer;

inputting the voice features to a candidate region detection layer of the emotion detection model to obtain a plurality of candidate emotion region features output by the candidate region detection layer;

inputting the candidate emotional area characteristics to a target area detection layer of the emotion detection model, and performing non-maximum inhibition processing on the candidate emotional area characteristics by the target area detection layer to obtain the target emotional area characteristics output by the target area detection layer;

and inputting the target emotion region characteristics to an emotion classification layer of the emotion detection model to obtain the emotion detection result output by the emotion classification layer.

According to the emotion detection method for voice data provided by the invention, the inputting the voice data into the feature extraction layer of the emotion detection model to obtain the voice feature of the voice data output by the feature extraction layer comprises the following steps:

inputting the voice data into a spectrogram conversion layer of the feature extraction layer to obtain a spectrogram corresponding to the voice data output by the spectrogram conversion layer;

inputting the spectrogram into an upper sampling layer of the feature extraction layer, and performing up-sampling on the spectrogram by the upper sampling layer to obtain high-dimensional features output by the upper sampling layer;

and inputting the high-dimensional features into a context fusion layer of the feature extraction layer, and performing context information fusion on the high-dimensional features by the context fusion layer to obtain the voice features of the voice data output by the context fusion layer.

According to the emotion detection method for voice data provided by the invention, the step of inputting the voice data into the spectrogram conversion layer of the feature extraction layer to obtain the spectrogram corresponding to the voice data output by the spectrogram conversion layer comprises the following steps:

and inputting the voice data into a spectrogram conversion layer of the feature extraction layer, and sequentially performing framing processing, windowing processing and Fourier transform on the voice data by the spectrogram conversion layer to obtain a spectrogram corresponding to the voice data output by the spectrogram conversion layer.

According to the emotion detection method for voice data provided by the invention, the inputting the voice features into the candidate region detection layer of the emotion detection model to obtain the candidate emotion region features output by the candidate region detection layer comprises the following steps:

inputting the voice features to an emotion area prediction layer of the candidate area detection layer to obtain a plurality of initial candidate emotion area features output by the emotion area prediction layer;

outputting each initial candidate emotional area feature to an endpoint detection layer of the candidate area detection layer, and adjusting an initial endpoint and/or a termination endpoint of each initial candidate emotional area feature by the endpoint detection layer to obtain the candidate emotional area features output by the endpoint detection layer.

According to the emotion detection method for voice data provided by the invention, the obtaining of the target emotion area feature output by the target area detection layer further comprises:

inputting the target emotional area characteristics and the voice characteristics to a risk degree prediction model to obtain a risk degree prediction result output by the risk degree prediction model;

the risk degree prediction model is obtained by training based on sample emotional area characteristics and the risk degree corresponding to the sample emotional area characteristics.

According to the emotion detection method of voice data provided by the invention, the step of inputting the characteristics of the target emotion area into a risk degree prediction model to obtain a risk degree prediction result output by the risk degree prediction model comprises the following steps:

inputting the target emotional area characteristics and the voice characteristics to a characteristic fusion layer of the risk degree prediction model to obtain fusion characteristics output by the characteristic fusion layer;

and inputting the fusion characteristics to a result prediction layer of the risk degree prediction model to obtain the risk degree prediction result output by the result prediction layer.

The invention also provides a voice data emotion detection device, which comprises:

the determining unit is used for determining voice data to be detected, and the voice data comprises at least one complete statement;

the detection unit is used for inputting the voice data into an emotion detection model to obtain an emotion detection result output by the emotion detection model;

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the steps of any one of the voice data emotion detection methods.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method for emotion detection of speech data as described in any of the above.

According to the voice data emotion detection method, the voice data emotion detection device, the electronic equipment and the storage medium, after the voice characteristics of the voice data are extracted through the emotion detection model, the voice characteristics are divided into a plurality of candidate emotion area characteristics, the target emotion area characteristics are determined from the candidate emotion area characteristics on the basis of non-maximum inhibition, and emotion classification is carried out on the target emotion area characteristics to obtain emotion detection results. Because the voice data corresponding to the candidate emotion area features are one or more complete sentences, emotion detection can be performed based on the complete sentences, and the problem that emotion detection cannot be accurately performed due to the fact that emotion classification is performed on each sentence of conversation in the traditional method, the starting and ending positions of each emotion are not considered, and the process and the stage of emotion expression are ignored is solved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart of a method for emotion detection of speech data according to the present invention;

FIG. 2 is a second flowchart of the emotion detection method for voice data according to the present invention;

FIG. 3 is a schematic structural diagram of an emotion detection apparatus for speech data according to the present invention;

fig. 4 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Currently, the main method for emotion detection of long-term continuous conversations is to classify emotion of each sentence of conversation. Although the method considers the context information in the process, the starting and ending positions of each emotion generation are not considered, and the procedurality and the stage of the emotion expression are ignored. Because expression of psychological emotion is a time sequence process, and is an accumulated result obtained by sequentially expressing multiple words according to time sequence, and the number of words required for judging the emotion is often difficult to determine, an emotion region detection method based on long speech analysis is needed, different emotions in speech are positioned and analyzed to obtain different emotion regions, emotion judgment in each region is given, and a final evaluation of the whole conversation is given on the basis.

For example, a psychological aid hotline is an important long-conversation voice scene, and most of the past rely on various clinical assessment scales to screen and assess the emotional risk degree of voice data in the psychological aid hotline. However, such conventional quantitative evaluation requires close cooperation of the caller to truly and comprehensively report various relevant situations, which not only takes a lot of time, but also easily causes dissatisfaction of the caller. If the artificial intelligence means can be utilized, the psychological condition hidden by the speaker can be found through fluctuation and fluctuation of voice and emotion of the speaker in the conversation process, so that the manual evaluation time can be reduced in the normal intervention process, the hotline service quality can be improved, and the emotion of the speaker can be well detected when the speaker is not matched with the evaluation.

In view of the above, the present invention provides a method for emotion detection of voice data. Fig. 1 is a schematic flow chart of a speech data emotion detection method provided by the present invention, as shown in fig. 1, the method includes the following steps:

step 110, determining voice data to be detected, wherein the voice data comprises at least one complete sentence.

Specifically, the voice data to be detected refers to voice data to be subjected to emotion detection. The voice data to be detected comprises at least one complete sentence, and the complete sentence refers to a sentence with complete meaning and structure. A sentence with a complete structure generally has two parts, the former part mainly says "who" or "what", the latter part mainly says "what", "what" or "what", and the like. To be clearly understood as a complete meaning, both parts need to be included. For example, the words "sister, learning to type" are used to explain the former part "who" and the words "learning to type" are used to explain the latter part "what" so that "sister, learning to type" constitute a complete sentence. In addition, the complete sentence may also be based on punctuation ". ","? ","! "and the like, and the embodiment of the present invention is not particularly limited thereto.

The voice data to be detected may be voice data collected by a voice device. In addition, it can be understood that, because the voice data collected by the voice device may be interfered by various noise in the surrounding environment, the collected original voice is not pure voice data, but noisy voice data polluted by noise, and even under the condition of large noise interference, useful voice data in the original voice may be submerged by noise, so that the noise reduction processing may be performed on the noisy original voice, useful voice data may be extracted from the noise background, and the noise interference may be suppressed and reduced, so that the emotion detection may be accurately performed based on the pure voice data. The voice data collected by the voice device may be subjected to noise reduction processing by using a noise reduction algorithm (e.g., an OMLSA algorithm, an LTSA algorithm, etc.), which is not specifically limited in this embodiment of the present invention.

Step 120, inputting the voice data into an emotion detection model to obtain an emotion detection result output by the emotion detection model;

the emotion detection model is obtained by training based on sample voice data containing at least one complete sentence and a corresponding sample emotion detection result; the emotion detection model is used for dividing the voice features into a plurality of candidate emotion area features after the voice features of the voice data are extracted, determining target emotion area features from the candidate emotion area features on the basis of non-maximum inhibition, and performing emotion classification on the target emotion area features to obtain emotion detection results; the voice data corresponding to each candidate emotional area feature is one or more complete sentences.

Specifically, after voice data is input into the emotion detection model, the emotion detection model firstly extracts voice features of the voice data, carries out emotion area detection on the voice features, and takes the voice features corresponding to one or more sentences containing the same emotion as a candidate emotion area feature, so that the voice features are divided into a plurality of candidate emotion area features. Because a plurality of candidate emotion area features may have redundant candidate frames, the target emotion area features are determined from the plurality of candidate emotion area features based on non-maximum suppression, and then emotion classification is performed on the target emotion area features to obtain emotion detection results. The emotion detection result may be probabilities (e.g., probabilities corresponding to happiness, sadness, anger, and the like) of various kinds of emotions corresponding to each target emotion area, and may also be directly output an emotion classification corresponding to each target emotion area.

It should be noted that the voice data corresponding to each candidate emotion area feature is one or more complete sentences, so that emotion detection can be performed based on the complete sentences, and the problem that emotion detection cannot be performed accurately due to the fact that emotion classification is performed on each sentence of conversation in the conventional method, the starting and ending positions of each emotion are not considered, and the process and stage of emotion expression are ignored is solved.

According to the voice data emotion detection method provided by the embodiment of the invention, after the voice characteristics of the voice data are extracted through the emotion detection model, the voice characteristics are divided into a plurality of candidate emotion area characteristics, the target emotion area characteristics are determined from the candidate emotion area characteristics on the basis of non-maximum inhibition, and the target emotion area characteristics are subjected to emotion classification to obtain the emotion detection result. Because the voice data corresponding to the candidate emotion area features are one or more complete sentences, emotion detection can be performed based on the complete sentences, and the problem that emotion detection cannot be accurately performed due to the fact that emotion classification is performed on each sentence of conversation in the traditional method, the starting and ending positions of each emotion are not considered, and the process and the stage of emotion expression are ignored is solved.

Based on the above embodiment, inputting the voice data to the emotion detection model to obtain an emotion detection result output by the emotion detection model, including:

inputting the voice data into a feature extraction layer of the emotion detection model to obtain voice features of the voice data output by the feature extraction layer;

inputting the voice features into a candidate region detection layer of the emotion detection model to obtain a plurality of candidate emotion region features output by the candidate region detection layer;

inputting the candidate emotional area characteristics to a target area detection layer of an emotional detection model, and performing non-maximum inhibition processing on the candidate emotional area characteristics by the target area detection layer to obtain target emotional area characteristics output by the target area detection layer;

and inputting the characteristics of the target emotion area into an emotion classification layer of the emotion detection model to obtain an emotion detection result output by the emotion classification layer.

Specifically, after the voice data to be detected is determined, the voice data is input to the feature extraction layer, the voice data is converted into a spectrogram by the feature extraction layer, and voice features are extracted based on the spectrogram.

After the voice feature is obtained, the voice feature is input to the candidate region detection layer, and the candidate region detection layer performs emotion region detection. It should be noted that the initial candidate emotion area feature obtained when the candidate area detection layer performs emotion area detection may not be a complete sentence, and in order to ensure the accuracy of the subsequent emotion detection result, the candidate area detection layer performs endpoint detection on the initial candidate emotion area feature to adjust the start and stop points corresponding to the initial candidate emotion area feature, so that the candidate emotion area feature corresponds to one or more complete sentences.

Because the obtained multiple candidate emotion area features have the redundant candidate frames, in order to further accurately extract the emotion area features, the multiple candidate emotion area features can be subjected to non-maximum inhibition processing to obtain target emotion area features, and then the target emotion area features are classified based on the emotion classification layer to obtain emotion detection results. Wherein, the emotion classification layer can perform emotion classification prediction on the target emotion region characteristics based on a Recurrent Neural Network (RNN).

Based on any one of the above embodiments, inputting the voice data to the feature extraction layer of the emotion detection model to obtain the voice features of the voice data output by the feature extraction layer, including:

Specifically, voice data is input into a spectrogram conversion layer to obtain a corresponding spectrogram S_n. Then, the spectrogram is up-sampled through an up-sampling layer to obtain a high-dimensional feature G_ct. Then, based on the context fusion layer, the extracted high-dimensional feature vector G is processed_ctFurther fusing the context information to obtain the final voice feature G of the voice data_t。

The upsampling layer may be constructed based on a Convolutional Neural Network (CNN), and the context fusion layer may be constructed based on a bidirectional cyclic neural network (BiLSTM), which is not specifically limited in this embodiment of the present invention.

Based on any one of the above embodiments, inputting the voice data to the spectrogram conversion layer of the feature extraction layer to obtain a spectrogram corresponding to the voice data output by the spectrogram conversion layer, including:

Specifically, the spectrogram conversion layer applies to the speech data X_nPerforming framing and windowing W_n-mIn operation, a short-time fast fourier transform (STFT) is then performed on each frame n, and the results of each frame are then stacked along the time dimension to generate a spectrogram, where the short-time fourier transform is given by the following formula, where m represents the window width.

Based on any of the above embodiments, inputting the speech features into the candidate region detection layer of the emotion detection model to obtain a plurality of candidate emotion region features output by the candidate region detection layer, including:

inputting the voice features into an emotion area prediction layer of a candidate area detection layer to obtain a plurality of initial candidate emotion area features output by the emotion area prediction layer;

and outputting each initial candidate emotional area feature to an endpoint detection layer of the candidate area detection layer, and adjusting the initial endpoint and/or the termination endpoint of each initial candidate emotional area feature by the endpoint detection layer to obtain a plurality of candidate emotional area features output by the endpoint detection layer.

Specifically, the endpoint detection layer is able to detect the temporal location of a complete sentence. The emotion area prediction layer can detect a time area in which a phrase including a certain emotion is present. In the emotion area prediction process, the endpoint detection layer as an auxiliary task can help to adjust the boundary of the emotion area, so that the predicted emotion area contains complete voice content, and the specific detection process is as follows:

(a) inputting data: receiving speech feature G of speech data output by feature extraction layer_t。

(b) Endpoint detection and emotional area prediction: use of fully connected neural networks (FC) to derive predicted emotional areas and the results that speech endpoints will predict using a loss function meterCalculating a loss value L for back propagation, the loss function being divided into two parts, respectively an end point loss L_vadAnd bounding box loss L_locUsing a balance coefficient lambda₁,λ₂Balancing the two losses, and expressing the specific function as follows:

wherein E is_tlocTrue annotated emotional area, E_tvadEnd points representing true annotations, L_vadThe Loss is calculated using a Cross Entropy Loss Function (Cross Entry Loss Function), the bounding box Loss L_locCalculated using the following formula:

where t denotes the start-stop time of the emotional area, the subscript start denotes the start time, the subscript end denotes the end time, the subscript i denotes the predicted frame number candidate, the superscript pred denotes the predicted value, and the superscript truth denotes the labeled value.

Based on any of the above embodiments, obtaining the target emotional area characteristics output by the target area detection layer, then further includes:

inputting the target emotional area characteristics and the voice characteristics into a risk degree prediction model to obtain a risk degree prediction result output by the risk degree prediction model;

the risk degree prediction model is obtained by training based on the sample emotional area characteristics and the risk degree corresponding to the sample emotional area characteristics.

Specifically, after the target emotion area characteristics are obtained, the target emotion area characteristics and the voice characteristics can be input into the risk degree prediction model, and a risk degree prediction result output by the risk degree prediction model is obtained, so that the risk degree of negative emotion carried in the voice data to be detected can be judged based on the prediction result, and the emotional state of the voice data user can be further accurately known.

Before inputting the target emotional area characteristics and the voice characteristics into the risk degree prediction model, the risk degree prediction model can be obtained through pre-training, and the method can be specifically realized by executing the following steps: firstly, a large number of sample emotional area characteristics are collected, and the corresponding danger degree of the sample emotional area characteristics is determined through manual marking. And training the initial model based on the sample emotional area characteristics and the danger degree corresponding to the sample emotional area characteristics, so as to obtain a danger degree prediction model.

Based on any one of the above embodiments, inputting the target emotional area characteristics to the risk degree prediction model to obtain a risk degree prediction result output by the risk degree prediction model, including:

inputting the target emotional area characteristics and the voice characteristics into a characteristic fusion layer of the risk degree prediction model to obtain fusion characteristics output by the characteristic fusion layer;

and inputting the fusion characteristics to a result prediction layer of the risk degree prediction model to obtain a risk degree prediction result output by the result prediction layer.

Specifically, when the feature fusion layer classifies sentiment, the target sentiment region features finally output by each region are spliced in time domain dimensions and serve as key values (K, V), the final dimension of the voice features obtained by the feature extraction layer serves as a query value (Q), and the following functions are used for performing attention fusion:

where d is the time dimension of Q.

And then the result prediction layer predicts the risk degree of the fused fusion features by using a fully connected neural network (FCN) to obtain a risk degree prediction result.

Based on any of the above embodiments, the present invention further provides a method for detecting emotion in voice data, as shown in fig. 2, the method includes:

first, an original audio file is input to soundSpectrum conversion module (SFM) to obtain corresponding spectrogram S_n. Obtaining a spectrogram S_nThen, feature extraction is carried out on the spectrogram by using a Feature Extraction Module (FEM) to obtain a feature vector G_t. Then, the high-dimensional feature vector G is processed_tInputting into emotion area prediction module (EPM) to predict different emotion areas E_loc. And carrying out emotion classification on each predicted emotion area by adopting an area Emotion Classification Module (ECM), finally fusing the emotion characteristic vectors obtained in the emotion classification process with the characteristic vectors obtained by the characteristic extraction module, and predicting the danger degree D by using a danger Degree Prediction Module (DPM).

The feature extraction module adopts a model structure of a Convolutional Neural Network (CNN) and a bidirectional cyclic neural network (BilSTM), wherein the Convolutional Neural Network (CNN) performs up-sampling on a spectrogram to acquire high-dimensional features G_ctThe bidirectional circulation neural network (BilSTM) adopts a double-layer structure, and extracts high-dimensional characteristic vector G from the convolutional neural network_ctFurther fusing context information to obtain a final feature vector G_t。

In addition, the emotion area prediction module (EPM) can detect a time area in which a plurality of words including a certain emotion are located. In the process of emotion area prediction, an endpoint detection module (VADM) as an auxiliary task may help to adjust the boundaries of the emotion areas so that the predicted emotion areas contain complete speech content. Wherein the endpoint detection module (VADM) is capable of detecting the temporal position of a complete sentence.

The following describes the emotion detection device for voice data provided by the present invention, and the emotion detection device for voice data described below and the emotion detection method for voice data described above can be referred to each other.

Based on any of the above embodiments, the present invention provides a speech data emotion detection apparatus, as shown in fig. 3, the apparatus includes:

a determining unit 310, configured to determine voice data to be detected, where the voice data includes at least one complete sentence;

the detection unit 320 is configured to input the voice data to an emotion detection model, and obtain an emotion detection result output by the emotion detection model;

Based on any of the above embodiments, the detecting unit 320 includes:

the feature extraction unit is used for inputting the voice data to a feature extraction layer of the emotion detection model to obtain voice features of the voice data output by the feature extraction layer;

a candidate region detection unit, configured to input the speech feature to a candidate region detection layer of the emotion detection model, so as to obtain the multiple candidate emotional region features output by the candidate region detection layer;

a target area detection unit, configured to input the multiple candidate emotional area features into a target area detection layer of the emotion detection model, and perform non-maximum suppression processing on the multiple candidate emotional area features by the target area detection layer to obtain the target emotional area features output by the target area detection layer;

and the emotion classification unit is used for inputting the target emotion area characteristics to an emotion classification layer of the emotion detection model to obtain the emotion detection result output by the emotion classification layer.

Based on any one of the above embodiments, the feature extraction unit includes:

the spectrogram conversion unit is used for inputting the voice data into a spectrogram conversion layer of the feature extraction layer to obtain a spectrogram corresponding to the voice data output by the spectrogram conversion layer;

the up-sampling unit is used for inputting the spectrogram into an up-sampling layer of the feature extraction layer, and up-sampling the spectrogram by the up-sampling layer to obtain high-dimensional features output by the up-sampling layer;

and the context fusion unit is used for inputting the high-dimensional features into a context fusion layer of the feature extraction layer, and performing context information fusion on the high-dimensional features by the context fusion layer to obtain the voice features of the voice data output by the context fusion layer.

Based on any embodiment, the spectrogram converting unit is configured to:

Based on any of the above embodiments, the candidate region detection unit includes:

the emotion area prediction unit is used for inputting the voice features into an emotion area prediction layer of the candidate area detection layer to obtain a plurality of initial candidate emotion area features output by the emotion area prediction layer;

and the endpoint detection unit is used for outputting each initial candidate emotional area feature to an endpoint detection layer of the candidate area detection layer, and the endpoint detection layer adjusts the starting endpoint and/or the ending endpoint of each initial candidate emotional area feature to obtain the candidate emotional area features output by the endpoint detection layer.

Based on any embodiment, the system further comprises a risk degree prediction unit for predicting the risk degree

After the target emotion area features output by the target area detection layer are obtained, inputting the target emotion area features and the voice features into a risk degree prediction model to obtain a risk degree prediction result output by the risk degree prediction model;

Based on any embodiment, the risk degree prediction unit includes:

the characteristic fusion unit is used for inputting the target emotional area characteristics and the voice characteristics into a characteristic fusion layer of the risk degree prediction model to obtain fusion characteristics output by the characteristic fusion layer;

and the result prediction unit is used for inputting the fusion characteristics to a result prediction layer of the risk degree prediction model to obtain the risk degree prediction result output by the result prediction layer.

Fig. 4 is a schematic structural diagram of an electronic device provided in the present invention, and as shown in fig. 4, the electronic device may include: a processor (processor)410, a memory (memory)420, a communication Interface (Communications Interface)430 and a communication bus 440, wherein the processor 410, the memory 420 and the communication Interface 430 are configured to communicate with each other via the communication bus 440. Processor 410 may invoke logic instructions in memory 420 to perform a method for emotion detection of speech data, the method comprising: determining voice data to be detected, wherein the voice data comprises at least one complete sentence; inputting the voice data into an emotion detection model to obtain an emotion detection result output by the emotion detection model; the emotion detection model is obtained by training based on sample voice data containing at least one complete sentence and a corresponding sample emotion detection result; the emotion detection model is used for dividing the voice features into a plurality of candidate emotion area features after the voice features of the voice data are extracted, determining target emotion area features from the candidate emotion area features based on non-maximum suppression, and performing emotion classification on the target emotion area features to obtain emotion detection results; the voice data corresponding to each candidate emotional area feature is one or more complete sentences.

Furthermore, the logic instructions in the memory 420 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the methods of determining speech data to be detected provided by the above methods, the speech data comprising at least one complete sentence; inputting the voice data into an emotion detection model to obtain an emotion detection result output by the emotion detection model; the emotion detection model is obtained by training based on sample voice data containing at least one complete sentence and a corresponding sample emotion detection result; the emotion detection model is used for dividing the voice features into a plurality of candidate emotion area features after the voice features of the voice data are extracted, determining target emotion area features from the candidate emotion area features based on non-maximum suppression, and performing emotion classification on the target emotion area features to obtain emotion detection results; the voice data corresponding to each candidate emotional area feature is one or more complete sentences.

In yet another aspect, the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the above-mentioned determining voice data to be detected, the voice data including at least one complete sentence; inputting the voice data into an emotion detection model to obtain an emotion detection result output by the emotion detection model; the emotion detection model is obtained by training based on sample voice data containing at least one complete sentence and a corresponding sample emotion detection result; the emotion detection model is used for dividing the voice features into a plurality of candidate emotion area features after the voice features of the voice data are extracted, determining target emotion area features from the candidate emotion area features based on non-maximum suppression, and performing emotion classification on the target emotion area features to obtain emotion detection results; the voice data corresponding to each candidate emotional area feature is one or more complete sentences.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for emotion detection of voice data, comprising:

2. The method for emotion detection of speech data according to claim 1, wherein said inputting the speech data to an emotion detection model to obtain an emotion detection result outputted by the emotion detection model comprises:

3. The method for emotion detection of speech data according to claim 2, wherein said inputting the speech data to a feature extraction layer of the emotion detection model to obtain the speech features of the speech data output by the feature extraction layer comprises:

4. The method for emotion detection of voice data according to claim 3, wherein the inputting the voice data to a spectrogram conversion layer of the feature extraction layer to obtain a spectrogram corresponding to the voice data output by the spectrogram conversion layer comprises:

5. The method according to any one of claims 2 to 4, wherein the inputting the speech features into a candidate region detection layer of the emotion detection model to obtain the candidate emotion region features output by the candidate region detection layer comprises:

6. The emotion detection method of voice data according to any of claims 2 to 4, wherein the obtaining of the target emotion region feature output by the target region detection layer further comprises:

7. The emotion detection method of voice data according to claim 6, wherein the step of inputting the features of the target emotion area to a risk degree prediction model to obtain a risk degree prediction result output by the risk degree prediction model includes:

8. An emotion detection device for speech data, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method for emotion detection of speech data according to any of claims 1 to 7 when executing the program.

10. A non-transitory computer readable storage medium, having a computer program stored thereon, wherein the computer program, when being executed by a processor, implements the steps of the emotion detection method for speech data according to any of claims 1 to 7.