CN107221344A

CN107221344A - A kind of speech emotional moving method

Info

Publication number: CN107221344A
Application number: CN201710222674.XA
Authority: CN
Inventors: 李华康; 杜阳阳; 金旭; 胡晓东; 丘添元; 张笑源; 孙国梓; 李涛
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2017-04-07
Filing date: 2017-04-07
Publication date: 2017-09-29

Abstract

The invention discloses a kind of speech emotional moving method, speech database generation speech emotional data set is primarily based on, label for labelling is completed, audio feature extraction is then carried out to audio file using speech characteristic parameter model, set of voice features is obtained；Next machine learning is carried out to set of voice features and speech emotional label using Machine learning tools, builds emotion model storehouse.Selection target to be migrated, from multimedia terminal input speech signal, obtain the feature set of current speech signal, current emotional category is obtained by emotional semantic classification, judge whether consistent with the target of input, original input speech signal is exported directly as target emotional speech if consistent, feature feeling shifting is otherwise carried out；Eventually pass phonetic synthesis processing generation final goal emotional speech output.The method proposed by the present invention migrated based on emotional semantic classification and feature, can realize the change of speech emotional on the premise of original speaker's sound mark is not lost.

Description

A kind of speech emotional moving method

Technical field

The invention belongs to technical field of voice recognition, it is related to the moving method of speech emotional, and in particular to one kind is not based on With the moving method of the speech emotional of speech vendors' model.

Background technology

With the development of intelligent chip technology, the intellectuality of various terminal equipment and integration degree more and more higher, equipment Miniaturization, lighting, networking make it that the life of people is more and more convenient.User constantly carries out voice by the network terminal The exchange of video, have accumulated the multi-medium data of magnanimity.With the accumulation of platform data, intelligent Answer System also gradually meet the tendency of and It is raw.These intelligent Answer Systems include speech recognition, sexy analysis, information retrieval, semantic matches, sentence generation, voice conjunction Into grade tip technology.

Speech recognition technology is to allow machine voice signal to be converted into corresponding text by identification technology and understanding process This information or machine instruction, allow machine to understand the expression content of the mankind, mainly include voice unit selection, phonetic feature The technologies such as extraction, pattern match and model training.Voice unit includes three kinds of word (sentence), syllable and velocity of sound, specifically according to field Scape and task are selected.Sub-word units are mainly suitable for small vocabulary speech identifying system；Syllable unit is more suitable for Chinese speech Identification；Although phoneme can explain voice basis well, complicated and changeable due to enunciator leads to not obtain steady Fixed data set, at present still under study for action.

Another research direction is the emotion recognition of voice, mainly by speech signal collection, affective feature extraction and emotion Identification composition.Wherein affective feature extraction mainly has three kinds of prosodic features, the correlated characteristic based on spectrum and tonequality feature.These Feature realizes extraction, and the progress emotion recognition in the form of global characteristics statistical value typically by minimum particle size of frame.In emotion In terms of recognizer, mainly including discrete language emotion classifiers and the major class of dimension speech emotional fallout predictor two.Speech emotional is known Other technology is also widely used for the fields such as telephone service center, the spiritual differentiation of driver, Online Distance Computing Course.

Intelligent body be described as be artificial intelligence of future generation condensation products, ambient environmental factors can not only be recognized, understand Behavior Expression and the language description of people, or even in the communication process with people, with greater need for the emotion for going to understand people, and can be real Existing apish emotional expression, could realize more soft interaction.The emotion research of current intelligent body, which is concentrated mainly on, to be based on Virtual image processing, is related to the multiple fields such as computer graphics, psychology, cognitive science, neuro-physiology, artificial intelligence and grinds The achievement for the person of studying carefully.Although it was found that the environment sensing information of people more than 90% comes from vision, the emotion perception of the overwhelming majority It is to come from voice.The emotion system of class people's intelligent body how is set up from voice field, not yet there is disclosed research issue so far.

The content of the invention

The purpose of the present invention is, using machine learning method as Main Means, to propose a kind of speech emotional expression method of people, And deep learning and convolutional network algorithm are used on this basis, the migration of speech emotional is realized from system.Not only to voice Identification, sentiment analysis provide certain reference method, can be more used widely on following class people intelligent body.

To achieve the above object, technical scheme proposed by the present invention is a kind of speech emotional moving method, it is specific include with Lower step：

Step 1, one speech database of preparation, pass through standard sample and generate speech emotional data set S={ s₁,s₂,…, s_n}；

Step 2, using manual type the speech database of step 1 is labelled, mark the emotion E=of each voice document {e₁,e₂,…,e_n}；

Step 3, using speech characteristic parameter model to each audio file s in sound bank_iCarry out audio feature extraction, Obtain basic set of voice features F_i={ f₁ ⁱ,f₂ ⁱ,…,f_n ⁱ}；

The voice feelings that step 4, each set of voice features and step 2 that are obtained using Machine learning tools to step 3 are obtained Feel label and carry out machine learning, obtain the characteristic model of each class speech emotional, build emotion model storehouse E_b；

Step 5, by a multimedia terminal, selection needs the target Target that speech emotional is migrated；

Step 6, from multimedia terminal input speech signal s_t；

Step 7, by the s currently inputted_tSpeech emotional characteristic extracting module is input to, the feature of current speech signal is obtained Collect F_t={ f₁ ^t,f₂ ^t,…,f_n ^t}；

Step 8, using with step 4 identical machine learning algorithm, the s that step 7 is obtained_tSet of voice features F_tWith reference to The emotion model storehouse E that step step 4 is obtained_bEmotional semantic classification is carried out, s is obtained_tCurrent emotional category s_e；

The s that step 9, judgment step 8 are obtained_eIt is whether consistent with step 5 Target inputted, if s_e=Target_e, then Original input speech signal is exported directly as target emotional speech, if s_eTarget_e, then invocation step 10 carry out feature Feeling shifting；

Step 10, speech emotional principal character of the current speech emotion principal character into emotion model storehouse moved Move；

Phonetic feature after step 11, the feature obtained using Speech Synthesis Algorithm to step 10 migration is processed, and is closed Into the output of final goal emotional speech.

Further, in above-mentioned steps 1, the sample frequency of speech data is 44.1KHz, record length between 3~10s, And save as wav forms.

In step 1, in order to obtain preferable performance, the natural quality dimension of sampled data can not be concentrated excessively, hits According to the collection in all ages and classes, sex, occupation et al. as far as possible.

In step 6, the input can click on to submit after the completion of inputting or record in real time.

The invention has the advantages that：

1st, present invention firstly provides the concept of speech emotional migration, emotion structure side can be provided for following virtual reality Method.

2nd, the method proposed by the present invention migrated based on emotional semantic classification and feature, can not lose original speaker's sounding spy The change of speech emotional is realized on the premise of levying.

Brief description of the drawings

Fig. 1 is the speech emotional moving method schematic diagram that the present invention is provided.

Fig. 2 is the spectrum signature figure that the present invention is originally inputted speech samples.

Fig. 3 is that raw tone sample of the present invention passes through the spectrum signature figure that emotion is converted.

Embodiment

In conjunction with accompanying drawing, the present invention is further detailed explanation.

The present invention provides a kind of user's expression speech emotional moving method based on speech emotional database, as shown in figure 1, The module or function that this method is related to include：

, there is the voice initial data under all ages and classes, sex, scene in basic speech storehouse.

Tag library, Emotion tagging is carried out to basic speech storehouse, such as gentle, glad, angry, angry, sad.

Speech input device, such as microphone, it is possible to achieve the real-time voice input of user.

Speech emotional feature extraction, by sound characteristic analysis tool, obtains general sound characteristic, and according to the language of people Sound signal characteristic and emotion behavior feature, the feature set needed for choosing are used as speech emotional feature.

Machine learning, speech emotional tag library is confirmed using machine learning algorithm, and speech emotional feature set is built and trained Model.

Emotion model storehouse, sound bank data by machine learning obtain according to the dimensions such as sex, age, emotion classify after Speech emotional model library.

Emotion is selected, user selects the emotion model for needing current speech being converted into real time before input speech signal.

Emotional category judges, judges whether the emotion of active user's input is consistent with the emotion of selection.If consistent, directly Connect output target emotional speech.If inconsistent, feeling shifting module is called.

Feeling shifting, in the case where user's input voice and selection emotion are inconsistent, will input speech emotional feature set Characteristic distance contrast is carried out with selection affective characteristics collection, adjustment input speech emotional feature space is represented, realizes feeling shifting.So Exported afterwards using the emotional speech adjusted as target emotional speech.

One embodiment is now provided, to illustrate the transition process of speech emotional, specifically comprised the steps of：

Step 1, this method need to prepare a speech database, preferably, speech data uses standard sample 44.1KHz, in short, the time saves as wav forms between 3~10s, obtains voice feelings some lower tester of record Feel data set S={ s₁,s₂,…,s_n}.In order to obtain preferable performance, whether sampled data as possible or not age, sex, occupation etc. The natural quality dimension of people is excessively concentrated.

Step 2, by the way of artificial, the speech database that step 1 prepares is labelled, each voice document is marked Emotion E={ e₁,e₂,…,e_n, such as " worry ", " startled ", " anger ", " disappointment ", " sadness " etc.

Step 3, using speech characteristic parameter model to each audio file s in sound bank_iAudio feature extraction is carried out, is obtained To basic set of voice features F_i={ f₁ ⁱ,f₂ ⁱ,…,f_n ⁱEtc. (Fig. 2 show raw tone sample spectrum signature signal Figure), such as " envelope (env) ", " word speed (speed) ", " zero-crossing rate (zcr) ", " energy (eng) ", " Energy-Entropy (eoe) ", " frequency Compose barycenter (spec_cent) ", " frequency spectrum diffusion (spec_spr) ", " mel-frequency (mfccs) ", " chroma vector (chrona) " Deng.

The feature set and step of step 4, each voice document obtained using Machine learning tools (such as Libsvm) to step 3 Speech emotional label obtained by rapid 2 carries out machine learning, obtains the characteristic model of each class speech emotional, builds emotion model Storehouse E_b。

Step 5, by a multimedia terminal, selection needs speech emotional to migrate target Target_e, such as " sadness ".

Step 6, from multimedia terminal input speech signal s_t, point after the completion of being real-time input or recording Hit and submit.

Step 7, by the s currently inputted_tSpeech emotional characteristic extracting module is input to, the feature of current speech signal is obtained Collect F_t={ f₁ ^t,f₂ ^t,…,f_n ^t}。

Step 8, using step 4 identical machine learning algorithm, the s that step 7 is obtained_tSet of voice features F_tWith reference to step The emotion model storehouse E that rapid step 4 is obtained_bEmotional semantic classification is carried out, s is obtained_tCurrent emotional category s_e。

The s that step 9, judgment step 8 are obtained_eThe Target inputted with step 5_eIt is whether consistent, if s_e=Target_e, then Original input speech signal is exported directly as target emotional speech.If s_eI Target_e, then invocation step 10 carry out spy Levy feeling shifting.

Step 10, by current speech emotion principal character, into emotion model storehouse, speech emotional principal character is migrated (Fig. 3 show the spectrum signature after migration), such as envelope migrates result_env=(s_env+Target_env)/2, word speed adjustment result_speed=(s_speed+Target_speed)/2。

Step 11, the feature obtained using a Speech Synthesis Algorithm (Pitch synchronous overlap add technology, PSOLA) to step 10 The phonetic feature migrated is processed the emotional speech output of synthesis final goal.

The foregoing is only the present invention is preferable to carry out case, is not intended to limit the invention, although with reference to foregoing The present invention is described in detail embodiment, for those skilled in the art, and it still can be to foregoing each reality Apply the technical scheme described in example to be improved, or which part technology is replaced on an equal basis.All spirit in the present invention Within principle, any modification, equivalent substitution and improvements made etc. should be included in the scope of the protection.

Claims

1. a kind of speech emotional moving method, it is characterised in that comprise the steps of：

Step 1, one speech database of preparation, pass through standard sample and generate speech emotional data set S={ s₁,s₂,…,s_n}；

Step 2, using manual type the speech database of step 1 is labelled, mark the emotion E={ e of each voice document₁, e₂,…,e_n}；

Step 3, using speech characteristic parameter model to each audio file s in sound bank_iAudio feature extraction is carried out, is obtained Basic set of voice features F_i={ f₁ ⁱ,f₂ ⁱ,…,f_n ⁱ}；

The speech emotional mark that step 4, each set of voice features and step 2 that are obtained using Machine learning tools to step 3 are obtained Label carry out machine learning, obtain the characteristic model of each class speech emotional, build emotion model storehouse E_b；

Step 5, by a multimedia terminal, selection needs the target Target that speech emotional is migrated_e；

Step 6, from multimedia terminal input speech signal s_t；

Step 7, by the s currently inputted_tSpeech emotional characteristic extracting module is input to, the feature set F of current speech signal is obtained_t ={ f₁ ^t,f₂ ^t,…,f_n ^t}；

Step 8, using with step 4 identical machine learning algorithm, the s that step 7 is obtained_tSet of voice features F_tWith reference to step The emotion model storehouse E that step 4 is obtained_bEmotional semantic classification is carried out, s is obtained_tCurrent emotional category s_e；

The s that step 9, judgment step 8 are obtained_eIt is whether consistent with step 5 Target inputted, if s_e=Target_e, then by original Beginning input speech signal is exported directly as target emotional speech, if s_eTarget_e, then invocation step 10 carry out feature emotion Migration；

Step 10, speech emotional principal character of the current speech emotion principal character into emotion model storehouse migrated；

Phonetic feature after step 11, the feature obtained using Speech Synthesis Algorithm to step 10 migration is processed, and synthesis is most Whole target emotional speech output.

2. speech emotional moving method according to claim 1, it is characterised in that the sample frequency of speech data in step 1 For 44.1KHz, record length saves as wav forms between 3~10s.

3. speech emotional moving method according to claim 1, it is characterised in that in order to obtain preferable property in step 1 Can, the natural quality dimension of sampled data can not be concentrated excessively.

4. speech emotional moving method according to claim 1, it is characterised in that input can be real-time described in step 6 Click on and submit after the completion of input or recording.