CN105654942A - Speech synthesis method of interrogative sentence and exclamatory sentence based on statistical parameter - Google Patents

Speech synthesis method of interrogative sentence and exclamatory sentence based on statistical parameter Download PDF

Info

Publication number
CN105654942A
CN105654942A CN201610000676.XA CN201610000676A CN105654942A CN 105654942 A CN105654942 A CN 105654942A CN 201610000676 A CN201610000676 A CN 201610000676A CN 105654942 A CN105654942 A CN 105654942A
Authority
CN
China
Prior art keywords
sentence
acoustic model
neural network
deep neural
interrogative
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610000676.XA
Other languages
Chinese (zh)
Inventor
徐明星
车浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Times Ruilang Technology Co Ltd
Original Assignee
Beijing Times Ruilang Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Times Ruilang Technology Co Ltd filed Critical Beijing Times Ruilang Technology Co Ltd
Priority to CN201610000676.XA priority Critical patent/CN105654942A/en
Publication of CN105654942A publication Critical patent/CN105654942A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a speech synthesis method of an interrogative sentence and an exclamatory sentence based on a statistical parameter. The speech synthesis method disclosed by the invention comprises the following three steps: carrying out model training on a declarative sentence to obtain an initial acoustic model of the declarative sentence, carrying out adaptive training on an interrogative sentence or exclamatory sentence to obtain an acoustic model of the interrogative sentence or exclamatory sentence, and according to the acoustic model of the interrogative sentence or exclamatory sentence, realizing speech production of the interrogative sentence or exclamatory sentence. The invention provides a method for rapidly realizing speech synthesis of the interrogative sentence or exclamatory sentence by adopting small-scale corpora under the small corpus condition, and solves the problems that corpora of the interrogative sentence or exclamatory sentence are difficult to collect compared with the corpora of the declarative sentence, so that synthesized speech with relatively high tone quality and naturalness is obtained under the condition of a relatively small corpus scale.

Description

A kind of interrogative sentence of Corpus--based Method parameter, exclamative sentence phoneme synthesizing method
Technical field
The present invention relates to a kind of phoneme synthesizing method, particularly relate to the phoneme synthesizing method of the interrogative sentence of a kind of Corpus--based Method parameter, exclamative sentence.
Background technology
Tone synthesis is an importance in the research of rich expressive phoneme synthesizing method. The tone substantially can be divided into assertive sentence, interrogative sentence, imperative sentence and exclamative sentence etc. Current speech synthesis system designs for assertive sentence mostly, and when synthesizing other tone, the expression effect of the tone is inconspicuous. If speech synthesis system can have bigger breakthrough in the synthesis of the tone, then the representability of synthetic speech will improve further, man-machine interaction will harmony and natural more.
Interrogative sentence, exclamative sentence are language phenomenons common in natural spoken language. In interrogative sentence and exclamative sentence synthesis, current existing method is after the rhythm feature analyzing the interrogative sentence with emotag, by building new rhythm template base and building new objective cost function, under waveform concatenation synthesis system framework, it is achieved interrogative sentence, exclamative sentence synthesis. From realizing method, this method has some deficiency following. First the method needs have the prosodic analysis carrying out interrogative sentence on the basis of emotion labelling of text; Secondly, it is necessary to the change of intonation is attributed to the change of several position, front and back at crucial syllable, not there is generality; Finally, system adopts the method for waveform concatenation to realize the synthesis of the tone, can retain the deficiency of this method. Also have under the framework of statistical parameter phonetic synthesis, be trained with certain interrogative sentence language material, it is achieved the generation of interrogative sentence. This method is made without the prosodic analysis of interrogative sentence, adopts the mode of machine learning to learn the rhythm in the query tone, it is achieved the synthesis of the tone, method is vague generalization more. But the corpus amount of interrogative sentence is required bigger by the method. And the corpus of substantial amounts of interrogative sentence is relatively difficult to obtain. Therefore, how with a small amount of interrogative sentence language material, building an interrogative sentence synthesis system rapidly is that the industry needs the urgent problems solved.
Summary of the invention
In order to solve the weak point existing for above-mentioned technology, the invention provides the phoneme synthesizing method of the interrogative sentence of a kind of Corpus--based Method parameter, exclamative sentence.
In order to solve above technical problem, the technical solution used in the present invention is: the interrogative sentence of a kind of Corpus--based Method parameter, exclamative sentence phoneme synthesizing method, the method is divided into three below part:
Part I: by the model training to assertive sentence, obtain the initial acoustic model of assertive sentence;
Obtaining the recording language material of extensive assertive sentence as corpus, the training acoustic model based on hidden Semi-Markov Process or the acoustic model based on deep neural network are as initial acoustic model;
Initial acoustic model adopts many spatial probability distribution-hidden Semi-Markov Process to be modeled, first excitation parameters, spectrum parameter are extracted respectively through text marking and voice signal, it is a vector by fundamental frequency and spectrum Parameter fusion, be subsequently adding single order, second order dynamic parameter as the input of many spatial probability distribution-hidden Semi-Markov Process training, finally give assertive sentence based on many spatial probability distribution-hidden Semi-Markov Process initial acoustic model;
Or, initial acoustic model adopts deep neural network to be modeled, and uses multi-task learning mode, completes the mapping of Text To Speech parameters,acoustic by deep neural network, obtain the initial acoustic model based on deep neural network;
Part II: by the adaptive training to interrogative sentence or exclamative sentence, obtain the acoustic model of interrogative sentence or exclamative sentence;
Obtain the recording language material of small-scale interrogative sentence or exclamative sentence as corpus, parameter extraction is carried out then through text marking and voice signal, then carry out adaptive training according on the basis based on many spatial probability distribution-hidden Semi-Markov Process initial acoustic model obtained in the first portion, obtain the acoustic model based on many spatial probability distribution-hidden Semi-Markov Process of interrogative sentence or exclamative sentence;
Or carry out adaptive training according to what obtain in the first portion based on the basis of the initial acoustic model of deep neural network, basis based on multi-task learning mode adjusts deep neural network model, obtains the acoustic model based on deep neural network of interrogative sentence or exclamative sentence;
Part III: the acoustic model according to interrogative sentence or exclamative sentence, it is achieved the speech production of interrogative sentence or exclamative sentence;
Text for voice to be synthesized, carry out text analyzing, the interrogative sentence obtained through Part II or the acoustic model based on many spatial probability distribution-hidden Semi-Markov Process of exclamative sentence or the acoustic model based on deep neural network is adopted to carry out the generation of speech parameter, it is then passed through phonetic vocoder, finally synthesizes the voice of interrogative sentence or exclamative sentence.
Based on the initial acoustic model of deep neural network, including the input as deep neural network of above, the hereafter relevant text feature, using parameters,acoustic as the output of deep neural network; Upper and lower literary composition related text feature includes phoneme, syllable position, phrase position; Parameters,acoustic includes spectrum, fundamental frequency, voicing decision; The deep neural network of multi-task learning mode, using the voicing decision second learning tasks as deep neural network; The output layer of deep neural network there is a neuron be coupled with the soft-max layer of softmax regression model, be output as voicing decision; There is a linear transformation layer, be output as speech parameter; This two-layer stacked in parallel is on the hidden layer of pre-training.
Acoustic model based on many spatial probability distribution-hidden Semi-Markov Process, employing linearly returns, based on limited maximum likelihood, the adaptive approach blended with structuring maximum a posteriori probability and is trained, the method includes first adopting and linearly returns, based on limited maximum likelihood, all model parameters that the initial acoustic model based on many spatial probability distribution-hidden Semi-Markov Process is related to and adjust on a large scale, then adopts structuring maximum a posteriori probability that the corresponding model occurred in self-adapting data is carried out the adaptive training of parameter.
The invention provides a kind of in little language material situation, language material on a small scale is adopted quickly to realize the phoneme synthesizing method of interrogative sentence or exclamative sentence, solve relative to assertive sentence language material, interrogative sentence or the more difficult collection of exclamative sentence language material, in the less situation of language material scale, relatively high tone quality and the problem of synthetic speech of naturalness.
Accompanying drawing explanation
Below in conjunction with the drawings and specific embodiments, the present invention is further detailed explanation.
Fig. 1 is the FB(flow block) of the present invention.
Fig. 2 is the training FB(flow block) of many spatial probability distribution in phoneme synthesizing method shown in Fig. 1-hidden Semi-Markov Process.
Fig. 3 is deep neural network study block diagram overall in phoneme synthesizing method shown in Fig. 1.
Fig. 4 is the phonetic synthesis block diagram of the acoustic model in phoneme synthesizing method shown in Fig. 1 based on many spatial probability distribution-hidden Semi-Markov Process.
Detailed description of the invention
As it is shown in figure 1, the concrete grammar of the present invention is divided into following three parts:
Part I: by the model training to assertive sentence, obtain the initial acoustic model of assertive sentence;
Obtain the recording language material of extensive assertive sentence as corpus, training is based on HMM (hiddenMarkovmodel, the acoustic model of HMM or based on the acoustic model of deep neural network (DeepNeuralNetwork, DNN) as initial acoustic model;
Initial acoustic model adopts many spatial probability distribution-hidden Semi-Markov Process (Multi-SpaceProbabilityDistribution-HiddenSemi-MarkovMode l, MSD-HSMM) it is modeled, first excitation parameters, spectrum parameter are extracted respectively through text marking and voice signal, it is a vector by fundamental frequency and spectrum Parameter fusion, it is subsequently adding single order, second order dynamic parameter as the MSD-HSMM input trained, finally gives the initial acoustic model based on MSD-HSMM of assertive sentence;
Or, initial acoustic model adopts deep neural network to be modeled, and uses multi-task learning mode, completes the mapping of Text To Speech parameters,acoustic by deep neural network, obtain the initial acoustic model based on deep neural network;
Part II: by the adaptive training to interrogative sentence or exclamative sentence, obtain the acoustic model of interrogative sentence or exclamative sentence;
Obtain the recording language material of small-scale interrogative sentence or exclamative sentence as corpus, parameter extraction is carried out then through text marking and voice signal, then carry out adaptive training according to what obtain in the first portion based on the basis of the initial acoustic model of MSD-HSMM, obtain the acoustic model based on MSD-HSMM of interrogative sentence or exclamative sentence;
Or carry out adaptive training according to what obtain in the first portion based on the basis of the initial acoustic model of deep neural network, basis based on multi-task learning mode adjusts deep neural network model, obtains the acoustic model based on deep neural network of interrogative sentence or exclamative sentence;
Part III: the acoustic model according to interrogative sentence or exclamative sentence, it is achieved the speech production of interrogative sentence or exclamative sentence;
Text for voice to be synthesized, carry out text analyzing, the interrogative sentence obtained through Part II or the acoustic model based on MSD-HSMM of exclamative sentence or the acoustic model based on deep neural network is adopted to carry out the generation of speech parameter, it is then passed through phonetic vocoder, finally synthesizes the voice of interrogative sentence or exclamative sentence.
Based on the initial acoustic model of deep neural network, including the input as deep neural network of above, the hereafter relevant text feature, using parameters,acoustic as the output of deep neural network;
Upper and lower literary composition related text feature includes phoneme, syllable position, phrase position;Parameters,acoustic includes spectrum, fundamental frequency, voicing decision;
The deep neural network of multi-task learning mode, using the voicing decision second learning tasks as deep neural network; The output layer of deep neural network there is a neuron be coupled with the soft-max layer of softmax regression model, be output as voicing decision; There is linear transformation (affine-transform) layer, be output as speech parameter; This two-layer stacked in parallel is on the hidden layer of pre-training.
Acoustic model based on MSD-HSMM, adopt and linearly return (ConstrainedMaximumLikelihoodLinearRegression based on limited maximum likelihood, and structuring maximum a posteriori probability (StructuralMaximumAPosterior CMLLR), SMAP) adaptive approach blended is trained, the method includes first adopting CMLLR all model parameters that the initial acoustic model based on MSD-HSMM is related to adjust on a large scale, then adopts SMAP that the corresponding model occurred in self-adapting data is carried out the adaptive training of parameter.
The present invention, in the indicative mood model training of Part I, records language material as corpus using large-scale assertive sentence, and extraction excitation parameters, spectrum parameter carry out MSD-HSMM training, and training gained model is as the initial acoustic model of adaptive training. It should be noted that the experiment for other different self adaptation language materials, this part only need to once be trained, and the initial acoustic model that this part only need to be obtained by different experiments is as the initial model of input.
In the interrogative sentence of Part II or the adaptive training of exclamative sentence, record language material as corpus using small-scale interrogative sentence or exclamative sentence, adopt adaptive training algorithm, the initial acoustic model based on MSD-HSMM or deep neural network of the obtained assertive sentence of previous step is carried out interrogative sentence or the adaptive training of the exclamative sentence tone, obtains the acoustic model based on many spatial probability distribution-hidden Semi-Markov Process or the acoustic model based on deep neural network of interrogative sentence or exclamative sentence.
Part III is speech production part, and the text of input is converted to context-sensitive phoneme notation sequence through text analyzing. According to this sequence, obtain the MSD-HSMM sequence of sentence level, use the parameter generation algorithm based on maximum likelihood criterion to obtain the fundamental frequency of corresponding each phoneme, spectrum and duration parameters, speech parameter is inputted vocoder and obtains the synthetic speech of interrogative sentence or exclamative sentence.
Or using the input as deep neural network of the context-sensitive text feature of voice to be synthesized, the output of deep neural network is speech parameter, speech parameter inputs vocoder and obtains the synthetic speech of interrogative sentence or exclamative sentence.
This method embodiment that various piece is adopted is described in detail below. Wherein, embodiment one is for adopting MSD-HSMM to be modeled, and embodiment two is modeled for adopting deep neural network.
Embodiment one
Step 1, obtains extensive assertive sentence language material and records nearly 6700. Wherein, all of recording is all single channel, 16kHz sample rate, the wav file format of 16. Adopting the context-sensitive single order MSD-HSMM of 7 states as initial acoustic model, the output of each state of MSD-HSMM is distributed as single Gauss distribution. The characteristic vector of each frame voice by composing, first-order difference and the second differnce of energy, fundamental frequency and each of which constitute.Wherein spectrum parameter is adopt the line spectrum pairs parameter (LineSpectralPair, LSP) on STRAIGHT (SpeechTransformationandRepresentationbasedonAdaptiveInte rpolationofweightedspectrogram) 40 rank extracted; In text, through text analyzing, obtain the text feature with context mark.
Step 2, carries out corresponding cutting by text and speech data, obtain each phoneme relative to position in voice and duration.
Step 3, the phone segmentation information that context-sensitive text feature, speech parameter and the step 2 step 1 obtained obtains, as input feature vector, carries out the training of MSD-HSMM. The flow process of training is as shown in Figure 2: estimate through variance bound, single factor test HMM training, context-sensitive HMM training, first time based on the HMM training after the HMM state clustering of decision tree, cluster, release clustering relationships, release the training of the context-sensitive HMM after clustering relationships, second time based on steps such as the HMM training after the state clustering of decision tree, cluster, finally give HMM model collection and the decision tree of assertive sentence.
Step 4, obtain small-scale interrogative sentence or exclamative sentence language material, for interrogative sentence, choose comprise general question, alternative question, special question, be interrogative sentence that non-interrogative sentence quantity is suitable totally 300 as self-adapting data, carry out the adaptive training of CMLLR and SMAP.
Hidden Semi-Markov Process-limited maximum likelihood linearly returns (HSMM-CMLLR) and adopts State-output is distributed by same transformation matrix and duration is distributed average and variance to convert simultaneously, such as formula (1) and (2):
b i ( o ) = N ( o ; ζ ′ μ i - ϵ ′ ; ζ ′ Σ i ζ ′ T ) = | ζ | N ( ζ o + ϵ ; μ i , Σ i ) = | ζ | N ( W ξ ; μ i ; Σ i ) - - - ( 1 )
p i ( d ) = N ( d ; χ ′ m i - υ ′ , χ ′ σ i 2 χ ′ ) = | χ | N ( χ d + υ ; m i , σ i 2 ) = | χ | N ( X φ ; m i , σ i 2 ) - - - ( 2 )
Wherein, biO () represents the output distribution under i-th state, piD () represents the distribution of duration under i-th state, N represents normal distribution, ��iRepresenting the average of normal distribution under i-th state, scalar �� ' is used for the average to duration distribution and variance converts, ��iRepresent the variance matrix of normal distribution, m under i-th stateiRepresent the average of duration distribution under i-th state,Represent the variance that the duration under i-th state is distributed.
Formula (1), (2) deriving of second and the 3rd equal sign represent conversion to model parameter, be equivalent to and observation vector o and duration d converted; �� ' �� RLAnd v ' is deviation, o �� RLBeing the observation vector of a L dimension, d is duration, matrix �� ' �� RL��LIt is the average of the output Gauss distribution to state i and transformation matrix that variance converts, wherein ��=�� '-1, ��=�� '-1�� ', ��=�� '-1, ��=[oT, 1]T, ��=[d, 1]T, W=[��, ��] �� RL��(L+1), X=[��, ��] �� R1��2Transformation matrix for State-output distribution and duration distribution. Estimate transformation parameterIt is the likelihood value maximizing self-adapting data o, as shown in formula (3):
Λ ~ = ( W ~ , X ~ ) = arg max Λ P ( O | λ , Λ ) - - - ( 3 )
Use expectation maximum (ExpectationMaximization, EM) algorithm to transformation parameterSolve. Wherein, �� is archetype parameter;Represent a new round and estimate the new transformation parameter obtained; Iteration is some take turns number after, willAs final matrixing parameter.
The basic thought of maximal posterior probability algorithm (MAP) can be illustrated by formula (4):
θ M A P = arg max θ g ( θ | x ) = arg max θ f ( x | θ ) g ( θ ) - - - ( 4 )
Wherein g, f are probability density function; X represents the observation of self-adapting data, �� representative model parameter, ��MAPThe model parameter obtained after representing self adaptation. Compared to maximum likelihood (MaximumLikelihood, ML) algorithm, MAP introduces the prior distribution of model parameter, in the less situation of data volume, the estimation of model parameter is relatively reliable.
But partial model can only be carried out self adaptation according to data by traditional maximum a posteriori probability (MAP), structuring maximum a posteriori probability (SMAP) is mainly for this problem of maximum a posteriori probability (MAP) partial estimation, by building the hierarchical structure in model parameter space, with a small amount of self-adapting data, all of model parameter is adjusted.
Step 5, after obtaining the acoustic model of interrogative sentence or exclamative sentence, namely the synthesis of Text To Speech can be carried out, by text to be synthesized through text analyzing, obtain context-sensitive text feature, as input, adopt based on many spatial probability distribution-hidden Semi-Markov Process MSD-HSMM acoustic model, carry out the generation of speech parameter, finally speech parameter is sent into vocoder and carries out obtaining generating voice. Based on MSD-HSMM acoustic model phonetic synthesis as shown in Figure 4.
The present invention proposes employing and trains initial model with assertive sentence language material, the adaptive algorithm based on CMLLR and SMAP is adopted to carry out the adaptive training of MSD-HSMM, the method obtaining the acoustic model of interrogative sentence exclamative sentence, the method can be extended in the synthesis of other tone.
Embodiment two
Step 1, obtains extensive assertive sentence language material and records nearly 6700. Wherein, all of recording is all single channel, 16kHz sample rate, the wav file format of 16. Adopting the context-sensitive single order MSD-HSMM of 7 states as initial acoustic model, the output of each state of MSD-HSMM is distributed as single Gauss distribution. The characteristic vector of each frame voice by composing, first-order difference and the second differnce of energy, fundamental frequency and each of which constitute. Wherein spectrum parameter is adopt the line spectrum pairs parameter (LineSpectralPair, LSP) on STRAIGHT (SpeechTransformationandRepresentationbasedonAdaptiveInte rpolationofweightedspectrogram) 40 rank extracted; In text, through text analyzing, obtain the text feature with context mark.
Step 2, carries out corresponding cutting by text and speech data, obtain each phoneme relative to position in voice and duration.
Step 3, a number of limited Boltzmann machine (RestrictedBoltzmannMachine, RBM) is adopted to carry out stacking, at top layer plus linear transformation layer as output layer, output speech parameter, including spectrum parameter, fundamental frequency and voicing decision. Input layer is context-sensitive text feature, including phoneme class, syllable position, word length and phrase position etc. In the deep neural network acoustic model that this is traditional, only one of which represents that voiced sound is made a distinction by the value of pure and impure sound from the sore throat relieving that band is made an uproar. This value is a linear convergent rate of linear transformation layer, and value is between 0 to 1. Only go to be modeled being apparently not to the difference between pure and impure sound abundant especially by this value. Therefore, in order to improve the modeling to the difference between pure and impure sound, the basis of original deep neural network introduces multi-task learning and carries out model training.
Second learning tasks is the classification learning carrying out pure and impure sound. Overall deep neural network learning framework is as shown in Figure 3. By adopting sdpecific dispersion criterion (ContrastiveDivergenceCriterion, CD) RBM four layers of hidden layer of stacking formation of pre-training are carried out as training criterion, ground floor input layer is standardized as standard gaussian distribution, therefore, make great efforts limited Boltzmann machine (Gaussian-BemoulliRBM) through first RBM of pre-training for Gauss-Bei, remaining make great efforts-Bei for shellfish and make great efforts limited Boltzmann machine (Bemoulli-BemoulliRBM).Shown in the output of last layer of output layer such as formula (5):
H=g (... g (W1, b1, x)) and (5)
In formula, W and b represents weight matrix and bias vector, and the subscript of W and b represents to be in which layer of network, and g is sigmoid function, and x is input phonetic feature, and h is hidden layer output.
The output of deep neural network has one layer of soft-max layer to classify as the judgement of pure and impure sound, and a layer line transform layer is as the output layer of speech parameter; The output of this two-layer is stacked on the hidden layer that pre-training is good side by side.
Wherein, linear transformation layer adopts least mean-square error (MinimumMeanSquareError, MMSE) criterion to be trained, such as formula (6):
D M S E ( y ^ , y ) = 1 T Σ t = 1 T ( y ^ - y ) 2 - - - ( 6 )
Wherein T represents frame number, and y represents target voice parameter,Represent the speech parameter generated,Generate shown in formula such as formula (7):
y ^ = g ~ ( W A , b A , h ) - - - ( 7 )
WhereinFor linear transformation function, w, b, the same formula of the implication (5) of h, subscript A represents the network number of plies.
Soft-Max layer adopts cross entropy (CrossEntropy, CE) to be trained, shown in object function such as formula (8):
D C E ( S ^ , S ) = - Σ n = 1 N Σ t = 1 T log s ^ S - - - ( 8 )
WhereinRepresenting the cross entropy obtained, N represents statement quantity, and T is frame number, and S is the pure and impure phonetic symbol note of target,Represent the pure and impure phonetic symbol note generated, formula such as (9):
s ‾ = exp ( g ~ ( W s , b s , h ) ) Σ exp ( g ~ ( W s , b s , h ) ) - - - ( 9 )
Step 4, obtain small-scale interrogative sentence or exclamative sentence language material, for interrogative sentence, choose comprise general question, alternative question, special question, be interrogative sentence that non-interrogative sentence quantity is suitable totally 300 as self-adapting data, the basis of original deep neural network model adopt again these data be trained, adjust original assertive sentence deep neural network acoustic model, obtain the acoustic model of interrogative sentence or exclamative sentence.
Step 5, after obtaining the acoustic model of interrogative sentence or exclamative sentence, namely the synthesis of Text To Speech can be carried out, by text to be synthesized through text analyzing, obtain context-sensitive text feature, as input, adopt the acoustic model of the deep neural network based on multi-task learning, carry out the generation of speech parameter, finally speech parameter is sent into vocoder and carries out obtaining generating voice.
The present invention proposes the deep neural network adopting multi-task learning to carry out Acoustic Modeling, by the judgement of pure and impure sound as second learning tasks, so that deep neural network model is more accurate. The study of multitask can be extended, and is not solely restricted to the second tasking learning of pure and impure sound, it is possible to extend to the multi-task learning of phoneme tags, phrase position etc.
Above-mentioned embodiment is not limitation of the present invention; the present invention is also not limited to the example above; the change made within the scope of technical scheme of those skilled in the art, remodeling, interpolation or replacement, also belong to protection scope of the present invention.

Claims (3)

1. the phoneme synthesizing method of the interrogative sentence of a Corpus--based Method parameter, exclamative sentence, it is characterised in that the method is divided into three below part:
Part I: by the model training to assertive sentence, obtain the initial acoustic model of assertive sentence;
Obtaining the recording language material of extensive assertive sentence as corpus, the training acoustic model based on hidden Semi-Markov Process or the acoustic model based on deep neural network are as initial acoustic model;
Initial acoustic model adopts many spatial probability distribution-hidden Semi-Markov Process to be modeled, first excitation parameters, spectrum parameter are extracted respectively through text marking and voice signal, it is a vector by fundamental frequency and spectrum Parameter fusion, be subsequently adding single order, second order dynamic parameter as the input of many spatial probability distribution-hidden Semi-Markov Process training, finally give assertive sentence based on many spatial probability distribution-hidden Semi-Markov Process initial acoustic model;
Or, initial acoustic model adopts deep neural network to be modeled, and uses multi-task learning mode, completes the mapping of Text To Speech parameters,acoustic by deep neural network, obtain the initial acoustic model based on deep neural network;
Part II: by the adaptive training to interrogative sentence or exclamative sentence, obtain the acoustic model of interrogative sentence or exclamative sentence;
Obtain the recording language material of small-scale interrogative sentence or exclamative sentence as corpus, parameter extraction is carried out then through text marking and voice signal, then carry out adaptive training according on the basis based on many spatial probability distribution-hidden Semi-Markov Process initial acoustic model obtained in the first portion, obtain the acoustic model based on many spatial probability distribution-hidden Semi-Markov Process of interrogative sentence or exclamative sentence;
Or carry out adaptive training according to what obtain in the first portion based on the basis of the initial acoustic model of deep neural network, basis based on multi-task learning mode adjusts deep neural network model, obtains the acoustic model based on deep neural network of interrogative sentence or exclamative sentence;
Part III: the acoustic model according to interrogative sentence or exclamative sentence, it is achieved the speech production of interrogative sentence or exclamative sentence;
Text for voice to be synthesized, carry out text analyzing, the interrogative sentence obtained through Part II or the acoustic model based on many spatial probability distribution-hidden Semi-Markov Process of exclamative sentence or the acoustic model based on deep neural network is adopted to carry out the generation of speech parameter, it is then passed through phonetic vocoder, finally synthesizes the voice of interrogative sentence or exclamative sentence.
2. the phoneme synthesizing method of the interrogative sentence of Corpus--based Method parameter according to claim 1, exclamative sentence, it is characterized in that: the described initial acoustic model based on deep neural network, including the input as deep neural network of above, the hereafter relevant text feature, using parameters,acoustic as the output of deep neural network;
Described upper and lower literary composition related text feature includes phoneme, syllable position, phrase position; Parameters,acoustic includes spectrum, fundamental frequency, voicing decision;
The deep neural network of described multi-task learning mode, using the voicing decision second learning tasks as deep neural network; The output layer of deep neural network there is a neuron be coupled with the soft-max layer of softmax regression model, be output as voicing decision; There is a linear transformation layer, be output as speech parameter; This two-layer stacked in parallel is on the hidden layer of pre-training.
3. the interrogative sentence of Corpus--based Method parameter according to claim 1, the phoneme synthesizing method of exclamative sentence, it is characterized in that: the described acoustic model based on many spatial probability distribution-hidden Semi-Markov Process, employing linearly returns, based on limited maximum likelihood, the adaptive approach blended with structuring maximum a posteriori probability and is trained, the method includes first adopting and linearly returns, based on limited maximum likelihood, all model parameters that the initial acoustic model based on many spatial probability distribution-hidden Semi-Markov Process is related to and adjust on a large scale, adopt structuring maximum a posteriori probability that the corresponding model occurred in self-adapting data is carried out the adaptive training of parameter again.
CN201610000676.XA 2016-01-04 2016-01-04 Speech synthesis method of interrogative sentence and exclamatory sentence based on statistical parameter Pending CN105654942A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610000676.XA CN105654942A (en) 2016-01-04 2016-01-04 Speech synthesis method of interrogative sentence and exclamatory sentence based on statistical parameter

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610000676.XA CN105654942A (en) 2016-01-04 2016-01-04 Speech synthesis method of interrogative sentence and exclamatory sentence based on statistical parameter

Publications (1)

Publication Number Publication Date
CN105654942A true CN105654942A (en) 2016-06-08

Family

ID=56491319

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610000676.XA Pending CN105654942A (en) 2016-01-04 2016-01-04 Speech synthesis method of interrogative sentence and exclamatory sentence based on statistical parameter

Country Status (1)

Country Link
CN (1) CN105654942A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107103900A (en) * 2017-06-06 2017-08-29 西北师范大学 A kind of across language emotional speech synthesizing method and system
CN108364631A (en) * 2017-01-26 2018-08-03 北京搜狗科技发展有限公司 A kind of phoneme synthesizing method and device
CN108447474A (en) * 2018-03-12 2018-08-24 北京灵伴未来科技有限公司 A kind of modeling and the control method of virtual portrait voice and Hp-synchronization
CN110942763A (en) * 2018-09-20 2020-03-31 阿里巴巴集团控股有限公司 Voice recognition method and device
CN111710326A (en) * 2020-06-12 2020-09-25 携程计算机技术(上海)有限公司 English voice synthesis method and system, electronic equipment and storage medium
CN111950545A (en) * 2020-07-23 2020-11-17 南京大学 Scene text detection method based on MSNDET and space division
WO2022134833A1 (en) * 2020-12-23 2022-06-30 深圳壹账通智能科技有限公司 Speech signal processing method, apparatus and device, and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1835074A (en) * 2006-04-07 2006-09-20 安徽中科大讯飞信息科技有限公司 Speaking person conversion method combined high layer discription information and model self adaption
CN103035247A (en) * 2012-12-05 2013-04-10 北京三星通信技术研究有限公司 Method and device of operation on audio/video file based on voiceprint information
CN103345656A (en) * 2013-07-17 2013-10-09 中国科学院自动化研究所 Method and device for data identification based on multitask deep neural network
US20140257809A1 (en) * 2011-10-28 2014-09-11 Vaibhava Goel Sparse maximum a posteriori (map) adaption
CN105118498A (en) * 2015-09-06 2015-12-02 百度在线网络技术(北京)有限公司 Training method and apparatus of speech synthesis model
CN105184303A (en) * 2015-04-23 2015-12-23 南京邮电大学 Image marking method based on multi-mode deep learning
CN105206258B (en) * 2015-10-19 2018-05-04 百度在线网络技术(北京)有限公司 The generation method and device and phoneme synthesizing method and device of acoustic model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1835074A (en) * 2006-04-07 2006-09-20 安徽中科大讯飞信息科技有限公司 Speaking person conversion method combined high layer discription information and model self adaption
US20140257809A1 (en) * 2011-10-28 2014-09-11 Vaibhava Goel Sparse maximum a posteriori (map) adaption
CN103035247A (en) * 2012-12-05 2013-04-10 北京三星通信技术研究有限公司 Method and device of operation on audio/video file based on voiceprint information
CN103345656A (en) * 2013-07-17 2013-10-09 中国科学院自动化研究所 Method and device for data identification based on multitask deep neural network
CN105184303A (en) * 2015-04-23 2015-12-23 南京邮电大学 Image marking method based on multi-mode deep learning
CN105118498A (en) * 2015-09-06 2015-12-02 百度在线网络技术(北京)有限公司 Training method and apparatus of speech synthesis model
CN105206258B (en) * 2015-10-19 2018-05-04 百度在线网络技术(北京)有限公司 The generation method and device and phoneme synthesizing method and device of acoustic model

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108364631A (en) * 2017-01-26 2018-08-03 北京搜狗科技发展有限公司 A kind of phoneme synthesizing method and device
CN108364631B (en) * 2017-01-26 2021-01-22 北京搜狗科技发展有限公司 Speech synthesis method and device
CN107103900A (en) * 2017-06-06 2017-08-29 西北师范大学 A kind of across language emotional speech synthesizing method and system
CN108447474A (en) * 2018-03-12 2018-08-24 北京灵伴未来科技有限公司 A kind of modeling and the control method of virtual portrait voice and Hp-synchronization
CN110942763A (en) * 2018-09-20 2020-03-31 阿里巴巴集团控股有限公司 Voice recognition method and device
CN110942763B (en) * 2018-09-20 2023-09-12 阿里巴巴集团控股有限公司 Speech recognition method and device
CN111710326A (en) * 2020-06-12 2020-09-25 携程计算机技术(上海)有限公司 English voice synthesis method and system, electronic equipment and storage medium
CN111710326B (en) * 2020-06-12 2024-01-23 携程计算机技术(上海)有限公司 English voice synthesis method and system, electronic equipment and storage medium
CN111950545A (en) * 2020-07-23 2020-11-17 南京大学 Scene text detection method based on MSNDET and space division
CN111950545B (en) * 2020-07-23 2024-02-09 南京大学 Scene text detection method based on MSDNet and space division
WO2022134833A1 (en) * 2020-12-23 2022-06-30 深圳壹账通智能科技有限公司 Speech signal processing method, apparatus and device, and storage medium

Similar Documents

Publication Publication Date Title
CN105654942A (en) Speech synthesis method of interrogative sentence and exclamatory sentence based on statistical parameter
CN101000765B (en) Speech synthetic method based on rhythm character
CN101178896B (en) Unit selection voice synthetic method based on acoustics statistical model
CN106531150B (en) Emotion synthesis method based on deep neural network model
CN106971703A (en) A kind of song synthetic method and device based on HMM
CN1835074B (en) Speaking person conversion method combined high layer discription information and model self adaption
CN101777347B (en) Model complementary Chinese accent identification method and system
CN108364639A (en) Speech processing system and method
CN102254554B (en) Method for carrying out hierarchical modeling and predicating on mandarin accent
CN102568476B (en) Voice conversion method based on self-organizing feature map network cluster and radial basis network
KR102311922B1 (en) Apparatus and method for controlling outputting target information to voice using characteristic of user voice
CN1835075B (en) Speech synthetizing method combined natural sample selection and acaustic parameter to build mould
CN105654939A (en) Voice synthesis method based on voice vector textual characteristics
Liu et al. Mongolian text-to-speech system based on deep neural network
WO2012164835A1 (en) Prosody generator, speech synthesizer, prosody generating method and prosody generating program
Khanam et al. Text to speech synthesis: a systematic review, deep learning based architecture and future research direction
Dongmei Design of English text-to-speech conversion algorithm based on machine learning
CN103226946B (en) Voice synthesis method based on limited Boltzmann machine
Shahid et al. Generative emotional ai for speech emotion recognition: The case for synthetic emotional speech augmentation
Savargiv et al. Study on unit-selection and statistical parametric speech synthesis techniques
Sakai Additive modeling of English F0 contour for speech synthesis
Wen et al. Improving deep neural network based speech synthesis through contextual feature parametrization and multi-task learning
CN116092471A (en) Multi-style personalized Tibetan language speech synthesis model oriented to low-resource condition
TWI402824B (en) A pronunciation variation generation method for spontaneous speech synthesis
Coto-Jiménez et al. LSTM deep neural networks postfiltering for improving the quality of synthetic voices

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 310000 Room 1105, 11/F, Building 4, No. 9, Jiuhuan Road, Jianggan District, Hangzhou City, Zhejiang Province

Applicant after: Limit element (Hangzhou) intelligent Polytron Technologies Inc.

Address before: 100089 Floor 1-312-316, No. 1 Building, 35 Shangdi East Road, Haidian District, Beijing

Applicant before: Limit element (Beijing) smart Polytron Technologies Inc.

Address after: 100089 Floor 1-312-316, No. 1 Building, 35 Shangdi East Road, Haidian District, Beijing

Applicant after: Limit element (Beijing) smart Polytron Technologies Inc.

Address before: 100089 Floor 1-312-316, No. 1 Building, 35 Shangdi East Road, Haidian District, Beijing

Applicant before: Limit Yuan (Beijing) Intelligent Technology Co.,Ltd.

Address after: 100089 Floor 1-312-316, No. 1 Building, 35 Shangdi East Road, Haidian District, Beijing

Applicant after: Limit Yuan (Beijing) Intelligent Technology Co.,Ltd.

Address before: 100085 Block 318, Yiquanhui Office Building, 35 Shangdi East Road, Haidian District, Beijing

Applicant before: BEIJING TIMES RUILANG TECHNOLOGY Co.,Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160608