CN117789771A - Cross-language end-to-end emotion voice synthesis method and system - Google Patents
Cross-language end-to-end emotion voice synthesis method and system Download PDFInfo
- Publication number
- CN117789771A CN117789771A CN202311545240.5A CN202311545240A CN117789771A CN 117789771 A CN117789771 A CN 117789771A CN 202311545240 A CN202311545240 A CN 202311545240A CN 117789771 A CN117789771 A CN 117789771A
- Authority
- CN
- China
- Prior art keywords
- emotion
- voice
- module
- text
- language
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000008451 emotion Effects 0.000 title claims abstract description 138
- 238000001308 synthesis method Methods 0.000 title claims abstract description 12
- 238000012549 training Methods 0.000 claims abstract description 47
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 43
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 43
- 238000000034 method Methods 0.000 claims abstract description 30
- 230000004927 fusion Effects 0.000 claims abstract description 16
- 238000012545 processing Methods 0.000 claims abstract description 10
- 238000001228 spectrum Methods 0.000 claims description 28
- 238000013528 artificial neural network Methods 0.000 claims description 12
- 238000005070 sampling Methods 0.000 claims description 8
- 238000013527 convolutional neural network Methods 0.000 claims description 3
- 230000006870 function Effects 0.000 claims description 3
- 230000009467 reduction Effects 0.000 claims description 3
- 238000012360 testing method Methods 0.000 claims description 3
- 238000012856 packing Methods 0.000 claims description 2
- 230000002194 synthesizing effect Effects 0.000 claims 2
- 238000003062 neural network model Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 12
- 230000006872 improvement Effects 0.000 description 9
- 230000008569 process Effects 0.000 description 7
- 238000013507 mapping Methods 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000007781 pre-processing Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 239000003086 colorant Substances 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000008909 emotion recognition Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Machine Translation (AREA)
Abstract
The invention relates to the field of intelligent digital signal processing, in particular to a cross-language end-to-end emotion voice synthesis method and system. By training the deep neural network model by the method, the voice of the target speaker in the A language with natural smoothness and good emotion expression can be synthesized by giving the text in the A language to be synthesized and the reference voice in the B language with emotion. The specific method comprises the following steps: collecting original training data of voice-text pairs, extracting voice frequency domain characteristics, discretely encoding text, extracting language independent emotion embedded codes, constructing a complete end-to-end emotion voice synthesis model and performing supervised training. The speech synthesis model comprises an emotion text fusion coding module, a target duration prediction module, a posterior coding module, an audio decoding module and a judging module. After the speech synthesis model is trained to be converged, the required emotion speech of the target speaker can be deduced through the prior encoding module, the duration prediction module and the audio decoding module.
Description
Technical Field
The invention relates to the field of intelligent digital signal processing, in particular to a cross-language end-to-end emotion voice synthesis method and system.
Background
Speech synthesis is a technique that converts text into natural, fluent speech audio. In recent years, the deep learning method has great effectiveness in the field of digital signal processing, and the deep neural network supports the continuous updating of the technology in the field of speech synthesis with stronger generalization and fitting capability, so that remarkable progress is achieved. Traditional speech synthesis systems tend to be tedious, and emotion speech synthesis can give vivid emotion colors to synthesized speech, so that the speech synthesis system is more humanized and natural. The research in this aspect also mainly depends on the deep learning technology, especially the development of models such as a Recurrent Neural Network (RNN) and a Variational Automatic Encoder (VAE), and can make the synthesized voice express emotion states such as anger and fear by introducing emotion related labels or features in the training process.
The Chinese patent application with publication number of CN115359774A discloses a cross-language voice synthesis method based on end-to-end tone and emotion migration. The invention collects a small amount of voice of a user, obtains voice data after preprocessing the voice, puts the voice data into a learning network architecture with training completed, extracts an embedded vector carrying the tone and emotion characteristics of a speaker through a speaker encoder in the learning network architecture, synthesizes a Mel frequency spectrum of target voice through a synthesizer in the learning network architecture, and finally inputs the Mel frequency spectrum of the target voice into a vocoder to synthesize the target voice. The method has complex treatment process and poor synthesis effect.
The method lacks a targeted decoupling process for emotion contained in the voice, so that the finally generated voice emotion effect is general and too dependent on the quality of the original speaker data, and the flexibility of using the method is insufficient.
Disclosure of Invention
The invention aims to meet the requirement of generating the language emotion voice under the condition of low resource of missing certain language emotion data.
In order to achieve the above purpose, the present invention is realized by the following technical scheme.
The invention provides a cross-language end-to-end emotion voice synthesis method, which comprises the following steps:
collecting paired speech-phoneme level text data of a large number of different speakers;
extracting frequency domain features and emotion embedded codes of voice, performing discretization coding on speaker ID and text data, and arranging to form a training data set;
constructing a complete end-to-end emotion voice synthesis model, and training the model by utilizing a training data set until the model converges;
the full end-to-end emotion voice synthesis model comprises: the system comprises an emotion text fusion encoding module, a target duration prediction module and an audio decoding module; wherein,
the emotion text fusion coding module is used for acquiring intermediate embedded representations related to the ID, the text and emotion of a speaker through a coder structure;
the target duration prediction module is used for outputting predicted phoneme duration;
the audio decoding module is used for obtaining a generated waveform through up-sampling network operation according to the middle embedded representation and the duration of the voice element;
inputting the ID of the speaker to be synthesized, the text sequence and the emotion embedded code into a complete end-to-end emotion voice synthesis model trained to be converged, and outputting to obtain a synthesized emotion voice audio result.
As one of the improvements of the above technical solution, the full end-to-end emotion voice synthesis model further includes: the posterior coding module and the discrimination module;
the posterior coding module is used for extracting posterior coding of the voice in the training stage; the posterior coding and the prior coding are constrained by the same Gaussian distribution, and the prior coding comprises an intermediate embedded representation and a phoneme duration;
the judging module forms a generating-countermeasure relation with the audio decoding module and is used for improving the generating capacity of the audio decoding module.
As one of the improvements of the above technical solution, the frequency domain features of the extracted speech include: linear spectrum and Mel-domain spectrum; wherein,
the kth short period of linear spectrum X k (ω) is expressed as:
wherein x (n) is an original voice signal, h (·) is a selected window function, ω represents frequency, e is a base of a natural logarithm, and j is an imaginary unit;
ith dimension Mel domain spectrum Mel of kth short period k (ω, i) is expressed as:
Mel k (ω,i)=Fbank(i,ω max ,ω min ,ω)P k (ω)
wherein P is k (omega) short-time power spectrum of kth short period, P k (ω)=|X k (ω)| 2 The method comprises the steps of carrying out a first treatment on the surface of the Fbank is a frequency ω passing through a preset maximum max And a minimum frequency omega min The i-dimensional Mel domain filter bank is constructed.
As one of the improvements of the above technical solution, the discrete coding of text data includes: all the different phonemes in the text data are mapped to positive integers starting from 1 in any order, and the text data is converted into a sequence comprising a number of positive integers.
As one of the improvements of the above technical scheme, the language-independent emotion embedded code is obtained by inputting each voice into a pre-trained emotion judgment system;
the emotion distinguishing system comprises a classifier formed by a two-way parallel neural network and a speaker decoupling network;
the two-path parallel neural network comprises a transducer structure, two branches of the convolutional neural network and a linear projection layer which is used for dimension reduction and classification output on a bus;
the speaker decoupling network comprises a time-lapse neural network (TDNN, time Delay Neural Network) structure.
As an improvement of the above technical solution, the sorting forms a training data set, including:
for each voice, packing an original signal, a speaker ID, an extracted voice linear frequency spectrum, a Mel domain spectrum, a quantized phoneme text sequence and a language independent emotion embedding code into a tuple, and converting the original signal, the frequency spectrum, the Mel domain spectrum and the emotion embedding code into a floating point type high-dimensional tensor, and converting the speaker ID and the quantized phoneme text sequence into a long-integer tensor; all data are divided into a training set and a testing set according to a certain capacity proportion.
The invention also provides a cross-language end-to-end emotion voice synthesis system, and the processing process of the system comprises the following steps:
collecting paired speech-phoneme level text data of a large number of different speakers;
extracting frequency domain features and emotion embedded codes of voice, performing discretization coding on speaker ID and text data, and arranging to form a training data set;
constructing a complete end-to-end emotion voice synthesis model, and training the model by utilizing a training data set until the model converges;
the full end-to-end emotion voice synthesis model comprises: the system comprises an emotion text fusion encoding module, a target duration prediction module and an audio decoding module; wherein,
the emotion text fusion coding module is used for acquiring intermediate embedded representations related to the ID, the text and emotion of a speaker through a coder structure;
the target duration prediction module is used for outputting predicted phoneme duration;
the audio decoding module is used for obtaining a generated waveform through up-sampling network operation according to the middle embedded representation and the duration of the voice element;
inputting the ID of the speaker to be synthesized, the text sequence and the emotion embedded code into a fully end-to-end emotion voice synthesis model which is established in advance and trained to be converged, and outputting to obtain a synthesized emotion voice audio result.
Compared with the prior art, the invention has the advantages that:
in general, the method constructs a set of complete end-to-end voice synthesis model architecture, directly establishes a mapping relation from text to voice, and can synthesize natural, smooth and human voice close to the sense of speech and hearing by utilizing the strong characteristic learning capability of deep learning. The method breaks through the bottleneck that emotion is difficult to describe accurately in voice, and the emotion is decoupled and encoded independently from voice, so that the obstacle of low-resource data condition is overcome, and the emotion synthesis of cross-language is realized. The method specifically comprises the following advantages:
1. the pre-trained emotion distinguishing system is used as an encoder, and the aim of the distinguishing system is a classification task, so that the encoder has better capacity of decoupling emotion expression from voice, and emotion coding is more accurate and pure, so that the generation effect is more consistent with human hearing;
2. the data acquisition and training process has no limitation on the languages of the audio and the text, so that the synthesis system can work across languages and is not influenced by the residues of the spoken voice of the speaker during single language coding;
3. the inference can be carried out from the reference voice, namely, the specific emotion type or description is not required to be pointed out, only one sentence of 'example sentence' which is required to be close to the reference voice is required to be given, and the emotion possessed by the reference voice can be imitatively synthesized by the target speaker, so that the flexibility of use is improved.
Drawings
FIG. 1 is a diagram of the overall architecture of the present invention;
FIG. 2 is a flow chart of a cross-language end-to-end emotion voice synthesis method of the present invention;
FIG. 3 is a diagram showing the constitution of the emotion judging system of the present invention;
FIG. 4 is a diagram of emotion text fusion encoding module components;
FIG. 5 is a block diagram of a posterior coding module;
FIG. 6 is a diagram of a target duration prediction module;
fig. 7 is a diagram showing the constitution of the audio decoding module and the discriminating module.
Detailed Description
As shown in FIG. 1, the invention provides a cross-language end-to-end emotion voice synthesis method. By training the deep neural network model by the method, the voice of the target speaker in the A language with natural smoothness and good emotion expression can be synthesized by giving the text in the A language to be synthesized and the reference voice in the B language with emotion. The method comprises the following steps:
step 1) collecting a large number of paired speech-phoneme level text data;
step 2) performing data preprocessing by means of digital signal processing and the like, including extracting frequency domain features of voice and performing discretization coding on text; extracting language independent emotion embedded codes of the voice by utilizing a pre-trained emotion judging system;
step 3) carrying out normalization processing on the results obtained in the steps 1-2 to form a training data set for a training model for standby;
step 4) constructing a complete end-to-end emotion voice synthesis model, training the model by utilizing the data set (original audio, key features and emotion embedded codes) prepared in the step 3 until the model converges, finishing training, and storing the model;
and 5) training to a converged emotion voice synthesis model by utilizing the step 4, inputting parameters to be synthesized and reference voices, and automatically reasoning emotion voice synthesis audio results of the target speaker by the model.
As an improvement of the above method, the step 2) specifically includes:
the frequency domain feature of the voice is extracted, and the steps are as follows: the original voice signal at the sampling point level is subjected to framing and windowing, and the linear frequency spectrum is obtained through short-time Fourier transform,
where x (n) is the original signal, h (·) is the selected window function, k represents the kth short period, ω represents the frequency, e is the base of the natural logarithm, and j is the imaginary unit.
Obtaining power spectrum by taking the modulus square of the linear spectrum, then using a Mel domain filter bank to carry out filtering treatment to obtain short-time Mel domain spectrum of the signal,
P k (ω)=|X k (ω)| 2
Mel k (ω,i)=Fbank(i,ω max ,ω min ,ω)P k (ω)
wherein P is k (omega) is the kth short-time power spectrum, fbank is the frequency ω by the preset maximum and minimum max ,ω min Constructed i-dimensional Mel domain filter bank, mel k (ω, i) represents the ith dimension Mel-domain spectrum of the kth short period.
Discretizing the phoneme-level text, mapping all the different phonemes to positive integers starting from 1 in any order, so that the phoneme-level text corresponding to all the voices can be converted into a sequence containing a plurality of positive integers.
Extracting language independent emotion embedded codes of voices by using a pre-trained emotion judging system, and automatically obtaining high-dimensional emotion embedded representation by passing each voice through the emotion judging system.
As an improvement of the above method, the step 3) specifically includes:
for each speech, the original signal, speaker ID, extracted speech linear spectrum, mel domain spectrum, quantized phoneme text sequence, and language independent emotion embedded code are packaged into a tuple, and the data format therein is converted into a long integer or floating point high-dimensional tensor. All data are divided into a training set and a testing set according to a certain capacity proportion.
As an improvement of the above method, the step 4) specifically includes:
the training set is input to an emotion speech synthesis system, which trains the mapping relationship from speaker ID, quantized phone text sequence, and emotion embedding code to audio waveform. The speaker ID, the quantized phoneme text sequence and the emotion embedded code are input to an emotion text fusion priori coding module together, and intermediate embedded representations related to the speaker ID, the text and the emotion are obtained through a self-encoder structure; meanwhile, the information is sent to a target duration prediction coding module, and predicted duration of each phoneme is output. The intermediate embedding and phoneme duration prediction results together form a priori code, and the priori code is input into an audio decoding module through a standardized stream structure, and a final generated waveform is obtained through up-sampling network operation. In the training stage, the posterior coding module is used for extracting posterior coding of the input real voice, and the prior and posterior coding are constrained by the same Gaussian distribution, so that the prior coding is continuously optimized to approach to the real posterior coding in the training process. The judging module and the audio decoding module form an countermeasure generation relation so as to improve the generation capacity of the audio decoding module.
As an improvement of the above method, the step 5) specifically includes:
and (3) training to a converged emotion voice synthesis model by utilizing the step (4), inputting a speaker ID to be synthesized, a quantized phoneme text sequence and emotion embedded codes, and automatically reasoning the synthesized emotion voice audio result by the model through an emotion text fusion coding module, a target duration prediction module and an audio decoding module.
The technical scheme of the invention is described in detail below with reference to the accompanying drawings and examples.
Example 1
As shown in fig. 2, embodiment 1 of the present invention provides a cross-language end-to-end emotion voice synthesis method, which includes:
step 101) collecting a plurality of paired speech-phoneme level text data;
102) preprocessing data by means of digital signal processing and the like, including extracting frequency domain characteristics of voice and discretizing and encoding text; extracting language independent emotion embedded codes of the voice by utilizing a pre-trained emotion judging system;
the choice of this emotion recognition system is varied. Off-the-shelf, e.g., https:// github.com/audiong/w 2v2-how-to may be employed. The emotion judgment system can also be built by self, as shown in fig. 3, and is a composition diagram of an emotion judgment system built by the applicant of the invention; the emotion distinguishing system comprises a classifier formed by a two-way parallel neural network and a speaker decoupling network; the two-path parallel neural network comprises a transducer structure, two branches of the convolutional neural network and a linear projection layer which is used for dimension reduction and classification output on a bus; the speaker decoupling network includes a TDNN (Time Delay Neural Network ) module.
Step 103) carrying out normalization processing on the results obtained in the steps 101-102 to form a training data set for a training model for standby;
104), constructing a complete end-to-end emotion voice synthesis model, training the model by utilizing the data set (original audio, key features and emotion embedded codes) prepared in the step 103 until the model converges, finishing training, and storing the model; FIGS. 4-7 are diagrams of the modules of the constructed complete end-to-end emotion voice synthesis model, wherein FIG. 4 is a diagram of emotion text fusion encoding module; FIG. 5 is a block diagram of a posterior coding module; FIG. 6 is a diagram of a target duration prediction module; fig. 7 is a diagram showing the constitution of the audio decoding module and the discriminating module.
The training set is input into a full end-to-end emotion speech synthesis model, training the mapping relationship from speaker ID, quantized phone text sequence and emotion embedding code to audio waveform. The speaker ID, the quantized phoneme text sequence and the emotion embedded code are input to an emotion text fusion priori coding module together, and intermediate embedded representations related to the speaker ID, the text and the emotion are obtained through a self-encoder structure; meanwhile, the information is sent to a target duration prediction coding module, and predicted duration of each phoneme is output. The intermediate embedding and phoneme duration prediction results together form a priori code, and the priori code is input into an audio decoding module through a standardized stream structure, and a final generated waveform is obtained through up-sampling network operation. In the training stage, the posterior coding module is used for extracting posterior coding of the input real voice, and the prior and posterior coding are constrained by the same Gaussian distribution, so that the prior coding is continuously optimized to approach to the real posterior coding in the training process. The judging module and the audio decoding module form an countermeasure generation relation so as to improve the generation capacity of the audio decoding module.
Step 105) training to a converged emotion voice synthesis model by utilizing the step 104, inputting parameters to be synthesized and reference voices, and automatically reasoning emotion voice synthesis audio results of the target speaker by the model.
Example 2
The embodiment 2 of the invention provides a cross-language end-to-end emotion voice synthesis system. The processing procedure of the system comprises the following steps:
collecting paired speech-phoneme level text data of a large number of different speakers;
extracting frequency domain features and emotion embedded codes of voice, performing discretization coding on speaker ID and text data, and arranging to form a training data set;
constructing a complete end-to-end emotion voice synthesis model, and training the model by utilizing a training data set until the model converges;
the full end-to-end emotion voice synthesis model comprises: the system comprises an emotion text fusion encoding module, a target duration prediction module and an audio decoding module; wherein,
the emotion text fusion coding module is used for acquiring intermediate embedded representations related to the ID, the text and emotion of a speaker through a coder structure;
the target duration prediction module is used for outputting predicted phoneme duration;
the audio decoding module is used for obtaining a generated waveform through up-sampling network operation according to the middle embedded representation and the duration of the voice element;
inputting the ID of the speaker to be synthesized, the text sequence and the emotion embedded code into a fully end-to-end emotion voice synthesis model which is established in advance and trained to be converged, and outputting to obtain a synthesized emotion voice audio result.
Finally, it should be noted that the above embodiments are only for illustrating the technical solution of the present invention and are not limiting. Although the present invention has been described in detail with reference to the embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the appended claims.
Claims (7)
1. A cross-language end-to-end emotion voice synthesis method comprises the following steps:
collecting paired speech-phoneme level text data of a large number of different speakers;
extracting frequency domain features and emotion embedded codes of voice, performing discretization coding on speaker ID and text data, and arranging to form a training data set;
constructing a complete end-to-end emotion voice synthesis model, and training the model by utilizing a training data set until the model converges;
the full end-to-end emotion voice synthesis model comprises: the system comprises an emotion text fusion encoding module, a target duration prediction module and an audio decoding module; wherein,
the emotion text fusion coding module is used for acquiring intermediate embedded representations related to the ID, the text and emotion of a speaker through a coder structure;
the target duration prediction module is used for outputting predicted phoneme duration;
the audio decoding module is used for obtaining a generated waveform through up-sampling network operation according to the middle embedded representation and the duration of the voice element;
inputting the ID of the speaker to be synthesized, the text sequence and the emotion embedded code into a complete end-to-end emotion voice synthesis model trained to be converged, and outputting to obtain a synthesized emotion voice audio result.
2. The cross-language end-to-end emotion voice synthesis method of claim 1, wherein said full end-to-end emotion voice synthesis model further comprises: the posterior coding module and the discrimination module;
the posterior coding module is used for extracting posterior coding of the voice in the training stage; the posterior coding and the prior coding are constrained by the same Gaussian distribution, and the prior coding comprises an intermediate embedded representation and a phoneme duration;
the judging module forms a generating-countermeasure relation with the audio decoding module and is used for improving the generating capacity of the audio decoding module.
3. The cross-language end-to-end emotion voice synthesis method of claim 1, wherein the extracted frequency domain features of the voice comprise: linear spectrum and Mel-domain spectrum; wherein,
the kth short period of linear spectrum X k (ω) is expressed as:
wherein x (n) is an original voice signal, h (·) is a selected window function, ω represents frequency, e is a base of a natural logarithm, and j is an imaginary unit;
ith dimension Mel domain spectrum Mel of kth short period k (ω, i) is expressed as:
Mel k (ω,i)=Fbank(i,ω max ,ω min ,ω)P k (ω)
wherein P is k (omega) short-time power spectrum of kth short period, P k (ω)=|X k (ω)| 2 The method comprises the steps of carrying out a first treatment on the surface of the Fbank is a frequency ω passing through a preset maximum max And a minimum frequency omega min The i-dimensional Mel domain filter bank is constructed.
4. The method for synthesizing end-to-end emotion voice across languages according to claim 1, wherein said discretizing the encoding of the text data comprises: all the different phonemes in the text data are mapped to positive integers starting from 1 in any order, and the text data is converted into a sequence comprising a number of positive integers.
5. The method for synthesizing end-to-end emotion voice across languages according to claim 1, wherein said language independent emotion embedded codes are obtained by inputting each voice into a pre-trained emotion discrimination system;
the emotion distinguishing system comprises a classifier formed by a two-way parallel neural network and a speaker decoupling network;
the two-path parallel neural network comprises a transducer structure, two branches of the convolutional neural network and a linear projection layer which is used for dimension reduction and classification output on a bus;
the speaker decoupling network comprises a time-lapse neural network structure.
6. The method for end-to-end emotion voice synthesis across language of claim 1, wherein said sorting forms a training data set comprising:
for each voice, packing an original signal, a speaker ID, an extracted voice linear frequency spectrum, a Mel domain spectrum, a quantized phoneme text sequence and a language independent emotion embedding code into a tuple, and converting the original signal, the frequency spectrum, the Mel domain spectrum and the emotion embedding code into a floating point type high-dimensional tensor, and converting the speaker ID and the quantized phoneme text sequence into a long-integer tensor; all data are divided into a training set and a testing set according to a certain capacity proportion.
7. A cross-language end-to-end emotion voice synthesis system is characterized in that the processing procedure of the system comprises the following steps:
collecting paired speech-phoneme level text data of a large number of different speakers;
extracting frequency domain features and emotion embedded codes of voice, performing discretization coding on speaker ID and text data, and arranging to form a training data set;
constructing a complete end-to-end emotion voice synthesis model, and training the model by utilizing a training data set until the model converges;
the full end-to-end emotion voice synthesis model comprises: the system comprises an emotion text fusion encoding module, a target duration prediction module and an audio decoding module; wherein,
the emotion text fusion coding module is used for acquiring intermediate embedded representations related to the ID, the text and emotion of a speaker through a coder structure;
the target duration prediction module is used for outputting predicted phoneme duration;
the audio decoding module is used for obtaining a generated waveform through up-sampling network operation according to the middle embedded representation and the duration of the voice element;
inputting the ID of the speaker to be synthesized, the text sequence and the emotion embedded code into a fully end-to-end emotion voice synthesis model which is established in advance and trained to be converged, and outputting to obtain a synthesized emotion voice audio result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311545240.5A CN117789771A (en) | 2023-11-20 | 2023-11-20 | Cross-language end-to-end emotion voice synthesis method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311545240.5A CN117789771A (en) | 2023-11-20 | 2023-11-20 | Cross-language end-to-end emotion voice synthesis method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117789771A true CN117789771A (en) | 2024-03-29 |
Family
ID=90398992
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311545240.5A Pending CN117789771A (en) | 2023-11-20 | 2023-11-20 | Cross-language end-to-end emotion voice synthesis method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117789771A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118135990A (en) * | 2024-05-06 | 2024-06-04 | 厦门立马耀网络科技有限公司 | End-to-end text speech synthesis method and system combining autoregressive |
-
2023
- 2023-11-20 CN CN202311545240.5A patent/CN117789771A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118135990A (en) * | 2024-05-06 | 2024-06-04 | 厦门立马耀网络科技有限公司 | End-to-end text speech synthesis method and system combining autoregressive |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112017644B (en) | Sound transformation system, method and application | |
Morgan et al. | Neural networks and speech processing | |
CN102779508B (en) | Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof | |
CN106971709A (en) | Statistic parameter model method for building up and device, phoneme synthesizing method and device | |
CN116364055B (en) | Speech generation method, device, equipment and medium based on pre-training language model | |
CN105654939A (en) | Voice synthesis method based on voice vector textual characteristics | |
CN112581963B (en) | Voice intention recognition method and system | |
Ghule et al. | Feature extraction techniques for speech recognition: A review | |
CN110010136A (en) | The training and text analyzing method, apparatus, medium and equipment of prosody prediction model | |
KR20230133362A (en) | Generate diverse and natural text-to-speech conversion samples | |
CN111508469A (en) | Text-to-speech conversion method and device | |
KR20200084443A (en) | System and method for voice conversion | |
CN117789771A (en) | Cross-language end-to-end emotion voice synthesis method and system | |
KR20190135853A (en) | Method and system of text to multiple speech | |
KR20200088263A (en) | Method and system of text to multiple speech | |
Wu et al. | Multilingual text-to-speech training using cross language voice conversion and self-supervised learning of speech representations | |
CN116092473A (en) | Prosody annotation model, training method of prosody prediction model and related equipment | |
Zhao et al. | Research on voice cloning with a few samples | |
CN116798403A (en) | Speech synthesis model method capable of synthesizing multi-emotion audio | |
CN112242134A (en) | Speech synthesis method and device | |
Sharma et al. | Soft-Computational Techniques and Spectro-Temporal Features for Telephonic Speech Recognition: an overview and review of current state of the art | |
Chandra et al. | Towards the development of accent conversion model for (l1) bengali speaker using cycle consistent adversarial network (cyclegan) | |
CN113436607A (en) | Fast voice cloning method | |
CN113763924B (en) | Acoustic deep learning model training method, and voice generation method and device | |
CN118197277B (en) | Speech synthesis method, device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |