CN117789771A - Cross-language end-to-end emotion voice synthesis method and system - Google Patents

Cross-language end-to-end emotion voice synthesis method and system Download PDF

Info

Publication number
CN117789771A
CN117789771A CN202311545240.5A CN202311545240A CN117789771A CN 117789771 A CN117789771 A CN 117789771A CN 202311545240 A CN202311545240 A CN 202311545240A CN 117789771 A CN117789771 A CN 117789771A
Authority
CN
China
Prior art keywords
emotion
voice
module
text
language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311545240.5A
Other languages
Chinese (zh)
Inventor
张鹏远
华桦
尚增强
黎塔
王丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Original Assignee
Institute of Acoustics CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS filed Critical Institute of Acoustics CAS
Priority to CN202311545240.5A priority Critical patent/CN117789771A/en
Publication of CN117789771A publication Critical patent/CN117789771A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to the field of intelligent digital signal processing, in particular to a cross-language end-to-end emotion voice synthesis method and system. By training the deep neural network model by the method, the voice of the target speaker in the A language with natural smoothness and good emotion expression can be synthesized by giving the text in the A language to be synthesized and the reference voice in the B language with emotion. The specific method comprises the following steps: collecting original training data of voice-text pairs, extracting voice frequency domain characteristics, discretely encoding text, extracting language independent emotion embedded codes, constructing a complete end-to-end emotion voice synthesis model and performing supervised training. The speech synthesis model comprises an emotion text fusion coding module, a target duration prediction module, a posterior coding module, an audio decoding module and a judging module. After the speech synthesis model is trained to be converged, the required emotion speech of the target speaker can be deduced through the prior encoding module, the duration prediction module and the audio decoding module.

Description

Cross-language end-to-end emotion voice synthesis method and system
Technical Field
The invention relates to the field of intelligent digital signal processing, in particular to a cross-language end-to-end emotion voice synthesis method and system.
Background
Speech synthesis is a technique that converts text into natural, fluent speech audio. In recent years, the deep learning method has great effectiveness in the field of digital signal processing, and the deep neural network supports the continuous updating of the technology in the field of speech synthesis with stronger generalization and fitting capability, so that remarkable progress is achieved. Traditional speech synthesis systems tend to be tedious, and emotion speech synthesis can give vivid emotion colors to synthesized speech, so that the speech synthesis system is more humanized and natural. The research in this aspect also mainly depends on the deep learning technology, especially the development of models such as a Recurrent Neural Network (RNN) and a Variational Automatic Encoder (VAE), and can make the synthesized voice express emotion states such as anger and fear by introducing emotion related labels or features in the training process.
The Chinese patent application with publication number of CN115359774A discloses a cross-language voice synthesis method based on end-to-end tone and emotion migration. The invention collects a small amount of voice of a user, obtains voice data after preprocessing the voice, puts the voice data into a learning network architecture with training completed, extracts an embedded vector carrying the tone and emotion characteristics of a speaker through a speaker encoder in the learning network architecture, synthesizes a Mel frequency spectrum of target voice through a synthesizer in the learning network architecture, and finally inputs the Mel frequency spectrum of the target voice into a vocoder to synthesize the target voice. The method has complex treatment process and poor synthesis effect.
The method lacks a targeted decoupling process for emotion contained in the voice, so that the finally generated voice emotion effect is general and too dependent on the quality of the original speaker data, and the flexibility of using the method is insufficient.
Disclosure of Invention
The invention aims to meet the requirement of generating the language emotion voice under the condition of low resource of missing certain language emotion data.
In order to achieve the above purpose, the present invention is realized by the following technical scheme.
The invention provides a cross-language end-to-end emotion voice synthesis method, which comprises the following steps:
collecting paired speech-phoneme level text data of a large number of different speakers;
extracting frequency domain features and emotion embedded codes of voice, performing discretization coding on speaker ID and text data, and arranging to form a training data set;
constructing a complete end-to-end emotion voice synthesis model, and training the model by utilizing a training data set until the model converges;
the full end-to-end emotion voice synthesis model comprises: the system comprises an emotion text fusion encoding module, a target duration prediction module and an audio decoding module; wherein,
the emotion text fusion coding module is used for acquiring intermediate embedded representations related to the ID, the text and emotion of a speaker through a coder structure;
the target duration prediction module is used for outputting predicted phoneme duration;
the audio decoding module is used for obtaining a generated waveform through up-sampling network operation according to the middle embedded representation and the duration of the voice element;
inputting the ID of the speaker to be synthesized, the text sequence and the emotion embedded code into a complete end-to-end emotion voice synthesis model trained to be converged, and outputting to obtain a synthesized emotion voice audio result.
As one of the improvements of the above technical solution, the full end-to-end emotion voice synthesis model further includes: the posterior coding module and the discrimination module;
the posterior coding module is used for extracting posterior coding of the voice in the training stage; the posterior coding and the prior coding are constrained by the same Gaussian distribution, and the prior coding comprises an intermediate embedded representation and a phoneme duration;
the judging module forms a generating-countermeasure relation with the audio decoding module and is used for improving the generating capacity of the audio decoding module.
As one of the improvements of the above technical solution, the frequency domain features of the extracted speech include: linear spectrum and Mel-domain spectrum; wherein,
the kth short period of linear spectrum X k (ω) is expressed as:
wherein x (n) is an original voice signal, h (·) is a selected window function, ω represents frequency, e is a base of a natural logarithm, and j is an imaginary unit;
ith dimension Mel domain spectrum Mel of kth short period k (ω, i) is expressed as:
Mel k (ω,i)=Fbank(i,ω maxmin ,ω)P k (ω)
wherein P is k (omega) short-time power spectrum of kth short period, P k (ω)=|X k (ω)| 2 The method comprises the steps of carrying out a first treatment on the surface of the Fbank is a frequency ω passing through a preset maximum max And a minimum frequency omega min The i-dimensional Mel domain filter bank is constructed.
As one of the improvements of the above technical solution, the discrete coding of text data includes: all the different phonemes in the text data are mapped to positive integers starting from 1 in any order, and the text data is converted into a sequence comprising a number of positive integers.
As one of the improvements of the above technical scheme, the language-independent emotion embedded code is obtained by inputting each voice into a pre-trained emotion judgment system;
the emotion distinguishing system comprises a classifier formed by a two-way parallel neural network and a speaker decoupling network;
the two-path parallel neural network comprises a transducer structure, two branches of the convolutional neural network and a linear projection layer which is used for dimension reduction and classification output on a bus;
the speaker decoupling network comprises a time-lapse neural network (TDNN, time Delay Neural Network) structure.
As an improvement of the above technical solution, the sorting forms a training data set, including:
for each voice, packing an original signal, a speaker ID, an extracted voice linear frequency spectrum, a Mel domain spectrum, a quantized phoneme text sequence and a language independent emotion embedding code into a tuple, and converting the original signal, the frequency spectrum, the Mel domain spectrum and the emotion embedding code into a floating point type high-dimensional tensor, and converting the speaker ID and the quantized phoneme text sequence into a long-integer tensor; all data are divided into a training set and a testing set according to a certain capacity proportion.
The invention also provides a cross-language end-to-end emotion voice synthesis system, and the processing process of the system comprises the following steps:
collecting paired speech-phoneme level text data of a large number of different speakers;
extracting frequency domain features and emotion embedded codes of voice, performing discretization coding on speaker ID and text data, and arranging to form a training data set;
constructing a complete end-to-end emotion voice synthesis model, and training the model by utilizing a training data set until the model converges;
the full end-to-end emotion voice synthesis model comprises: the system comprises an emotion text fusion encoding module, a target duration prediction module and an audio decoding module; wherein,
the emotion text fusion coding module is used for acquiring intermediate embedded representations related to the ID, the text and emotion of a speaker through a coder structure;
the target duration prediction module is used for outputting predicted phoneme duration;
the audio decoding module is used for obtaining a generated waveform through up-sampling network operation according to the middle embedded representation and the duration of the voice element;
inputting the ID of the speaker to be synthesized, the text sequence and the emotion embedded code into a fully end-to-end emotion voice synthesis model which is established in advance and trained to be converged, and outputting to obtain a synthesized emotion voice audio result.
Compared with the prior art, the invention has the advantages that:
in general, the method constructs a set of complete end-to-end voice synthesis model architecture, directly establishes a mapping relation from text to voice, and can synthesize natural, smooth and human voice close to the sense of speech and hearing by utilizing the strong characteristic learning capability of deep learning. The method breaks through the bottleneck that emotion is difficult to describe accurately in voice, and the emotion is decoupled and encoded independently from voice, so that the obstacle of low-resource data condition is overcome, and the emotion synthesis of cross-language is realized. The method specifically comprises the following advantages:
1. the pre-trained emotion distinguishing system is used as an encoder, and the aim of the distinguishing system is a classification task, so that the encoder has better capacity of decoupling emotion expression from voice, and emotion coding is more accurate and pure, so that the generation effect is more consistent with human hearing;
2. the data acquisition and training process has no limitation on the languages of the audio and the text, so that the synthesis system can work across languages and is not influenced by the residues of the spoken voice of the speaker during single language coding;
3. the inference can be carried out from the reference voice, namely, the specific emotion type or description is not required to be pointed out, only one sentence of 'example sentence' which is required to be close to the reference voice is required to be given, and the emotion possessed by the reference voice can be imitatively synthesized by the target speaker, so that the flexibility of use is improved.
Drawings
FIG. 1 is a diagram of the overall architecture of the present invention;
FIG. 2 is a flow chart of a cross-language end-to-end emotion voice synthesis method of the present invention;
FIG. 3 is a diagram showing the constitution of the emotion judging system of the present invention;
FIG. 4 is a diagram of emotion text fusion encoding module components;
FIG. 5 is a block diagram of a posterior coding module;
FIG. 6 is a diagram of a target duration prediction module;
fig. 7 is a diagram showing the constitution of the audio decoding module and the discriminating module.
Detailed Description
As shown in FIG. 1, the invention provides a cross-language end-to-end emotion voice synthesis method. By training the deep neural network model by the method, the voice of the target speaker in the A language with natural smoothness and good emotion expression can be synthesized by giving the text in the A language to be synthesized and the reference voice in the B language with emotion. The method comprises the following steps:
step 1) collecting a large number of paired speech-phoneme level text data;
step 2) performing data preprocessing by means of digital signal processing and the like, including extracting frequency domain features of voice and performing discretization coding on text; extracting language independent emotion embedded codes of the voice by utilizing a pre-trained emotion judging system;
step 3) carrying out normalization processing on the results obtained in the steps 1-2 to form a training data set for a training model for standby;
step 4) constructing a complete end-to-end emotion voice synthesis model, training the model by utilizing the data set (original audio, key features and emotion embedded codes) prepared in the step 3 until the model converges, finishing training, and storing the model;
and 5) training to a converged emotion voice synthesis model by utilizing the step 4, inputting parameters to be synthesized and reference voices, and automatically reasoning emotion voice synthesis audio results of the target speaker by the model.
As an improvement of the above method, the step 2) specifically includes:
the frequency domain feature of the voice is extracted, and the steps are as follows: the original voice signal at the sampling point level is subjected to framing and windowing, and the linear frequency spectrum is obtained through short-time Fourier transform,
where x (n) is the original signal, h (·) is the selected window function, k represents the kth short period, ω represents the frequency, e is the base of the natural logarithm, and j is the imaginary unit.
Obtaining power spectrum by taking the modulus square of the linear spectrum, then using a Mel domain filter bank to carry out filtering treatment to obtain short-time Mel domain spectrum of the signal,
P k (ω)=|X k (ω)| 2
Mel k (ω,i)=Fbank(i,ω maxmin ,ω)P k (ω)
wherein P is k (omega) is the kth short-time power spectrum, fbank is the frequency ω by the preset maximum and minimum maxmin Constructed i-dimensional Mel domain filter bank, mel k (ω, i) represents the ith dimension Mel-domain spectrum of the kth short period.
Discretizing the phoneme-level text, mapping all the different phonemes to positive integers starting from 1 in any order, so that the phoneme-level text corresponding to all the voices can be converted into a sequence containing a plurality of positive integers.
Extracting language independent emotion embedded codes of voices by using a pre-trained emotion judging system, and automatically obtaining high-dimensional emotion embedded representation by passing each voice through the emotion judging system.
As an improvement of the above method, the step 3) specifically includes:
for each speech, the original signal, speaker ID, extracted speech linear spectrum, mel domain spectrum, quantized phoneme text sequence, and language independent emotion embedded code are packaged into a tuple, and the data format therein is converted into a long integer or floating point high-dimensional tensor. All data are divided into a training set and a testing set according to a certain capacity proportion.
As an improvement of the above method, the step 4) specifically includes:
the training set is input to an emotion speech synthesis system, which trains the mapping relationship from speaker ID, quantized phone text sequence, and emotion embedding code to audio waveform. The speaker ID, the quantized phoneme text sequence and the emotion embedded code are input to an emotion text fusion priori coding module together, and intermediate embedded representations related to the speaker ID, the text and the emotion are obtained through a self-encoder structure; meanwhile, the information is sent to a target duration prediction coding module, and predicted duration of each phoneme is output. The intermediate embedding and phoneme duration prediction results together form a priori code, and the priori code is input into an audio decoding module through a standardized stream structure, and a final generated waveform is obtained through up-sampling network operation. In the training stage, the posterior coding module is used for extracting posterior coding of the input real voice, and the prior and posterior coding are constrained by the same Gaussian distribution, so that the prior coding is continuously optimized to approach to the real posterior coding in the training process. The judging module and the audio decoding module form an countermeasure generation relation so as to improve the generation capacity of the audio decoding module.
As an improvement of the above method, the step 5) specifically includes:
and (3) training to a converged emotion voice synthesis model by utilizing the step (4), inputting a speaker ID to be synthesized, a quantized phoneme text sequence and emotion embedded codes, and automatically reasoning the synthesized emotion voice audio result by the model through an emotion text fusion coding module, a target duration prediction module and an audio decoding module.
The technical scheme of the invention is described in detail below with reference to the accompanying drawings and examples.
Example 1
As shown in fig. 2, embodiment 1 of the present invention provides a cross-language end-to-end emotion voice synthesis method, which includes:
step 101) collecting a plurality of paired speech-phoneme level text data;
102) preprocessing data by means of digital signal processing and the like, including extracting frequency domain characteristics of voice and discretizing and encoding text; extracting language independent emotion embedded codes of the voice by utilizing a pre-trained emotion judging system;
the choice of this emotion recognition system is varied. Off-the-shelf, e.g., https:// github.com/audiong/w 2v2-how-to may be employed. The emotion judgment system can also be built by self, as shown in fig. 3, and is a composition diagram of an emotion judgment system built by the applicant of the invention; the emotion distinguishing system comprises a classifier formed by a two-way parallel neural network and a speaker decoupling network; the two-path parallel neural network comprises a transducer structure, two branches of the convolutional neural network and a linear projection layer which is used for dimension reduction and classification output on a bus; the speaker decoupling network includes a TDNN (Time Delay Neural Network ) module.
Step 103) carrying out normalization processing on the results obtained in the steps 101-102 to form a training data set for a training model for standby;
104), constructing a complete end-to-end emotion voice synthesis model, training the model by utilizing the data set (original audio, key features and emotion embedded codes) prepared in the step 103 until the model converges, finishing training, and storing the model; FIGS. 4-7 are diagrams of the modules of the constructed complete end-to-end emotion voice synthesis model, wherein FIG. 4 is a diagram of emotion text fusion encoding module; FIG. 5 is a block diagram of a posterior coding module; FIG. 6 is a diagram of a target duration prediction module; fig. 7 is a diagram showing the constitution of the audio decoding module and the discriminating module.
The training set is input into a full end-to-end emotion speech synthesis model, training the mapping relationship from speaker ID, quantized phone text sequence and emotion embedding code to audio waveform. The speaker ID, the quantized phoneme text sequence and the emotion embedded code are input to an emotion text fusion priori coding module together, and intermediate embedded representations related to the speaker ID, the text and the emotion are obtained through a self-encoder structure; meanwhile, the information is sent to a target duration prediction coding module, and predicted duration of each phoneme is output. The intermediate embedding and phoneme duration prediction results together form a priori code, and the priori code is input into an audio decoding module through a standardized stream structure, and a final generated waveform is obtained through up-sampling network operation. In the training stage, the posterior coding module is used for extracting posterior coding of the input real voice, and the prior and posterior coding are constrained by the same Gaussian distribution, so that the prior coding is continuously optimized to approach to the real posterior coding in the training process. The judging module and the audio decoding module form an countermeasure generation relation so as to improve the generation capacity of the audio decoding module.
Step 105) training to a converged emotion voice synthesis model by utilizing the step 104, inputting parameters to be synthesized and reference voices, and automatically reasoning emotion voice synthesis audio results of the target speaker by the model.
Example 2
The embodiment 2 of the invention provides a cross-language end-to-end emotion voice synthesis system. The processing procedure of the system comprises the following steps:
collecting paired speech-phoneme level text data of a large number of different speakers;
extracting frequency domain features and emotion embedded codes of voice, performing discretization coding on speaker ID and text data, and arranging to form a training data set;
constructing a complete end-to-end emotion voice synthesis model, and training the model by utilizing a training data set until the model converges;
the full end-to-end emotion voice synthesis model comprises: the system comprises an emotion text fusion encoding module, a target duration prediction module and an audio decoding module; wherein,
the emotion text fusion coding module is used for acquiring intermediate embedded representations related to the ID, the text and emotion of a speaker through a coder structure;
the target duration prediction module is used for outputting predicted phoneme duration;
the audio decoding module is used for obtaining a generated waveform through up-sampling network operation according to the middle embedded representation and the duration of the voice element;
inputting the ID of the speaker to be synthesized, the text sequence and the emotion embedded code into a fully end-to-end emotion voice synthesis model which is established in advance and trained to be converged, and outputting to obtain a synthesized emotion voice audio result.
Finally, it should be noted that the above embodiments are only for illustrating the technical solution of the present invention and are not limiting. Although the present invention has been described in detail with reference to the embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the appended claims.

Claims (7)

1. A cross-language end-to-end emotion voice synthesis method comprises the following steps:
collecting paired speech-phoneme level text data of a large number of different speakers;
extracting frequency domain features and emotion embedded codes of voice, performing discretization coding on speaker ID and text data, and arranging to form a training data set;
constructing a complete end-to-end emotion voice synthesis model, and training the model by utilizing a training data set until the model converges;
the full end-to-end emotion voice synthesis model comprises: the system comprises an emotion text fusion encoding module, a target duration prediction module and an audio decoding module; wherein,
the emotion text fusion coding module is used for acquiring intermediate embedded representations related to the ID, the text and emotion of a speaker through a coder structure;
the target duration prediction module is used for outputting predicted phoneme duration;
the audio decoding module is used for obtaining a generated waveform through up-sampling network operation according to the middle embedded representation and the duration of the voice element;
inputting the ID of the speaker to be synthesized, the text sequence and the emotion embedded code into a complete end-to-end emotion voice synthesis model trained to be converged, and outputting to obtain a synthesized emotion voice audio result.
2. The cross-language end-to-end emotion voice synthesis method of claim 1, wherein said full end-to-end emotion voice synthesis model further comprises: the posterior coding module and the discrimination module;
the posterior coding module is used for extracting posterior coding of the voice in the training stage; the posterior coding and the prior coding are constrained by the same Gaussian distribution, and the prior coding comprises an intermediate embedded representation and a phoneme duration;
the judging module forms a generating-countermeasure relation with the audio decoding module and is used for improving the generating capacity of the audio decoding module.
3. The cross-language end-to-end emotion voice synthesis method of claim 1, wherein the extracted frequency domain features of the voice comprise: linear spectrum and Mel-domain spectrum; wherein,
the kth short period of linear spectrum X k (ω) is expressed as:
wherein x (n) is an original voice signal, h (·) is a selected window function, ω represents frequency, e is a base of a natural logarithm, and j is an imaginary unit;
ith dimension Mel domain spectrum Mel of kth short period k (ω, i) is expressed as:
Mel k (ω,i)=Fbank(i,ω maxmin ,ω)P k (ω)
wherein P is k (omega) short-time power spectrum of kth short period, P k (ω)=|X k (ω)| 2 The method comprises the steps of carrying out a first treatment on the surface of the Fbank is a frequency ω passing through a preset maximum max And a minimum frequency omega min The i-dimensional Mel domain filter bank is constructed.
4. The method for synthesizing end-to-end emotion voice across languages according to claim 1, wherein said discretizing the encoding of the text data comprises: all the different phonemes in the text data are mapped to positive integers starting from 1 in any order, and the text data is converted into a sequence comprising a number of positive integers.
5. The method for synthesizing end-to-end emotion voice across languages according to claim 1, wherein said language independent emotion embedded codes are obtained by inputting each voice into a pre-trained emotion discrimination system;
the emotion distinguishing system comprises a classifier formed by a two-way parallel neural network and a speaker decoupling network;
the two-path parallel neural network comprises a transducer structure, two branches of the convolutional neural network and a linear projection layer which is used for dimension reduction and classification output on a bus;
the speaker decoupling network comprises a time-lapse neural network structure.
6. The method for end-to-end emotion voice synthesis across language of claim 1, wherein said sorting forms a training data set comprising:
for each voice, packing an original signal, a speaker ID, an extracted voice linear frequency spectrum, a Mel domain spectrum, a quantized phoneme text sequence and a language independent emotion embedding code into a tuple, and converting the original signal, the frequency spectrum, the Mel domain spectrum and the emotion embedding code into a floating point type high-dimensional tensor, and converting the speaker ID and the quantized phoneme text sequence into a long-integer tensor; all data are divided into a training set and a testing set according to a certain capacity proportion.
7. A cross-language end-to-end emotion voice synthesis system is characterized in that the processing procedure of the system comprises the following steps:
collecting paired speech-phoneme level text data of a large number of different speakers;
extracting frequency domain features and emotion embedded codes of voice, performing discretization coding on speaker ID and text data, and arranging to form a training data set;
constructing a complete end-to-end emotion voice synthesis model, and training the model by utilizing a training data set until the model converges;
the full end-to-end emotion voice synthesis model comprises: the system comprises an emotion text fusion encoding module, a target duration prediction module and an audio decoding module; wherein,
the emotion text fusion coding module is used for acquiring intermediate embedded representations related to the ID, the text and emotion of a speaker through a coder structure;
the target duration prediction module is used for outputting predicted phoneme duration;
the audio decoding module is used for obtaining a generated waveform through up-sampling network operation according to the middle embedded representation and the duration of the voice element;
inputting the ID of the speaker to be synthesized, the text sequence and the emotion embedded code into a fully end-to-end emotion voice synthesis model which is established in advance and trained to be converged, and outputting to obtain a synthesized emotion voice audio result.
CN202311545240.5A 2023-11-20 2023-11-20 Cross-language end-to-end emotion voice synthesis method and system Pending CN117789771A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311545240.5A CN117789771A (en) 2023-11-20 2023-11-20 Cross-language end-to-end emotion voice synthesis method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311545240.5A CN117789771A (en) 2023-11-20 2023-11-20 Cross-language end-to-end emotion voice synthesis method and system

Publications (1)

Publication Number Publication Date
CN117789771A true CN117789771A (en) 2024-03-29

Family

ID=90398992

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311545240.5A Pending CN117789771A (en) 2023-11-20 2023-11-20 Cross-language end-to-end emotion voice synthesis method and system

Country Status (1)

Country Link
CN (1) CN117789771A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118135990A (en) * 2024-05-06 2024-06-04 厦门立马耀网络科技有限公司 End-to-end text speech synthesis method and system combining autoregressive

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118135990A (en) * 2024-05-06 2024-06-04 厦门立马耀网络科技有限公司 End-to-end text speech synthesis method and system combining autoregressive

Similar Documents

Publication Publication Date Title
CN112017644B (en) Sound transformation system, method and application
Morgan et al. Neural networks and speech processing
CN102779508B (en) Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof
CN106971709A (en) Statistic parameter model method for building up and device, phoneme synthesizing method and device
CN116364055B (en) Speech generation method, device, equipment and medium based on pre-training language model
CN105654939A (en) Voice synthesis method based on voice vector textual characteristics
CN112581963B (en) Voice intention recognition method and system
Ghule et al. Feature extraction techniques for speech recognition: A review
CN110010136A (en) The training and text analyzing method, apparatus, medium and equipment of prosody prediction model
KR20230133362A (en) Generate diverse and natural text-to-speech conversion samples
CN111508469A (en) Text-to-speech conversion method and device
KR20200084443A (en) System and method for voice conversion
CN117789771A (en) Cross-language end-to-end emotion voice synthesis method and system
KR20190135853A (en) Method and system of text to multiple speech
KR20200088263A (en) Method and system of text to multiple speech
Wu et al. Multilingual text-to-speech training using cross language voice conversion and self-supervised learning of speech representations
CN116092473A (en) Prosody annotation model, training method of prosody prediction model and related equipment
Zhao et al. Research on voice cloning with a few samples
CN116798403A (en) Speech synthesis model method capable of synthesizing multi-emotion audio
CN112242134A (en) Speech synthesis method and device
Sharma et al. Soft-Computational Techniques and Spectro-Temporal Features for Telephonic Speech Recognition: an overview and review of current state of the art
Chandra et al. Towards the development of accent conversion model for (l1) bengali speaker using cycle consistent adversarial network (cyclegan)
CN113436607A (en) Fast voice cloning method
CN113763924B (en) Acoustic deep learning model training method, and voice generation method and device
CN118197277B (en) Speech synthesis method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination